[GH-ISSUE #15017] nemotron-cascade-2 not working in parallel #56161

Closed
opened 2026-04-29 10:20:21 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @lclrd on GitHub (Mar 22, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15017

What is the issue?

When using docker compose with the following environment variables, nemotron-cascade-2 still processes requests serially and not in parallel. I have tested gpt-oss:20b and it works fine in parallel, it appears to be just nemotron-cascade-2 that is broken.

    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - OLLAMA_KEEP_ALIVE=99999999m
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=f16 # q8_0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_HOST=0.0.0.0:11434

Relevant log output

time=2026-03-22T23:50:53.437Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42335"
time=2026-03-22T23:50:53.988Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe
time=2026-03-22T23:50:54.183Z level=INFO source=server.go:246 msg="enabling flash attention"
time=2026-03-22T23:50:54.183Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-9e0c827cfd6a6d000032be3da3d0914668b0c1112977e927186d29c4487466c4 --port 37627"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:484 msg="system memory" total="62.7 GiB" free="33.1 GiB" free_swap="7.2 GiB"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 library=CUDA available="22.9 GiB" free="23.3 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-22T23:50:54.184Z level=INFO source=server.go:757 msg="loading model" "model layers"=53 requested=-1
time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:37627"
time=2026-03-22T23:50:54.217Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:53(0..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:54.276Z level=INFO source=ggml.go:136 msg="" architecture=nemotron_h_moe file_type=Q4_K_M name="" description="" num_tensors=401 num_key_values=45
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-03-22T23:50:54.522Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-03-22T23:50:55.729Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:56.718Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:57.891Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:482 msg="offloading 52 repeating layers to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:494 msg="offloaded 53/53 layers to GPU"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.5 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:245 msg="model weights" device=CPU size="231.0 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.6 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="1.1 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="852.5 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="418.0 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.2 MiB"
time=2026-03-22T23:50:57.891Z level=INFO source=device.go:272 msg="total memory" size="26.5 GiB"
time=2026-03-22T23:50:57.891Z level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-03-22T23:50:57.891Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-22T23:50:57.892Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-22T23:51:00.654Z level=INFO source=server.go:1388 msg="llama runner started in 6.47 seconds"

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.18.2

Originally created by @lclrd on GitHub (Mar 22, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15017 ### What is the issue? When using docker compose with the following environment variables, nemotron-cascade-2 still processes requests serially and not in parallel. I have tested gpt-oss:20b and it works fine in parallel, it appears to be just nemotron-cascade-2 that is broken. ``` environment: - CUDA_VISIBLE_DEVICES=0,1 - OLLAMA_KEEP_ALIVE=99999999m - OLLAMA_FLASH_ATTENTION=1 - OLLAMA_KV_CACHE_TYPE=f16 # q8_0 - OLLAMA_NUM_PARALLEL=4 - OLLAMA_HOST=0.0.0.0:11434 ``` ### Relevant log output ```shell time=2026-03-22T23:50:53.437Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42335" time=2026-03-22T23:50:53.988Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe time=2026-03-22T23:50:54.183Z level=INFO source=server.go:246 msg="enabling flash attention" time=2026-03-22T23:50:54.183Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-9e0c827cfd6a6d000032be3da3d0914668b0c1112977e927186d29c4487466c4 --port 37627" time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:484 msg="system memory" total="62.7 GiB" free="33.1 GiB" free_swap="7.2 GiB" time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-22T23:50:54.184Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 library=CUDA available="22.9 GiB" free="23.3 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-22T23:50:54.184Z level=INFO source=server.go:757 msg="loading model" "model layers"=53 requested=-1 time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-03-22T23:50:54.207Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:37627" time=2026-03-22T23:50:54.217Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:53(0..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-22T23:50:54.276Z level=INFO source=ggml.go:136 msg="" architecture=nemotron_h_moe file_type=Q4_K_M name="" description="" num_tensors=401 num_key_values=45 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2026-03-22T23:50:54.522Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-03-22T23:50:55.729Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-22T23:50:56.718Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-22T23:50:57.891Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType:f16 NumThreads:16 GPULayers:53[ID:GPU-93b61cac-f5db-49f5-17f4-1ed78856e2b5 Layers:27(0..26) ID:GPU-d3b5fad5-12b0-0e51-7181-da25cb711fd6 Layers:26(27..52)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:482 msg="offloading 52 repeating layers to GPU" time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-03-22T23:50:57.891Z level=INFO source=ggml.go:494 msg="offloaded 53/53 layers to GPU" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.5 GiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:245 msg="model weights" device=CPU size="231.0 MiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.6 GiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="1.1 GiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="852.5 MiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="418.0 MiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.2 MiB" time=2026-03-22T23:50:57.891Z level=INFO source=device.go:272 msg="total memory" size="26.5 GiB" time=2026-03-22T23:50:57.891Z level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-03-22T23:50:57.891Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-03-22T23:50:57.892Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-03-22T23:51:00.654Z level=INFO source=server.go:1388 msg="llama runner started in 6.47 seconds" ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.18.2
GiteaMirror added the bug label 2026-04-29 10:20:21 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 23, 2026):

time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe
<!-- gh-comment-id:4107363801 --> @rick-github commented on GitHub (Mar 23, 2026): ``` time=2026-03-22T23:50:54.084Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=nemotron_h_moe ```
Author
Owner

@lclrd commented on GitHub (Mar 23, 2026):

oops... sorry for the dupe issue!

<!-- gh-comment-id:4107366477 --> @lclrd commented on GitHub (Mar 23, 2026): oops... sorry for the dupe issue!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56161