[GH-ISSUE #14621] Qwen3.5:9b concurrent call BUG #55987

Open
opened 2026-04-29 10:06:37 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @BARERM on GitHub (Mar 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14621

What is the issue?

Summary:

Despite having 128GB of Unified Memory on an ARM-based NVIDIA DGX Spark (GB10) and setting OLLAMA_NUM_PARALLEL correctly, Ollama (v0.17.6) fails to handle concurrent requests for the qwen3.5 architecture.

Details:

  1. Parallelism Downgraded: The server logs show a warning: model architecture does not currently support parallel requests for architecture=qwen35. It overrides the environment variable and forces Parallel: 1.

  2. Crash (SIGABRT): When attempting to force concurrent calls or during the model loading phase for parallel execution, the runner crashes with a SIGABRT during ggml_backend_sched_reserve.

  3. Platform Specificity: This issue occurs on the NVIDIA DGX Spark (ARM64). The device is detected as iGPU with 119.7 GiB VRAM. Similar configurations work on Apple Silicon (macOS) but fail here, suggesting a backend/scheduling bug in the Linux-ARM64 CUDA runner for this specific architecture.

Relevant log output

#### 1. The Architecture Warning (Parallelism Blocked):

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35
level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}"

#### 2.The Backend Crash (SIGABRT):

SIGABRT: abort
PC=0xfe7256047608 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 [syscall]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10)
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100)
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1)

#### 3.Hardware Environment:

level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB"

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.17.6

Originally created by @BARERM on GitHub (Mar 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14621 ### What is the issue? ### Summary: Despite having 128GB of Unified Memory on an ARM-based NVIDIA DGX Spark (GB10) and setting OLLAMA_NUM_PARALLEL correctly, Ollama (v0.17.6) fails to handle concurrent requests for the qwen3.5 architecture. ### Details: 1. **Parallelism Downgraded**: The server logs show a warning: model architecture does not currently support parallel requests for architecture=qwen35. It overrides the environment variable and forces Parallel: 1. 2. **Crash (SIGABRT)**: When attempting to force concurrent calls or during the model loading phase for parallel execution, the runner crashes with a SIGABRT during ggml_backend_sched_reserve. 3. **Platform Specificity**: This issue occurs on the **NVIDIA DGX Spark** (ARM64). The device is detected as iGPU with 119.7 GiB VRAM. Similar configurations work on Apple Silicon (macOS) but fail here, suggesting a backend/scheduling bug in the Linux-ARM64 CUDA runner for this specific architecture. ### Relevant log output ```shell #### 1. The Architecture Warning (Parallelism Blocked): level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35 level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}" #### 2.The Backend Crash (SIGABRT): SIGABRT: abort PC=0xfe7256047608 m=11 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 20 [syscall]: runtime.cgocall(...) github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10) github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100) github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1) #### 3.Hardware Environment: level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB" ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.17.6
GiteaMirror added the bug label 2026-04-29 10:06:37 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 4, 2026):

  1. Some model architectures don't currently support parallelism.
  2. How are you forcing concurrent calls?
<!-- gh-comment-id:3999930591 --> @rick-github commented on GitHub (Mar 4, 2026): 1. Some model architectures [don't currently support parallelism](https://github.com/ollama/ollama/issues/4165). 2. How are you forcing concurrent calls?
Author
Owner

@BARERM commented on GitHub (Mar 5, 2026):

Hi @rick-github,

I am simply opening multiple terminal windows and executing curl commands to the Ollama API at the roughly same time.

Here is how I triggered it:

  1. Start the Ollama server on DGX Spark.
  2. Open 2 or 3 terminal windows.
  3. Run the standard curl request in them simultaneously.

For example:

curl http://172.31.0.51:11434/api/generate -d '{
  "model": "qwen3.5:9b", 
  "prompt": "你是什么模型?", 
  "stream": true
}'
<!-- gh-comment-id:4002684818 --> @BARERM commented on GitHub (Mar 5, 2026): Hi @rick-github, I am simply opening multiple terminal windows and executing `curl` commands to the Ollama API at the roughly same time. Here is how I triggered it: 1. Start the Ollama server on DGX Spark. 2. Open 2 or 3 terminal windows. 3. Run the standard `curl` request in them simultaneously. For example: ```bash curl http://172.31.0.51:11434/api/generate -d '{ "model": "qwen3.5:9b", "prompt": "你是什么模型?", "stream": true }' ```
Author
Owner

@rick-github commented on GitHub (Mar 5, 2026):

This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. Server logs may aid in debugging.

<!-- gh-comment-id:4004816611 --> @rick-github commented on GitHub (Mar 5, 2026): This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. [Server logs](https://docs.ollama.com/troubleshooting) may aid in debugging.
Author
Owner

@NixerWong commented on GitHub (Mar 6, 2026):

OLLAMA_NUM_PARALLEL parameter have no effect, ollama work serially in multi terminals.

<!-- gh-comment-id:4009366192 --> @NixerWong commented on GitHub (Mar 6, 2026): OLLAMA_NUM_PARALLEL parameter have no effect, ollama work serially in multi terminals.
Author
Owner

@BARERM commented on GitHub (Mar 6, 2026):

This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. Server logs may aid in debugging.

Hi @rick-github, thanks for testing this. I think we are looking at two different things here:

True concurrency (My main concern): My goal is to get actual parallel processing, not serial queuing. The logs show it forces Parallel: 1 because the qwen35 architecture "does not currently support parallel requests". Is there a plan to support true parallelism for Qwen models?

The SIGABRT crash: I understand it should gracefully fall back to serial processing (like it did in your lab), but on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT instead of queuing them.

I will run OLLAMA_DEBUG=1 ollama serve and attach the full crash logs in a moment to help figure out why the serial fallback is breaking on my machine.

<!-- gh-comment-id:4009381034 --> @BARERM commented on GitHub (Mar 6, 2026): > This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. [Server logs](https://docs.ollama.com/troubleshooting) may aid in debugging. Hi @rick-github, thanks for testing this. I think we are looking at two different things here: True concurrency (My main concern): My goal is to get actual parallel processing, not serial queuing. The logs show it forces Parallel: 1 because the qwen35 architecture "does not currently support parallel requests". Is there a plan to support true parallelism for Qwen models? The SIGABRT crash: I understand it should gracefully fall back to serial processing (like it did in your lab), but on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT instead of queuing them. I will run OLLAMA_DEBUG=1 ollama serve and attach the full crash logs in a moment to help figure out why the serial fallback is breaking on my machine.
Author
Owner

@NixerWong commented on GitHub (Mar 6, 2026):

All Qwen 3.5 serial models have the same problem: OLLAMA_NUM_PARALLEL parameter have no use, inference work only support serially

<!-- gh-comment-id:4009776609 --> @NixerWong commented on GitHub (Mar 6, 2026): All Qwen 3.5 serial models have the same problem: OLLAMA_NUM_PARALLEL parameter have no use, inference work only support serially
Author
Owner

@scmarvin commented on GitHub (Mar 6, 2026):

I am also seeing the parallelism problem show up in all Qwen v.3.5 integrations with Ollama's current version (v0.17.7), while Qwen v3 models function as expected under Ollama. This is not an inherent problem with the model however, as according to my research the Qwen 3.5 models support it. This issue is specifically related to the current Ollama integration with the current Qwen version from what I can see.

<!-- gh-comment-id:4014007714 --> @scmarvin commented on GitHub (Mar 6, 2026): I am also seeing the parallelism problem show up in all Qwen v.3.5 integrations with Ollama's current version (v0.17.7), while Qwen v3 models function as expected under Ollama. This is not an inherent problem with the model however, as according to my research the Qwen 3.5 models support it. This issue is specifically related to the current Ollama integration with the current Qwen version from what I can see.
Author
Owner

@rick-github commented on GitHub (Mar 7, 2026):

Is there a plan to support true parallelism for Qwen models?

There's an open ticket so I expect it will be addressed eventually.

on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT

This is likely independent of the parallelism issue. Logs may help in understanding what's going on.

<!-- gh-comment-id:4017006735 --> @rick-github commented on GitHub (Mar 7, 2026): > Is there a plan to support true parallelism for Qwen models? There's an open ticket so I expect it will be addressed eventually. > on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT This is likely independent of the parallelism issue. Logs may help in understanding what's going on.
Author
Owner

@KevinTurnbull commented on GitHub (Mar 8, 2026):

I naively removed the safety check for qwen35 and here's what I get in debug logs. Long story short - it's the graph, not just a safety check about VL models.

FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath:       
  MainGPU:0 UseMmap:false}"                                                                                                                                                           
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.208-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32              
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0             
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0  
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0             
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""               
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1      
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found"                                               
  key=qwen35.rope.scaling.original_context_length default=0                                                                                                                           
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0          
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0             
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0        
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false    
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found"                                               
  key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07                                                                                                        
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base              
  default=10000                                                                                                                                                                       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings   
  default=2304                                                                                                                                                                        
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token              
  default=false                                                                                                                                                                       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0     
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.eos_token_ids              
  default="&{size:0 values:[]}"                                                                                                                                                       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.469-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442                                   
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.157-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=107                                  
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.164-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2463 splits=3                                     
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB"                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.0 GiB"                             
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB"                                
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB"                                  
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB"                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB"                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:272 msg="total memory" size="18.5 GiB"                                        
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=591052800                   
  required.CPU.Weights="[268028672 135971584 135971584 132057088 122995456 122995456 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1747209152]" required.CPU.Cache="[219545600  
  219545600 219545600 573571072 219545600 219545600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=659647456                                              
  required.CUDA0.ID=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 required.CUDA0.Weights="[0 0 0 0 0 0 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456     
  122995456 132057088 122995456 122995456 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456 122995456 132057088 135971584 135971584 135971584 132057088 
   0]" required.CUDA0.Cache="[0 0 0 0 0 0 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 
   219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 0]" required.CUDA0.Graph=2787689984                        
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:976 msg="available gpu" id=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7           
  library=CUDA "available layer vram"="11.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="2.6 GiB"                                                                      
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:793 msg="new layout created"                                                  
  layers="26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)]"                                                                                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512  
  FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath:       
  MainGPU:0 UseMmap:false}"                                                                                                                                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:482 msg="offloading 26 repeating layers to GPU"                                  
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"                                         
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:494 msg="offloaded 26/33 layers to GPU"                                          
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB"                            
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="3.0 GiB"                              
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB"                                 
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB"                                   
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB"                            
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB"                            
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:272 msg="total memory" size="18.5 GiB"                                         
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=sched.go:565 msg="loaded runners" count=1                                                
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded"                                            
  model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                                                                
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"                        
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.166-04:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading   
  model"                                                                                                                                                                              
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.167-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.00"                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.417-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.33"                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.669-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.65"                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.921-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.92"                                           
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.126-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0             
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=INFO source=server.go:1388 msg="llama runner started in 6.96 seconds"                                
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:577 msg="finished setting up"                                                  
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000                                                                                                                                                                
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded"                                            
  model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                                                                
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14416 format=""                 
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14439 format=""                 
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14312 format=""                 
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.264-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=2961 used=0 remaining=2961    
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=1 cache=0 prompt=2973 used=0 remaining=2973    
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=2 cache=0 prompt=2980 used=0 remaining=2980    
  Mar 08 10:33:11 linai ollama[680738]: panic: failed to build graph: model does not support operation                                                                                
  Mar 08 10:33:11 linai ollama[680738]: goroutine 11 [running]:                                                                                                                       
  Mar 08 10:33:11 linai ollama[680738]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0005565a0, {0x1b15c30, 0xc0005b26e0})                                           
  Mar 08 10:33:11 linai ollama[680738]:         /code/ollama/runner/ollamarunner/runner.go:455 +0x325                                                                                 
  Mar 08 10:33:11 linai ollama[680738]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1                                                                
  Mar 08 10:33:11 linai ollama[680738]:         /code/ollama/runner/ollamarunner/runner.go:1442 +0x4c9                                                                                
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\":    
  EOF"                                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.552741187s |   192.168.69.16 | POST     "/api/chat"                                                    
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\":    
  EOF"                                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.522634051s |   192.168.69.16 | POST     "/api/chat"                                                    
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:585 msg="context for request finished"                                         
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event"                              
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 refCount=2                                                                                                                                                     
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:431 msg="context for request finished"                                         
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event"                              
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 refCount=1                                                                                                                                                     
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\":    
  EOF"                                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.610894761s |   192.168.69.16 | POST     "/api/chat"                                                    
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=DEBUG source=sched.go:431 msg="context for request finished"                                         
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:336 msg="runner with non-zero duration has gone idle, adding timer"            
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 duration=5m0s                                                                                                                                                  
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event"                              
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 refCount=0                                                                                                                                                     

The key line:

  • panic: failed to build graph: model does not support operation

This happened when the runner tried to process 3 concurrent requests (Parallel:4). The qwen3.5 DeltaNet architecture genuinely can't build a compute graph for multiple parallel sequences - it's not just a conservative block.

<!-- gh-comment-id:4019168064 --> @KevinTurnbull commented on GitHub (Mar 8, 2026): I naively removed the safety check for qwen35 and here's what I get in debug logs. Long story short - it's the graph, not just a safety check about VL models. ``` FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.208-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.469-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.157-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=107 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.164-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2463 splits=3 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.0 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:272 msg="total memory" size="18.5 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=591052800 required.CPU.Weights="[268028672 135971584 135971584 132057088 122995456 122995456 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1747209152]" required.CPU.Cache="[219545600 219545600 219545600 573571072 219545600 219545600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=659647456 required.CUDA0.ID=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 required.CUDA0.Weights="[0 0 0 0 0 0 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456 122995456 132057088 122995456 122995456 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456 122995456 132057088 135971584 135971584 135971584 132057088 0]" required.CUDA0.Cache="[0 0 0 0 0 0 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 0]" required.CUDA0.Graph=2787689984 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:976 msg="available gpu" id=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 library=CUDA "available layer vram"="11.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="2.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:793 msg="new layout created" layers="26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)]" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:482 msg="offloading 26 repeating layers to GPU" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:494 msg="offloaded 26/33 layers to GPU" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="3.0 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:272 msg="total memory" size="18.5 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=sched.go:565 msg="loaded runners" count=1 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.166-04:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.167-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.00" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.417-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.33" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.669-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.65" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.921-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.92" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.126-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=INFO source=server.go:1388 msg="llama runner started in 6.96 seconds" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:577 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14416 format="" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14439 format="" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14312 format="" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.264-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=2961 used=0 remaining=2961 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=1 cache=0 prompt=2973 used=0 remaining=2973 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=2 cache=0 prompt=2980 used=0 remaining=2980 Mar 08 10:33:11 linai ollama[680738]: panic: failed to build graph: model does not support operation Mar 08 10:33:11 linai ollama[680738]: goroutine 11 [running]: Mar 08 10:33:11 linai ollama[680738]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0005565a0, {0x1b15c30, 0xc0005b26e0}) Mar 08 10:33:11 linai ollama[680738]: /code/ollama/runner/ollamarunner/runner.go:455 +0x325 Mar 08 10:33:11 linai ollama[680738]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 Mar 08 10:33:11 linai ollama[680738]: /code/ollama/runner/ollamarunner/runner.go:1442 +0x4c9 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\": EOF" Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.552741187s | 192.168.69.16 | POST "/api/chat" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\": EOF" Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.522634051s | 192.168.69.16 | POST "/api/chat" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:585 msg="context for request finished" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 refCount=2 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:431 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 refCount=1 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\": EOF" Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.610894761s | 192.168.69.16 | POST "/api/chat" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=DEBUG source=sched.go:431 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:336 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 duration=5m0s Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 refCount=0 ``` The key line: - panic: failed to build graph: model does not support operation This happened when the runner tried to process 3 concurrent requests (Parallel:4). The qwen3.5 DeltaNet architecture genuinely can't build a compute graph for multiple parallel sequences - it's not just a conservative block.
Author
Owner

@KevinTurnbull commented on GitHub (Mar 8, 2026):

I've got a technical fix. It keeps it from crashing under concurrent load but because of the nature of the graph the total throughput is essentially the same (on my Nvidia Quadro RTX5000 from 2018 at least).

40s per request for 3 serialized requests == 120s for 3 concurrent requests. On the plus side - they do stream concurrently which can have perceived performance benefits.

But - here's a patch for you. Notably I did not test it on anything other than qwen35 (although I suspect it would work for the whole family) -- nor did I test it for images.

qwen35-parallel.patch

@rick-github -- Ball's in your court. :)

<!-- gh-comment-id:4019564756 --> @KevinTurnbull commented on GitHub (Mar 8, 2026): I've got a technical fix. It keeps it from crashing under concurrent load but because of the nature of the graph the total throughput is essentially the same (on my Nvidia Quadro RTX5000 from 2018 at least). 40s per request for 3 serialized requests == 120s for 3 concurrent requests. On the plus side - they do stream concurrently which can have perceived performance benefits. But - here's a patch for you. Notably I did not test it on anything other than qwen35 (although I suspect it would work for the whole family) -- nor did I test it for images. [qwen35-parallel.patch](https://github.com/user-attachments/files/25826099/qwen35-parallel.patch) @rick-github -- Ball's in your court. :)
Author
Owner

@NixerWong commented on GitHub (Mar 18, 2026):

Any update on this issue? When can we expect a official fix?

<!-- gh-comment-id:4078922022 --> @NixerWong commented on GitHub (Mar 18, 2026): Any update on this issue? When can we expect a official fix?
Author
Owner

@NixerWong commented on GitHub (Apr 6, 2026):

Any update on this issue? When can we expect a official fix?

<!-- gh-comment-id:4192037950 --> @NixerWong commented on GitHub (Apr 6, 2026): Any update on this issue? When can we expect a official fix?
Author
Owner

@bingoct commented on GitHub (Apr 8, 2026):

still has problem in v0.20.3

<!-- gh-comment-id:4206033431 --> @bingoct commented on GitHub (Apr 8, 2026): still has problem in v0.20.3
Author
Owner

@KevinTurnbull commented on GitHub (Apr 10, 2026):

I think the holdup is that there's a non-trivial rewrite of the dag runner needed to support real concurrency. My patch doesn't really move the needle in terms of throughput.

<!-- gh-comment-id:4225530024 --> @KevinTurnbull commented on GitHub (Apr 10, 2026): I think the holdup is that there's a non-trivial rewrite of the dag runner needed to support real concurrency. My patch doesn't really move the needle in terms of throughput.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55987