[GH-ISSUE #14621] Qwen3.5:9b concurrent call BUG #55987

New Issue

GiteaMirror · 2026-04-29T10:06:37-05:00

GiteaMirror commented

2026-04-29 10:06:37 -05:00

Originally created by @BARERM on GitHub (Mar 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14621

What is the issue?

Summary:

Despite having 128GB of Unified Memory on an ARM-based NVIDIA DGX Spark (GB10) and setting OLLAMA_NUM_PARALLEL correctly, Ollama (v0.17.6) fails to handle concurrent requests for the qwen3.5 architecture.

Details:

Parallelism Downgraded: The server logs show a warning: model architecture does not currently support parallel requests for architecture=qwen35. It overrides the environment variable and forces Parallel: 1.
Crash (SIGABRT): When attempting to force concurrent calls or during the model loading phase for parallel execution, the runner crashes with a SIGABRT during ggml_backend_sched_reserve.
Platform Specificity: This issue occurs on the NVIDIA DGX Spark (ARM64). The device is detected as iGPU with 119.7 GiB VRAM. Similar configurations work on Apple Silicon (macOS) but fail here, suggesting a backend/scheduling bug in the Linux-ARM64 CUDA runner for this specific architecture.

Relevant log output

#### 1. The Architecture Warning (Parallelism Blocked):

level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35
level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}"

#### 2.The Backend Crash (SIGABRT):

SIGABRT: abort
PC=0xfe7256047608 m=11 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 [syscall]:
runtime.cgocall(...)
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10)
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100)
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1)

#### 3.Hardware Environment:

level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB"

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.17.6

Originally created by @BARERM on GitHub (Mar 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14621 ### What is the issue? ### Summary: Despite having 128GB of Unified Memory on an ARM-based NVIDIA DGX Spark (GB10) and setting OLLAMA_NUM_PARALLEL correctly, Ollama (v0.17.6) fails to handle concurrent requests for the qwen3.5 architecture. ### Details: 1. **Parallelism Downgraded**: The server logs show a warning: model architecture does not currently support parallel requests for architecture=qwen35. It overrides the environment variable and forces Parallel: 1. 2. **Crash (SIGABRT)**: When attempting to force concurrent calls or during the model loading phase for parallel execution, the runner crashes with a SIGABRT during ggml_backend_sched_reserve. 3. **Platform Specificity**: This issue occurs on the **NVIDIA DGX Spark** (ARM64). The device is detected as iGPU with 119.7 GiB VRAM. Similar configurations work on Apple Silicon (macOS) but fail here, suggesting a backend/scheduling bug in the Linux-ARM64 CUDA runner for this specific architecture. ### Relevant log output ```shell #### 1. The Architecture Warning (Parallelism Blocked): level=WARN source=sched.go:450 msg="model architecture does not currently support parallel requests" architecture=qwen35 level=INFO source=runner.go:1302 msg=load request="{Operation:fit ... Parallel:1 ... FlashAttention:Enabled KvSize:4096 ...}" #### 2.The Backend Crash (SIGABRT): SIGABRT: abort PC=0xfe7256047608 m=11 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 20 [syscall]: runtime.cgocall(...) github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0xfe71e10b9950, 0xfe6f5eceda10) github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x400043c100) github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x400024f0e0, 0x1) #### 3.Hardware Environment: level=INFO source=types.go:42 msg="inference compute" id=GPU-... library=CUDA compute=12.1 name=CUDA0 description="NVIDIA GB10" libdirs=ollama,cuda_v13 driver=13.0 pci_id=000f:01:00.0 type=iGPU total="119.7 GiB" available="61.4 GiB" ``` ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.17.6

GiteaMirror added the bug label 2026-04-29 10:06:37 -05:00

GiteaMirror commented

2026-04-29 10:06:39 -05:00

@rick-github commented on GitHub (Mar 4, 2026):

Some model architectures don't currently support parallelism.
How are you forcing concurrent calls?

@rick-github commented on GitHub (Mar 4, 2026): 1. Some model architectures [don't currently support parallelism](https://github.com/ollama/ollama/issues/4165). 2. How are you forcing concurrent calls?

GiteaMirror commented

2026-04-29 10:06:39 -05:00

@BARERM commented on GitHub (Mar 5, 2026):

Hi @rick-github,

I am simply opening multiple terminal windows and executing curl commands to the Ollama API at the roughly same time.

Here is how I triggered it:

Start the Ollama server on DGX Spark.
Open 2 or 3 terminal windows.
Run the standard curl request in them simultaneously.

For example:

curl http://172.31.0.51:11434/api/generate -d '{
  "model": "qwen3.5:9b", 
  "prompt": "你是什么模型？", 
  "stream": true
}'

@BARERM commented on GitHub (Mar 5, 2026): Hi @rick-github, I am simply opening multiple terminal windows and executing `curl` commands to the Ollama API at the roughly same time. Here is how I triggered it: 1. Start the Ollama server on DGX Spark. 2. Open 2 or 3 terminal windows. 3. Run the standard `curl` request in them simultaneously. For example: ```bash curl http://172.31.0.51:11434/api/generate -d '{ "model": "qwen3.5:9b", "prompt": "你是什么模型？", "stream": true }' ```

GiteaMirror commented

2026-04-29 10:06:40 -05:00

@rick-github commented on GitHub (Mar 5, 2026):

This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. Server logs may aid in debugging.

@rick-github commented on GitHub (Mar 5, 2026): This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. [Server logs](https://docs.ollama.com/troubleshooting) may aid in debugging.

GiteaMirror commented

2026-04-29 10:06:41 -05:00

@NixerWong commented on GitHub (Mar 6, 2026):

OLLAMA_NUM_PARALLEL parameter have no effect, ollama work serially in multi terminals.

@NixerWong commented on GitHub (Mar 6, 2026): OLLAMA_NUM_PARALLEL parameter have no effect, ollama work serially in multi terminals.

GiteaMirror commented

2026-04-29 10:06:43 -05:00

@BARERM commented on GitHub (Mar 6, 2026):

This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. Server logs may aid in debugging.

Hi @rick-github, thanks for testing this. I think we are looking at two different things here:

True concurrency (My main concern): My goal is to get actual parallel processing, not serial queuing. The logs show it forces Parallel: 1 because the qwen35 architecture "does not currently support parallel requests". Is there a plan to support true parallelism for Qwen models?

The SIGABRT crash: I understand it should gracefully fall back to serial processing (like it did in your lab), but on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT instead of queuing them.

I will run OLLAMA_DEBUG=1 ollama serve and attach the full crash logs in a moment to help figure out why the serial fallback is breaking on my machine.

@BARERM commented on GitHub (Mar 6, 2026): > This works as expected on the DGX Spark in the lab - running 4 simultaneous curl requests shows them completing in serial operation with no crashes. [Server logs](https://docs.ollama.com/troubleshooting) may aid in debugging. Hi @rick-github, thanks for testing this. I think we are looking at two different things here: True concurrency (My main concern): My goal is to get actual parallel processing, not serial queuing. The logs show it forces Parallel: 1 because the qwen35 architecture "does not currently support parallel requests". Is there a plan to support true parallelism for Qwen models? The SIGABRT crash: I understand it should gracefully fall back to serial processing (like it did in your lab), but on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT instead of queuing them. I will run OLLAMA_DEBUG=1 ollama serve and attach the full crash logs in a moment to help figure out why the serial fallback is breaking on my machine.

GiteaMirror commented

2026-04-29 10:06:43 -05:00

@NixerWong commented on GitHub (Mar 6, 2026):

All Qwen 3.5 serial models have the same problem: OLLAMA_NUM_PARALLEL parameter have no use, inference work only support serially

@NixerWong commented on GitHub (Mar 6, 2026): All Qwen 3.5 serial models have the same problem: OLLAMA_NUM_PARALLEL parameter have no use, inference work only support serially

GiteaMirror commented

2026-04-29 10:06:44 -05:00

@scmarvin commented on GitHub (Mar 6, 2026):

I am also seeing the parallelism problem show up in all Qwen v.3.5 integrations with Ollama's current version (v0.17.7), while Qwen v3 models function as expected under Ollama. This is not an inherent problem with the model however, as according to my research the Qwen 3.5 models support it. This issue is specifically related to the current Ollama integration with the current Qwen version from what I can see.

@scmarvin commented on GitHub (Mar 6, 2026): I am also seeing the parallelism problem show up in all Qwen v.3.5 integrations with Ollama's current version (v0.17.7), while Qwen v3 models function as expected under Ollama. This is not an inherent problem with the model however, as according to my research the Qwen 3.5 models support it. This issue is specifically related to the current Ollama integration with the current Qwen version from what I can see.

GiteaMirror commented

2026-04-29 10:06:44 -05:00

@rick-github commented on GitHub (Mar 7, 2026):

Is there a plan to support true parallelism for Qwen models?

There's an open ticket so I expect it will be addressed eventually.

on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT

This is likely independent of the parallelism issue. Logs may help in understanding what's going on.

@rick-github commented on GitHub (Mar 7, 2026): > Is there a plan to support true parallelism for Qwen models? There's an open ticket so I expect it will be addressed eventually. > on my specific DGX Spark setup, triggering these simultaneous requests causes it to crash with SIGABRT This is likely independent of the parallelism issue. Logs may help in understanding what's going on.

GiteaMirror commented

2026-04-29 10:06:45 -05:00

@KevinTurnbull commented on GitHub (Mar 8, 2026):

I naively removed the safety check for qwen35 and here's what I get in debug logs. Long story short - it's the graph, not just a safety check about VL models.

FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath:       
  MainGPU:0 UseMmap:false}"                                                                                                                                                           
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.208-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32              
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0             
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0  
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0             
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""               
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1      
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found"                                               
  key=qwen35.rope.scaling.original_context_length default=0                                                                                                                           
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0          
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0             
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0        
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false    
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found"                                               
  key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07                                                                                                        
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base              
  default=10000                                                                                                                                                                       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings   
  default=2304                                                                                                                                                                        
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token              
  default=false                                                                                                                                                                       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0     
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.eos_token_ids              
  default="&{size:0 values:[]}"                                                                                                                                                       
  Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.469-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442                                   
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.157-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=107                                  
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.164-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2463 splits=3                                     
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB"                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.0 GiB"                             
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB"                                
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB"                                  
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB"                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB"                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:272 msg="total memory" size="18.5 GiB"                                        
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=591052800                   
  required.CPU.Weights="[268028672 135971584 135971584 132057088 122995456 122995456 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1747209152]" required.CPU.Cache="[219545600  
  219545600 219545600 573571072 219545600 219545600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=659647456                                              
  required.CUDA0.ID=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 required.CUDA0.Weights="[0 0 0 0 0 0 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456     
  122995456 132057088 122995456 122995456 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456 122995456 132057088 135971584 135971584 135971584 132057088 
   0]" required.CUDA0.Cache="[0 0 0 0 0 0 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 
   219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 0]" required.CUDA0.Graph=2787689984                        
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:976 msg="available gpu" id=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7           
  library=CUDA "available layer vram"="11.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="2.6 GiB"                                                                      
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:793 msg="new layout created"                                                  
  layers="26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)]"                                                                                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512  
  FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath:       
  MainGPU:0 UseMmap:false}"                                                                                                                                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:482 msg="offloading 26 repeating layers to GPU"                                  
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"                                         
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:494 msg="offloaded 26/33 layers to GPU"                                          
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB"                            
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="3.0 GiB"                              
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB"                                 
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB"                                   
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB"                            
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB"                            
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:272 msg="total memory" size="18.5 GiB"                                         
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=sched.go:565 msg="loaded runners" count=1                                                
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded"                                            
  model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                                                                
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"                        
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.166-04:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading   
  model"                                                                                                                                                                              
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.167-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.00"                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.417-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.33"                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.669-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.65"                                           
  Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.921-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.92"                                           
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.126-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0             
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=INFO source=server.go:1388 msg="llama runner started in 6.96 seconds"                                
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:577 msg="finished setting up"                                                  
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000                                                                                                                                                                
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded"                                            
  model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                                                                
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14416 format=""                 
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14439 format=""                 
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14312 format=""                 
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.264-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=2961 used=0 remaining=2961    
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=1 cache=0 prompt=2973 used=0 remaining=2973    
  Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=2 cache=0 prompt=2980 used=0 remaining=2980    
  Mar 08 10:33:11 linai ollama[680738]: panic: failed to build graph: model does not support operation                                                                                
  Mar 08 10:33:11 linai ollama[680738]: goroutine 11 [running]:                                                                                                                       
  Mar 08 10:33:11 linai ollama[680738]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0005565a0, {0x1b15c30, 0xc0005b26e0})                                           
  Mar 08 10:33:11 linai ollama[680738]:         /code/ollama/runner/ollamarunner/runner.go:455 +0x325                                                                                 
  Mar 08 10:33:11 linai ollama[680738]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1                                                                
  Mar 08 10:33:11 linai ollama[680738]:         /code/ollama/runner/ollamarunner/runner.go:1442 +0x4c9                                                                                
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\":    
  EOF"                                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.552741187s |   192.168.69.16 | POST     "/api/chat"                                                    
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\":    
  EOF"                                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.522634051s |   192.168.69.16 | POST     "/api/chat"                                                    
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:585 msg="context for request finished"                                         
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event"                              
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 refCount=2                                                                                                                                                     
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:431 msg="context for request finished"                                         
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event"                              
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 refCount=1                                                                                                                                                     
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\":    
  EOF"                                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.610894761s |   192.168.69.16 | POST     "/api/chat"                                                    
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=DEBUG source=sched.go:431 msg="context for request finished"                                         
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000                                                                                                                                                                
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:336 msg="runner with non-zero duration has gone idle, adding timer"            
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 duration=5m0s                                                                                                                                                  
  Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event"                              
  runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB"     
  runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c                     
  runner.num_ctx=35000 refCount=0

The key line:

panic: failed to build graph: model does not support operation

This happened when the runner tried to process 3 concurrent requests (Parallel:4). The qwen3.5 DeltaNet architecture genuinely can't build a compute graph for multiple parallel sequences - it's not just a conservative block.

@KevinTurnbull commented on GitHub (Mar 8, 2026): I naively removed the safety check for qwen35 and here's what I get in debug logs. Long story short - it's the graph, not just a safety check about VL models. ``` FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.208-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.214-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" Mar 08 10:33:01 linai ollama[680738]: time=2026-03-08T10:33:01.469-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.157-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=107 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.164-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2463 splits=3 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="3.0 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=device.go:272 msg="total memory" size="18.5 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=591052800 required.CPU.Weights="[268028672 135971584 135971584 132057088 122995456 122995456 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1747209152]" required.CPU.Cache="[219545600 219545600 219545600 573571072 219545600 219545600 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=659647456 required.CUDA0.ID=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 required.CUDA0.Weights="[0 0 0 0 0 0 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456 122995456 132057088 122995456 122995456 135971584 119080960 122995456 135971584 122995456 119080960 135971584 122995456 122995456 132057088 135971584 135971584 135971584 132057088 0]" required.CUDA0.Cache="[0 0 0 0 0 0 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 219545600 219545600 219545600 573571072 0]" required.CUDA0.Graph=2787689984 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:976 msg="available gpu" id=GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 library=CUDA "available layer vram"="11.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="2.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=server.go:793 msg="new layout created" layers="26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)]" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:140000 KvCacheType: NumThreads:6 GPULayers:26[ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Layers:26(6..31)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:482 msg="offloading 26 repeating layers to GPU" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=ggml.go:494 msg="offloaded 26/33 layers to GPU" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="3.0 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="7.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="2.6 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="629.1 MiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=device.go:272 msg="total memory" size="18.5 GiB" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=sched.go:565 msg="loaded runners" count=1 Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.165-04:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.166-04:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.167-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.00" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.417-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.33" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.669-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.65" Mar 08 10:33:03 linai ollama[680738]: time=2026-03-08T10:33:03.921-04:00 level=DEBUG source=server.go:1394 msg="model load progress 0.92" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.126-04:00 level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=INFO source=server.go:1388 msg="llama runner started in 6.96 seconds" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:577 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.172-04:00 level=DEBUG source=sched.go:729 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14416 format="" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14439 format="" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.233-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=14312 format="" Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.264-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=2961 used=0 remaining=2961 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=1 cache=0 prompt=2973 used=0 remaining=2973 Mar 08 10:33:04 linai ollama[680738]: time=2026-03-08T10:33:04.265-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=2 cache=0 prompt=2980 used=0 remaining=2980 Mar 08 10:33:11 linai ollama[680738]: panic: failed to build graph: model does not support operation Mar 08 10:33:11 linai ollama[680738]: goroutine 11 [running]: Mar 08 10:33:11 linai ollama[680738]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0005565a0, {0x1b15c30, 0xc0005b26e0}) Mar 08 10:33:11 linai ollama[680738]: /code/ollama/runner/ollamarunner/runner.go:455 +0x325 Mar 08 10:33:11 linai ollama[680738]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 Mar 08 10:33:11 linai ollama[680738]: /code/ollama/runner/ollamarunner/runner.go:1442 +0x4c9 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\": EOF" Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.552741187s | 192.168.69.16 | POST "/api/chat" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\": EOF" Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.522634051s | 192.168.69.16 | POST "/api/chat" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:585 msg="context for request finished" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 refCount=2 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:431 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.162-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 refCount=1 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33475/completion\": EOF" Mar 08 10:33:12 linai ollama[680738]: [GIN] 2026/03/08 - 10:33:12 | 500 | 20.610894761s | 192.168.69.16 | POST "/api/chat" Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.219-04:00 level=DEBUG source=sched.go:431 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:336 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 duration=5m0s Mar 08 10:33:12 linai ollama[680738]: time=2026-03-08T10:33:12.220-04:00 level=DEBUG source=sched.go:354 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:9b runner.inference="[{ID:GPU-3a4bb711-3dff-4796-72b9-7ad73c0f74d7 Library:CUDA}]" runner.size="18.5 GiB" runner.vram="13.3 GiB" runner.parallel=4 runner.pid=687768 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-dec52a44569a2a25341c4e4d3fee25846eed4f6f0b936278e3a3c900bb99d37c runner.num_ctx=35000 refCount=0 ``` The key line: - panic: failed to build graph: model does not support operation This happened when the runner tried to process 3 concurrent requests (Parallel:4). The qwen3.5 DeltaNet architecture genuinely can't build a compute graph for multiple parallel sequences - it's not just a conservative block.

GiteaMirror commented

2026-04-29 10:06:45 -05:00

@KevinTurnbull commented on GitHub (Mar 8, 2026):

I've got a technical fix. It keeps it from crashing under concurrent load but because of the nature of the graph the total throughput is essentially the same (on my Nvidia Quadro RTX5000 from 2018 at least).

40s per request for 3 serialized requests == 120s for 3 concurrent requests. On the plus side - they do stream concurrently which can have perceived performance benefits.

But - here's a patch for you. Notably I did not test it on anything other than qwen35 (although I suspect it would work for the whole family) -- nor did I test it for images.

qwen35-parallel.patch

@rick-github -- Ball's in your court. :)

@KevinTurnbull commented on GitHub (Mar 8, 2026): I've got a technical fix. It keeps it from crashing under concurrent load but because of the nature of the graph the total throughput is essentially the same (on my Nvidia Quadro RTX5000 from 2018 at least). 40s per request for 3 serialized requests == 120s for 3 concurrent requests. On the plus side - they do stream concurrently which can have perceived performance benefits. But - here's a patch for you. Notably I did not test it on anything other than qwen35 (although I suspect it would work for the whole family) -- nor did I test it for images. [qwen35-parallel.patch](https://github.com/user-attachments/files/25826099/qwen35-parallel.patch) @rick-github -- Ball's in your court. :)

GiteaMirror commented

2026-04-29 10:06:45 -05:00

@NixerWong commented on GitHub (Mar 18, 2026):

Any update on this issue? When can we expect a official fix?

@NixerWong commented on GitHub (Mar 18, 2026): Any update on this issue? When can we expect a official fix?

GiteaMirror commented

2026-04-29 10:06:46 -05:00

@NixerWong commented on GitHub (Apr 6, 2026):

Any update on this issue? When can we expect a official fix?

@NixerWong commented on GitHub (Apr 6, 2026): Any update on this issue? When can we expect a official fix?

GiteaMirror commented

2026-04-29 10:06:46 -05:00

@bingoct commented on GitHub (Apr 8, 2026):

still has problem in v0.20.3

@bingoct commented on GitHub (Apr 8, 2026): still has problem in v0.20.3

GiteaMirror commented

2026-04-29 10:06:46 -05:00

@KevinTurnbull commented on GitHub (Apr 10, 2026):

I think the holdup is that there's a non-trivial rewrite of the dag runner needed to support real concurrency. My patch doesn't really move the needle in terms of throughput.

@KevinTurnbull commented on GitHub (Apr 10, 2026): I think the holdup is that there's a non-trivial rewrite of the dag runner needed to support real concurrency. My patch doesn't really move the needle in terms of throughput.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#55987