[GH-ISSUE #14715] Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) #9513

Closed
opened 2026-04-12 22:26:12 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @airhand on GitHub (Mar 8, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14715

What is the issue?

Title:
[Bug] Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) with Xid 43/31; Compiler warning in llama-graph.cpp suggests undefined behavior

Description:

. Summary
Running Qwen3.5 models (e.g., qwen3.5:9b) on NVIDIA Turing architecture GPUs (specifically RTX 2080 Ti) causes immediate system instability, driver resets (Xid 43, Xid 31), or silent process termination during inference.
Additionally, compiling the latest source code triggers a severe GCC warning (-Waggressive-loop-optimizations) in llama-graph.cpp, indicating potential Undefined Behavior (UB) in the computation graph logic. While newer architectures (Ampere/Ada) seem unaffected, Turing cards fail consistently.

. Environment
OS: Linux (Ubuntu/Debian based)
GPU: NVIDIA GeForce RTX 2080 Ti (22GB VRAM, Modified)
Architecture: Turing (Compute Capability 7.5)
Driver Version: [Insert your driver version, e.g., 535.xx or 550.xx]
Ollama Version: Latest source build (post-v0.17.5) / v0.17.5 binary
Model: qwen3.5:9b (GGUF)
Compiler: GCC (version [e.g., 11.4.0])

. Symptoms
Driver Crash: Upon initiating inference (often after the first token or during KV cache expansion), the GPU drops off the bus.
dmesg logs show: NVRM: Xid (PCI:0000:xx:xx.x): 43, pid=xxxx, Ch 00, [...] or Xid 31.
The system often requires a hard reboot; nvidia-smi fails to respond.
Silent Termination: In some cases, the ollama_llama_server process dies without a clear error message in the application log, just before the driver reset.
Compilation Warning: Building from source reveals a critical logic flaw warning:
text

github.com/ollama/ollama/llama/llama.cpp/src
llama-graph.cpp: In member function ‘virtual void llm_graph_input_attn_cross::set_input(const llama_ubatch*)’:
llama-graph.cpp:473:9: warning: iteration 2147483645 invokes undefined behavior [-Waggressive-loop-optimizations]
| for (int i = n_tokens; i < n_tokens; ++i) {
| ^~~
llama-graph.cpp:473:34: note: within this loop
| for (int i = n_tokens; i < n_tokens; ++i) {
| ~~^~~~~~~~~~
. Steps to Reproduce
Install Ollama on a machine with an RTX 2080 Ti (Turing).
Pull the model: ollama pull qwen3.5:9b.
Run a simple generation: ollama run qwen3.5:9b "Hello".
Observe the system hang, driver reset, or process crash.
(Optional) Compile from source to see the llama-graph.cpp warning.

. Technical Analysis & Hypothesis
The Loop Logic: The code for (int i = n_tokens; i < n_tokens; ++i) is logically a no-op (condition is initially false). However, the GCC warning about "iteration 2147483645" suggests the compiler detects a path where integer overflow or aggressive optimization leads to Undefined Behavior.
Impact on Turing: In C++, UB can cause the compiler to generate optimized machine code that behaves unpredictably. It appears that Turing GPUs (or the specific CUDA kernel generation for CC 7.5) are extremely sensitive to this malformed control flow or the resulting memory layout, leading to illegal memory access or invalid kernel launches.
Qwen3.5 Specifics: This model uses new attention mechanisms (Hybrid/MROPE). The llm_graph_input_attn_cross class is likely heavily utilized. If the graph construction is flawed due to this UB, the resulting CUDA graph sent to the Turing GPU may contain invalid instructions, causing the Xid 43 (GPU dropped off bus) error.
Why not Ampere?: Newer architectures might have more robust error handling or the specific instruction sequence generated by the optimizer happens to be "safe enough" on CC 8.0+, masking the underlying bug.

. Expected Behavior
The model should run stably on Turing GPUs, utilizing the available 22GB VRAM.
No compiler warnings regarding undefined behavior should exist in critical graph construction paths.

. Suggested Fix
Immediate Code Fix: Inspect and correct line 473 in llama-graph.cpp. If the loop is intended to be empty, remove it entirely or wrap it in an explicit if (false) block to prevent compiler misinterpretation.

// Current problematic code:
// for (int i = n_tokens; i < n_tokens; ++i) { ... }

.Proposed fix:
// Remove the loop if it serves no purpose, or fix the logic if it was meant to iterate.
Turing-Specific Testing: Add CI tests or manual verification steps specifically for Compute Capability 7.5 (Turing) when running Qwen3.5 series models.
Kernel Validation: Ensure that the computed graph splits and memory offsets do not exceed 32-bit integer limits or align poorly on older architectures.

. Logs
Journalctl / Ollama Log Snippet (before crash):

Mar 08 19:00:30 aiserver ollama[6612]: level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=4
Mar 08 19:00:30 aiserver ollama[6612]: level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU"
Mar 08 19:00:33 aiserver ollama[6612]: level=INFO source=server.go:1388 msg="llama runner started in 5.53 seconds"
... Log cuts off abruptly or followed by Xid error in dmesg ...
dmesg Error:

NVRM: Xid (PCI:0000:09:00.0): 43, pid=XXXX, Ch 00, [XXX]

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.0

Originally created by @airhand on GitHub (Mar 8, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14715 ### What is the issue? **Title:** [Bug] Qwen3.5 crashes on NVIDIA Turing GPUs (RTX 2080 Ti) with Xid 43/31; Compiler warning in llama-graph.cpp suggests undefined behavior Description: **. Summary** Running Qwen3.5 models (e.g., qwen3.5:9b) on NVIDIA Turing architecture GPUs (specifically RTX 2080 Ti) causes immediate system instability, driver resets (Xid 43, Xid 31), or silent process termination during inference. Additionally, compiling the latest source code triggers a severe GCC warning (-Waggressive-loop-optimizations) in llama-graph.cpp, indicating potential Undefined Behavior (UB) in the computation graph logic. While newer architectures (Ampere/Ada) seem unaffected, Turing cards fail consistently. **. Environment** OS: Linux (Ubuntu/Debian based) GPU: NVIDIA GeForce RTX 2080 Ti (22GB VRAM, Modified) Architecture: Turing (Compute Capability 7.5) Driver Version: [Insert your driver version, e.g., 535.xx or 550.xx] Ollama Version: Latest source build (post-v0.17.5) / v0.17.5 binary Model: qwen3.5:9b (GGUF) Compiler: GCC (version [e.g., 11.4.0]) **. Symptoms** Driver Crash: Upon initiating inference (often after the first token or during KV cache expansion), the GPU drops off the bus. dmesg logs show: NVRM: Xid (PCI:0000:xx:xx.x): 43, pid=xxxx, Ch 00, [...] or Xid 31. The system often requires a hard reboot; nvidia-smi fails to respond. Silent Termination: In some cases, the ollama_llama_server process dies without a clear error message in the application log, just before the driver reset. Compilation Warning: Building from source reveals a critical logic flaw warning: text **github.com/ollama/ollama/llama/llama.cpp/src** llama-graph.cpp: In member function ‘virtual void llm_graph_input_attn_cross::set_input(const llama_ubatch*)’: llama-graph.cpp:473:9: warning: iteration 2147483645 invokes undefined behavior [-Waggressive-loop-optimizations] | for (int i = n_tokens; i < n_tokens; ++i) { | ^~~ llama-graph.cpp:473:34: note: within this loop | for (int i = n_tokens; i < n_tokens; ++i) { | ~~^~~~~~~~~~ **. Steps to Reproduce** Install Ollama on a machine with an RTX 2080 Ti (Turing). Pull the model: ollama pull qwen3.5:9b. Run a simple generation: ollama run qwen3.5:9b "Hello". Observe the system hang, driver reset, or process crash. (Optional) Compile from source to see the llama-graph.cpp warning. **. Technical Analysis & Hypothesis** The Loop Logic: The code for (int i = n_tokens; i < n_tokens; ++i) is logically a no-op (condition is initially false). However, the GCC warning about "iteration 2147483645" suggests the compiler detects a path where integer overflow or aggressive optimization leads to Undefined Behavior. Impact on Turing: In C++, UB can cause the compiler to generate optimized machine code that behaves unpredictably. It appears that Turing GPUs (or the specific CUDA kernel generation for CC 7.5) are extremely sensitive to this malformed control flow or the resulting memory layout, leading to illegal memory access or invalid kernel launches. Qwen3.5 Specifics: This model uses new attention mechanisms (Hybrid/MROPE). The llm_graph_input_attn_cross class is likely heavily utilized. If the graph construction is flawed due to this UB, the resulting CUDA graph sent to the Turing GPU may contain invalid instructions, causing the Xid 43 (GPU dropped off bus) error. Why not Ampere?: Newer architectures might have more robust error handling or the specific instruction sequence generated by the optimizer happens to be "safe enough" on CC 8.0+, masking the underlying bug. **. Expected Behavior** The model should run stably on Turing GPUs, utilizing the available 22GB VRAM. No compiler warnings regarding undefined behavior should exist in critical graph construction paths. **. Suggested Fix** Immediate Code Fix: Inspect and correct line 473 in llama-graph.cpp. If the loop is intended to be empty, remove it entirely or wrap it in an explicit if (false) block to prevent compiler misinterpretation. // Current problematic code: // for (int i = n_tokens; i < n_tokens; ++i) { ... } **.Proposed fix:** // Remove the loop if it serves no purpose, or fix the logic if it was meant to iterate. Turing-Specific Testing: Add CI tests or manual verification steps specifically for Compute Capability 7.5 (Turing) when running Qwen3.5 series models. Kernel Validation: Ensure that the computed graph splits and memory offsets do not exceed 32-bit integer limits or align poorly on older architectures. **. Logs** Journalctl / Ollama Log Snippet (before crash): Mar 08 19:00:30 aiserver ollama[6612]: level=DEBUG source=ggml.go:852 msg="compute graph" nodes=16775 splits=4 Mar 08 19:00:30 aiserver ollama[6612]: level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU" Mar 08 19:00:33 aiserver ollama[6612]: level=INFO source=server.go:1388 msg="llama runner started in 5.53 seconds" ... Log cuts off abruptly or followed by Xid error in dmesg ... dmesg Error: NVRM: Xid (PCI:0000:09:00.0): 43, pid=XXXX, Ch 00, [XXX] ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.17.0
GiteaMirror added the bug label 2026-04-12 22:26:12 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 8, 2026):

The code path for inference with qwen3.5:9b doesn't include this function, it's fixed upstream, and the next vendor sync will result in the warning going away.

XID 31 indicates a illegal address access. XID 43 indicates a software induced fault.

Post a full log, preferably with OLLAMA_DEBUG=2.

<!-- gh-comment-id:4019383410 --> @rick-github commented on GitHub (Mar 8, 2026): The code path for inference with qwen3.5:9b doesn't include this function, it's [fixed upstream](https://github.com/ggml-org/llama.cpp/commit/076b0faf7ddbcf9c1c68b90be2a47497a57141c2), and the next vendor sync will result in the warning going away. [XID 31](https://docs.nvidia.com/deploy/xid-errors/analyzing-xid-catalog.html#:~:text=N/A%3B%20Unused-,XID,-31) indicates a illegal address access. [XID 43](https://docs.nvidia.com/deploy/xid-errors/analyzing-xid-catalog.html#:~:text=N/A%3B%20Unused-,XID,-43) indicates a software induced fault. Post a full log, preferably with `OLLAMA_DEBUG=2`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9513