[GH-ISSUE #15918] Silent CPU fallback after driver update: CUDA forward compatibility error causes 100% CPU usage with no user warning #87826

Open
opened 2026-05-10 06:25:52 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @0xpietri on GitHub (May 1, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15918

What is the issue?

After updating NVIDIA drivers today, ollama silently fell back to CPU-only inference instead of failing fast or warning the user. The model (qwen3.5:9b, Q4_K_M, 6.1 GiB) loaded entirely on CPU, consuming 766% CPU and 16 GB RAM for ~38 minutes, causing the system to reach 101°C, while the API kept returning 500 errors after 60s timeouts.

The fallback happened 3 times at 30-minute intervals (12:28, 12:58, 13:28) triggered by an external tool (OpenAI Codex) calling POST /api/chat. No warning was shown anywhere — ollama appeared to be running normally.

Expected behavior: ollama should either refuse to load the model with a clear error, or emit a visible WARNING log when falling back to CPU due to a CUDA initialization failure.

Ollama version: 0.20.6
NVIDIA driver updated today (NVML library version: 580.159)
nvidia-smi itself fails with "Driver/library version mismatch" — kernel module still loaded from previous driver version, reboot pending.

Relevant log output

ggml_cuda_init: failed to initialize CUDA: forward compatibility was attempted on non supported HW
time=... source=ggml.go:494 msg="offloaded 0/33 layers to GPU"
time=... source=device.go:245 msg="model weights" device=CPU size="6.1 GiB"
time=... source=device.go:256 msg="kv cache" device=CPU size="9.2 GiB"
time=... source=device.go:267 msg="compute graph" device=CPU size="1.5 GiB"
time=... source=device.go:272 msg="total memory" size="16.8 GiB"
[GIN] 2026/05/01 - 13:29:04 | 500 | 59.999378025s | 192.168.0.38 | POST "/api/chat"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @0xpietri on GitHub (May 1, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15918 ### What is the issue? After updating NVIDIA drivers today, ollama silently fell back to CPU-only inference instead of failing fast or warning the user. The model (qwen3.5:9b, Q4_K_M, 6.1 GiB) loaded entirely on CPU, consuming 766% CPU and 16 GB RAM for ~38 minutes, causing the system to reach 101°C, while the API kept returning 500 errors after 60s timeouts. The fallback happened 3 times at 30-minute intervals (12:28, 12:58, 13:28) triggered by an external tool (OpenAI Codex) calling POST /api/chat. No warning was shown anywhere — ollama appeared to be running normally. Expected behavior: ollama should either refuse to load the model with a clear error, or emit a visible WARNING log when falling back to CPU due to a CUDA initialization failure. Ollama version: 0.20.6 NVIDIA driver updated today (NVML library version: 580.159) nvidia-smi itself fails with "Driver/library version mismatch" — kernel module still loaded from previous driver version, reboot pending. ### Relevant log output ```shell ggml_cuda_init: failed to initialize CUDA: forward compatibility was attempted on non supported HW time=... source=ggml.go:494 msg="offloaded 0/33 layers to GPU" time=... source=device.go:245 msg="model weights" device=CPU size="6.1 GiB" time=... source=device.go:256 msg="kv cache" device=CPU size="9.2 GiB" time=... source=device.go:267 msg="compute graph" device=CPU size="1.5 GiB" time=... source=device.go:272 msg="total memory" size="16.8 GiB" [GIN] 2026/05/01 - 13:29:04 | 500 | 59.999378025s | 192.168.0.38 | POST "/api/chat" ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-10 06:25:52 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#87826