[GH-ISSUE #14258] GPU-to-CPU fallback happens silently with no user-visible warning #9284

Open
opened 2026-04-12 22:09:14 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @akuligowski9 on GitHub (Feb 14, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14258

Problem

When Ollama cannot fit a model (or any layers) into GPU VRAM, it silently falls back to CPU execution. Users experience unexpected slowness and have no way to know why — the only indication is buried in debug-level logs that are invisible by default.

This affects three specific code paths in llm/server.go:

  1. macOS full disable (lines 1048-1052): When a model's VRAM requirement exceeds total system memory on Darwin, NumGPU is silently set to 0 with only a code comment — no log message at all.

  2. Zero GPU layers (lines 1056-1057): When GPUs exist but no layers fit, the message "insufficient VRAM to load any model layers" is logged at slog.Debug level — invisible unless OLLAMA_DEBUG=1 is set.

  3. Partial offload reduction: When the layer assignment algorithm reduces the number of GPU layers to fit available VRAM, there is no message indicating how many layers ended up on CPU vs GPU.

Why this matters

This is arguably the single most common source of user confusion in Ollama. Users run ollama run llama3:70b on a machine with 8GB VRAM, get slow responses, and file an issue saying "Ollama is very slow" or "GPU not being used." The diagnosis almost always requires a maintainer to manually ask "what does ollama ps show?" and explain the CPU/GPU split.

Searching GitHub issues for related patterns returns 500+ results across queries like "GPU not detected," "fallback CPU," "very slow," and "GPU not used" — most still open.

Example issues:

  • #12976 — Performance Regression on Apple Silicon M1: GPU to CPU Fallback
  • #13589 — gfx1151 silent fallback to CPU
  • #12197 — CPU processing despite GPU-loaded model
  • #13814 — Ollama stopped using the GPU
  • #5923 — Slow Model Loading Speed on macOS System
  • #4996 — Apple Silicon 8/16GB slow down with larger models
  • #10707 — Add option to disable CPU fallback when GPU memory insufficient
  • #9948 — Default Num_GPU causes garbage output on Apple Metal

Proposed change

Upgrade the logging in these three paths from silent/debug to slog.Warn, so users see a clear message in normal server logs when fallback occurs. No behavior change — just visibility.

How it will be tested

  • Existing tests continue to pass (no behavior change)
  • Manual verification: load a model larger than available VRAM and confirm the warning appears in server logs at default log level
  • The warning messages can be verified by checking ollama ps shows CPU/GPU split matching the logged information
Originally created by @akuligowski9 on GitHub (Feb 14, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14258 ## Problem When Ollama cannot fit a model (or any layers) into GPU VRAM, it silently falls back to CPU execution. Users experience unexpected slowness and have no way to know why — the only indication is buried in debug-level logs that are invisible by default. This affects three specific code paths in `llm/server.go`: 1. **macOS full disable** (lines 1048-1052): When a model's VRAM requirement exceeds total system memory on Darwin, `NumGPU` is silently set to 0 with only a code comment — no log message at all. 2. **Zero GPU layers** (lines 1056-1057): When GPUs exist but no layers fit, the message `"insufficient VRAM to load any model layers"` is logged at `slog.Debug` level — invisible unless `OLLAMA_DEBUG=1` is set. 3. **Partial offload reduction**: When the layer assignment algorithm reduces the number of GPU layers to fit available VRAM, there is no message indicating how many layers ended up on CPU vs GPU. ## Why this matters This is arguably the single most common source of user confusion in Ollama. Users run `ollama run llama3:70b` on a machine with 8GB VRAM, get slow responses, and file an issue saying "Ollama is very slow" or "GPU not being used." The diagnosis almost always requires a maintainer to manually ask "what does `ollama ps` show?" and explain the CPU/GPU split. Searching GitHub issues for related patterns returns 500+ results across queries like "GPU not detected," "fallback CPU," "very slow," and "GPU not used" — most still open. Example issues: - #12976 — Performance Regression on Apple Silicon M1: GPU to CPU Fallback - #13589 — gfx1151 silent fallback to CPU - #12197 — CPU processing despite GPU-loaded model - #13814 — Ollama stopped using the GPU - #5923 — Slow Model Loading Speed on macOS System - #4996 — Apple Silicon 8/16GB slow down with larger models - #10707 — Add option to disable CPU fallback when GPU memory insufficient - #9948 — Default Num_GPU causes garbage output on Apple Metal ## Proposed change Upgrade the logging in these three paths from silent/debug to `slog.Warn`, so users see a clear message in normal server logs when fallback occurs. No behavior change — just visibility. ## How it will be tested - Existing tests continue to pass (no behavior change) - Manual verification: load a model larger than available VRAM and confirm the warning appears in server logs at default log level - The warning messages can be verified by checking `ollama ps` shows CPU/GPU split matching the logged information
GiteaMirror added the documentation label 2026-04-12 22:09:14 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 14, 2026):

There is no need to look in the logs, ollama ps will show that the model has spilled to system RAM.

<!-- gh-comment-id:3902619458 --> @rick-github commented on GitHub (Feb 14, 2026): There is no need to look in the logs, `ollama ps` will show that the model has spilled to system RAM.
Author
Owner

@akuligowski9 commented on GitHub (Feb 16, 2026):

Thanks for the feedback @rick-github — that's a fair point and I've scaled back the PR accordingly.

I removed the general GPU/CPU layer split logging (logGPUOffloadStatus) since ollama ps already shows that breakdown in the PROCESSOR column.

That said, I kept two specific warnings where ollama ps doesn't tell the full story:

  1. macOS silently disabling GPU offload when the model exceeds total system memory — the system sets NumGPU = 0 to prevent lockup, but ollama ps just shows "100% CPU" with no indication that GPU was deliberately disabled or why. A user with a GPU-capable Mac seeing "100% CPU" has no way to know this was a safety decision vs. a bug.

  2. "insufficient VRAM to load any model layers" (upgraded from Debug → Warn) — same situation: ollama ps shows "100% CPU" but doesn't explain that GPUs were detected and couldn't fit any layers. At Debug level this was invisible in default logs.

In both cases, ollama ps answers what (CPU) but not why — and the "why" is the part users struggle with when filing issues. The updated PR is now just a 3-line change focused on these two edge cases: https://github.com/ollama/ollama/pull/14261

<!-- gh-comment-id:3910143626 --> @akuligowski9 commented on GitHub (Feb 16, 2026): Thanks for the feedback @rick-github — that's a fair point and I've scaled back the PR accordingly. I removed the general GPU/CPU layer split logging (`logGPUOffloadStatus`) since `ollama ps` already shows that breakdown in the PROCESSOR column. That said, I kept two specific warnings where `ollama ps` doesn't tell the full story: 1. **macOS silently disabling GPU offload when the model exceeds total system memory** — the system sets `NumGPU = 0` to prevent lockup, but `ollama ps` just shows "100% CPU" with no indication that GPU was deliberately disabled or why. A user with a GPU-capable Mac seeing "100% CPU" has no way to know this was a safety decision vs. a bug. 2. **"insufficient VRAM to load any model layers" (upgraded from Debug → Warn)** — same situation: `ollama ps` shows "100% CPU" but doesn't explain that GPUs were detected and couldn't fit any layers. At `Debug` level this was invisible in default logs. In both cases, `ollama ps` answers *what* (CPU) but not *why* — and the "why" is the part users struggle with when filing issues. The updated PR is now just a 3-line change focused on these two edge cases: https://github.com/ollama/ollama/pull/14261
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9284