[GH-ISSUE #9750] prefer offloading model layers over kv cache when both do not fit in VRAM. #68427

Open
opened 2026-05-04 13:55:39 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @apt-install-coffee on GitHub (Mar 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9750

Currently when running a model and large context size which do not both fit entirely in VRAM, ollama will reduce the number of model layers significantly instead of keeping the kv cache in main memory; for example by using --no-kv-offload.

My experiments when running a QwQ-32B quantized to IQ3_M (~14GB) at num_ctx=131072, on a 16GB 7800XT show significant improvements to token generation (reduced time thinking from ~2 hours to 35 minutes), when applying --no-kv-offload and allowing all model layers to be offloaded.

Originally created by @apt-install-coffee on GitHub (Mar 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9750 Currently when running a model and large context size which do not both fit entirely in VRAM, ollama will reduce the number of model layers significantly instead of keeping the kv cache in main memory; for example by using `--no-kv-offload`. My experiments when running a QwQ-32B quantized to IQ3_M (~14GB) at num_ctx=131072, on a 16GB 7800XT show significant improvements to token generation (reduced time thinking from ~2 hours to 35 minutes), when applying `--no-kv-offload` and allowing all model layers to be offloaded.
GiteaMirror added the feature request label 2026-05-04 13:55:39 -05:00
Author
Owner

@apt-install-coffee commented on GitHub (Mar 14, 2025):

I've opened #9751 to discuss what tweaking this manually might look like.

<!-- gh-comment-id:2723303101 --> @apt-install-coffee commented on GitHub (Mar 14, 2025): I've opened #9751 to discuss what tweaking this manually might look like.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68427