[GH-ISSUE #7198] num_ctx forces entire model to CPU #51084

Closed
opened 2026-04-28 18:18:01 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @jimwashbrook on GitHub (Oct 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7198

What is the issue?

Apologies if this is covered somewhere, but I couldn't find any documentation for it and it doesn't seem intended.

For context, my GPU has 8GB VRAM, and the model I "discovered" this with was llama3.2:3b-instruct-q8_0, but it seems to occur for any other.

ollama ps shows the model as 3.4 GB

When doing an API request without num_ctx, 100% of the model is loaded on the GPU.

When setting num_ctx to a number that'll force it to exceed my VRAM size, we'll get something like 10 GB 33%/67% CPU/GPU as expected (num_ctx of 40000 in that instance).

However, when I set it to 128000, the result is:
image

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

No response

Originally created by @jimwashbrook on GitHub (Oct 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7198 ### What is the issue? Apologies if this is covered somewhere, but I couldn't find any documentation for it and it doesn't seem intended. For context, my GPU has 8GB VRAM, and the model I "discovered" this with was `llama3.2:3b-instruct-q8_0`, but it seems to occur for any other. `ollama ps` shows the model as `3.4 GB` When doing an API request without `num_ctx`, 100% of the model is loaded on the GPU. When setting `num_ctx` to a number that'll force it to exceed my VRAM size, we'll get something like `10 GB 33%/67% CPU/GPU` as expected (num_ctx of 40000 in that instance). However, when I set it to 128000, the result is: <img width="823" alt="image" src="https://github.com/user-attachments/assets/46ae45f4-95fd-4c50-aa6d-f74bf486d69f"> ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version _No response_
GiteaMirror added the question label 2026-04-28 18:18:01 -05:00
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

The context size has a direct impact on the amount of VRAM required to run the model. Ollama supports spilling over into system memory and splitting inference between the GPU and CPU, up to a point. There are some data structures that have to be allocated in the GPU to support any GPU inference. Before loading the model, we calculate how much VRAM is required, and if the required structures can not fit, we fall back to 100% CPU based inference.

<!-- gh-comment-id:2420276890 --> @dhiltgen commented on GitHub (Oct 17, 2024): The context size has a direct impact on the amount of VRAM required to run the model. Ollama supports spilling over into system memory and splitting inference between the GPU and CPU, up to a point. There are some data structures that have to be allocated in the GPU to support any GPU inference. Before loading the model, we calculate how much VRAM is required, and if the required structures can not fit, we fall back to 100% CPU based inference.
Author
Owner

@jimwashbrook commented on GitHub (Oct 21, 2024):

Ah I didn't realise that the structures would grow so large that they would exceed VRAM. Good to know thank you!

<!-- gh-comment-id:2426511788 --> @jimwashbrook commented on GitHub (Oct 21, 2024): Ah I didn't realise that the structures would grow so large that they would exceed VRAM. Good to know thank you!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51084