[GH-ISSUE #1906] Ollama not respecting num_gpu to load entire model into VRAM for a model that I know should fit into 24GB. #47608

Closed
opened 2026-04-28 04:30:02 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @madsamjp on GitHub (Jan 10, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1906

Originally assigned to: @jmorganca on GitHub.

Somewhat related to this issue: https://github.com/jmorganca/ollama/issues/1374

I have a model that I have configured to fit almost exactly into my 4090's VRAM. Prior to v 0.1.13, this model ran fine, and I could fit all layers into VRAM and fill the context. It would utilize a total of 23697MiB (which gave about 900MiB headroom). After 0.1.13, this model would cause me an OOM. I found a temporary solution to this by building ollama from source with the -DLLAMA_CUDA_FORCE_MMQ=on flag.

Now after the recent update to 0.1.19, there was a PR that alleged to fix this issue https://github.com/jmorganca/ollama/pull/1850.

I can now run that model without the OOM, however, Ollama never offloads more than 49 of the 63 layers to GPU. Even if I set the num_gpu parameter in interactive mode to 63 or higher, it still loads only 49 of the layers and only utilizes 21027MiB of a total of 24564MiB (only 86% of VRAM).

Here is the modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 63
PARAMETER num_ctx 2048

Is it possible to force Ollama to load the entire model into VRAM as it was before v 0.1.11? Do models now take up MORE VRAM than before?

VRAM is expensive and scarce at this time. I feel that giving users the flexibility to finely control models to maximize VRAM usage is imperative.

Originally created by @madsamjp on GitHub (Jan 10, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1906 Originally assigned to: @jmorganca on GitHub. Somewhat related to this issue: https://github.com/jmorganca/ollama/issues/1374 I have a model that I have configured to fit almost exactly into my 4090's VRAM. Prior to v 0.1.13, this model ran fine, and I could fit all layers into VRAM and fill the context. It would utilize a total of 23697MiB (which gave about 900MiB headroom). After 0.1.13, this model would cause me an OOM. I found a temporary solution to this by building ollama from source with the `-DLLAMA_CUDA_FORCE_MMQ=on` flag. Now after the recent update to 0.1.19, there was a PR that alleged to fix this issue https://github.com/jmorganca/ollama/pull/1850. I can now run that model without the OOM, however, Ollama never offloads more than 49 of the 63 layers to GPU. Even if I set the num_gpu parameter in interactive mode to 63 or higher, it still loads only 49 of the layers and only utilizes 21027MiB of a total of 24564MiB (only 86% of VRAM). Here is the modelfile: ``` FROM deepseek-coder:33b-instruct-q5_K_S PARAMETER num_gpu 63 PARAMETER num_ctx 2048 ``` Is it possible to force Ollama to load the entire model into VRAM as it was before v 0.1.11? Do models now take up MORE VRAM than before? VRAM is expensive and scarce at this time. I feel that giving users the flexibility to finely control models to maximize VRAM usage is imperative.
GiteaMirror added the bug label 2026-04-28 04:30:02 -05:00
Author
Owner

@jmorganca commented on GitHub (May 10, 2024):

Hi there, this should be improved now as memory utilization estimates are more accurate – let me know if you're still seeing this issue and I can re-open.

<!-- gh-comment-id:2103634207 --> @jmorganca commented on GitHub (May 10, 2024): Hi there, this should be improved now as memory utilization estimates are more accurate – let me know if you're still seeing this issue and I can re-open.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47608