[GH-ISSUE #1385] Update VRAM layer offloading to account for context size #62767

Closed
opened 2026-05-03 10:15:47 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @BruceMacD on GitHub (Dec 5, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1385

Originally assigned to: @BruceMacD on GitHub.

Our current calculation for the amount of layers to offload to VRAM does not account for context size, so large context models may fail to load if too many layers are off-loaded.

Originally created by @BruceMacD on GitHub (Dec 5, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1385 Originally assigned to: @BruceMacD on GitHub. Our current calculation for the amount of layers to offload to VRAM does not account for context size, so large context models may fail to load if too many layers are off-loaded.
GiteaMirror added the bug label 2026-05-03 10:15:47 -05:00
Author
Owner

@m0wer commented on GitHub (Dec 15, 2023):

Yup, happens often.

<!-- gh-comment-id:1857877972 --> @m0wer commented on GitHub (Dec 15, 2023): Yup, happens often.
Author
Owner

@m0wer commented on GitHub (Dec 15, 2023):

Is there any workaround until the fix?

<!-- gh-comment-id:1857878292 --> @m0wer commented on GitHub (Dec 15, 2023): Is there any workaround until the fix?
Author
Owner

@BruceMacD commented on GitHub (Dec 15, 2023):

I believe setting the num_gpu parameter to not exceed the number of layers could help, since that prevents the context from being loaded into VRAM.

Here's what doing that would look like:

  1. Search the logs for "layers to GPU" to see how many layers the model has.

Here is what that looks like on my linux instance:

$ journalctl -u ollama -r -g "layers to GPU"
Dec 15 11:22:00: llama_model_load_internal: offloaded 35/35 layers to GPU
  1. Start a new interactive session and explicitly set the num_gputo the number of layers (35 in the case of my example.
$ ollama run llama2
>>> /set parameter num_gpu 32
Set parameter 'num_gpu' to '32'

This should prevent context off-loading to VRAM.

<!-- gh-comment-id:1858148432 --> @BruceMacD commented on GitHub (Dec 15, 2023): I believe setting the `num_gpu` parameter to not exceed the number of layers could help, since that prevents the context from being loaded into VRAM. Here's what doing that would look like: 1. [Search the logs](https://github.com/jmorganca/ollama/blob/main/docs/faq.md#how-can-i-view-the-logs) for "layers to GPU" to see how many layers the model has. Here is what that looks like on my linux instance: ``` $ journalctl -u ollama -r -g "layers to GPU" Dec 15 11:22:00: llama_model_load_internal: offloaded 35/35 layers to GPU ``` 2. Start a new interactive session and explicitly set the `num_gpu`to the number of layers (35 in the case of my example. ``` $ ollama run llama2 >>> /set parameter num_gpu 32 Set parameter 'num_gpu' to '32' ``` This should prevent context off-loading to VRAM.
Author
Owner

@m0wer commented on GitHub (Dec 18, 2023):

That worked! Thanks a lot @BruceMacD

If it helps anyone, what I did was:

# /tmp/mixtral_less_gpu.gguf
FROM mixtral:latest
TEMPLATE """ [INST] {{ .System }} {{ .Prompt }} [/INST]"""
PARAMETER num_ctx 32768
PARAMETER stop "</s>"
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
PARAMETER num_gpu 4
ollama create mixtral_less_gpu -f /tmp/mixtral_less_gpu.gguf

For mixtral:latest each layer requires around 800 MB RAM, and the context aorund 2.5 GB. So with 6 GB I have 3 layers and total VRAM used: 4895.82 MiB (model: 2347.78 MiB, context: 2548.04 MiB). By default it was trying to fit 5 layers (out of the total of 33) to VRAM.

<!-- gh-comment-id:1861303297 --> @m0wer commented on GitHub (Dec 18, 2023): That worked! Thanks a lot @BruceMacD If it helps anyone, what I did was: ```gguf # /tmp/mixtral_less_gpu.gguf FROM mixtral:latest TEMPLATE """ [INST] {{ .System }} {{ .Prompt }} [/INST]""" PARAMETER num_ctx 32768 PARAMETER stop "</s>" PARAMETER stop "[INST]" PARAMETER stop "[/INST]" PARAMETER num_gpu 4 ``` ```bash ollama create mixtral_less_gpu -f /tmp/mixtral_less_gpu.gguf ``` For `mixtral:latest` each layer requires around 800 MB RAM, and the context aorund 2.5 GB. So with 6 GB I have 3 layers and `total VRAM used: 4895.82 MiB (model: 2347.78 MiB, context: 2548.04 MiB)`. By default it was trying to fit 5 layers (out of the total of 33) to VRAM.
Author
Owner

@mongolu commented on GitHub (Jan 3, 2024):

can we /set parameter num_gpu 32 on runtime?
it would save a lot of tries of "ollama create [name] -f [modelfile]".

I'm using litellm and autogen, so I'm not sure what would mean "at runtime".
What I would like to do though, is when starting the "litellm ... --model ollama/[name]" to be able to pass also num_gpu...
But right now, when I'm writing this, it seems more of a feature for litellm project, isn't it?

<!-- gh-comment-id:1875824258 --> @mongolu commented on GitHub (Jan 3, 2024): can we `/set parameter num_gpu 32` on runtime? it would save a lot of tries of "`ollama create [name] -f [modelfile]`". I'm using `litellm `and `autogen`, so I'm not sure what would mean "at runtime". What I would like to do though, is when starting the "`litellm ... --model ollama/[name]`" to be able to pass also `num_gpu`... But right now, when I'm writing this, it seems more of a feature for `litellm` project, isn't it?
Author
Owner

@iplayfast commented on GitHub (Jan 3, 2024):

I wonder if that has anything to do with this bug https://github.com/jmorganca/ollama/issues/1691

<!-- gh-comment-id:1875851775 --> @iplayfast commented on GitHub (Jan 3, 2024): I wonder if that has anything to do with this bug https://github.com/jmorganca/ollama/issues/1691
Author
Owner

@m0wer commented on GitHub (Jan 4, 2024):

can we /set parameter num_gpu 32 on runtime? it would save a lot of tries of "ollama create [name] -f [modelfile]".

I'm using litellm and autogen, so I'm not sure what would mean "at runtime". What I would like to do though, is when starting the "litellm ... --model ollama/[name]" to be able to pass also num_gpu... But right now, when I'm writing this, it seems more of a feature for litellm project, isn't it?

Currently not, but it would be breat.

<!-- gh-comment-id:1876542619 --> @m0wer commented on GitHub (Jan 4, 2024): > can we `/set parameter num_gpu 32` on runtime? it would save a lot of tries of "`ollama create [name] -f [modelfile]`". > > I'm using `litellm `and `autogen`, so I'm not sure what would mean "at runtime". What I would like to do though, is when starting the "`litellm ... --model ollama/[name]`" to be able to pass also `num_gpu`... But right now, when I'm writing this, it seems more of a feature for `litellm` project, isn't it? Currently not, but it would be breat.
Author
Owner

@m0wer commented on GitHub (Jan 4, 2024):

I wonder if that has anything to do with this bug #1691

I don't think so. This bug occurs even on a fresh installation.

<!-- gh-comment-id:1876545099 --> @m0wer commented on GitHub (Jan 4, 2024): > I wonder if that has anything to do with this bug #1691 I don't think so. This bug occurs even on a fresh installation.
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

We do take context size into consideration now, so I'm going to close this issue as fixed. If folks are still hitting OOMs please make sure to upgrade to the latest release 0.1.33 and share your logs so we can take another look.

<!-- gh-comment-id:2091710651 --> @dhiltgen commented on GitHub (May 2, 2024): We do take context size into consideration now, so I'm going to close this issue as fixed. If folks are still hitting OOMs please make sure to upgrade to the latest release 0.1.33 and share your logs so we can take another look.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62767