[GH-ISSUE #12315] Add Required (V)RAM Calculator to Ollama's Website #33942

Closed
opened 2026-04-22 17:07:28 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Panican-Whyasker on GitHub (Sep 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12315

Given the recent developments leading to drastically increasing (V)RAM use by Ollama,

please add Required (V)RAM Calculator feature to the Ollama website showing the Minimum System Requirements in terms of free (V)RAM, that for each model VARIANT (# of parameters, Quantization, and Default Context Length).

See Issue #12305 (Bug report "Too Much RAM Eaten in Ollama 0.11.11").

Quoting rick-github's last explanation below:

But we are certain that, until recently, all models run with Ollama took about the same about of (V)RAM as was the model's file size on the hard drive.

This has never been the case when using a context more then the default, as shown in https://github.com/ollama/ollama/issues/12305#issuecomment-3299975820.

The default context was 2048 until around 0.7.0 when it was increased to 4096. At the same time the default value for OLLAMA_NUM_PARALLEL was reduced, so while the size of an individual context buffer was increased, the overall memory allocation stayed the same.

But there is a clear difference in Ollama's (V)RAM use before and after, perhaps some key Config setting was changed fairly recently.

The size of the your model loads is explained by the size of the context buffer, and the failure of etoulas's model load is explained by the lack of free VRAM.

There has been a recent change in ollama's memory estimation logic when using the new engine. It should now be more accurate and reduces the need to override num_gpu when loading the model, although there are corner cases where this remains a valid tactic.

FIle size is 7 GB, but Ollama runs it in 28 GB of RAM (larger by a factor of 4). That's ridiculous!

The context buffer is the space where tokens are stored. If you want to store more tokens by increasing the context, you need more space. Different models require a different amount of space per token, so some models will be able to sustain a larger context than others. There are additional data structures that are proportional to the size of the context as well, so as context grows, so do they (primarily referencing the memory graph). gpt-oss, for example, has a quite efficient tokenization and can support 128k in 17GB (13G weights + 4G context/graph). Compare to mistral-nemo using 28G (7G weights + 21G context/graph).

NAME                   ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:20b            aa4295ac10c3    17 GB    100% GPU     131072     Forever    
mistral-nemo:latest    e7e06d107c6c    28 GB    100% GPU     131072     Forever    

Originally posted by @rick-github in #12305

Originally created by @Panican-Whyasker on GitHub (Sep 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12315 Given the recent developments leading to drastically increasing (V)RAM use by Ollama, please add Required (V)RAM Calculator feature to the Ollama website showing the Minimum System Requirements in terms of free (V)RAM, that for each model VARIANT (# of parameters, Quantization, and Default Context Length). See Issue #12305 (Bug report "Too Much RAM Eaten in Ollama 0.11.11"). Quoting rick-github's last explanation below: > But we are certain that, until recently, all models run with Ollama took about the same about of (V)RAM as was the model's file size on the hard drive. > > This has never been the case when using a context more then the default, as shown in https://github.com/ollama/ollama/issues/12305#issuecomment-3299975820. > > The default context was 2048 until around 0.7.0 when it was increased to 4096. At the same time the default value for OLLAMA_NUM_PARALLEL was reduced, so while the size of an individual context buffer was increased, the overall memory allocation stayed the same. > > But there is a clear difference in Ollama's (V)RAM use before and after, perhaps some key Config setting was changed fairly recently. > > The size of the your model loads is explained by the size of the context buffer, and the failure of etoulas's model load is explained by the lack of free VRAM. > > There has been a recent change in ollama's memory estimation logic when using the new engine. It should now be more accurate and reduces the need to override `num_gpu` when loading the model, although there are corner cases where this remains a valid tactic. > > FIle size is 7 GB, but Ollama runs it in 28 GB of RAM (larger by a factor of 4). That's ridiculous! > > The context buffer is the space where tokens are stored. If you want to store more tokens by increasing the context, you need more space. Different models require a different amount of space per token, so some models will be able to sustain a larger context than others. There are additional data structures that are proportional to the size of the context as well, so as context grows, so do they (primarily referencing the memory graph). gpt-oss, for example, has a quite efficient tokenization and can support 128k in 17GB (13G weights + 4G context/graph). Compare to mistral-nemo using 28G (7G weights + 21G context/graph). > ```console > NAME ID SIZE PROCESSOR CONTEXT UNTIL > gpt-oss:20b aa4295ac10c3 17 GB 100% GPU 131072 Forever > mistral-nemo:latest e7e06d107c6c 28 GB 100% GPU 131072 Forever > ``` _Originally posted by @rick-github in [#12305](https://github.com/ollama/ollama/issues/12305#issuecomment-3302194339)_
GiteaMirror added the feature request label 2026-04-22 17:07:28 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 17, 2025):

#9774

<!-- gh-comment-id:3302607209 --> @rick-github commented on GitHub (Sep 17, 2025): #9774
Author
Owner

@pdevine commented on GitHub (Sep 17, 2025):

Going to close as a dupe.

<!-- gh-comment-id:3304434186 --> @pdevine commented on GitHub (Sep 17, 2025): Going to close as a dupe.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33942