[GH-ISSUE #9774] Estimate of VRAM needs based on context length and quantization #52903

Open
opened 2026-04-29 01:19:31 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @mmb78 on GitHub (Mar 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9774

It would really help to know what is the VRAM necessary to load and run the models that are available on the Ollama.com site. The needs are enormous when larger context window is set with the num_ctx parameter. In addition, this also depends on quantization of the model. Just a few examples of VRAM needs would be helpful. For example, 2k tokens, 8k, 32k, 128k.
Thank you!

Originally created by @mmb78 on GitHub (Mar 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9774 It would really help to know what is the VRAM necessary to load and run the models that are available on the Ollama.com site. The needs are enormous when larger context window is set with the num_ctx parameter. In addition, this also depends on quantization of the model. Just a few examples of VRAM needs would be helpful. For example, 2k tokens, 8k, 32k, 128k. Thank you!
GiteaMirror added the ollama.comfeature request labels 2026-04-29 01:19:32 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 14, 2025):

I did this for a few models a while back here.

<!-- gh-comment-id:2725984416 --> @rick-github commented on GitHub (Mar 14, 2025): I did this for a few models a while back [here](https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918).
Author
Owner

@RustoMCSpit commented on GitHub (May 10, 2025):

I did this for a few models a while back here.

are these displayed on the website?

<!-- gh-comment-id:2869214755 --> @RustoMCSpit commented on GitHub (May 10, 2025): > I did this for a few models a while back [here](https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918). are these displayed on the website?
Author
Owner

@RustoMCSpit commented on GitHub (May 10, 2025):

i imagine the best way to go about this is to simply auto-run each uploaded model on gpus from weakest to strongest until one works. youd need to have access to such gpus but after that this becomes a fully automated process.

<!-- gh-comment-id:2869215153 --> @RustoMCSpit commented on GitHub (May 10, 2025): i imagine the best way to go about this is to simply auto-run each uploaded model on gpus from weakest to strongest until one works. youd need to have access to such gpus but after that this becomes a fully automated process.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52903