[GH-ISSUE #10177] mistral-small3.1 using too much VRAM #6677

Closed
opened 2026-04-12 18:24:18 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @arty-hlr on GitHub (Apr 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10177

What is the issue?

Loading mistral-small3.1 24b in Q4 takes double the amount of VRAM it should use with default 4096 context:

Image

Image

Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.5

Originally created by @arty-hlr on GitHub (Apr 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10177 ### What is the issue? Loading mistral-small3.1 24b in Q4 takes double the amount of VRAM it should use with default 4096 context: ![Image](https://github.com/user-attachments/assets/4d684d12-f78f-47a3-8a0a-0c07242c4e4f) ![Image](https://github.com/user-attachments/assets/fdae78e7-46e8-494d-aa4c-1935513dd8c5) ![Image](https://github.com/user-attachments/assets/774e8842-7c31-42b5-bc06-dbeb0a330c6f) ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.5
GiteaMirror added the bug label 2026-04-12 18:24:18 -05:00
Author
Owner

@oblq commented on GitHub (Apr 8, 2025):

Same here: level=WARN source=server.go:133 msg="model request too large for system" requested="49.8 GiB".
This is for the q8 version.

OS
Linux

GPU
Nvidia

CPU
Intel

Ollama version
0.6.5

<!-- gh-comment-id:2786510427 --> @oblq commented on GitHub (Apr 8, 2025): Same here: `level=WARN source=server.go:133 msg="model request too large for system" requested="49.8 GiB"`. This is for the q8 version. OS Linux GPU Nvidia CPU Intel Ollama version 0.6.5
Author
Owner

@toastloaf commented on GitHub (Apr 8, 2025):

Same issue here, its using 26GB with 16k context, flash attention, and q8 kv cache quant.

<!-- gh-comment-id:2787131680 --> @toastloaf commented on GitHub (Apr 8, 2025): Same issue here, its using 26GB with 16k context, flash attention, and q8 kv cache quant.
Author
Owner

@maglat commented on GitHub (Apr 8, 2025):

Same issue sadly. 15k Context size using 31 GB Vram on my side. The model is now split to two GPUs (two RTX3090).

<!-- gh-comment-id:2787477223 --> @maglat commented on GitHub (Apr 8, 2025): Same issue sadly. 15k Context size using 31 GB Vram on my side. The model is now split to two GPUs (two RTX3090).
Author
Owner

@mmb78 commented on GitHub (Apr 8, 2025):

Maybe you have OLLAMA_NUM_PARALLEL=2?

<!-- gh-comment-id:2787583615 --> @mmb78 commented on GitHub (Apr 8, 2025): Maybe you have OLLAMA_NUM_PARALLEL=2?
Author
Owner

@toastloaf commented on GitHub (Apr 8, 2025):

Maybe you have OLLAMA_NUM_PARALLEL=2?

Not sure what the default is, but manually setting it to 1 doesnt change anything.

<!-- gh-comment-id:2787595214 --> @toastloaf commented on GitHub (Apr 8, 2025): > Maybe you have OLLAMA_NUM_PARALLEL=2? Not sure what the default is, but manually setting it to 1 doesnt change anything.
Author
Owner

@Fade78 commented on GitHub (Apr 8, 2025):

I have the same bug but another observation. While ollama reports using 32GB (I adjusted the context so mistral-small3.1:24b fits) on my 2x4060 TI 16GB, it, in facts, only takes around 20GB in reality.

ollama version is 0.6.5

NAME                    ID              SIZE     PROCESSOR    UNTIL
mistral-small3.1:24b    b9aaf0c2586a    32 GB    100% GPU     36 hours from now
  31099 root   1  Compute  79%  15264MiB  93%     0%   1596MiB /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models
  31099 root   0  Compute  20%   4628MiB  28%   101%   1596MiB /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models

If I put a little more, it will dispatch computation on the CPU while there is still VRAM available.

<!-- gh-comment-id:2787596909 --> @Fade78 commented on GitHub (Apr 8, 2025): I have the same bug but another observation. While ollama reports using 32GB (I adjusted the context so mistral-small3.1:24b fits) on my 2x4060 TI 16GB, it, in facts, only takes around 20GB in reality. **ollama version is 0.6.5** ``` NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b b9aaf0c2586a 32 GB 100% GPU 36 hours from now ``` ``` 31099 root 1 Compute 79% 15264MiB 93% 0% 1596MiB /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models 31099 root 0 Compute 20% 4628MiB 28% 101% 1596MiB /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models ``` If I put a little more, it will dispatch computation on the CPU while there is still VRAM available.
Author
Owner

@maglat commented on GitHub (Apr 8, 2025):

Maybe you have OLLAMA_NUM_PARALLEL=2?

No, haven't set this setting. But even I would, the VRAM consumtion is way to high. I was able to run Mistral-Small-24B (the one before v3.1) with even higher context size of 22k, with an total Vram consumption of 21GB

<!-- gh-comment-id:2787612185 --> @maglat commented on GitHub (Apr 8, 2025): > Maybe you have OLLAMA_NUM_PARALLEL=2? No, haven't set this setting. But even I would, the VRAM consumtion is way to high. I was able to run Mistral-Small-24B (the one before v3.1) with even higher context size of 22k, with an total Vram consumption of 21GB
Author
Owner

@jessegross commented on GitHub (Apr 8, 2025):

Several notes:

  • mistral-small3.1 is a vision model, whereas mistral-small is text only
  • The amount of VRAM required is not just the on disk space (the weights), it's also the computation graph and KV cache. Ollama must account for all of these when allocating memory.
  • The computation graph for vision models is quite large. Furthermore, the graph can vary in size based on factors such as context length and image size.
  • Ollama must account for the worst case of the computation graph in its estimates, which means that currently allocated memory might well be less that what ollama ps shows.
  • You can force more to be offloaded by setting num_gpu and this may work in non-worst case scenarios. However, Ollama may crash depending on the inputs.

I don't see anything here that actually looks incorrect, so I am going to close this but feel free to keep commenting.

<!-- gh-comment-id:2787889111 --> @jessegross commented on GitHub (Apr 8, 2025): Several notes: - mistral-small3.1 is a vision model, whereas mistral-small is text only - The amount of VRAM required is not just the on disk space (the weights), it's also the computation graph and KV cache. Ollama must account for all of these when allocating memory. - The computation graph for vision models is quite large. Furthermore, the graph can vary in size based on factors such as context length and image size. - Ollama must account for the worst case of the computation graph in its estimates, which means that currently allocated memory might well be less that what `ollama ps` shows. - You can force more to be offloaded by setting `num_gpu` and this may work in non-worst case scenarios. However, Ollama may crash depending on the inputs. I don't see anything here that actually looks incorrect, so I am going to close this but feel free to keep commenting.
Author
Owner

@arty-hlr commented on GitHub (Apr 9, 2025):

@jessegross It is incorrect because other vision models (gemma3 or llama3.2 for example) don't have this issue.

<!-- gh-comment-id:2789591532 --> @arty-hlr commented on GitHub (Apr 9, 2025): @jessegross It is incorrect because other vision models (gemma3 or llama3.2 for example) don't have this issue.
Author
Owner

@Fade78 commented on GitHub (Apr 29, 2025):

This issue is affecting Qwen3 too. For example:
qwen3:30b is using 80% of my VRAM while ollama says it uses 100% and, if I compensate, it will go GPU/CPU. Same problem with 32b (but I didn't note the proportion).

<!-- gh-comment-id:2839462077 --> @Fade78 commented on GitHub (Apr 29, 2025): This issue is affecting Qwen3 too. For example: qwen3:30b is using 80% of my VRAM while ollama says it uses 100% and, if I compensate, it will go GPU/CPU. Same problem with 32b (but I didn't note the proportion).
Author
Owner

@ilionroot commented on GitHub (May 3, 2025):

Same for Qwen3 4B Q4_K_M, taking 11GB with 8192 context. Other models such gemma3 4B takes only half of the memory.

<!-- gh-comment-id:2848791602 --> @ilionroot commented on GitHub (May 3, 2025): Same for Qwen3 4B Q4_K_M, taking 11GB with 8192 context. Other models such gemma3 4B takes only half of the memory.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6677