[GH-ISSUE #10359] Memory allocation or estimation problem #6806

Closed
opened 2026-04-12 18:35:21 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @apunkt on GitHub (Apr 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10359

What is the issue?

Not sure if it is a real issue or just me not fully understanding how things work inside ollama, but experiencing the following:

Setup:
2 NVIDIA M40 24GB + 1 NVIDA T4 16GB

Ollama config:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KEEP_ALIVE=10m

Running mistral-small3.1:latest with standard context window size 2048:
ollama ps reports 10GB VRAM usage, which can be confirmed by nvidia-smi

Running mistral-small3.1:latest with extended context window size 32768:
ollama ps

mistral-small3.1:latest b9aaf0c2586a 59 GB 100% GPU 9 minutes from now

while nvidia-smi reports:

0 N/A N/A 1442364 | C /usr/local/bin/ollama | 11856MiB
1 N/A N/A 1442364 | C /usr/local/bin/ollama | 8624MiB
2 N/A N/A 1442364 | C /usr/local/bin/ollama | 6214MiB

which sums up to 26694 MiB VRAM usage.

Not sure if this is expected due to Flash Attention usage??
Is nvidia-smi reporting real world usage and ollama ps max allocation?
Ollama however reports 2.2x times the real world VRAM usage.

The problem for me arises when trying to load another model at the same time!
There is plenty of VRAM left to load another model and keep mistral loaded, but when loading another model (also with extended context window) ollama unloads mistral first and then loads another model that would have fit perfectly in the remaining VRAM.
This behaviour is independent from model! It is not a mistral problem as the same shows with other models, too. Further the 2nd model is also not loaded into CPU. If it is a memory allocation problem my expectation would be that it gets loaded into CPU & RAM, but one model is unloaded the other is loaded 100% GPU.

This brings the following problems for my use case:
I only can have a single model loaded in VRAM for extended context window.
As my users use a variety of models ollama is busy in loading unloading models causing heavy delay and latency.
VRAM usage is far from efficient.

If I am using the standard 2048 token context window size I can load 6 different models in VRAM in parallel and all working simultaneously.
So my expectation for bigger context windows would be that I can load as many models in parallel that fit into VRAM as they do with standard context window size.

Relevant log output


OS

Linux

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @apunkt on GitHub (Apr 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10359 ### What is the issue? Not sure if it is a real issue or just me not fully understanding how things work inside ollama, but experiencing the following: Setup: 2 NVIDIA M40 24GB + 1 NVIDA T4 16GB Ollama config: OLLAMA_FLASH_ATTENTION=1 OLLAMA_KEEP_ALIVE=10m Running mistral-small3.1:latest with standard context window size 2048: ollama ps reports 10GB VRAM usage, which can be confirmed by nvidia-smi Running mistral-small3.1:latest with extended context window size 32768: ollama ps `mistral-small3.1:latest b9aaf0c2586a 59 GB 100% GPU 9 minutes from now ` while nvidia-smi reports: 0 N/A N/A 1442364 | C /usr/local/bin/ollama | 11856MiB 1 N/A N/A 1442364 | C /usr/local/bin/ollama | 8624MiB 2 N/A N/A 1442364 | C /usr/local/bin/ollama | 6214MiB which sums up to 26694 MiB VRAM usage. Not sure if this is expected due to Flash Attention usage?? Is nvidia-smi reporting real world usage and ollama ps max allocation? Ollama however reports 2.2x times the real world VRAM usage. The problem for me arises when trying to load another model at the same time! There is plenty of VRAM left to load another model and keep mistral loaded, but when loading another model (also with extended context window) ollama *unloads* mistral first and then loads another model that would have fit perfectly in the remaining VRAM. This behaviour is independent from model! It is _not_ a mistral problem as the same shows with other models, too. Further the 2nd model is also not loaded into CPU. If it is a memory allocation problem my expectation would be that it gets loaded into CPU & RAM, but one model is unloaded the other is loaded 100% GPU. This brings the following problems for my use case: I only can have a single model loaded in VRAM for extended context window. As my users use a variety of models ollama is busy in loading unloading models causing heavy delay and latency. VRAM usage is far from efficient. If I am using the standard 2048 token context window size I can load 6 different models in VRAM in parallel and all working simultaneously. So my expectation for bigger context windows would be that I can load as many models in parallel that fit into VRAM as they do with standard context window size. ### Relevant log output ```shell ``` ### OS Linux ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 18:35:21 -05:00
Author
Owner

@apunkt commented on GitHub (Apr 21, 2025):

Adding a 4th card causes size estimation for same model with same settings to be different:

3 Cards:

NAME ID SIZE PROCESSOR UNTIL
cogito:14b d0cac86a2347 44 GB 100% GPU 9 minutes from now

4 Cards:

NAME ID SIZE PROCESSOR UNTIL
cogito:14b d0cac86a2347 52 GB 15%/85% CPU/GPU 9 minutes from now

so more cards => more VRAM requirements with exact same settings?

<!-- gh-comment-id:2818299234 --> @apunkt commented on GitHub (Apr 21, 2025): Adding a 4th card causes size estimation for same model with same settings to be different: 3 Cards: | NAME |ID |SIZE|PROCESSOR | UNTIL | --- | --- | --- | --- |--- | | cogito:14b | d0cac86a2347 | 44 GB | 100% GPU | 9 minutes from now | 4 Cards: | NAME |ID |SIZE|PROCESSOR | UNTIL | --- | --- | --- | --- |--- | | cogito:14b | d0cac86a2347 | 52 GB | 15%/85% CPU/GPU | 9 minutes from now | so more cards => more VRAM requirements with exact same settings?
Author
Owner

@c0008 commented on GitHub (Apr 21, 2025):

I ran into the same problem after I enabled flash attention. I have installed two graphic cards (2x 16GB) and the VRAM usage reported by Ollama is too high which is causing unnecessary CPU offloading.
Ollama version is 0.6.5

For example ollama ps was reporting 33GB memory usage for Qwen2.5 Coder 32B Q5 but according to System monitor VRAM usage was much lower:
GPU1 13.2 GiB
GPU2 12.6 GiB
total 25.8 GiB

Without flash attention these numbers did match.

This is the memory calculation from the logs:

level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=63 layers.split=30,33 memory.available="[14.6 GiB 15.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="30.8 GiB" memory.required.partial="29.8 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[14.3 GiB 15.5 GiB]" memory.weights.total="21.2 GiB" memory.weights.repeating="20.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="2.5 GiB" memory.graph.partial="2.5 GiB"

<!-- gh-comment-id:2819649770 --> @c0008 commented on GitHub (Apr 21, 2025): I ran into the same problem after I enabled flash attention. I have installed two graphic cards (2x 16GB) and the VRAM usage reported by Ollama is too high which is causing unnecessary CPU offloading. Ollama version is 0.6.5 For example ollama ps was reporting 33GB memory usage for Qwen2.5 Coder 32B Q5 but according to System monitor VRAM usage was much lower: GPU1 13.2 GiB GPU2 12.6 GiB total 25.8 GiB Without flash attention these numbers did match. This is the memory calculation from the logs: > level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=63 layers.split=30,33 memory.available="[14.6 GiB 15.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="30.8 GiB" memory.required.partial="29.8 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[14.3 GiB 15.5 GiB]" memory.weights.total="21.2 GiB" memory.weights.repeating="20.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="2.5 GiB" memory.graph.partial="2.5 GiB"
Author
Owner

@rick-github commented on GitHub (Apr 22, 2025):

#6160

<!-- gh-comment-id:2819744105 --> @rick-github commented on GitHub (Apr 22, 2025): #6160
Author
Owner

@sunhy0316 commented on GitHub (Apr 22, 2025):

Encountered the same issue, where only one model can be loaded. When loading a second model, should it first check the actual remaining available VRAM, rather than using Ollama's predicted VRAM size? If the actual remaining VRAM is sufficient, it should not unload the original model but directly load the new model.

For example, with two GPUs totaling 160GB of VRAM, there is still nearly 80GB of VRAM remaining when trying to load a second model. Even with this much available VRAM, Ollama still unloads the first model, even though both models could easily fit together.

<!-- gh-comment-id:2822693385 --> @sunhy0316 commented on GitHub (Apr 22, 2025): Encountered the same issue, where only one model can be loaded. When loading a second model, should it first check the **actual remaining available VRAM**, rather than using Ollama's predicted VRAM size? If the actual remaining VRAM is sufficient, it should not unload the original model but directly load the new model. For example, with two GPUs totaling 160GB of VRAM, there is still nearly 80GB of VRAM remaining when trying to load a second model. Even with this much available VRAM, Ollama still unloads the first model, even though both models could easily fit together.
Author
Owner

@bitcandy commented on GitHub (Apr 29, 2025):

Devs, until new code for memory management will be available, could we please have a version of Ollama that allows us to manually set/increase memory utilization? Ideally, it would be good to adjust a parameter to ensure that the model fits fully into GPU VRAM

<!-- gh-comment-id:2839256094 --> @bitcandy commented on GitHub (Apr 29, 2025): Devs, until new code for memory management will be available, could we please have a version of Ollama that allows us to manually set/increase memory utilization? Ideally, it would be good to adjust a parameter to ensure that the model fits fully into GPU VRAM
Author
Owner

@sunhy0316 commented on GitHub (May 8, 2025):

Write it in modelfile is a good choice.

<!-- gh-comment-id:2864631108 --> @sunhy0316 commented on GitHub (May 8, 2025): Write it in modelfile is a good choice.
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

PARAMETER num_gpu xxx
<!-- gh-comment-id:2864661617 --> @rick-github commented on GitHub (May 8, 2025): ``` PARAMETER num_gpu xxx ```
Author
Owner

@MarkWard0110 commented on GitHub (May 26, 2025):

there is something off with Ollama and mistral model now
https://github.com/ollama/ollama/issues/10553

<!-- gh-comment-id:2910642726 --> @MarkWard0110 commented on GitHub (May 26, 2025): there is something off with Ollama and mistral model now https://github.com/ollama/ollama/issues/10553
Author
Owner

@c0008 commented on GitHub (Jun 15, 2025):

When I use Qwen3 32B Q5_KM with Ollama it limits me to a context length of 14000 before offloading is used. At this point I still have 7GB of 32GB VRAM unused. So the VRAM estimation for context is way off. Ollama assumes 9GB while it is actually only about 2GB.

<!-- gh-comment-id:2974684603 --> @c0008 commented on GitHub (Jun 15, 2025): When I use Qwen3 32B Q5_KM with Ollama it limits me to a context length of 14000 before offloading is used. At this point I still have 7GB of 32GB VRAM unused. So the VRAM estimation for context is way off. Ollama assumes 9GB while it is actually only about 2GB.
Author
Owner

@jessegross commented on GitHub (Jun 16, 2025):

There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090

Please leave any feedback on that PR.

<!-- gh-comment-id:2978295922 --> @jessegross commented on GitHub (Jun 16, 2025): There is an early preview of Ollama's new memory management with the goal of comprehensively fixing these issues. It is still in development, however, if you want to compile from source and try it out, you can find it here: https://github.com/ollama/ollama/pull/11090 Please leave any feedback on that PR.
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.

<!-- gh-comment-id:3330110026 --> @jessegross commented on GitHub (Sep 24, 2025): I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6806