[GH-ISSUE #12941] Ollama uses 40GB RAM in qwen3-4b-instruct #55093

Closed
opened 2026-04-29 08:19:25 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @owenzhao on GitHub (Nov 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12941

What is the issue?

See. It use 40GB on Qwen3-4bit-instruct with 256K context. Is this normal?

Image

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.12.9

Originally created by @owenzhao on GitHub (Nov 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12941 ### What is the issue? See. It use 40GB on Qwen3-4bit-instruct with 256K context. Is this normal? <img width="1175" height="758" alt="Image" src="https://github.com/user-attachments/assets/402560d5-f8c6-48f6-8bae-5f2d2c480ef0" /> ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.12.9
GiteaMirror added the bug label 2026-04-29 08:19:25 -05:00
Author
Owner

@owenzhao commented on GitHub (Nov 4, 2025):

This issue is related to the cache size. This happens when cache is set to 256K in ollama app's settings.

Image
<!-- gh-comment-id:3488379673 --> @owenzhao commented on GitHub (Nov 4, 2025): This issue is related to the cache size. This happens when cache is set to 256K in ollama app's settings. <img width="912" height="712" alt="Image" src="https://github.com/user-attachments/assets/35c353f7-20c6-4348-9eba-8b6324a64f20" />
Author
Owner

@rick-github commented on GitHub (Nov 4, 2025):

Different models use a different amount of space per token in the context buffer. More buffer, more RAM required.

Image
<!-- gh-comment-id:3488381846 --> @rick-github commented on GitHub (Nov 4, 2025): Different models use a different amount of space per token in the context buffer. More buffer, more RAM required. <img width="916" height="600" alt="Image" src="https://github.com/user-attachments/assets/f07ef3d4-1024-4969-9b15-4f4b5ab89b09" />
Author
Owner

@owenzhao commented on GitHub (Nov 4, 2025):

Maybe. But many model providers can dynamically choose the context basing on the input. For example, if you use Kimi-last as your model, it will automatically choose the context base on your input and prices of different contexts are different.

So I had been thinking Ollama's context was dynamic as well. The context size we set should be max size that we could keep, not every time it used.

Is there any limit forbiding Ollama to do that?

<!-- gh-comment-id:3488411101 --> @owenzhao commented on GitHub (Nov 4, 2025): Maybe. But many model providers can dynamically choose the context basing on the input. For example, if you use Kimi-last as your model, it will automatically choose the context base on your input and prices of different contexts are different. So I had been thinking Ollama's context was dynamic as well. The context size we set should be max size that we could keep, not every time it used. Is there any limit forbiding Ollama to do that?
Author
Owner

@rick-github commented on GitHub (Nov 4, 2025):

The size of the context can be set by the client when it sends an API request, via "options":{"num_ctx":context_value}. Ollama doesn't currently support dynamic context.

<!-- gh-comment-id:3488430297 --> @rick-github commented on GitHub (Nov 4, 2025): The size of the context can be set by the client when it sends an API request, via `"options":{"num_ctx":context_value}`. Ollama doesn't currently support dynamic context.
Author
Owner

@owenzhao commented on GitHub (Nov 4, 2025):

I have never used "num_ctx" before. I think this works the same way as the settings in Ollama. And how could I know the exact context size so I can set this value?

<!-- gh-comment-id:3488455762 --> @owenzhao commented on GitHub (Nov 4, 2025): I have never used "num_ctx" before. I think this works the same way as the settings in Ollama. And how could I know the exact context size so I can set this value?
Author
Owner

@rick-github commented on GitHub (Nov 4, 2025):

The response from ollama contains the number of tokens in the previous prompt and the inference. Add them together, add the length of the additional prompt being sent, set num_ctx. Note that resizing the context on each API call will result in a model reload. So the more efficient way is to choose an initial context buffer that will hold a few rounds of conversation, and then resize when it's close to full.

Or just choose a context buffer suitable for the entirety of the interaction.

<!-- gh-comment-id:3488468097 --> @rick-github commented on GitHub (Nov 4, 2025): The response from ollama contains the number of tokens in the previous prompt and the inference. Add them together, add the length of the additional prompt being sent, set `num_ctx`. Note that resizing the context on each API call will result in a model reload. So the more efficient way is to choose an initial context buffer that will hold a few rounds of conversation, and then resize when it's close to full. Or just choose a context buffer suitable for the entirety of the interaction.
Author
Owner

@owenzhao commented on GitHub (Nov 5, 2025):

In that case, I think the best practices of save RAM, is set the max size that a conversation will be use in my app and then set the context size to minimum(4k), after the conversation. That will free the previous occupied RAM.

<!-- gh-comment-id:3488494793 --> @owenzhao commented on GitHub (Nov 5, 2025): In that case, I think the best practices of save RAM, is set the max size that a conversation will be use in my app and then set the context size to minimum(4k), after the conversation. That will free the previous occupied RAM.
Author
Owner

@rick-github commented on GitHub (Nov 5, 2025):

Unloading the model (setting "keep_alive":0) will also free up RAM.

<!-- gh-comment-id:3488500670 --> @rick-github commented on GitHub (Nov 5, 2025): Unloading the model (setting `"keep_alive":0`) will also free up RAM.
Author
Owner

@pdevine commented on GitHub (Nov 5, 2025):

I'm going to close this as answered (thank you @rick-github !)

<!-- gh-comment-id:3493813323 --> @pdevine commented on GitHub (Nov 5, 2025): I'm going to close this as answered (thank you @rick-github !)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55093