[GH-ISSUE #11335] Mem increased to max value when setup MAX_QUEUE and NUM_PARALLEL #69534

Closed
opened 2026-05-04 18:23:39 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @aaronpliu on GitHub (Jul 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11335

What is the issue?

I am running ollama (v0.9.5) on Mac M3 Ultra / 512G Mem. would like to take advantage of machine hardware to support multiple ppl to access.
When setup MAX_QUEUE=20 and NUM_PARALLEL=10, the mem usage increasing more and more.
If nothing do with default setup of ollama, it would keep a low level of mem usage, but response is slow also.
Is there any best practice to match machine's hardware to optimize ollama? Thanks

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.9.5

Originally created by @aaronpliu on GitHub (Jul 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11335 ### What is the issue? I am running ollama (v0.9.5) on Mac M3 Ultra / 512G Mem. would like to take advantage of machine hardware to support multiple ppl to access. When setup MAX_QUEUE=20 and NUM_PARALLEL=10, the mem usage increasing more and more. If nothing do with default setup of ollama, it would keep a low level of mem usage, but response is slow also. Is there any best practice to match machine's hardware to optimize ollama? Thanks ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.9.5
GiteaMirror added the bug label 2026-05-04 18:23:39 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 9, 2025):

Each increment in OLLAMA_NUM_PARALLEL needs to allocate another context buffer (OLLAMA_CONTEXT_LENGTH). If the amount of memory required by the model is larger than the memory assigned to the GPU, some layers will be loaded in system RAM where inference is slower. Increasing the parallelism won't increase the response rate for any single completion. If you want a faster response, you need to use a smaller model or a faster GPU.

<!-- gh-comment-id:3051914758 --> @rick-github commented on GitHub (Jul 9, 2025): Each increment in `OLLAMA_NUM_PARALLEL` needs to allocate another context buffer ([`OLLAMA_CONTEXT_LENGTH`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size)). If the amount of memory required by the model is larger than the memory assigned to the GPU, some layers will be loaded in system RAM where inference is slower. Increasing the parallelism won't increase the response rate for any single completion. If you want a faster response, you need to use a smaller model or a faster GPU.
Author
Owner

@aaronpliu commented on GitHub (Jul 11, 2025):

so need to adjust OLLAMA_NUM_PARALLEL and OLLAMA_CONTEXT_LENGTH at the same time? where is OLLAMA_CONTEXT_LENGTH, i don't see it in ollama serve -h.
What's suggestion to setup its value as per machine's model?

<!-- gh-comment-id:3060480551 --> @aaronpliu commented on GitHub (Jul 11, 2025): so need to adjust `OLLAMA_NUM_PARALLEL` and `OLLAMA_CONTEXT_LENGTH` at the same time? where is `OLLAMA_CONTEXT_LENGTH`, i don't see it in `ollama serve -h`. What's suggestion to setup its value as per machine's model?
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

where is OLLAMA_CONTEXT_LENGTH

OLLAMA_CONTEXT_LENGTH

What's suggestion to setup its value as per machine's model?

Set OLLAMA_NUM_PARALLEL and OLLAMA_CONTEXT_LENGTH such that they use as much as the VRAM as possible, without spilling into system RAM. You can check the amount of VRAM being used with ollama ps.

<!-- gh-comment-id:3061492939 --> @rick-github commented on GitHub (Jul 11, 2025): > where is `OLLAMA_CONTEXT_LENGTH` [`OLLAMA_CONTEXT_LENGTH`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size) > What's suggestion to setup its value as per machine's model? Set `OLLAMA_NUM_PARALLEL` and `OLLAMA_CONTEXT_LENGTH` such that they use as much as the VRAM as possible, without spilling into system RAM. You can check the amount of VRAM being used with `ollama ps`.
Author
Owner

@pdevine commented on GitHub (Jul 12, 2025):

I'm going to go ahead and close this as answered (thank you @rick-github !).

<!-- gh-comment-id:3064340728 --> @pdevine commented on GitHub (Jul 12, 2025): I'm going to go ahead and close this as answered (thank you @rick-github !).
Author
Owner

@aaronpliu commented on GitHub (Aug 5, 2025):

then it did not solve the issue

<!-- gh-comment-id:3153564342 --> @aaronpliu commented on GitHub (Aug 5, 2025): then it did not solve the issue
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69534