[GH-ISSUE #9246] question about "ollama num parallel" #6025

Closed
opened 2026-04-12 17:21:54 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @A3shTnT on GitHub (Feb 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9246

How does"ollama num parallel"work,and does the default value of 0 mean there is no limit?
I have eight L20 GPUs and want to run the 202GB DeepSeek-Q2 model. Should I adjust the OLLAMA_NUM_PARALLEL to a more reasonable value?

Originally created by @A3shTnT on GitHub (Feb 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9246 How does"ollama num parallel"work,and does the default value of 0 mean there is no limit? I have eight L20 GPUs and want to run the 202GB DeepSeek-Q2 model. Should I adjust the OLLAMA_NUM_PARALLEL to a more reasonable value?
Author
Owner

@rick-github commented on GitHub (Feb 20, 2025):

OLLAMA_NUM_PARALLEL creates a context for each parallel completion. So if you set num_ctx=4096 and OLLAMA_NUM_PARALLEL=4, ollama allocates enough VRAM to hold 16384 tokens. If it is unset, ollama picks a value based on the available memory: if you have lots, the default is 4, if not, ollama uses a value of 1.

<!-- gh-comment-id:2670939272 --> @rick-github commented on GitHub (Feb 20, 2025): `OLLAMA_NUM_PARALLEL` creates a context for each parallel completion. So if you set `num_ctx=4096` and `OLLAMA_NUM_PARALLEL=4`, ollama allocates enough VRAM to hold 16384 tokens. If it is unset, ollama picks a value based on the available memory: if you have lots, the default is 4, if not, ollama uses a value of 1.
Author
Owner

@A3shTnT commented on GitHub (Feb 20, 2025):

OLLAMA_NUM_PARALLEL creates a context for each parallel completion. So if you set num_ctx=4096 and OLLAMA_NUM_PARALLEL=4, ollama allocates enough VRAM to hold 16384 tokens. If it is unset, ollama picks a value based on the available memory: if you have lots, the default is 4, if not, ollama uses a value of 1.

got it, thank you.😘

<!-- gh-comment-id:2671011094 --> @A3shTnT commented on GitHub (Feb 20, 2025): > `OLLAMA_NUM_PARALLEL` creates a context for each parallel completion. So if you set `num_ctx=4096` and `OLLAMA_NUM_PARALLEL=4`, ollama allocates enough VRAM to hold 16384 tokens. If it is unset, ollama picks a value based on the available memory: if you have lots, the default is 4, if not, ollama uses a value of 1. got it, thank you.😘
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6025