[GH-ISSUE #4894] Feature: Allow setting OLLAMA_NUM_PARALLEL per model #3090

Open
opened 2026-04-12 13:31:46 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @sammcj on GitHub (Jun 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4894

It would be great if you could set OLLAMA_NUM_PARALLEL per model.

Example use case:

  • You have one large "smart" model you only ever want one request at a time going to to avoid using all your memory.
  • You have a smaller "fast" fast model (or just one with a smaller context) that you might want to allow a number of parallel requests to.

Perhaps this could be configured with a modelfile and corresponding API parameter rather than at launch time?

Originally created by @sammcj on GitHub (Jun 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4894 It would be great if you could set OLLAMA_NUM_PARALLEL per model. Example use case: - You have one large "smart" model you only ever want one request at a time going to to avoid using all your memory. - You have a smaller "fast" fast model (or just one with a smaller context) that you might want to allow a number of parallel requests to. Perhaps this could be configured with a [modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) and corresponding [API parameter](https://github.com/ollama/ollama/blob/main/docs/api.md#parameters) rather than at launch time?
GiteaMirror added the feature request label 2026-04-12 13:31:46 -05:00
Author
Owner

@JordanDalton commented on GitHub (Oct 24, 2024):

+1

<!-- gh-comment-id:2436054606 --> @JordanDalton commented on GitHub (Oct 24, 2024): +1
Author
Owner

@forReason commented on GitHub (Mar 27, 2025):

This is important, it does not make sense to use the same parallelisation for all models.
If i have a processing model with short context length, I may want many parallel instances. But for a very large model with medium context, I may require less parallelization.

<!-- gh-comment-id:2757355686 --> @forReason commented on GitHub (Mar 27, 2025): This is important, it does not make sense to use the same parallelisation for all models. If i have a processing model with short context length, I may want many parallel instances. But for a very large model with medium context, I may require less parallelization.
Author
Owner

@JoshJarabek7 commented on GitHub (Jun 11, 2025):

Is this a thing yet. The embedding model can handle hundreds in parallel, but Llama 4 Scout with 10M context would probably need parallelization set to 1.

<!-- gh-comment-id:2964002084 --> @JoshJarabek7 commented on GitHub (Jun 11, 2025): Is this a thing yet. The embedding model can handle hundreds in parallel, but Llama 4 Scout with 10M context would probably need parallelization set to 1.
Author
Owner

@megafetis commented on GitHub (Aug 15, 2025):

+1

<!-- gh-comment-id:3190864234 --> @megafetis commented on GitHub (Aug 15, 2025): +1
Author
Owner

@fighter3005 commented on GitHub (Aug 22, 2025):

+1

<!-- gh-comment-id:3213352778 --> @fighter3005 commented on GitHub (Aug 22, 2025): +1
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3090