[GH-ISSUE #5693] Per-Model Concurrency #29308

Closed
opened 2026-04-22 08:04:06 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @ProjectMoon on GitHub (Jul 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5693

I like the new concurrency features. I'm wondering if it would be possible to add a new Modelfile parameter to control parallel requests on a per-model basis. This would override OLLAMA_NUM_PARALLEL if set. The primary use-case for this idea is to allow small models like embedding models to serve many quick requests at once, while the larger models that take longer can serve a smaller number of requests at a time. This would allow larger models to be loaded more into the GPU, while the embedding models can work much faster.

Embedding creation from OpenWebUI is much much faster when hitting a bunch of documents (~45) with a high parallelism (20 in the test). However, this high parallelism forces the LLM model that will generate text to be mostly loaded into CPU, because it is also expecting to serve 20 parallel requests (when in reality it's going to serve one).

Originally created by @ProjectMoon on GitHub (Jul 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5693 I like the new concurrency features. I'm wondering if it would be possible to add a new Modelfile parameter to control parallel requests on a per-model basis. This would override `OLLAMA_NUM_PARALLEL` if set. The primary use-case for this idea is to allow small models like embedding models to serve many quick requests at once, while the larger models that take longer can serve a smaller number of requests at a time. This would allow larger models to be loaded more into the GPU, while the embedding models can work much faster. Embedding creation from OpenWebUI is much much faster when hitting a bunch of documents (~45) with a high parallelism (20 in the test). However, this high parallelism forces the LLM model that will generate text to be mostly loaded into CPU, because it is also expecting to serve 20 parallel requests (when in reality it's going to serve one).
GiteaMirror added the feature request label 2026-04-22 08:04:07 -05:00
Author
Owner

@ProjectMoon commented on GitHub (Jul 15, 2024):

Based on quick skim of the code, it could perhaps be stuck in sched.go under processPending, and just set numParallel to the Modelfile's value, if it exists?

<!-- gh-comment-id:2227536626 --> @ProjectMoon commented on GitHub (Jul 15, 2024): Based on quick skim of the code, it could perhaps be stuck in sched.go under `processPending`, and just set `numParallel` to the Modelfile's value, if it exists?
Author
Owner

@ProjectMoon commented on GitHub (Jul 15, 2024):

https://github.com/ollama/ollama/pull/5657 also relevant.

<!-- gh-comment-id:2227538761 --> @ProjectMoon commented on GitHub (Jul 15, 2024): https://github.com/ollama/ollama/pull/5657 also relevant.
Author
Owner

@pdevine commented on GitHub (Sep 17, 2024):

This is definitely something that we're thinking about, but I'm going to close it as a dupe of #4894

cc @dhiltgen

<!-- gh-comment-id:2354340748 --> @pdevine commented on GitHub (Sep 17, 2024): This is definitely something that we're thinking about, but I'm going to close it as a dupe of #4894 cc @dhiltgen
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29308