[GH-ISSUE #8778] Feature: Parallel processing of embeddings? #5700

Closed
opened 2026-04-12 16:59:38 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @AncientMystic on GitHub (Feb 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8778

When utilizing embedding models for documents in UIs for ollama, they are pretty slow and i am seeing, 1.2GB vram usage from the model and 20% cpu and 15% gpu total usage on the whole system.

Would it be possible to somehow allow parallel processing or requests for documents?

Maybe at least 2-4+ at once instead of one, even if we cannot increase the speed of processing, maybe we could at least process a few at once? (Even if that means loading the model more than once to handle each instance)

(I also have ollama_num_parallel set and max queue, max loaded models, etc but it doesn't seem to work the same as it did in older versions, even multiple small models wont load together it seems but thats another issue)

Originally created by @AncientMystic on GitHub (Feb 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8778 When utilizing embedding models for documents in UIs for ollama, they are pretty slow and i am seeing, 1.2GB vram usage from the model and 20% cpu and 15% gpu total usage on the whole system. Would it be possible to somehow allow parallel processing or requests for documents? Maybe at least 2-4+ at once instead of one, even if we cannot increase the speed of processing, maybe we could at least process a few at once? (Even if that means loading the model more than once to handle each instance) (I also have ollama_num_parallel set and max queue, max loaded models, etc but it doesn't seem to work the same as it did in older versions, even multiple small models wont load together it seems but thats another issue)
GiteaMirror added the feature request label 2026-04-12 16:59:38 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

ollama currently doesn't allow parallel completions for embedding models, which is why OLLAMA_NUM_PARALLEL has no effect. Loading a model multiple times is not supported but there is a ticket open for this issue. The only way to accomplish what you want at the moment is to run multiple servers and offer a uniform interface with a reverse proxy (eg litellm, ollama_proxy, nginx).

<!-- gh-comment-id:2630949027 --> @rick-github commented on GitHub (Feb 3, 2025): ollama currently [doesn't allow](https://github.com/ollama/ollama/blob/ad22ace439eb3fab7230134e56bb6276a78347e4/server/sched.go#L196) parallel completions for embedding models, which is why `OLLAMA_NUM_PARALLEL` has no effect. Loading a model multiple times is not supported but there is a [ticket](https://github.com/ollama/ollama/issues/3902) open for this issue. The only way to accomplish what you want at the moment is to run multiple servers and offer a uniform interface with a reverse proxy (eg [litellm](https://github.com/BerriAI/litellm), [ollama_proxy](https://github.com/ParisNeo/ollama_proxy_server), [nginx](https://github.com/ollama/ollama/issues/8186#issuecomment-2560443545)).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5700