[GH-ISSUE #3902] Model Pooling and Instance Management #48927

Open
opened 2026-04-28 10:07:40 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @saul-jb on GitHub (Apr 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3902

Originally assigned to: @dhiltgen on GitHub.

This builds on from model concurrency (#3418), and the keep_alive option.

First of all it would be great to be able to load multiple instances of the same model, if I'm not mistaken the model concurrency only works for different models.

Of course loading multiple instances would require a new way of managing them since each model name could refer to many instances - therefore instances should be identifiable.

In addition to this having more methods to be able to get the state of the system is important, here are some suggestions of what a user might want to do:

  • Get the ID of a model instance (when generating or preloading).
  • List the models/instances currently loaded into memory.
  • Get information about each instance (keep_alive, expiry, queue length, RAM/VRAM usage etc.)
  • Unload specific instances from memory.
  • Queue requests to any instance of a model.
  • Queue requests to a particular instance of a model.

My specific use case is that I have different models for different applications (multi-modal LLM, smarter LLM, faster LLM, etc.) and I would like to have (if possible - dynamic) pools of these models so each model has several instances and can manage multiple concurrent requests on the same machine similar to a thread pool.

References:

Originally created by @saul-jb on GitHub (Apr 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3902 Originally assigned to: @dhiltgen on GitHub. This builds on from model concurrency (#3418), and the `keep_alive` option. First of all it would be great to be able to load multiple instances of the _same_ model, if I'm not mistaken the model concurrency only works for different models. Of course loading multiple instances would require a new way of managing them since each model name could refer to many instances - therefore instances should be identifiable. In addition to this having more methods to be able to get the state of the system is important, here are some suggestions of what a user might want to do: - Get the ID of a model instance (when generating or preloading). - List the models/instances currently loaded into memory. - Get information about each instance (keep_alive, expiry, queue length, RAM/VRAM usage etc.) - Unload specific instances from memory. - Queue requests to any instance of a model. - Queue requests to a particular instance of a model. My specific use case is that I have different models for different applications (multi-modal LLM, smarter LLM, faster LLM, etc.) and I would like to have (if possible - dynamic) pools of these models so each model has several instances and can manage multiple concurrent requests on the same machine similar to a thread pool. References: - https://github.com/ollama/ollama/pull/3418 - https://github.com/ollama/ollama/issues/2431 - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-pre-load-a-model-to-get-faster-response-times
GiteaMirror added the feature request label 2026-04-28 10:07:41 -05:00
Author
Owner

@jeepshop commented on GitHub (Jun 25, 2025):

Please please please implement this. The parallel features don't really accomplish the same functionality - making use of multiple GPUs to simultaneously serve the same model.

<!-- gh-comment-id:3005696148 --> @jeepshop commented on GitHub (Jun 25, 2025): Please please please implement this. The parallel features don't really accomplish the same functionality - making use of multiple GPUs to simultaneously serve the same model.
Author
Owner

@davonisher commented on GitHub (Jan 16, 2026):

Is this already implemented? It is very important to have consistent outputs but better efficiency. And running multiple instances of the same model on one or multiple GPUs will improve this.

<!-- gh-comment-id:3759935582 --> @davonisher commented on GitHub (Jan 16, 2026): Is this already implemented? It is very important to have consistent outputs but better efficiency. And running multiple instances of the same model on one or multiple GPUs will improve this.
Author
Owner

@renatomaluhy commented on GitHub (Apr 2, 2026):

Real-world impact: OpenClaw multi-user deployment

We're running Ollama in production with OpenClaw (personal AI assistant across Telegram, WhatsApp, Slack, Discord) and hitting the same concurrency wall.

Our setup:

  • Mac mini M-series, Ollama 0.18.3+
  • Multiple users sending messages simultaneously via Telegram
  • Models: qwen3.5:9b, qwen3.5:27b, gemma3

What we're experiencing:

  • Requests queue for 6-10 hours instead of processing in parallel (see #15195 for logs)
  • Setting has no observable effect
  • Different models serialize instead of running concurrently
  • Users see massive delays during peak hours

Why this matters:
This isn't just about throughput — it's about usability. When 3-5 people message their assistant within minutes, everyone waits hours. Makes multi-user deployments effectively unusable.

We need:

  1. True parallel request handling (not just queuing)
  2. Model instance pooling (multiple copies of same model)
  3. Visibility into queue state / instance status
  4. Roadmap timeline — even rough ETA helps us plan

33 reactions on this issue suggests this is a common production blocker, not edge case.

Any update from the Ollama team on when this might land? Happy to help test PRs or provide more logs.


Related: #14510, #12107, #15195

<!-- gh-comment-id:4176470315 --> @renatomaluhy commented on GitHub (Apr 2, 2026): ## Real-world impact: OpenClaw multi-user deployment We're running Ollama in production with OpenClaw (personal AI assistant across Telegram, WhatsApp, Slack, Discord) and hitting the same concurrency wall. **Our setup:** - Mac mini M-series, Ollama 0.18.3+ - Multiple users sending messages simultaneously via Telegram - Models: qwen3.5:9b, qwen3.5:27b, gemma3 **What we're experiencing:** - Requests queue for **6-10 hours** instead of processing in parallel (see #15195 for logs) - Setting has no observable effect - Different models serialize instead of running concurrently - Users see massive delays during peak hours **Why this matters:** This isn't just about throughput — it's about **usability**. When 3-5 people message their assistant within minutes, everyone waits hours. Makes multi-user deployments effectively unusable. **We need:** 1. ✅ True parallel request handling (not just queuing) 2. ✅ Model instance pooling (multiple copies of same model) 3. ✅ Visibility into queue state / instance status 4. ✅ Roadmap timeline — even rough ETA helps us plan 33 reactions on this issue suggests this is a **common production blocker**, not edge case. Any update from the Ollama team on when this might land? Happy to help test PRs or provide more logs. --- *Related: #14510, #12107, #15195*
Author
Owner

@CastelDazur commented on GitHub (Apr 2, 2026):

Running a 27B + 7B simultaneously on a single 24GB card basically forces you to pick between context length and parallelism since there's no per-model VRAM budget in the scheduler. Even just exposing current slot usage and memory per instance via /api/ps would make external load balancing way more practical than the nginx workaround in #4165.

<!-- gh-comment-id:4177781378 --> @CastelDazur commented on GitHub (Apr 2, 2026): Running a 27B + 7B simultaneously on a single 24GB card basically forces you to pick between context length and parallelism since there's no per-model VRAM budget in the scheduler. Even just exposing current slot usage and memory per instance via /api/ps would make external load balancing way more practical than the nginx workaround in #4165.
Author
Owner

@Robinsane commented on GitHub (Apr 3, 2026):

@renatomaluhy
OLLAMA_NUM_PARALLEL doesn't work with qwen MoE architecture atm, see following issue:
https://github.com/ollama/ollama/issues/4165

<!-- gh-comment-id:4182317329 --> @Robinsane commented on GitHub (Apr 3, 2026): @renatomaluhy OLLAMA_NUM_PARALLEL doesn't work with qwen MoE architecture atm, see following issue: https://github.com/ollama/ollama/issues/4165
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48927