[GH-ISSUE #11810] Clarification/feature request: utilization‑aware multi‑GPU scheduling with a single model instance (and multi‑model placement) #54350

Open
opened 2026-04-29 05:47:19 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @pfcouto on GitHub (Aug 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11810

I have a machine with 2 GPUs (each 40+ GB VRAM) and two applications that both use qwen3:14b (~10 GB). I’m running a single Ollama daemon with default settings.

I have not tested concurrent access yet. I’d like to understand the default behavior when multiple clients use the same model, and whether Ollama can be configured to achieve the desired utilization‑aware behavior below. If not supported, I’d appreciate guidance, roadmap info, and alternatives.

Desired behavior

  • Avoid duplicate VRAM copies where possible (single model instance per model).
  • Prioritize inference latency/throughput while keeping both GPUs effectively utilized.
  • Dynamic, utilization‑aware scheduling:

Same model, two apps:

  • App A loads qwen3:14b on GPU 0 and runs inference there.
  • When App B starts (same model), route its inference based on current utilization:
    • If GPU 0 is busy with A, prefer GPU 1 for B.
    • If GPU 0 becomes idle, allow B to switch to GPU 0 (if GPU 0 better than 1), or span GPUs 0+1 if that improves latency/throughput.
    • If B is spanning GPUs 0+1 and A needs to run, let A take one GPU; B should release that GPU and continue on the other.
  • Ideally, all of this happens without duplicating the model in VRAM (one logical model instance), unless duplication is required.

Adding a different model:

  • If a new App C uses a different model, place that model on whichever GPU has capacity (e.g., GPU 1 if GPU 0 already holds qwen3:14b).
  • Route inference for each app/model to the least‑busy GPU(s), allowing elastic use of single or multiple GPUs when beneficial.
  • Maintain preference for avoiding duplicate model loads across GPUs, unless needed for performance.

Questions

  1. With a single Ollama daemon and multiple clients using the same model, what is the default GPU selection/scheduling behavior? Does Ollama reuse an existing model instance on a specific GPU, and can requests be routed to another GPU without duplicating the model?
  2. Can Ollama do utilization‑aware, per‑request GPU routing and/or dynamic multi‑GPU (sharded) inference for a single model while avoiding duplicate VRAM copies?
  3. If not supported today, is this planned? Is there a recommended configuration to approximate it?
  4. For multiple models, does Ollama have a placement/scheduling strategy (per‑GPU model residency, load‑aware routing, migration/preemption)?
  5. If Ollama cannot provide this behavior, are there alternative runtimes/frameworks you recommend for multi‑GPU scheduling with minimal model duplication?

Notes / Considered approaches

  • Separate daemons per GPU (different ports) to split load across GPUs, at the cost of duplicating the model in VRAM. Have both apps use the different ports, but I don’t think this is beneficial for scalability.
# Session 1
CUDA_VISIBLE_DEVICES = "0"
OLLAMA_HOST = "127.0.0.1:11434"
ollama serve
# Session 2
CUDA_VISIBLE_DEVICES = "1"
OLLAMA_HOST = "127.0.0.1:11435"
ollama serve
  • Single daemon sharding a model across GPUs (e.g., split tensors across GPUs) so all requests use both GPUs; this improves utilization but doesn’t provide dynamic, per‑request routing. Using the variables below would unnecessarily split the weights across both GPUs (~5 GB VRAM on each).
CUDA_VISIBLE_DEVICES = 0,1
OLLAMA_SCHED_SPREAD=1
ollama serve

Environment

  • OS: Windows
  • GPUs: 2× (40+ GB VRAM each)
  • Model: qwen3:14b
  • Ollama version: 0.11.3
  • Driver/CUDA: 12.4
Originally created by @pfcouto on GitHub (Aug 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11810 I have a machine with 2 GPUs (each 40+ GB VRAM) and two applications that both use qwen3:14b (~10 GB). I’m running a single Ollama daemon with default settings. I have not tested concurrent access yet. I’d like to understand the default behavior when multiple clients use the same model, and whether Ollama can be configured to achieve the desired utilization‑aware behavior below. If not supported, I’d appreciate guidance, roadmap info, and alternatives. Desired behavior - Avoid duplicate VRAM copies where possible (single model instance per model). - Prioritize inference latency/throughput while keeping both GPUs effectively utilized. - Dynamic, utilization‑aware scheduling: Same model, two apps: - App A loads qwen3:14b on GPU 0 and runs inference there. - When App B starts (same model), route its inference based on current utilization: - If GPU 0 is busy with A, prefer GPU 1 for B. - If GPU 0 becomes idle, allow B to switch to GPU 0 (if GPU 0 better than 1), or span GPUs 0+1 if that improves latency/throughput. - If B is spanning GPUs 0+1 and A needs to run, let A take one GPU; B should release that GPU and continue on the other. - Ideally, all of this happens without duplicating the model in VRAM (one logical model instance), unless duplication is required. Adding a different model: - If a new App C uses a different model, place that model on whichever GPU has capacity (e.g., GPU 1 if GPU 0 already holds qwen3:14b). - Route inference for each app/model to the least‑busy GPU(s), allowing elastic use of single or multiple GPUs when beneficial. - Maintain preference for avoiding duplicate model loads across GPUs, unless needed for performance. Questions 1) With a single Ollama daemon and multiple clients using the same model, what is the default GPU selection/scheduling behavior? Does Ollama reuse an existing model instance on a specific GPU, and can requests be routed to another GPU without duplicating the model? 2) Can Ollama do utilization‑aware, per‑request GPU routing and/or dynamic multi‑GPU (sharded) inference for a single model while avoiding duplicate VRAM copies? 3) If not supported today, is this planned? Is there a recommended configuration to approximate it? 4) For multiple models, does Ollama have a placement/scheduling strategy (per‑GPU model residency, load‑aware routing, migration/preemption)? 5) If Ollama cannot provide this behavior, are there alternative runtimes/frameworks you recommend for multi‑GPU scheduling with minimal model duplication? Notes / Considered approaches - Separate daemons per GPU (different ports) to split load across GPUs, at the cost of duplicating the model in VRAM. Have both apps use the different ports, but I don’t think this is beneficial for scalability. ```bash # Session 1 CUDA_VISIBLE_DEVICES = "0" OLLAMA_HOST = "127.0.0.1:11434" ollama serve ``` ```bash # Session 2 CUDA_VISIBLE_DEVICES = "1" OLLAMA_HOST = "127.0.0.1:11435" ollama serve ``` - Single daemon sharding a model across GPUs (e.g., split tensors across GPUs) so all requests use both GPUs; this improves utilization but doesn’t provide dynamic, per‑request routing. Using the variables below would unnecessarily split the weights across both GPUs (~5 GB VRAM on each). ```bash CUDA_VISIBLE_DEVICES = 0,1 OLLAMA_SCHED_SPREAD=1 ollama serve ``` Environment - OS: Windows - GPUs: 2× (40+ GB VRAM each) - Model: qwen3:14b - Ollama version: 0.11.3 - Driver/CUDA: 12.4
GiteaMirror added the feature request label 2026-04-29 05:47:19 -05:00
Author
Owner

@UMDSmith commented on GitHub (Nov 20, 2025):

This is a feature I am interested in as well. I have 2 instances of ollama running on ports 11434 and 11435. I want to bind certain models to specific instances (can be done now), but I want to bind those specific instances to a specific GPU.

In my care, I have an RTX5090 and an RTX3080ti. I want to run 11435 on the 3080ti only, and 11434 on the 5090. This way I can control which models go where, as I run a multi stage turn that calls models at specific times that handle different functions of my application.

<!-- gh-comment-id:3555849191 --> @UMDSmith commented on GitHub (Nov 20, 2025): This is a feature I am interested in as well. I have 2 instances of ollama running on ports 11434 and 11435. I want to bind certain models to specific instances (can be done now), but I want to bind those specific instances to a specific GPU. In my care, I have an RTX5090 and an RTX3080ti. I want to run 11435 on the 3080ti only, and 11434 on the 5090. This way I can control which models go where, as I run a multi stage turn that calls models at specific times that handle different functions of my application.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54350