[GH-ISSUE #6091] Parallel Bug: Would rather queue than reload on another GPU #50319

Closed
opened 2026-04-28 15:06:10 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @txd0213 on GitHub (Jul 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6091

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Experimental environment: 8 x A6000 GPUs
LLM: qwen2:7b
Environment variables:

Environment="OLLAMA_NUM_PARALLEL=16"
Environment="OLLAMA_MAX_LOADED_MODELS=4"

When the concurrency is less than or equal to 4, the parallel processing is effective. However, once it exceeds 4, OLLAMA does not choose to reload the same model on another GPU.

image

Although I sent 16 requests simultaneously, as can be seen from the graph, the actual concurrency of the model is only 4.
image

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.0

Originally created by @txd0213 on GitHub (Jul 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6091 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? **Experimental environment: 8 x A6000 GPUs** **LLM: qwen2:7b** **Environment variables:** ``` Environment="OLLAMA_NUM_PARALLEL=16" Environment="OLLAMA_MAX_LOADED_MODELS=4" ``` When the concurrency is less than or equal to **4**, the parallel processing is effective. However, once it exceeds 4, OLLAMA does not choose to **reload the same model on another GPU**. ![image](https://github.com/user-attachments/assets/de471e9e-1e21-4e5c-969f-8a17d0a0db19) Although I sent 16 requests simultaneously, as can be seen from the graph, the actual concurrency of the model is only 4. ![image](https://github.com/user-attachments/assets/0e0893c0-84c5-47cf-9bdb-528c1b24cd92) ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.0
GiteaMirror added the feature request label 2026-04-28 15:06:10 -05:00
Author
Owner

@dhiltgen commented on GitHub (Aug 1, 2024):

I think what you're asking for is the ability to load the same model multiple times on different GPUs. That's not currently supported with our concurrency implementation, and tracked via #3902 so I'll close this as a dup.

If that's not what your asking, can you clarify and I'll reopen the issue.

<!-- gh-comment-id:2264111911 --> @dhiltgen commented on GitHub (Aug 1, 2024): I think what you're asking for is the ability to load the same model multiple times on different GPUs. That's not currently supported with our concurrency implementation, and tracked via #3902 so I'll close this as a dup. If that's not what your asking, can you clarify and I'll reopen the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50319