[GH-ISSUE #12107] Concurrent requests take longer than sequential #70106

Open
opened 2026-05-04 20:21:57 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @Cdany2001 on GitHub (Aug 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12107

OllamaDebugLogs.txt

What is the issue?

I'm attempting to run multiple request in parallel but the processing time per request is higher than when running multiple sequentially (eg: to run 3 sequentially it would take around 30s, in parallel around 43s). I tried several configurations and optimization that I've found on closed bugs on this topic. In the end I do not see an increase in memory or in computation when increasing OLLAMA_NUM_PARALLEL and sending multiple request. The system has plenty of resources for this task. Is it a configuration that I'm missing? Should I allocate more resources?

Model: Gpt-Oss:20b
Context window: 20k
Avg request size: 8k tokens
Used memory: ~14GB

Configurations:
OLLAMA_NUM_PARALLEL=2 (some tests with higher numbers)
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_FLASH_ATTENTION=1
OLLAMA_NEW_ESTIMATES=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KEEP_ALIVE=10m

Resources:
Over 100GB RAM
Over 45GB VRAM
Cuda utilization during processing: [40-55%]

Relevant log output


OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.11.7

Originally created by @Cdany2001 on GitHub (Aug 28, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12107 [OllamaDebugLogs.txt](https://github.com/user-attachments/files/22120575/OllamaDebugLogs.txt) ### What is the issue? I'm attempting to run multiple request in parallel but the processing time per request is higher than when running multiple sequentially (eg: to run 3 sequentially it would take around 30s, in parallel around 43s). I tried several configurations and optimization that I've found on closed bugs on this topic. In the end I do not see an increase in memory or in computation when increasing OLLAMA_NUM_PARALLEL and sending multiple request. The system has plenty of resources for this task. Is it a configuration that I'm missing? Should I allocate more resources? Model: Gpt-Oss:20b Context window: 20k Avg request size: 8k tokens Used memory: ~14GB Configurations: OLLAMA_NUM_PARALLEL=2 (some tests with higher numbers) OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_KEEP_ALIVE=10m Resources: Over 100GB RAM Over 45GB VRAM Cuda utilization during processing: [40-55%] ### Relevant log output ```shell ``` ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.7
GiteaMirror added the bug label 2026-05-04 20:21:57 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 28, 2025):

Server logs may aid in debugging.

Concurrency depends on hardware. For example, gpt-oss:20b on an RTX6000:

Image

But on an A100:

Image
<!-- gh-comment-id:3234963504 --> @rick-github commented on GitHub (Aug 28, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. Concurrency depends on hardware. For example, gpt-oss:20b on an RTX6000: <img width="851" height="583" alt="Image" src="https://github.com/user-attachments/assets/42d23fff-3b57-4c03-a1c0-388cf26f4590" /> But on an A100: <img width="833" height="575" alt="Image" src="https://github.com/user-attachments/assets/016c6f8f-4edd-48d4-8048-297c08586aec" />
Author
Owner

@Scronkfinkle commented on GitHub (Sep 1, 2025):

I've noticed something similar. I have a pair of identical cards with 24 GiB of VRAM on each (RTX 3090).

I have a set of independent large queries (about 60) that I use the model to evaluate for a reasoning task. When I run gpt-oss:20b the model fits on a single card, and when configured to only run with OLLAMA_NUM_PARALLEL=1 , it takes about ~10 mins to process.

When I scale it up to 2, the processing jobs still finish at about the same time. which surprised me because I figured that'd make the task at least noticeably faster
ollama reports that it all runs on 100% GPU still, and it also only runs on a single card.

Now, if I turn parallelization up to 4, then the system spins up another gpt-oss model on the other card, but the performance actually degrades at that point. ollama ps also reports 2% CPU/ 98% GPU which I'm guessing is causing some of the issue.

It would be nice if I didn't have to stand up ollama again on a separate port to get it to just run a separate processing job on the other GPU so I can have 2 parallel jobs working at a time, which would give parallelism instead of the concurrency and effectively half the processing time for the evaluations I'm running.

<!-- gh-comment-id:3240594231 --> @Scronkfinkle commented on GitHub (Sep 1, 2025): I've noticed something similar. I have a pair of identical cards with 24 GiB of VRAM on each (RTX 3090). I have a set of independent large queries (about 60) that I use the model to evaluate for a reasoning task. When I run `gpt-oss:20b` the model fits on a single card, and when configured to only run with `OLLAMA_NUM_PARALLEL=1` , it takes about ~10 mins to process. When I scale it up to 2, the processing jobs still finish at about the same time. which surprised me because I figured that'd make the task at least noticeably faster ollama reports that it all runs on 100% GPU still, and it also only runs on a single card. Now, if I turn parallelization up to 4, then the system spins up another gpt-oss model on the other card, but the performance actually degrades at that point. `ollama ps` also reports `2% CPU/ 98% GPU` which I'm guessing is causing some of the issue. It would be nice if I didn't have to stand up ollama again on a separate port to get it to just run a separate processing job on the other GPU so I can have 2 parallel jobs working at a time, which would give parallelism instead of the concurrency and effectively half the processing time for the evaluations I'm running.
Author
Owner

@Cdany2001 commented on GitHub (Sep 2, 2025):

Hi @rick-github! Thanks for your reply! I m working on a set of debug info on a new container and will update the item with it soon (I have some sensitive information on the original container and I'm not sure what is logged).
I have some troubles understanding the provided charts. Does the Evaluation line refer to the avg TPS for one request and the Effective line refer to the total TPS outputted from all the requests combined? I have a GPU comparable with RTX6000 so that helps me a lot.

<!-- gh-comment-id:3244358790 --> @Cdany2001 commented on GitHub (Sep 2, 2025): Hi @rick-github! Thanks for your reply! I m working on a set of debug info on a new container and will update the item with it soon (I have some sensitive information on the original container and I'm not sure what is logged). I have some troubles understanding the provided charts. Does the Evaluation line refer to the avg TPS for one request and the Effective line refer to the total TPS outputted from all the requests combined? I have a GPU comparable with RTX6000 so that helps me a lot.
Author
Owner

@Cdany2001 commented on GitHub (Sep 3, 2025):

Hi @rick-github! I've added the mentioned debug logs.

<!-- gh-comment-id:3249432197 --> @Cdany2001 commented on GitHub (Sep 3, 2025): Hi @rick-github! I've added the mentioned debug logs.
Author
Owner

@rick-github commented on GitHub (Sep 8, 2025):

The logs show three parallel requests running in separate slots, taking about 45s to complete, and then a final request that took 15s.

From my experiments, individual requests take longer but the aggregate tokens/sec goes up (see first graph). In this case it looks like aggregate tokens/sec stays about the same, as the 3 parallel requests take 3 times longer than the single request. The only unusual thing I see is that your CUDA driver is version 13. Ollama hasn't been officially released with v13 support yet so there may be an incompatibility. Would it be possible to downgrade to a v12 driver and test to see if there's a difference?

<!-- gh-comment-id:3267413206 --> @rick-github commented on GitHub (Sep 8, 2025): The logs show three parallel requests running in separate slots, taking about 45s to complete, and then a final request that took 15s. From my experiments, individual requests take longer but the aggregate tokens/sec goes up (see first graph). In this case it looks like aggregate tokens/sec stays about the same, as the 3 parallel requests take 3 times longer than the single request. The only unusual thing I see is that your CUDA driver is version 13. Ollama hasn't been officially released with v13 support [yet](https://github.com/ollama/ollama/pull/12000) so there may be an incompatibility. Would it be possible to downgrade to a [v12](https://developer.nvidia.com/cuda-12-9-0-download-archive) driver and test to see if there's a difference?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70106