[GH-ISSUE #2282] Slow response with concurrent requests #1313

Closed
opened 2026-04-12 11:08:36 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @oxaronick on GitHub (Jan 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2282

Originally assigned to: @bmizerany on GitHub.

Ollama is great. It makes deploying LLMs easy. However, I have an issue with sending two requests to Ollama within a second or so of each other.

When I do this, Ollama usually responds to one of the requests fine, but the CPU usage jumps by at least 100% and the other request doesn't get a response. Sometimes it will after many minutes, but I don't always wait around to find out. Responses are normally returned within 2s of a request.

I'm running Ollama on an A100 with 80GB of VRAM and according to nvidia-smi Ollama is only using ~7GB.

I would expect it to handle one request, then handle the other, both on the GPU but I'm wondering if the second request is causing Ollama to try to run something on the CPU.

How can I configure Ollama to handle concurrent (or near-concurrent) requests better?

Originally created by @oxaronick on GitHub (Jan 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2282 Originally assigned to: @bmizerany on GitHub. Ollama is great. It makes deploying LLMs easy. However, I have an issue with sending two requests to Ollama within a second or so of each other. When I do this, Ollama usually responds to one of the requests fine, but the CPU usage jumps by at least 100% and the other request doesn't get a response. Sometimes it will after many minutes, but I don't always wait around to find out. Responses are normally returned within 2s of a request. I'm running Ollama on an A100 with 80GB of VRAM and according to `nvidia-smi` Ollama is only using ~7GB. I would expect it to handle one request, then handle the other, both on the GPU but I'm wondering if the second request is causing Ollama to try to run something on the CPU. How can I configure Ollama to handle concurrent (or near-concurrent) requests better?
GiteaMirror added the feature request label 2026-04-12 11:08:36 -05:00
Author
Owner

@nathanpbell commented on GitHub (Jan 31, 2024):

Note: I'm just a user, not a contributor. But I've played a bit with this.

My understanding is that Ollama does not currently support concurrent requests. I believe it blocks the second request until the first request is completed. You'll need to build your own queue in front of ollama.

llama.cpp, which ollama uses to run the model generation does support what you are wanting to do - it's called continuous batching. And there's a feature request to support that mode in ollama here.

As to why it's running the second request on CPU - are you requesting the same model for each? If you are (it's not unloading one model to load the next model), then there may be a bug there.

<!-- gh-comment-id:1918147057 --> @nathanpbell commented on GitHub (Jan 31, 2024): Note: I'm just a user, not a contributor. But I've played a bit with this. My understanding is that Ollama does not currently support concurrent requests. I believe it blocks the second request until the first request is completed. You'll need to build your own queue in front of ollama. llama.cpp, which ollama uses to run the model generation does support what you are wanting to do - it's called continuous batching. And there's a feature request to support that mode in ollama [here](https://github.com/ollama/ollama/issues/1396). As to why it's running the second request on CPU - are you requesting the same model for each? If you are (it's not unloading one model to load the next model), then there may be a bug there.
Author
Owner

@oxaronick commented on GitHub (Jan 31, 2024):

Thanks, @nathanpbell, that's helpful.

As to why it's running the second request on CPU - are you requesting the same model for each? If you are (it's not unloading one model to load the next model), then there may be a bug there.

I was sending concurrent requests for different models. I'll try with just a single model.

<!-- gh-comment-id:1919327638 --> @oxaronick commented on GitHub (Jan 31, 2024): Thanks, @nathanpbell, that's helpful. > As to why it's running the second request on CPU - are you requesting the same model for each? If you are (it's not unloading one model to load the next model), then there may be a bug there. I was sending concurrent requests for different models. I'll try with just a single model.
Author
Owner

@oxaronick commented on GitHub (Jan 31, 2024):

I haven't been able to reproduce with one model, but using a single instance of Ollama for chat and code completion causes the issue pretty reliably for me.

Is there a way to disable CPU processing? I can find docs on disabling GPU but not CPU. Even if one client got an error message instead of a response it would be preferable to having Ollama leave requests hanging until it's restarted.

<!-- gh-comment-id:1919874751 --> @oxaronick commented on GitHub (Jan 31, 2024): I haven't been able to reproduce with one model, but using a single instance of Ollama for chat and code completion causes the issue pretty reliably for me. Is there a way to disable CPU processing? I can find docs on disabling GPU but not CPU. Even if one client got an error message instead of a response it would be preferable to having Ollama leave requests hanging until it's restarted.
Author
Owner

@nathanpbell commented on GitHub (Jan 31, 2024):

It will fallback to CPU if it doesn't think you have enough VRAM. Are each of the models you're trying to load the same size?

<!-- gh-comment-id:1920166627 --> @nathanpbell commented on GitHub (Jan 31, 2024): It will fallback to CPU if it doesn't think you have enough VRAM. Are each of the models you're trying to load the same size?
Author
Owner

@oxaronick commented on GitHub (Feb 1, 2024):

I have 80GB of VRAM, with over 70GB free. I'm not even sure it's trying to run on the CPU, I just see the CPU usage spike.

<!-- gh-comment-id:1921326162 --> @oxaronick commented on GitHub (Feb 1, 2024): I have 80GB of VRAM, with over 70GB free. I'm not even sure it's trying to run on the CPU, I just see the CPU usage spike.
Author
Owner

@bmizerany commented on GitHub (Mar 11, 2024):

@nathanpbell is correct. Ollama currently serializes prompt requests.

Closing as dup of #358

<!-- gh-comment-id:1989560207 --> @bmizerany commented on GitHub (Mar 11, 2024): @nathanpbell is correct. Ollama currently serializes prompt requests. Closing as dup of #358
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1313