[GH-ISSUE #1400] How to serve multiple simultaneous request in Ollama? #26503

Closed
opened 2026-04-22 02:48:25 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @austin-starks on GitHub (Dec 6, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1400

Hello!

I want to deploy Ollama in the cloud server. The cloud server I'm renting is big enough to handle multiple requests at the same time with the models I'm using. However, Ollama queues the request. What specific changes do I need to make for this to be possible? And, is there any way for this to be an additional configuration option added to the Ollama repo?

Originally created by @austin-starks on GitHub (Dec 6, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1400 Hello! I want to deploy Ollama in the cloud server. The cloud server I'm renting is big enough to handle multiple requests at the same time with the models I'm using. However, Ollama queues the request. What specific changes do I need to make for this to be possible? And, is there any way for this to be an additional configuration option added to the Ollama repo?
Author
Owner

@technovangelist commented on GitHub (Dec 6, 2023):

For now Ollama is designed to provide a great experience on your local machine for a single user. It is designed to queue the request and then it will get to the next request after the current one is complete. We intend to look at making a better server experience in the future.

<!-- gh-comment-id:1843612664 --> @technovangelist commented on GitHub (Dec 6, 2023): For now Ollama is designed to provide a great experience on your local machine for a single user. It is designed to queue the request and then it will get to the next request after the current one is complete. We intend to look at making a better server experience in the future.
Author
Owner

@phalexo commented on GitHub (Dec 7, 2023):

If you wanted to do the work, you could probably set up a process pool, which would take work off its own queue and then managing multiple ollama(s) running against different GPUs. It is messy of course.

<!-- gh-comment-id:1845903319 --> @phalexo commented on GitHub (Dec 7, 2023): If you wanted to do the work, you could probably set up a process pool, which would take work off its own queue and then managing multiple ollama(s) running against different GPUs. It is messy of course.
Author
Owner

@ciliamadani commented on GitHub (Dec 19, 2023):

Hello @austin-starks, what alternatives are you considering if I may know, I'm in the same situation and would like to know the available options.
Thank you in advance.

<!-- gh-comment-id:1863456852 --> @ciliamadani commented on GitHub (Dec 19, 2023): Hello @austin-starks, what alternatives are you considering if I may know, I'm in the same situation and would like to know the available options. Thank you in advance.
Author
Owner

@ParisNeo commented on GitHub (Jan 9, 2024):

Maybe you can create multiple instances of the server with different port numbers

<!-- gh-comment-id:1882610703 --> @ParisNeo commented on GitHub (Jan 9, 2024): Maybe you can create multiple instances of the server with different port numbers
Author
Owner

@ParisNeo commented on GitHub (Jan 9, 2024):

For now Ollama is designed to provide a great experience on your local machine for a single user. It is designed to queue the request and then it will get to the next request after the current one is complete. We intend to look at making a better server experience in the future.

It is fine to have a single generation with a queue. But can you add an endpoint to ask fir que status?

I want to launch multiple instances of the server to solve this multi users problem. But i need to put a disparcher in front of them. I need to be able to know the state of each queue to dispatch the connections.

<!-- gh-comment-id:1882732693 --> @ParisNeo commented on GitHub (Jan 9, 2024): > For now Ollama is designed to provide a great experience on your local machine for a single user. It is designed to queue the request and then it will get to the next request after the current one is complete. We intend to look at making a better server experience in the future. It is fine to have a single generation with a queue. But can you add an endpoint to ask fir que status? I want to launch multiple instances of the server to solve this multi users problem. But i need to put a disparcher in front of them. I need to be able to know the state of each queue to dispatch the connections.
Author
Owner

@m0wer commented on GitHub (Jan 17, 2024):

Sounds like something https://github.com/vllm-project/vllm has sorted out (queuing + configurable number of workers).

<!-- gh-comment-id:1895745260 --> @m0wer commented on GitHub (Jan 17, 2024): Sounds like something https://github.com/vllm-project/vllm has sorted out (queuing + configurable number of workers).
Author
Owner

@ParisNeo commented on GitHub (Jan 18, 2024):

except vllm doesn't know how to run GGUF models and is very hungry in terms of memory consumption.

<!-- gh-comment-id:1898010168 --> @ParisNeo commented on GitHub (Jan 18, 2024): except vllm doesn't know how to run GGUF models and is very hungry in terms of memory consumption.
Author
Owner

@m0wer commented on GitHub (Jan 18, 2024):

except vllm doesn't know how to run GGUF models and is very hungry in terms of memory consumption.

Agreed. Would be great to have parallelism in Ollama instead.

<!-- gh-comment-id:1898067655 --> @m0wer commented on GitHub (Jan 18, 2024): > except vllm doesn't know how to run GGUF models and is very hungry in terms of memory consumption. Agreed. Would be great to have parallelism in Ollama instead.
Author
Owner

@easp commented on GitHub (Jan 18, 2024):

Depending on what level of concurrency you are after, quantization can be counterproductive. Using quantized models requires more compute, which ends up becoming the bottleneck at higher levels of concurrency.

<!-- gh-comment-id:1898877492 --> @easp commented on GitHub (Jan 18, 2024): Depending on what level of concurrency you are after, quantization can be counterproductive. Using quantized models requires more compute, which ends up becoming the bottleneck at higher levels of concurrency.
Author
Owner

@pdevine commented on GitHub (Jan 26, 2024):

Going to close this as a dupe of #358

<!-- gh-comment-id:1912842579 --> @pdevine commented on GitHub (Jan 26, 2024): Going to close this as a dupe of #358
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26503