[GH-ISSUE #2004] Any plans to add a queue status endpoint? #47670

Open
opened 2026-04-28 04:49:47 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @ParisNeo on GitHub (Jan 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2004

Originally assigned to: @jessegross on GitHub.

Hi.

Thank you for this cool server. I am developing an open source AI tool that is compatible with multiple services/models. And ollama is one of them. Except that I need to use it with multiple clients setting.

To do that I run multiple servers (example ollama service) and want to use the queue status to decide which server to route the request to.

Is there a way to get an endpoint to show how many requests are in the queue when dealing with multiple connections?

I need this to share the load between multiple servers. My client needs to ask each server the status of its queue in order to know which server can handle the load.

For example if I have three servers, and the first one has two requests in the queue, the second one has one request and the last one has 0, then I'll take the third one. The idea is that the client seeks the server that has less requests in the queue allowing me to simultaniously serve multiple lollms clients.

This could be really helpful.

Also, if you can add lollms to the list of frontends that can use ollama server it would be cool:
LoLLMS.

Thanks

Originally created by @ParisNeo on GitHub (Jan 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2004 Originally assigned to: @jessegross on GitHub. Hi. Thank you for this cool server. I am developing an open source AI tool that is compatible with multiple services/models. And ollama is one of them. Except that I need to use it with multiple clients setting. To do that I run multiple servers (example ollama service) and want to use the queue status to decide which server to route the request to. Is there a way to get an endpoint to show how many requests are in the queue when dealing with multiple connections? I need this to share the load between multiple servers. My client needs to ask each server the status of its queue in order to know which server can handle the load. For example if I have three servers, and the first one has two requests in the queue, the second one has one request and the last one has 0, then I'll take the third one. The idea is that the client seeks the server that has less requests in the queue allowing me to simultaniously serve multiple lollms clients. This could be really helpful. Also, if you can add lollms to the list of frontends that can use ollama server it would be cool: [LoLLMS](https://github.com/ParisNeo/lollms-webui). Thanks
GiteaMirror added the feature requestapi labels 2026-04-28 04:49:48 -05:00
Author
Owner

@pdevine commented on GitHub (Jan 15, 2024):

There isn't a way to tell that right now unfortunately. The server will just block each of the connections while one is being serviced, and then each of those connections will race to try and be serviced next. It's not ideal. We'll definitely be looking at improving this in the future.

<!-- gh-comment-id:1892824629 --> @pdevine commented on GitHub (Jan 15, 2024): There isn't a way to tell that right now unfortunately. The server will just block each of the connections while one is being serviced, and then each of those connections will race to try and be serviced next. It's not ideal. We'll definitely be looking at improving this in the future.
Author
Owner

@ParisNeo commented on GitHub (Jan 15, 2024):

I guess I have to handle this on my end then. I'll add a proxy that counts the connections and route them to multiple servers.

<!-- gh-comment-id:1892833753 --> @ParisNeo commented on GitHub (Jan 15, 2024): I guess I have to handle this on my end then. I'll add a proxy that counts the connections and route them to multiple servers.
Author
Owner

@ParisNeo commented on GitHub (Jan 16, 2024):

Ok, it is done, I have created a separate repository for it. it also handles permissions and user authentication using a KEY (just like open ai api):
https://github.com/ParisNeo/ollama_proxy_server

<!-- gh-comment-id:1892919317 --> @ParisNeo commented on GitHub (Jan 16, 2024): Ok, it is done, I have created a separate repository for it. it also handles permissions and user authentication using a KEY (just like open ai api): https://github.com/ParisNeo/ollama_proxy_server
Author
Owner

@remy415 commented on GitHub (Jan 31, 2024):

@ParisNeo You could also run it behind a load balancer in Kubernetes. It's fairly easy to configure an nginx proxy to connect to even bare metal hosts, and it's able to be configured with SSL passthrough or SSL termination. Kubernetes cluster will also allow you to integrate an OAUTH solution to manage connections.

<!-- gh-comment-id:1918237697 --> @remy415 commented on GitHub (Jan 31, 2024): @ParisNeo You could also run it behind a load balancer in Kubernetes. It's fairly easy to configure an nginx proxy to connect to even bare metal hosts, and it's able to be configured with SSL passthrough or SSL termination. Kubernetes cluster will also allow you to integrate an OAUTH solution to manage connections.
Author
Owner

@uzumakinaruto19 commented on GitHub (Jun 24, 2024):

is there any update on this? @dhiltgen

<!-- gh-comment-id:2186084844 --> @uzumakinaruto19 commented on GitHub (Jun 24, 2024): is there any update on this? @dhiltgen
Author
Owner

@Thf772 commented on GitHub (Mar 14, 2026):

I'd like to see this feature being added. My use case is more of a "debug" purpose: I'm experimenting with some AI apps that generate multiple parallel queries, and I'd like to see which queries are still pending to get an idea of how much processing is left for each experiment.

<!-- gh-comment-id:4060837685 --> @Thf772 commented on GitHub (Mar 14, 2026): I'd like to see this feature being added. My use case is more of a "debug" purpose: I'm experimenting with some AI apps that generate multiple parallel queries, and I'd like to see which queries are still pending to get an idea of how much processing is left for each experiment.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47670