[GH-ISSUE #788] feat: multiple Ollama servers in one webui + load balancing #12215

Closed
opened 2026-04-19 19:05:32 -05:00 by GiteaMirror · 20 comments
Owner

Originally created by @nick-tonjum on GitHub (Feb 18, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/788

Originally assigned to: @tjbck on GitHub.

The #1 downside to Ollama is the fact that it can only process one request at a time, no matter what hardware is available. Right now my solution is to have multiple Ollama instances running, so when one is in use on one graphics card I can use the other instance on the other card.

It would be nice to see open-webui allow multiple Ollama server connections, as I think an environment with multiple users using it simultaneously could really benefit from this. I myself have one machine with two Ollama instances (one per graphics card), and a whole different machine off-site that also has an Ollama instance. Utilizing all three of them would be nice for multiple concurrent users

Originally created by @nick-tonjum on GitHub (Feb 18, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/788 Originally assigned to: @tjbck on GitHub. The #1 downside to Ollama is the fact that it can only process one request at a time, no matter what hardware is available. Right now my solution is to have multiple Ollama instances running, so when one is in use on one graphics card I can use the other instance on the other card. It would be nice to see open-webui allow multiple Ollama server connections, as I think an environment with multiple users using it simultaneously could really benefit from this. I myself have one machine with two Ollama instances (one per graphics card), and a whole different machine off-site that also has an Ollama instance. Utilizing all three of them would be nice for multiple concurrent users
Author
Owner

@justinh-rahb commented on GitHub (Feb 19, 2024):

This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server

That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.

<!-- gh-comment-id:1953245056 --> @justinh-rahb commented on GitHub (Feb 19, 2024): This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.
Author
Owner

@UberMetroid commented on GitHub (Feb 27, 2024):

Wh

This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server

That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.

When you get to that point I have systems that can test.

<!-- gh-comment-id:1966754167 --> @UberMetroid commented on GitHub (Feb 27, 2024): Wh > This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server > > That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too. When you get to that point I have systems that can test.
Author
Owner

@BigBIueWhale commented on GitHub (Oct 22, 2024):

I created a solution https://github.com/BigBIueWhale/ollama_load_balancer
It's a Rust utility that load balances multiple Ollama servers

<!-- gh-comment-id:2430067684 --> @BigBIueWhale commented on GitHub (Oct 22, 2024): I created a solution https://github.com/BigBIueWhale/ollama_load_balancer It's a Rust utility that load balances multiple Ollama servers
Author
Owner

@QuifixOfficial commented on GitHub (Nov 25, 2024):

OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!

<!-- gh-comment-id:2498641523 --> @QuifixOfficial commented on GitHub (Nov 25, 2024): OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!
Author
Owner

@stanthewizzard commented on GitHub (Jan 17, 2025):

OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!

Sorry could you explain how ?
Thanks

<!-- gh-comment-id:2597613140 --> @stanthewizzard commented on GitHub (Jan 17, 2025): > OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now! Sorry could you explain how ? Thanks
Author
Owner

@xuyangbocn commented on GitHub (Feb 1, 2025):

If using AWS, below is what i did
https://github.com/xuyangbocn/terraform-aws-self-host-llm
https://youtu.be/hRJEREemyos

<!-- gh-comment-id:2628968198 --> @xuyangbocn commented on GitHub (Feb 1, 2025): If using AWS, below is what i did https://github.com/xuyangbocn/terraform-aws-self-host-llm https://youtu.be/hRJEREemyos
Author
Owner

@stanthewizzard commented on GitHub (Feb 1, 2025):

Self hosted and not on aws :(

<!-- gh-comment-id:2628969187 --> @stanthewizzard commented on GitHub (Feb 1, 2025): Self hosted and not on aws :(
Author
Owner

@xuyangbocn commented on GitHub (Feb 1, 2025):

@stanthewizzard
Under Open WebUI Admin setting >> Setting >> Connections >> Ollama API, you can specify multiple endpoints, one for each Ollama deployment.

Image
<!-- gh-comment-id:2628979678 --> @xuyangbocn commented on GitHub (Feb 1, 2025): @stanthewizzard Under Open WebUI Admin setting >> Setting >> Connections >> Ollama API, you can specify multiple endpoints, one for each Ollama deployment. <img width="616" alt="Image" src="https://github.com/user-attachments/assets/fe8b7c2f-9c4c-45fb-90df-d2e01a5a3ef0" />
Author
Owner

@stanthewizzard commented on GitHub (Feb 1, 2025):

Thanks for the advice 😇
I already have that. But it doesn't load balance ?

<!-- gh-comment-id:2628980147 --> @stanthewizzard commented on GitHub (Feb 1, 2025): Thanks for the advice 😇 I already have that. But it doesn't load balance ?
Author
Owner

@xuyangbocn commented on GitHub (Feb 1, 2025):

Really? Though i havent tried on my own, but seemingly this allows differnet models to be directed to different endpoints.

<!-- gh-comment-id:2628993406 --> @xuyangbocn commented on GitHub (Feb 1, 2025): Really? Though i havent tried on my own, but seemingly this allows differnet models to be directed to different endpoints.
Author
Owner

@stanthewizzard commented on GitHub (Feb 1, 2025):

Same model on 2 computers and only one is used. Always the same btw

<!-- gh-comment-id:2628998475 --> @stanthewizzard commented on GitHub (Feb 1, 2025): Same model on 2 computers and only one is used. Always the same btw
Author
Owner

@mateuszdrab commented on GitHub (Feb 3, 2025):

I think Open WebUI needs a smarter endpoint selection algorithm so that it can consider allowed models, loaded models, preferred instances and etc...
Ollama now supports parralelism and queuing, and making a decision where to send the request needs to consider the models loaded, the state of the queue and etc.

<!-- gh-comment-id:2631233643 --> @mateuszdrab commented on GitHub (Feb 3, 2025): I think Open WebUI needs a smarter endpoint selection algorithm so that it can consider allowed models, loaded models, preferred instances and etc... Ollama now supports parralelism and queuing, and making a decision where to send the request needs to consider the models loaded, the state of the queue and etc.
Author
Owner

@stanthewizzard commented on GitHub (Feb 3, 2025):

exactly what I'm looking for

<!-- gh-comment-id:2631606316 --> @stanthewizzard commented on GitHub (Feb 3, 2025): exactly what I'm looking for
Author
Owner

@filviu commented on GitHub (Feb 6, 2025):

I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?

<!-- gh-comment-id:2639254094 --> @filviu commented on GitHub (Feb 6, 2025): I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?
Author
Owner

@gonzalu commented on GitHub (Mar 4, 2025):

I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?

I have the exact same question :D Seems like it should be as easy as picking a server but I am puzzled.

I have Ollama running on a Jetson Orin Nano 8GB Dev Plus and also on an AMD 7940HS 32GB RAM CPU system. I have both configured on my OpenWebUI (see below) but for the life of me, I can't figure out how I pick one over the other?

Image

Probably easy and obvious but I am a complete n00b :P

Image

Thanks.

<!-- gh-comment-id:2696142119 --> @gonzalu commented on GitHub (Mar 4, 2025): > I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ? I have the exact same question :D Seems like it should be as easy as picking a server but I am puzzled. I have Ollama running on a Jetson Orin Nano 8GB Dev Plus and also on an AMD 7940HS 32GB RAM CPU system. I have both configured on my OpenWebUI (see below) but for the life of me, I can't figure out how I pick one over the other? ![Image](https://github.com/user-attachments/assets/3ab6f10e-22a7-4fea-b215-20fcd4040e41) Probably easy and obvious but I am a complete n00b :P ![Image](https://github.com/user-attachments/assets/a4ffc7c4-ef6f-47e3-8826-755beb085638) Thanks.
Author
Owner

@d-shehu commented on GitHub (Mar 12, 2025):

What is the algo for picking the model if there are multiple endpoints/servers. Does it go down the list as configured in "connections" and pick the 1st working connection?

Fallback logic would be nice in lieu of something as complicated as a gateway. Thanks.

<!-- gh-comment-id:2719079324 --> @d-shehu commented on GitHub (Mar 12, 2025): What is the algo for picking the model if there are multiple endpoints/servers. Does it go down the list as configured in "connections" and pick the 1st working connection? Fallback logic would be nice in lieu of something as complicated as a gateway. Thanks.
Author
Owner

@gonzalu commented on GitHub (Mar 17, 2025):

I found that you can prefix each server with a tag and then when you HOVER OVER the models in the selection dropdown, the tooltip will show you the server it is from. Good enough for me until a more elegant way is available.

Image

Image

<!-- gh-comment-id:2727905639 --> @gonzalu commented on GitHub (Mar 17, 2025): I found that you can prefix each server with a tag and then when you HOVER OVER the models in the selection dropdown, the tooltip will show you the server it is from. Good enough for me until a more elegant way is available. ![Image](https://github.com/user-attachments/assets/14a92ec2-0dc2-4e16-9ccb-11e6d0226f57) ![Image](https://github.com/user-attachments/assets/fc754429-0a9f-4f29-b53f-0acf2bc1c48b)
Author
Owner

@WyattLiu commented on GitHub (Apr 19, 2025):

second this, right now I have 2 servers of the same model, what open webui currently offer is fail safe, one is backing up for the other... if we can somehow use both and have a little queue, it would be good.

<!-- gh-comment-id:2816876859 --> @WyattLiu commented on GitHub (Apr 19, 2025): second this, right now I have 2 servers of the same model, what open webui currently offer is fail safe, one is backing up for the other... if we can somehow use both and have a little queue, it would be good.
Author
Owner

@GTez commented on GitHub (Aug 5, 2025):

I'd love this. I'm using a HAProxy right now to proxy the ollama calls in order to load balance, but would way rather it be smart, understand what's loaded, then farm out the requests properly.

<!-- gh-comment-id:3156764704 --> @GTez commented on GitHub (Aug 5, 2025): I'd love this. I'm using a HAProxy right now to proxy the ollama calls in order to load balance, but would way rather it be smart, understand what's loaded, then farm out the requests properly.
Author
Owner

@apunkt commented on GitHub (Sep 1, 2025):

Used HAProxy until now,
which is great, but not aware of which model is loaded on which ollama server, thus having mixed results.

I am successful now with NOMYO Router

<!-- gh-comment-id:3242797707 --> @apunkt commented on GitHub (Sep 1, 2025): Used HAProxy until now, which is great, but not aware of which model is loaded on which ollama server, thus having mixed results. I am successful now with [NOMYO Router](https://github.com/nomyo-ai/nomyo-router)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#12215