mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[GH-ISSUE #788] feat: multiple Ollama servers in one webui + load balancing #12215
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nick-tonjum on GitHub (Feb 18, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/788
Originally assigned to: @tjbck on GitHub.
The #1 downside to Ollama is the fact that it can only process one request at a time, no matter what hardware is available. Right now my solution is to have multiple Ollama instances running, so when one is in use on one graphics card I can use the other instance on the other card.
It would be nice to see open-webui allow multiple Ollama server connections, as I think an environment with multiple users using it simultaneously could really benefit from this. I myself have one machine with two Ollama instances (one per graphics card), and a whole different machine off-site that also has an Ollama instance. Utilizing all three of them would be nice for multiple concurrent users
@justinh-rahb commented on GitHub (Feb 19, 2024):
This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server
That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.
@UberMetroid commented on GitHub (Feb 27, 2024):
Wh
When you get to that point I have systems that can test.
@BigBIueWhale commented on GitHub (Oct 22, 2024):
I created a solution https://github.com/BigBIueWhale/ollama_load_balancer
It's a Rust utility that load balances multiple Ollama servers
@QuifixOfficial commented on GitHub (Nov 25, 2024):
OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!
@stanthewizzard commented on GitHub (Jan 17, 2025):
Sorry could you explain how ?
Thanks
@xuyangbocn commented on GitHub (Feb 1, 2025):
If using AWS, below is what i did
https://github.com/xuyangbocn/terraform-aws-self-host-llm
https://youtu.be/hRJEREemyos
@stanthewizzard commented on GitHub (Feb 1, 2025):
Self hosted and not on aws :(
@xuyangbocn commented on GitHub (Feb 1, 2025):
@stanthewizzard
Under Open WebUI Admin setting >> Setting >> Connections >> Ollama API, you can specify multiple endpoints, one for each Ollama deployment.
@stanthewizzard commented on GitHub (Feb 1, 2025):
Thanks for the advice 😇
I already have that. But it doesn't load balance ?
@xuyangbocn commented on GitHub (Feb 1, 2025):
Really? Though i havent tried on my own, but seemingly this allows differnet models to be directed to different endpoints.
@stanthewizzard commented on GitHub (Feb 1, 2025):
Same model on 2 computers and only one is used. Always the same btw
@mateuszdrab commented on GitHub (Feb 3, 2025):
I think Open WebUI needs a smarter endpoint selection algorithm so that it can consider allowed models, loaded models, preferred instances and etc...
Ollama now supports parralelism and queuing, and making a decision where to send the request needs to consider the models loaded, the state of the queue and etc.
@stanthewizzard commented on GitHub (Feb 3, 2025):
exactly what I'm looking for
@filviu commented on GitHub (Feb 6, 2025):
I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?
@gonzalu commented on GitHub (Mar 4, 2025):
I have the exact same question :D Seems like it should be as easy as picking a server but I am puzzled.
I have Ollama running on a Jetson Orin Nano 8GB Dev Plus and also on an AMD 7940HS 32GB RAM CPU system. I have both configured on my OpenWebUI (see below) but for the life of me, I can't figure out how I pick one over the other?
Probably easy and obvious but I am a complete n00b :P
Thanks.
@d-shehu commented on GitHub (Mar 12, 2025):
What is the algo for picking the model if there are multiple endpoints/servers. Does it go down the list as configured in "connections" and pick the 1st working connection?
Fallback logic would be nice in lieu of something as complicated as a gateway. Thanks.
@gonzalu commented on GitHub (Mar 17, 2025):
I found that you can prefix each server with a tag and then when you HOVER OVER the models in the selection dropdown, the tooltip will show you the server it is from. Good enough for me until a more elegant way is available.
@WyattLiu commented on GitHub (Apr 19, 2025):
second this, right now I have 2 servers of the same model, what open webui currently offer is fail safe, one is backing up for the other... if we can somehow use both and have a little queue, it would be good.
@GTez commented on GitHub (Aug 5, 2025):
I'd love this. I'm using a HAProxy right now to proxy the ollama calls in order to load balance, but would way rather it be smart, understand what's loaded, then farm out the requests properly.
@apunkt commented on GitHub (Sep 1, 2025):
Used HAProxy until now,
which is great, but not aware of which model is loaded on which ollama server, thus having mixed results.
I am successful now with NOMYO Router