[GH-ISSUE #788] feat: multiple Ollama servers in one webui + load balancing #12215

New Issue

GiteaMirror · 2026-04-19T19:05:32-05:00

GiteaMirror commented

2026-04-19 19:05:32 -05:00

Originally created by @nick-tonjum on GitHub (Feb 18, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/788

Originally assigned to: @tjbck on GitHub.

The #1 downside to Ollama is the fact that it can only process one request at a time, no matter what hardware is available. Right now my solution is to have multiple Ollama instances running, so when one is in use on one graphics card I can use the other instance on the other card.

It would be nice to see open-webui allow multiple Ollama server connections, as I think an environment with multiple users using it simultaneously could really benefit from this. I myself have one machine with two Ollama instances (one per graphics card), and a whole different machine off-site that also has an Ollama instance. Utilizing all three of them would be nice for multiple concurrent users

Originally created by @nick-tonjum on GitHub (Feb 18, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/788 Originally assigned to: @tjbck on GitHub. The #1 downside to Ollama is the fact that it can only process one request at a time, no matter what hardware is available. Right now my solution is to have multiple Ollama instances running, so when one is in use on one graphics card I can use the other instance on the other card. It would be nice to see open-webui allow multiple Ollama server connections, as I think an environment with multiple users using it simultaneously could really benefit from this. I myself have one machine with two Ollama instances (one per graphics card), and a whole different machine off-site that also has an Ollama instance. Utilizing all three of them would be nice for multiple concurrent users

GiteaMirror closed this issue

2026-04-19 19:05:32 -05:00

GiteaMirror commented

2026-04-19 19:05:33 -05:00

@justinh-rahb commented on GitHub (Feb 19, 2024):

This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server

That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.

@justinh-rahb commented on GitHub (Feb 19, 2024): This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.

GiteaMirror commented

2026-04-19 19:05:33 -05:00

@UberMetroid commented on GitHub (Feb 27, 2024):

Wh

This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server

That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too.

When you get to that point I have systems that can test.

@UberMetroid commented on GitHub (Feb 27, 2024): Wh > This might interest you, somebody made a proxy load-balancer specifically for Ollama: https://github.com/ParisNeo/ollama_proxy_server > > That said, I think our present direction will eventually bring us here, we're planning to support much more than one backend. If we can do internal load-balancing that'd be neat too. When you get to that point I have systems that can test.

GiteaMirror commented

2026-04-19 19:05:34 -05:00

@BigBIueWhale commented on GitHub (Oct 22, 2024):

I created a solution https://github.com/BigBIueWhale/ollama_load_balancer
It's a Rust utility that load balances multiple Ollama servers

@BigBIueWhale commented on GitHub (Oct 22, 2024): I created a solution https://github.com/BigBIueWhale/ollama_load_balancer It's a Rust utility that load balances multiple Ollama servers

GiteaMirror commented

2026-04-19 19:05:34 -05:00

@QuifixOfficial commented on GitHub (Nov 25, 2024):

OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!

@QuifixOfficial commented on GitHub (Nov 25, 2024): OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!

GiteaMirror commented

2026-04-19 19:05:35 -05:00

@stanthewizzard commented on GitHub (Jan 17, 2025):

OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now!

Sorry could you explain how ?
Thanks

@stanthewizzard commented on GitHub (Jan 17, 2025): > OpenWebUI does support load balancing on its own. You do not need to use any third party plugins as of now! Sorry could you explain how ? Thanks

GiteaMirror commented

2026-04-19 19:05:35 -05:00

@xuyangbocn commented on GitHub (Feb 1, 2025):

If using AWS, below is what i did
https://github.com/xuyangbocn/terraform-aws-self-host-llm
https://youtu.be/hRJEREemyos

@xuyangbocn commented on GitHub (Feb 1, 2025): If using AWS, below is what i did https://github.com/xuyangbocn/terraform-aws-self-host-llm https://youtu.be/hRJEREemyos

GiteaMirror commented

2026-04-19 19:05:36 -05:00

@stanthewizzard commented on GitHub (Feb 1, 2025):

Self hosted and not on aws :(

@stanthewizzard commented on GitHub (Feb 1, 2025): Self hosted and not on aws :(

GiteaMirror commented

2026-04-19 19:05:37 -05:00

@xuyangbocn commented on GitHub (Feb 1, 2025):

@stanthewizzard
Under Open WebUI Admin setting >> Setting >> Connections >> Ollama API, you can specify multiple endpoints, one for each Ollama deployment.

@xuyangbocn commented on GitHub (Feb 1, 2025): @stanthewizzard Under Open WebUI Admin setting >> Setting >> Connections >> Ollama API, you can specify multiple endpoints, one for each Ollama deployment. <img width="616" alt="Image" src="https://github.com/user-attachments/assets/fe8b7c2f-9c4c-45fb-90df-d2e01a5a3ef0" />

GiteaMirror commented

2026-04-19 19:05:37 -05:00

@stanthewizzard commented on GitHub (Feb 1, 2025):

Thanks for the advice 😇
I already have that. But it doesn't load balance ?

@stanthewizzard commented on GitHub (Feb 1, 2025): Thanks for the advice 😇 I already have that. But it doesn't load balance ?

GiteaMirror commented

2026-04-19 19:05:38 -05:00

@xuyangbocn commented on GitHub (Feb 1, 2025):

Really? Though i havent tried on my own, but seemingly this allows differnet models to be directed to different endpoints.

@xuyangbocn commented on GitHub (Feb 1, 2025): Really? Though i havent tried on my own, but seemingly this allows differnet models to be directed to different endpoints.

GiteaMirror commented

2026-04-19 19:05:39 -05:00

@stanthewizzard commented on GitHub (Feb 1, 2025):

Same model on 2 computers and only one is used. Always the same btw

@stanthewizzard commented on GitHub (Feb 1, 2025): Same model on 2 computers and only one is used. Always the same btw

GiteaMirror commented

2026-04-19 19:05:39 -05:00

@mateuszdrab commented on GitHub (Feb 3, 2025):

I think Open WebUI needs a smarter endpoint selection algorithm so that it can consider allowed models, loaded models, preferred instances and etc...
Ollama now supports parralelism and queuing, and making a decision where to send the request needs to consider the models loaded, the state of the queue and etc.

@mateuszdrab commented on GitHub (Feb 3, 2025): I think Open WebUI needs a smarter endpoint selection algorithm so that it can consider allowed models, loaded models, preferred instances and etc... Ollama now supports parralelism and queuing, and making a decision where to send the request needs to consider the models loaded, the state of the queue and etc.

GiteaMirror commented

2026-04-19 19:05:41 -05:00

@stanthewizzard commented on GitHub (Feb 3, 2025):

exactly what I'm looking for

@stanthewizzard commented on GitHub (Feb 3, 2025): exactly what I'm looking for

GiteaMirror commented

2026-04-19 19:05:42 -05:00

@filviu commented on GitHub (Feb 6, 2025):

I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?

@filviu commented on GitHub (Feb 6, 2025): I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?

GiteaMirror commented

2026-04-19 19:05:43 -05:00

@gonzalu commented on GitHub (Mar 4, 2025):

I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ?

I have the exact same question :D Seems like it should be as easy as picking a server but I am puzzled.

I have Ollama running on a Jetson Orin Nano 8GB Dev Plus and also on an AMD 7940HS 32GB RAM CPU system. I have both configured on my OpenWebUI (see below) but for the life of me, I can't figure out how I pick one over the other?

Probably easy and obvious but I am a complete n00b :P

Thanks.

@gonzalu commented on GitHub (Mar 4, 2025): > I'm adding a silly question - if I have multiple ollama connections configured is there a way to know which I'm using ? E.g. if host1 and host2 both have "modelX" can I select on which one the query will be executed ? I have the exact same question :D Seems like it should be as easy as picking a server but I am puzzled. I have Ollama running on a Jetson Orin Nano 8GB Dev Plus and also on an AMD 7940HS 32GB RAM CPU system. I have both configured on my OpenWebUI (see below) but for the life of me, I can't figure out how I pick one over the other? ![Image](https://github.com/user-attachments/assets/3ab6f10e-22a7-4fea-b215-20fcd4040e41) Probably easy and obvious but I am a complete n00b :P ![Image](https://github.com/user-attachments/assets/a4ffc7c4-ef6f-47e3-8826-755beb085638) Thanks.

GiteaMirror commented

2026-04-19 19:05:44 -05:00

@d-shehu commented on GitHub (Mar 12, 2025):

What is the algo for picking the model if there are multiple endpoints/servers. Does it go down the list as configured in "connections" and pick the 1st working connection?

Fallback logic would be nice in lieu of something as complicated as a gateway. Thanks.

@d-shehu commented on GitHub (Mar 12, 2025): What is the algo for picking the model if there are multiple endpoints/servers. Does it go down the list as configured in "connections" and pick the 1st working connection? Fallback logic would be nice in lieu of something as complicated as a gateway. Thanks.

GiteaMirror commented

2026-04-19 19:05:44 -05:00

@gonzalu commented on GitHub (Mar 17, 2025):

I found that you can prefix each server with a tag and then when you HOVER OVER the models in the selection dropdown, the tooltip will show you the server it is from. Good enough for me until a more elegant way is available.

@gonzalu commented on GitHub (Mar 17, 2025): I found that you can prefix each server with a tag and then when you HOVER OVER the models in the selection dropdown, the tooltip will show you the server it is from. Good enough for me until a more elegant way is available. ![Image](https://github.com/user-attachments/assets/14a92ec2-0dc2-4e16-9ccb-11e6d0226f57) ![Image](https://github.com/user-attachments/assets/fc754429-0a9f-4f29-b53f-0acf2bc1c48b)

GiteaMirror commented

2026-04-19 19:05:45 -05:00

@WyattLiu commented on GitHub (Apr 19, 2025):

second this, right now I have 2 servers of the same model, what open webui currently offer is fail safe, one is backing up for the other... if we can somehow use both and have a little queue, it would be good.

@WyattLiu commented on GitHub (Apr 19, 2025): second this, right now I have 2 servers of the same model, what open webui currently offer is fail safe, one is backing up for the other... if we can somehow use both and have a little queue, it would be good.

GiteaMirror commented

2026-04-19 19:05:46 -05:00

@GTez commented on GitHub (Aug 5, 2025):

I'd love this. I'm using a HAProxy right now to proxy the ollama calls in order to load balance, but would way rather it be smart, understand what's loaded, then farm out the requests properly.

@GTez commented on GitHub (Aug 5, 2025): I'd love this. I'm using a HAProxy right now to proxy the ollama calls in order to load balance, but would way rather it be smart, understand what's loaded, then farm out the requests properly.

GiteaMirror commented

2026-04-19 19:05:47 -05:00

@apunkt commented on GitHub (Sep 1, 2025):

Used HAProxy until now,
which is great, but not aware of which model is loaded on which ollama server, thus having mixed results.

I am successful now with NOMYO Router

@apunkt commented on GitHub (Sep 1, 2025): Used HAProxy until now, which is great, but not aware of which model is loaded on which ollama server, thus having mixed results. I am successful now with [NOMYO Router](https://github.com/nomyo-ai/nomyo-router)

GiteaMirror referenced this issue

2026-04-20 04:27:27 -05:00

[PR #12215] [CLOSED] feat: adjust jpeg compression quality to 0.25 for avatar uploads #22867

GiteaMirror referenced this issue

2026-04-25 11:29:38 -05:00

[PR #12215] [CLOSED] feat: adjust jpeg compression quality to 0.25 for avatar uploads #38497

GiteaMirror referenced this issue

2026-04-29 20:30:43 -05:00

[PR #12215] [CLOSED] feat: adjust jpeg compression quality to 0.25 for avatar uploads #45915

GiteaMirror referenced this issue

2026-05-06 05:20:16 -05:00

[PR #12215] [CLOSED] feat: adjust jpeg compression quality to 0.25 for avatar uploads #61723