[GH-ISSUE #1081] feat: smarter ollama load balancing #12327

Closed
opened 2026-04-19 19:13:52 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @tjbck on GitHub (Mar 7, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/1081

right now it just uses random.choices, refer to my comment here: 8ed5759d0e/backend/apps/ollama/main.py (L39)

Originally created by @tjbck on GitHub (Mar 7, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/1081 right now it just uses `random.choices`, refer to my comment here: https://github.com/open-webui/open-webui/blob/8ed5759d0e9424f87d01fe3f8013116c4ba2004f/backend/apps/ollama/main.py#L39
GiteaMirror added the enhancementgood first issuehelp wanted labels 2026-04-19 19:13:52 -05:00
Author
Owner

@asedmammad commented on GitHub (Mar 17, 2024):

@tjbck I can work on this one. I am considering implementing a configuration parameter that allows users to select from various strategies.

<!-- gh-comment-id:2002214993 --> @asedmammad commented on GitHub (Mar 17, 2024): @tjbck I can work on this one. I am considering implementing a configuration parameter that allows users to select from various strategies.
Author
Owner

@tjbck commented on GitHub (Mar 17, 2024):

@asedmammad Feel free to create a draft PR! I'll also get actively involved in this one!

<!-- gh-comment-id:2002283889 --> @tjbck commented on GitHub (Mar 17, 2024): @asedmammad Feel free to create a draft PR! I'll also get actively involved in this one!
Author
Owner

@lewismacnow commented on GitHub (Apr 12, 2024):

@tjbck I can work on this one. I am considering implementing a configuration parameter that allows users to select from various strategies.

I'd be very thankful for an ability to restrict models and/or Users to a specific connection (i.e. prevent/stop load balancing in open-webui in some situations).
I drew some examples of these situations in #1527

My reason for this is:

The loading/unloading process adds seconds to the response so if multiple users are activiely using multiple models which are shared bewteen connections, we will potentially add time to responses (whilst the model is removed/added to memory) instead of benefiting from load balancing over multiple connections.

<!-- gh-comment-id:2052485274 --> @lewismacnow commented on GitHub (Apr 12, 2024): > @tjbck I can work on this one. I am considering implementing a configuration parameter that allows users to select from various strategies. I'd be very thankful for an ability to restrict models and/or Users to a specific connection (i.e. prevent/stop load balancing in open-webui in some situations). I drew some examples of these situations in #1527 My reason for this is: The loading/unloading process adds seconds to the response so if multiple users are activiely using multiple models which are shared bewteen connections, we will potentially **_add time to responses_** (whilst the model is removed/added to memory) instead of benefiting from load balancing over multiple connections.
Author
Owner

@longregen commented on GitHub (Apr 30, 2024):

I'd like to bump this since now Ollama has support for multiple loaded models at the same time: https://github.com/ollama/ollama/pull/3418

In my testing with the "rc5" of 0.33, it worked just fine without changes on open-webui

<!-- gh-comment-id:2084328295 --> @longregen commented on GitHub (Apr 30, 2024): I'd like to bump this since now Ollama has support for multiple loaded models at the same time: https://github.com/ollama/ollama/pull/3418 In my testing with the "rc5" of 0.33, it worked just fine without changes on open-webui
Author
Owner

@bkev commented on GitHub (Aug 11, 2024):

I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance.

<!-- gh-comment-id:2282725938 --> @bkev commented on GitHub (Aug 11, 2024): I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance.
Author
Owner

@haydonryan commented on GitHub (Sep 23, 2024):

I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance.

100% came here to post this. I have an epyc server that stays on 24/7 and gives ok performance. I'd much rather my desktop (with a 3090) serve the model when it's on. At the moment I have found a workaround to copy the model name to "llama3.1:latest-desktop" so it appears as something different in the list and manually select that, but would love something more automated.

Keep up the great work with openwebui!

<!-- gh-comment-id:2368639899 --> @haydonryan commented on GitHub (Sep 23, 2024): > I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance. 100% came here to post this. I have an epyc server that stays on 24/7 and gives ok performance. I'd much rather my desktop (with a 3090) serve the model when it's on. At the moment I have found a workaround to copy the model name to "llama3.1:latest-desktop" so it appears as something different in the list and manually select that, but would love something more automated. Keep up the great work with openwebui!
Author
Owner

@bkev commented on GitHub (Sep 28, 2024):

I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance.

100% came here to post this. I have an epyc server that stays on 24/7 and gives ok performance. I'd much rather my desktop (with a 3090) serve the model when it's on. At the moment I have found a workaround to copy the model name to "llama3.1:latest-desktop" so it appears as something different in the list and manually select that, but would love something more automated.

Keep up the great work with openwebui!

If there are 2 of the same models on 2 different servers, how do you rename them so it's clear which server they are on?

<!-- gh-comment-id:2380549641 --> @bkev commented on GitHub (Sep 28, 2024): > > I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance. > > 100% came here to post this. I have an epyc server that stays on 24/7 and gives ok performance. I'd much rather my desktop (with a 3090) serve the model when it's on. At the moment I have found a workaround to copy the model name to "llama3.1:latest-desktop" so it appears as something different in the list and manually select that, but would love something more automated. > > Keep up the great work with openwebui! If there are 2 of the same models on 2 different servers, how do you rename them so it's clear which server they are on?
Author
Owner

@haydonryan commented on GitHub (Sep 30, 2024):

I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance.

100% came here to post this. I have an epyc server that stays on 24/7 and gives ok performance. I'd much rather my desktop (with a 3090) serve the model when it's on. At the moment I have found a workaround to copy the model name to "llama3.1:latest-desktop" so it appears as something different in the list and manually select that, but would love something more automated.
Keep up the great work with openwebui!

If there are 2 of the same models on 2 different servers, how do you rename them so it's clear which server they are on?
On the command line:

$ ollama cp  --help
Copy a model

Usage:
  ollama cp SOURCE DESTINATION [flags]

Flags:
  -h, --help   help for cp

Environment Variables:
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)

just call it something like blah-desktop (or whatever you want to identify. Note it justcopies the ID, does not copy all the bits so it's space efficient.

$ ollama list
NAME                   	ID          	SIZE  	MODIFIED    
llama3.2-desktop:latest	a80c4f17acd5	2.0 GB	4 days ago 	
llama3.2:latest        	a80c4f17acd5	2.0 GB	4 days ago 	
llama3.1:latest-desktop	42182419e950	4.7 GB	7 days ago 	
llama3.1:70b           	c0df3564cfe8	39 GB 	2 weeks ago	
llama3.1:latest        	42182419e950	4.7 GB	2 weeks ago	
<!-- gh-comment-id:2383286780 --> @haydonryan commented on GitHub (Sep 30, 2024): > > > I think it would be really useful to have a preference for a certain node when it's online if that's possible? For example, openweb UI running on a low powered server with Ollama, but a higher powered node, when powered up and available is preferred for to better performance. > > > > > > 100% came here to post this. I have an epyc server that stays on 24/7 and gives ok performance. I'd much rather my desktop (with a 3090) serve the model when it's on. At the moment I have found a workaround to copy the model name to "llama3.1:latest-desktop" so it appears as something different in the list and manually select that, but would love something more automated. > > Keep up the great work with openwebui! > > If there are 2 of the same models on 2 different servers, how do you rename them so it's clear which server they are on? On the command line: ``` $ ollama cp --help Copy a model Usage: ollama cp SOURCE DESTINATION [flags] Flags: -h, --help help for cp Environment Variables: OLLAMA_HOST IP Address for the ollama server (default 127.0.0.1:11434) ``` just call it something like blah-desktop (or whatever you want to identify. Note it justcopies the ID, does not copy all the bits so it's space efficient. ``` $ ollama list NAME ID SIZE MODIFIED llama3.2-desktop:latest a80c4f17acd5 2.0 GB 4 days ago llama3.2:latest a80c4f17acd5 2.0 GB 4 days ago llama3.1:latest-desktop 42182419e950 4.7 GB 7 days ago llama3.1:70b c0df3564cfe8 39 GB 2 weeks ago llama3.1:latest 42182419e950 4.7 GB 2 weeks ago ```
Author
Owner

@uocnb commented on GitHub (Jan 9, 2025):

Hi there, I would like to bump on this feature. It would be be very useful if we have ollama server / connection name. It will be easy for priority ordering (better with conditions but it will be more complex), prefix / suffix in the model selection.

<!-- gh-comment-id:2579507749 --> @uocnb commented on GitHub (Jan 9, 2025): Hi there, I would like to bump on this feature. It would be be very useful if we have ollama server / connection name. It will be easy for priority ordering (better with conditions but it will be more complex), prefix / suffix in the model selection.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#12327