mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #21535] [CLOSED] feat: preload Ollama models on inital model load and model switch to prevent queued/lost requests #26128
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/21535
Author: @blboyko
Created: 2/17/2026
Status: ❌ Closed
Base:
dev← Head:feature/ollama-preload-on-switch📝 Commits (10+)
fe6783cMerge pull request #19030 from open-webui/devfc05e0aMerge pull request #19405 from open-webui/deve3faec6Merge pull request #19416 from open-webui/dev9899293Merge pull request #19448 from open-webui/dev140605eMerge pull request #19462 from open-webui/dev6f1486fMerge pull request #19466 from open-webui/devd95f533Merge pull request #19729 from open-webui/deva7271530.6.43 (#20093)6adde20Merge pull request #20394 from open-webui/devf9b0534Merge pull request #20522 from open-webui/dev📊 Changes
2 files changed (+81 additions, -0 deletions)
View changed files
📝
backend/open_webui/config.py(+7 -0)📝
backend/open_webui/routers/ollama.py(+74 -0)📄 Description
Pull Request Checklist
devbranch.OLLAMA_PRELOAD_ON_SWITCHdocumented in description.feat:Changelog Entry
Description
When a user switches between Ollama models in OpenWebUI, chat requests are sent immediately without checking if the target model is loaded into memory. On single-GPU setups, model loading takes 20-60 seconds, during which requests queue inside Ollama and return out of order, responses are lost or attributed to the wrong model, and users see hangs with no feedback.
This adds a preload check in
generate_chat_completion()that ensures the requested model is fully loaded before sending the chat request.How it works:
request.app.state/api/psto see if the target model is resident in memory/api/generatewith no prompt (Ollama's official preload mechanism) in a background thread/api/psevery 2 seconds until the model appears (180s timeout)Design decisions:
/api/pscalls return in ~25μsOLLAMA_PRELOAD_ON_SWITCHenv var (default:true)Related issues:
Added
OLLAMA_PRELOAD_ON_SWITCHenvironment variable to enable/disable model preloading (default:true)is_model_loaded()helper — checks/api/psfor model residencywait_for_model_loaded()helper — polls/api/psevery 2s with 180s timeoutgenerate_chat_completion()that triggers model load on model switchChanged
generate_chat_completion()now waits for model to be loaded before sending chat request (when enabled)Deprecated
Removed
Fixed
Security
Breaking Changes
OLLAMA_PRELOAD_ON_SWITCH=false.Additional Information
Tested with RTX 4070 Ti (12GB VRAM), switching between 7B models:
Zero 499 errors observed across all test sessions with the feature enabled.
Screenshots or Videos
N/A — backend-only change, no UI modifications.
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.