[PR #21535] [CLOSED] feat: preload Ollama models on inital model load and model switch to prevent queued/lost requests #41758

New Issue

GiteaMirror · 2026-04-25T13:54:33-05:00

GiteaMirror commented

2026-04-25 13:54:33 -05:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21535
Author: @blboyko
Created: 2/17/2026
Status: ❌ Closed

Base: dev ← Head: feature/ollama-preload-on-switch

📝 Commits (10+)

fe6783c Merge pull request #19030 from open-webui/dev
fc05e0a Merge pull request #19405 from open-webui/dev
e3faec6 Merge pull request #19416 from open-webui/dev
9899293 Merge pull request #19448 from open-webui/dev
140605e Merge pull request #19462 from open-webui/dev
6f1486f Merge pull request #19466 from open-webui/dev
d95f533 Merge pull request #19729 from open-webui/dev
a727153 0.6.43 (#20093)
6adde20 Merge pull request #20394 from open-webui/dev
f9b0534 Merge pull request #20522 from open-webui/dev

📊 Changes

2 files changed (+81 additions, -0 deletions)

View changed files

📝 backend/open_webui/config.py (+7 -0)
📝 backend/open_webui/routers/ollama.py (+74 -0)

📄 Description

Pull Request Checklist

Target branch: Verify that the pull request targets the dev branch.
Description: Provided below.
Changelog: Provided below.
Documentation: Environment variable OLLAMA_PRELOAD_ON_SWITCH documented in description.
Dependencies: No new dependencies.
Testing: Manually tested across multiple model switches on single-GPU setup. Details below.
Agentic AI Code: Human-reviewed and manually tested.
Code review: Self-reviewed.
Design & Architecture: Single env var config, smart default (enabled by default).
Git Hygiene: Single atomic commit, rebased on latest upstream.
Title Prefix: feat:

Changelog Entry

Description

When a user switches between Ollama models in OpenWebUI, chat requests are sent immediately without checking if the target model is loaded into memory. On single-GPU setups, model loading takes 20-60 seconds, during which requests queue inside Ollama and return out of order, responses are lost or attributed to the wrong model, and users see hangs with no feedback.

This adds a preload check in generate_chat_completion() that ensures the requested model is fully loaded before sending the chat request.

How it works:

Tracks the last-used model per Ollama instance via request.app.state
On model change, checks /api/ps to see if the target model is resident in memory
If not loaded, triggers a preload via /api/generate with no prompt (Ollama's official preload mechanism) in a background thread
Polls /api/ps every 2 seconds until the model appears (180s timeout)
Proceeds with the chat request only after the model is confirmed ready

Design decisions:

Only fires on model switch, not every request — zero overhead for same-model conversations
Polling is lightweight — /api/ps calls return in ~25μs
No prompt field in preload payload — signals pure load to Ollama, no wasted generation
Background thread for trigger — non-blocking preload initiation
Configurable via OLLAMA_PRELOAD_ON_SWITCH env var (default: true)

Related issues:

ollama/ollama#8779 (model switching hangs, causes response problems)
#3987 (feature request for model preload)

Added

OLLAMA_PRELOAD_ON_SWITCH environment variable to enable/disable model preloading (default: true)
is_model_loaded() helper — checks /api/ps for model residency
wait_for_model_loaded() helper — polls /api/ps every 2s with 180s timeout
Preload guard in generate_chat_completion() that triggers model load on model switch

Changed

generate_chat_completion() now waits for model to be loaded before sending chat request (when enabled)

Deprecated

N/A

Removed

N/A

Fixed

Queued/lost responses when switching between Ollama models on single-GPU setups
Wrong-model response attribution during model load transitions
499 errors from Ollama when requests arrive before model is ready

Security

N/A

Breaking Changes

None. Feature is additive and can be disabled via OLLAMA_PRELOAD_ON_SWITCH=false.

Additional Information

Tested with RTX 4070 Ti (12GB VRAM), switching between 7B models:

Scenario	Before	After
Model switch (cold)	499 errors, lost responses	20-35s wait, then immediate response
Model switch (warm)	Occasional wrong-model response	Instant (already loaded)
Same model repeat	N/A	No preload triggered (correctly skipped)
Flag disabled	N/A	Stock behavior, no preload

Zero 499 errors observed across all test sessions with the feature enabled.

Screenshots or Videos

N/A — backend-only change, no UI modifications.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21535 **Author:** [@blboyko](https://github.com/blboyko) **Created:** 2/17/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `feature/ollama-preload-on-switch` --- ### 📝 Commits (10+) - [`fe6783c`](https://github.com/open-webui/open-webui/commit/fe6783c16699911c7be17392596d579333fb110c) Merge pull request #19030 from open-webui/dev - [`fc05e0a`](https://github.com/open-webui/open-webui/commit/fc05e0a6c5d39da60b603b4d520f800d6e36f748) Merge pull request #19405 from open-webui/dev - [`e3faec6`](https://github.com/open-webui/open-webui/commit/e3faec62c58e3a83d89aa3df539feacefa125e0c) Merge pull request #19416 from open-webui/dev - [`9899293`](https://github.com/open-webui/open-webui/commit/9899293f050ad50ae12024cbebee7e018acd851e) Merge pull request #19448 from open-webui/dev - [`140605e`](https://github.com/open-webui/open-webui/commit/140605e660b8186a7d5c79fb3be6ffb147a2f498) Merge pull request #19462 from open-webui/dev - [`6f1486f`](https://github.com/open-webui/open-webui/commit/6f1486ffd0cb288d0e21f41845361924e0d742b3) Merge pull request #19466 from open-webui/dev - [`d95f533`](https://github.com/open-webui/open-webui/commit/d95f533214e3fe5beb5e41ec1f349940bc4c7043) Merge pull request #19729 from open-webui/dev - [`a727153`](https://github.com/open-webui/open-webui/commit/a7271532f8a38da46785afcaa7e65f9a45e7d753) 0.6.43 (#20093) - [`6adde20`](https://github.com/open-webui/open-webui/commit/6adde203cd292a9e3af9c64a2ae36b603fed096a) Merge pull request #20394 from open-webui/dev - [`f9b0534`](https://github.com/open-webui/open-webui/commit/f9b0534e0c442631d1cb7205169588b9b6204179) Merge pull request #20522 from open-webui/dev ### 📊 Changes **2 files changed** (+81 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+7 -0) 📝 `backend/open_webui/routers/ollama.py` (+74 -0) </details> ### 📄 Description  # Pull Request Checklist - [x] **Target branch:** Verify that the pull request targets the `dev` branch. - [x] **Description:** Provided below. - [x] **Changelog:** Provided below. - [x] **Documentation:** Environment variable `OLLAMA_PRELOAD_ON_SWITCH` documented in description. - [x] **Dependencies:** No new dependencies. - [x] **Testing:** Manually tested across multiple model switches on single-GPU setup. Details below. - [x] **Agentic AI Code:** Human-reviewed and manually tested. - [x] **Code review:** Self-reviewed. - [x] **Design & Architecture:** Single env var config, smart default (enabled by default). - [x] **Git Hygiene:** Single atomic commit, rebased on latest upstream. - [x] **Title Prefix:** `feat:` # Changelog Entry ### Description When a user switches between Ollama models in OpenWebUI, chat requests are sent immediately without checking if the target model is loaded into memory. On single-GPU setups, model loading takes 20-60 seconds, during which requests queue inside Ollama and return out of order, responses are lost or attributed to the wrong model, and users see hangs with no feedback. This adds a preload check in `generate_chat_completion()` that ensures the requested model is fully loaded before sending the chat request. **How it works:** 1. Tracks the last-used model per Ollama instance via `request.app.state` 2. On model change, checks `/api/ps` to see if the target model is resident in memory 3. If not loaded, triggers a preload via `/api/generate` with no prompt (Ollama's official preload mechanism) in a background thread 4. Polls `/api/ps` every 2 seconds until the model appears (180s timeout) 5. Proceeds with the chat request only after the model is confirmed ready **Design decisions:** - Only fires on model switch, not every request — zero overhead for same-model conversations - Polling is lightweight — `/api/ps` calls return in ~25μs - No prompt field in preload payload — signals pure load to Ollama, no wasted generation - Background thread for trigger — non-blocking preload initiation - Configurable via `OLLAMA_PRELOAD_ON_SWITCH` env var (default: `true`) Related issues: - ollama/ollama#8779 (model switching hangs, causes response problems) - #3987 (feature request for model preload) ### Added - `OLLAMA_PRELOAD_ON_SWITCH` environment variable to enable/disable model preloading (default: `true`) - `is_model_loaded()` helper — checks `/api/ps` for model residency - `wait_for_model_loaded()` helper — polls `/api/ps` every 2s with 180s timeout - Preload guard in `generate_chat_completion()` that triggers model load on model switch ### Changed - `generate_chat_completion()` now waits for model to be loaded before sending chat request (when enabled) ### Deprecated - N/A ### Removed - N/A ### Fixed - Queued/lost responses when switching between Ollama models on single-GPU setups - Wrong-model response attribution during model load transitions - 499 errors from Ollama when requests arrive before model is ready ### Security - N/A ### Breaking Changes - None. Feature is additive and can be disabled via `OLLAMA_PRELOAD_ON_SWITCH=false`. --- ### Additional Information Tested with RTX 4070 Ti (12GB VRAM), switching between 7B models: | Scenario | Before | After | |---|---|---| | Model switch (cold) | 499 errors, lost responses | 20-35s wait, then immediate response | | Model switch (warm) | Occasional wrong-model response | Instant (already loaded) | | Same model repeat | N/A | No preload triggered (correctly skipped) | | Flag disabled | N/A | Stock behavior, no preload | Zero 499 errors observed across all test sessions with the feature enabled. ### Screenshots or Videos N/A — backend-only change, no UI modifications. ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms.  By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-04-25 13:54:33 -05:00

GiteaMirror closed this issue

2026-04-25 13:54:34 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#41758