[PR #21535] [CLOSED] feat: preload Ollama models on inital model load and model switch to prevent queued/lost requests #41758

Closed
opened 2026-04-25 13:54:33 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21535
Author: @blboyko
Created: 2/17/2026
Status: Closed

Base: devHead: feature/ollama-preload-on-switch


📝 Commits (10+)

📊 Changes

2 files changed (+81 additions, -0 deletions)

View changed files

📝 backend/open_webui/config.py (+7 -0)
📝 backend/open_webui/routers/ollama.py (+74 -0)

📄 Description

Pull Request Checklist

  • Target branch: Verify that the pull request targets the dev branch.
  • Description: Provided below.
  • Changelog: Provided below.
  • Documentation: Environment variable OLLAMA_PRELOAD_ON_SWITCH documented in description.
  • Dependencies: No new dependencies.
  • Testing: Manually tested across multiple model switches on single-GPU setup. Details below.
  • Agentic AI Code: Human-reviewed and manually tested.
  • Code review: Self-reviewed.
  • Design & Architecture: Single env var config, smart default (enabled by default).
  • Git Hygiene: Single atomic commit, rebased on latest upstream.
  • Title Prefix: feat:

Changelog Entry

Description

When a user switches between Ollama models in OpenWebUI, chat requests are sent immediately without checking if the target model is loaded into memory. On single-GPU setups, model loading takes 20-60 seconds, during which requests queue inside Ollama and return out of order, responses are lost or attributed to the wrong model, and users see hangs with no feedback.

This adds a preload check in generate_chat_completion() that ensures the requested model is fully loaded before sending the chat request.

How it works:

  1. Tracks the last-used model per Ollama instance via request.app.state
  2. On model change, checks /api/ps to see if the target model is resident in memory
  3. If not loaded, triggers a preload via /api/generate with no prompt (Ollama's official preload mechanism) in a background thread
  4. Polls /api/ps every 2 seconds until the model appears (180s timeout)
  5. Proceeds with the chat request only after the model is confirmed ready

Design decisions:

  • Only fires on model switch, not every request — zero overhead for same-model conversations
  • Polling is lightweight — /api/ps calls return in ~25μs
  • No prompt field in preload payload — signals pure load to Ollama, no wasted generation
  • Background thread for trigger — non-blocking preload initiation
  • Configurable via OLLAMA_PRELOAD_ON_SWITCH env var (default: true)

Related issues:

Added

  • OLLAMA_PRELOAD_ON_SWITCH environment variable to enable/disable model preloading (default: true)
  • is_model_loaded() helper — checks /api/ps for model residency
  • wait_for_model_loaded() helper — polls /api/ps every 2s with 180s timeout
  • Preload guard in generate_chat_completion() that triggers model load on model switch

Changed

  • generate_chat_completion() now waits for model to be loaded before sending chat request (when enabled)

Deprecated

  • N/A

Removed

  • N/A

Fixed

  • Queued/lost responses when switching between Ollama models on single-GPU setups
  • Wrong-model response attribution during model load transitions
  • 499 errors from Ollama when requests arrive before model is ready

Security

  • N/A

Breaking Changes

  • None. Feature is additive and can be disabled via OLLAMA_PRELOAD_ON_SWITCH=false.

Additional Information

Tested with RTX 4070 Ti (12GB VRAM), switching between 7B models:

Scenario Before After
Model switch (cold) 499 errors, lost responses 20-35s wait, then immediate response
Model switch (warm) Occasional wrong-model response Instant (already loaded)
Same model repeat N/A No preload triggered (correctly skipped)
Flag disabled N/A Stock behavior, no preload

Zero 499 errors observed across all test sessions with the feature enabled.

Screenshots or Videos

N/A — backend-only change, no UI modifications.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21535 **Author:** [@blboyko](https://github.com/blboyko) **Created:** 2/17/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `feature/ollama-preload-on-switch` --- ### 📝 Commits (10+) - [`fe6783c`](https://github.com/open-webui/open-webui/commit/fe6783c16699911c7be17392596d579333fb110c) Merge pull request #19030 from open-webui/dev - [`fc05e0a`](https://github.com/open-webui/open-webui/commit/fc05e0a6c5d39da60b603b4d520f800d6e36f748) Merge pull request #19405 from open-webui/dev - [`e3faec6`](https://github.com/open-webui/open-webui/commit/e3faec62c58e3a83d89aa3df539feacefa125e0c) Merge pull request #19416 from open-webui/dev - [`9899293`](https://github.com/open-webui/open-webui/commit/9899293f050ad50ae12024cbebee7e018acd851e) Merge pull request #19448 from open-webui/dev - [`140605e`](https://github.com/open-webui/open-webui/commit/140605e660b8186a7d5c79fb3be6ffb147a2f498) Merge pull request #19462 from open-webui/dev - [`6f1486f`](https://github.com/open-webui/open-webui/commit/6f1486ffd0cb288d0e21f41845361924e0d742b3) Merge pull request #19466 from open-webui/dev - [`d95f533`](https://github.com/open-webui/open-webui/commit/d95f533214e3fe5beb5e41ec1f349940bc4c7043) Merge pull request #19729 from open-webui/dev - [`a727153`](https://github.com/open-webui/open-webui/commit/a7271532f8a38da46785afcaa7e65f9a45e7d753) 0.6.43 (#20093) - [`6adde20`](https://github.com/open-webui/open-webui/commit/6adde203cd292a9e3af9c64a2ae36b603fed096a) Merge pull request #20394 from open-webui/dev - [`f9b0534`](https://github.com/open-webui/open-webui/commit/f9b0534e0c442631d1cb7205169588b9b6204179) Merge pull request #20522 from open-webui/dev ### 📊 Changes **2 files changed** (+81 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+7 -0) 📝 `backend/open_webui/routers/ollama.py` (+74 -0) </details> ### 📄 Description <!-- ⚠️ CRITICAL CHECKS FOR CONTRIBUTORS (READ, DON'T DELETE) ⚠️ 1. Target the `dev` branch. PRs targeting `main` will be automatically closed. 2. Do NOT delete the CLA section at the bottom. It is required for the bot to accept your PR. --> # Pull Request Checklist - [x] **Target branch:** Verify that the pull request targets the `dev` branch. - [x] **Description:** Provided below. - [x] **Changelog:** Provided below. - [x] **Documentation:** Environment variable `OLLAMA_PRELOAD_ON_SWITCH` documented in description. - [x] **Dependencies:** No new dependencies. - [x] **Testing:** Manually tested across multiple model switches on single-GPU setup. Details below. - [x] **Agentic AI Code:** Human-reviewed and manually tested. - [x] **Code review:** Self-reviewed. - [x] **Design & Architecture:** Single env var config, smart default (enabled by default). - [x] **Git Hygiene:** Single atomic commit, rebased on latest upstream. - [x] **Title Prefix:** `feat:` # Changelog Entry ### Description When a user switches between Ollama models in OpenWebUI, chat requests are sent immediately without checking if the target model is loaded into memory. On single-GPU setups, model loading takes 20-60 seconds, during which requests queue inside Ollama and return out of order, responses are lost or attributed to the wrong model, and users see hangs with no feedback. This adds a preload check in `generate_chat_completion()` that ensures the requested model is fully loaded before sending the chat request. **How it works:** 1. Tracks the last-used model per Ollama instance via `request.app.state` 2. On model change, checks `/api/ps` to see if the target model is resident in memory 3. If not loaded, triggers a preload via `/api/generate` with no prompt (Ollama's official preload mechanism) in a background thread 4. Polls `/api/ps` every 2 seconds until the model appears (180s timeout) 5. Proceeds with the chat request only after the model is confirmed ready **Design decisions:** - Only fires on model switch, not every request — zero overhead for same-model conversations - Polling is lightweight — `/api/ps` calls return in ~25μs - No prompt field in preload payload — signals pure load to Ollama, no wasted generation - Background thread for trigger — non-blocking preload initiation - Configurable via `OLLAMA_PRELOAD_ON_SWITCH` env var (default: `true`) Related issues: - ollama/ollama#8779 (model switching hangs, causes response problems) - #3987 (feature request for model preload) ### Added - `OLLAMA_PRELOAD_ON_SWITCH` environment variable to enable/disable model preloading (default: `true`) - `is_model_loaded()` helper — checks `/api/ps` for model residency - `wait_for_model_loaded()` helper — polls `/api/ps` every 2s with 180s timeout - Preload guard in `generate_chat_completion()` that triggers model load on model switch ### Changed - `generate_chat_completion()` now waits for model to be loaded before sending chat request (when enabled) ### Deprecated - N/A ### Removed - N/A ### Fixed - Queued/lost responses when switching between Ollama models on single-GPU setups - Wrong-model response attribution during model load transitions - 499 errors from Ollama when requests arrive before model is ready ### Security - N/A ### Breaking Changes - None. Feature is additive and can be disabled via `OLLAMA_PRELOAD_ON_SWITCH=false`. --- ### Additional Information Tested with RTX 4070 Ti (12GB VRAM), switching between 7B models: | Scenario | Before | After | |---|---|---| | Model switch (cold) | 499 errors, lost responses | 20-35s wait, then immediate response | | Model switch (warm) | Occasional wrong-model response | Instant (already loaded) | | Same model repeat | N/A | No preload triggered (correctly skipped) | | Flag disabled | N/A | Stock behavior, no preload | Zero 499 errors observed across all test sessions with the feature enabled. ### Screenshots or Videos N/A — backend-only change, no UI modifications. ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. <!-- 🚨 DO NOT DELETE THE TEXT BELOW 🚨 Keep the "Contributor License Agreement" confirmation text intact. Deleting it will trigger the CLA-Bot to INVALIDATE your PR. --> By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 13:54:33 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#41758