Feature Request: Preemptive Model Loading Before Input Submission #1805

Closed
opened 2025-11-11 14:53:39 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @cybersholt on GitHub (Aug 17, 2024).

Is your feature request related to a problem? Please describe.
Hi, I have a feature request related to optimizing the response time when interacting with models through the Open Web UI interface.

When a user initially loads the Open Web UI interface and types a question or command, the model isn't loaded into memory until after the input is submitted. This causes a delay, as the model loading process takes some time.

Describe the solution you'd like
Introduce a toggleable option called "Preemptive Model Loading." When enabled, this feature would send a signal to Ollama (or the respective backend) as soon as the user loads the interface. This signal would prompt the backend to start loading the model into memory, anticipating that a question or command will soon be submitted. This would reduce the perceived response time for the user, as the model would already be loaded by the time the input is sent.

This feature would be particularly useful in scenarios where users interact frequently with the interface and need quick responses. It could be toggled off for users who prefer to manage memory usage more conservatively and could be defaulted to off.

Describe alternatives you've considered

Additional context

Originally created by @cybersholt on GitHub (Aug 17, 2024). **Is your feature request related to a problem? Please describe.** Hi, I have a feature request related to optimizing the response time when interacting with models through the Open Web UI interface. When a user initially loads the Open Web UI interface and types a question or command, the model isn't loaded into memory until after the input is submitted. This causes a delay, as the model loading process takes some time. **Describe the solution you'd like** Introduce a toggleable option called "Preemptive Model Loading." When enabled, this feature would send a signal to Ollama (or the respective backend) as soon as the user loads the interface. This signal would prompt the backend to start loading the model into memory, anticipating that a question or command will soon be submitted. This would reduce the perceived response time for the user, as the model would already be loaded by the time the input is sent. This feature would be particularly useful in scenarios where users interact frequently with the interface and need quick responses. It could be toggled off for users who prefer to manage memory usage more conservatively and could be defaulted to off. **Describe alternatives you've considered** **Additional context**
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#1805