mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-10 15:54:15 -05:00
Model size nearly doubles in 0.5.3 #3174
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @yourmomdatestedcruz on GitHub (Jan 3, 2025).
Installation Method
Docker
Environment
Open WebUI Version: 0.5.3
Ollama: 0.5.4
Operating System: macOS 15.2
Browser (if applicable): Safari 18.2
Confirmation:
Expected Behavior:
When loading a model in RAM through Open WebUI, the model loads with its reported size.
Actual Behavior:
When loading a model in RAM through Open WebUI, the model loads with roughly double it's reported size - from 36GB to 65GB
Description
Bug Summary:
When I run a model (for example dolphin-mixtral8x7b:q5_k_m) directly in command line, the model loads in RAM, and when subsequently running ollama ps, I can see that the model is loading 36GB of data into ram and running 100% on GPU.
When running the same model in Open WebUI and then running ollama ps in command line, the model is reported to be 65GB and must therefore be loaded in part on the CPU (roughly 50/50).
Reproduction Details
Steps to Reproduce:
Running the model in Open WebUI
Additional Information
I've confirmed that this does not occur in Open WebUI 0.5.2 - I re-pulled the 0.5.2. Docker image and ran the model again in 0.5.2, and I get the same model size as I do as when I run this in command line directly.
Is there a feature that is on by default that I should disable in 0.5.3?
@pressdarling commented on GitHub (Jan 3, 2025):
What's your context size in each environment? If you haven't set it in the modelfile for Ollama, it'll be 2048 there. You can also see RAM usage for each case in the Ollama server logs if you test and check against timestamps, if you've set a larger context size in Open WebUI you can see how much RAM is reserved for your context upon model load.
@yourmomdatestedcruz commented on GitHub (Jan 3, 2025):
Thanks @pressdarling for the quick reply!
I had 128k context length set, so i went ahead and did some tests to try and narrow it down.
Using dolphin-mixtral:8x7b-v2.7-q5_K_M as the model for this test:
In Open WebUI 0.5.2:
Now with Open WebUI 0.5.3:
Just to be safe, I set the context length both in the chat I was using for the test (using the controls menu in the top right of the screen), and in Settings > General > Advanced Parameters > Context Length on both 0.5.2 and 0.5.3.
Is the behavior I'm seeing in 0.5.3 actually what should be expected from 0.5.2, and it's 0.5.2 that has an "issue" here?
@yourmomdatestedcruz commented on GitHub (Jan 3, 2025):
Update - I tried with more context lengths in 0.5.3 - 32k, 24k and 16k - and the scaling continues: the lower the context length the lower the size of the model that is loaded in RAM.