Model size nearly doubles in 0.5.3 #3174

New Issue

GiteaMirror · 2025-11-11T15:24:44-06:00

GiteaMirror commented

2025-11-11 15:24:44 -06:00

Originally created by @yourmomdatestedcruz on GitHub (Jan 3, 2025).

Installation Method

Docker

Environment

Open WebUI Version: 0.5.3
Ollama: 0.5.4
Operating System: macOS 15.2
Browser (if applicable): Safari 18.2

Confirmation:

I have read and followed all the instructions provided in the README.md.
I am on the latest version of both Open WebUI and Ollama.
I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

When loading a model in RAM through Open WebUI, the model loads with its reported size.

Actual Behavior:

When loading a model in RAM through Open WebUI, the model loads with roughly double it's reported size - from 36GB to 65GB

Description

Bug Summary:
When I run a model (for example dolphin-mixtral8x7b:q5_k_m) directly in command line, the model loads in RAM, and when subsequently running ollama ps, I can see that the model is loading 36GB of data into ram and running 100% on GPU.

When running the same model in Open WebUI and then running ollama ps in command line, the model is reported to be 65GB and must therefore be loaded in part on the CPU (roughly 50/50).

Reproduction Details

Steps to Reproduce:
Running the model in Open WebUI

Additional Information

I've confirmed that this does not occur in Open WebUI 0.5.2 - I re-pulled the 0.5.2. Docker image and ran the model again in 0.5.2, and I get the same model size as I do as when I run this in command line directly.

Is there a feature that is on by default that I should disable in 0.5.3?

Originally created by @yourmomdatestedcruz on GitHub (Jan 3, 2025). ## Installation Method Docker ## Environment - **Open WebUI Version:** 0.5.3 - **Ollama:** 0.5.4 - **Operating System:** macOS 15.2 - **Browser (if applicable):** Safari 18.2 **Confirmation:** - [x] I have read and followed all the instructions provided in the README.md. - [x] I am on the latest version of both Open WebUI and Ollama. - [x] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Expected Behavior: When loading a model in RAM through Open WebUI, the model loads with its reported size. ## Actual Behavior: When loading a model in RAM through Open WebUI, the model loads with roughly double it's reported size - from 36GB to 65GB ## Description **Bug Summary:** When I run a model (for example dolphin-mixtral8x7b:q5_k_m) directly in command line, the model loads in RAM, and when subsequently running ollama ps, I can see that the model is loading 36GB of data into ram and running 100% on GPU. When running the same model in Open WebUI and then running ollama ps in command line, the model is reported to be 65GB and must therefore be loaded in part on the CPU (roughly 50/50). ## Reproduction Details **Steps to Reproduce:** Running the model in Open WebUI ## Additional Information I've confirmed that this does not occur in Open WebUI 0.5.2 - I re-pulled the 0.5.2. Docker image and ran the model again in 0.5.2, and I get the same model size as I do as when I run this in command line directly. Is there a feature that is on by default that I should disable in 0.5.3?

GiteaMirror closed this issue

2025-11-11 15:24:44 -06:00

GiteaMirror commented

2025-11-11 15:24:45 -06:00

@pressdarling commented on GitHub (Jan 3, 2025):

What's your context size in each environment? If you haven't set it in the modelfile for Ollama, it'll be 2048 there. You can also see RAM usage for each case in the Ollama server logs if you test and check against timestamps, if you've set a larger context size in Open WebUI you can see how much RAM is reserved for your context upon model load.

@pressdarling commented on GitHub (Jan 3, 2025): What's your context size in each environment? If you haven't set it in the modelfile for Ollama, it'll be 2048 there. You can also see RAM usage for each case in the Ollama server logs if you test and check against timestamps, if you've set a larger context size in Open WebUI you can see how much RAM is reserved for your context upon model load.

GiteaMirror commented

2025-11-11 15:24:45 -06:00

@yourmomdatestedcruz commented on GitHub (Jan 3, 2025):

Thanks @pressdarling for the quick reply!
I had 128k context length set, so i went ahead and did some tests to try and narrow it down.

Using dolphin-mixtral:8x7b-v2.7-q5_K_M as the model for this test:

In Open WebUI 0.5.2:

with 128k context length, the model loads at 36GB (which is the actual model size),
with 64k context length, the model also loads at 36GB
with default context length, the model loads at 36GB.

Now with Open WebUI 0.5.3:

with 128k context length, the model loads at 61GB,
with 64k context length, the model also loads at 47GB
with default context length, the model loads at 36GB.

Just to be safe, I set the context length both in the chat I was using for the test (using the controls menu in the top right of the screen), and in Settings > General > Advanced Parameters > Context Length on both 0.5.2 and 0.5.3.

Is the behavior I'm seeing in 0.5.3 actually what should be expected from 0.5.2, and it's 0.5.2 that has an "issue" here?

@yourmomdatestedcruz commented on GitHub (Jan 3, 2025): Thanks @pressdarling for the quick reply! I had 128k context length set, so i went ahead and did some tests to try and narrow it down. Using dolphin-mixtral:8x7b-v2.7-q5_K_M as the model for this test: In Open WebUI 0.5.2: - with 128k context length, the model loads at 36GB (which is the actual model size), - with 64k context length, the model also loads at 36GB - with default context length, the model loads at 36GB. Now with Open WebUI 0.5.3: - with 128k context length, the model loads at 61GB, - with 64k context length, the model also loads at 47GB - with default context length, the model loads at 36GB. Just to be safe, I set the context length both in the chat I was using for the test (using the controls menu in the top right of the screen), and in Settings > General > Advanced Parameters > Context Length on both 0.5.2 and 0.5.3. Is the behavior I'm seeing in 0.5.3 actually what should be expected from 0.5.2, and it's 0.5.2 that has an "issue" here?

GiteaMirror commented

2025-11-11 15:24:45 -06:00

@yourmomdatestedcruz commented on GitHub (Jan 3, 2025):

Update - I tried with more context lengths in 0.5.3 - 32k, 24k and 16k - and the scaling continues: the lower the context length the lower the size of the model that is loaded in RAM.

@yourmomdatestedcruz commented on GitHub (Jan 3, 2025): Update - I tried with more context lengths in 0.5.3 - 32k, 24k and 16k - and the scaling continues: the lower the context length the lower the size of the model that is loaded in RAM.

GiteaMirror referenced this issue

2025-11-11 17:42:11 -06:00

[PR #3174] [MERGED] Update Chinese translation #7996

GiteaMirror referenced this issue

2026-04-20 03:23:52 -05:00

[PR #3174] [MERGED] Update Chinese translation #21200

GiteaMirror referenced this issue

2026-04-25 10:33:44 -05:00

[PR #3174] [MERGED] Update Chinese translation #36830

GiteaMirror referenced this issue

2026-04-29 18:16:07 -05:00

[PR #3174] [MERGED] Update Chinese translation #44248