feat: add mlock parameter from ollama (use ram to load models) #390

New Issue

GiteaMirror · 2025-11-11T14:19:57-06:00

GiteaMirror commented

2025-11-11 14:19:57 -06:00

Originally created by @IGLADI on GitHub (Mar 1, 2024).

Is your feature request related to a problem? Please describe.
It seems like models from ollama barely/don't go into ram which slows it done by a lot (see iowait screenshot#1).

Describe the solution you'd like
Per ollama api docs & llama docs you could just send use_mlock = true to the api request, donno the code base so i couldn't change it myself but that should be an easy fix (good first commit?)

Describe alternatives you've considered
Make it togglable into the UI instead or putting it into docker compose/do both.
Maybe an arg for the docker compose of ollama exist to do this?
Maybe another fix exist to load them into ram?

Additional context
Add any other context or screenshots about the feature request here.

All the ram used is basically from other services

Uses a lot of memory but none is loaded into ram.

Originally created by @IGLADI on GitHub (Mar 1, 2024). **Is your feature request related to a problem? Please describe.** It seems like models from ollama barely/don't go into ram which slows it done by a lot (see iowait screenshot#1). **Describe the solution you'd like** Per [ollama api docs](https://github.com/ollama/ollama/blob/3b4bab3dc55c615a14b1ae74ea64815d3891b5b0/docs/api.md?plain=1#L323) & [llama docs](https://github.com/ggerganov/llama.cpp/blob/e7433867288d2f142cffe596f3751bda5d7ee2c7/examples/main/README.md?plain=1#L276) you could just send `use_mlock = true` to the api request, donno the code base so i couldn't change it myself but that should be an easy fix (good first commit?) **Describe alternatives you've considered** Make it togglable into the UI instead or putting it into docker compose/do both. Maybe an arg for the docker compose of ollama exist to do this? Maybe another fix exist to load them into ram? **Additional context** Add any other context or screenshots about the feature request here. ![image](https://github.com/open-webui/open-webui/assets/128155180/64c04586-e80f-43f1-9063-5e0093509f63) All the ram used is basically from other services ![image](https://github.com/open-webui/open-webui/assets/128155180/e0f74286-20ab-48fa-9620-14908b2efe72) Uses a lot of memory but none is loaded into ram.

GiteaMirror added the enhancement good first issue core labels 2025-11-11 14:19:57 -06:00

GiteaMirror closed this issue

2025-11-11 14:19:58 -06:00

GiteaMirror commented

2025-11-11 14:19:59 -06:00

@Ankk98 commented on GitHub (Mar 2, 2024):

I can see the same thing. I would like to contribute to fix this.

@Ankk98 commented on GitHub (Mar 2, 2024): ![image](https://github.com/open-webui/open-webui/assets/24476799/86dedad9-2a89-477c-90f8-68bc795088ef) I can see the same thing. I would like to contribute to fix this.

GiteaMirror commented

2025-11-11 14:19:59 -06:00

@NiLon commented on GitHub (Mar 14, 2024):

Isn't mlock just reserving (locking) the memory for the model for the maximum amount rather than dynamically increasing as the model needs it? I don't think it actually contributes towards model staying in the memory. If Ollama does release the model from memory after each prompt, then mlock would not solve it. Still useful option to have, but does it actually solve the problem?

@NiLon commented on GitHub (Mar 14, 2024): Isn't mlock just reserving (locking) the memory for the model for the maximum amount rather than dynamically increasing as the model needs it? I don't think it actually contributes towards model staying in the memory. If Ollama does release the model from memory after each prompt, then mlock would not solve it. Still useful option to have, but does it actually solve the problem?

GiteaMirror commented

2025-11-11 14:19:59 -06:00

@IGLADI commented on GitHub (Mar 14, 2024):

Isn't mlock just reserving (locking) the memory for the model for the maximum amount rather than dynamically increasing as the model needs it? I don't think it actually contributes towards model staying in the memory. If Ollama does release the model from memory after each prompt, then mlock would not solve it. Still useful option to have, but does it actually solve the problem?

Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself.

Maybe you think of another fix?

@IGLADI commented on GitHub (Mar 14, 2024): > Isn't mlock just reserving (locking) the memory for the model for the maximum amount rather than dynamically increasing as the model needs it? I don't think it actually contributes towards model staying in the memory. If Ollama does release the model from memory after each prompt, then mlock would not solve it. Still useful option to have, but does it actually solve the problem? Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself. Maybe you think of another fix?

GiteaMirror commented

2025-11-11 14:19:59 -06:00

@NiLon commented on GitHub (Mar 16, 2024):

Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself.

In my testing the model stays in the memory. It would only load model again if you swap it. If you are using GPU offloading then do keep in mind that it is GPU memory (VRAM), not the system RAM it is using. This is at least using docker version of ollama under Linux host.

@NiLon commented on GitHub (Mar 16, 2024): > Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself. In my testing the model stays in the memory. It would only load model again if you swap it. If you are using GPU offloading then do keep in mind that it is GPU memory (VRAM), not the system RAM it is using. This is at least using docker version of ollama under Linux host.

GiteaMirror commented

2025-11-11 14:19:59 -06:00

@IGLADI commented on GitHub (Mar 16, 2024):

Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself.

In my testing the model stays in the memory. It would only load model again if you swap it. If you are using GPU offloading then do keep in mind that it is GPU memory (VRAM), not the system RAM it is using. This is at least using docker version of ollama under Linux host.

I run it on cpu, the screenshot is taken using mixtral (while prompting).
When using mistral (lighter model) it seems to load a part (still not all) of it in ram, really weird
Really weird might be an ollama issue? Idk

@IGLADI commented on GitHub (Mar 16, 2024): > > Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself. > > In my testing the model stays in the memory. It would only load model again if you swap it. If you are using GPU offloading then do keep in mind that it is GPU memory (VRAM), not the system RAM it is using. This is at least using docker version of ollama under Linux host. > I run it on cpu, the screenshot is taken using mixtral (while prompting). When using mistral (lighter model) it seems to load a part (still not all) of it in ram, really weird Really weird might be an ollama issue? Idk

GiteaMirror commented

2025-11-11 14:20:00 -06:00

@NiLon commented on GitHub (Mar 16, 2024):

I run it on cpu, the screenshot is taken using mixtral (while prompting). When using mistral (lighter model) it seems to load a part (still not all) of it in ram, really weird Really weird might be an ollama issue? Idk

Fairly positive this is ollama thing. I would suggest to try running just bare ollama instance and confirm findings there. The webui can't do much towards fixing ollama behavior. You could try the mlock thing as well.
With mlock tho make sure you are able to allocate enough memory ulimit -l
Might be required to increase the limit.
/etc/systemd/user.conf
DefaultLimitMEMLOCK=infinity

Do note that it's not recommended to edit this file directly, but rather make new config for the pool directory if you plan to keep it.

Also for settings to stick you need to reload the daemon: sudo systemctl daemon-reload and then login to new session (open new bash for example) and run ulimit -l to confirm, it should say unlimited if you used infinity value. You could also set it to actual value like 16G.
For older systems / not using systemd you would edit the /etc/security/limits.conf (memlock)

@NiLon commented on GitHub (Mar 16, 2024): > I run it on cpu, the screenshot is taken using mixtral (while prompting). When using mistral (lighter model) it seems to load a part (still not all) of it in ram, really weird Really weird might be an ollama issue? Idk Fairly positive this is ollama thing. I would suggest to try running just bare ollama instance and confirm findings there. The webui can't do much towards fixing ollama behavior. You could try the mlock thing as well. With mlock tho make sure you are able to allocate enough memory `ulimit -l` Might be required to increase the limit. /etc/systemd/user.conf DefaultLimitMEMLOCK=infinity Do note that it's not recommended to edit this file directly, but rather make new config for the pool directory if you plan to keep it. Also for settings to stick you need to reload the daemon: `sudo systemctl daemon-reload` and then login to new session (open new bash for example) and run `ulimit -l` to confirm, it should say unlimited if you used infinity value. You could also set it to actual value like 16G. For older systems / not using systemd you would edit the /etc/security/limits.conf (memlock)

GiteaMirror commented

2025-11-11 14:20:00 -06:00

@ZenDarva commented on GitHub (Apr 28, 2024):

The web-ui can specify

use_nmap:false use_mlock:true

in the generation call to create the chat, or make it user configurable?

this would resolve the issue i believe.

I can't find anywhere where it is already user configurable.

@ZenDarva commented on GitHub (Apr 28, 2024): The web-ui can specify `use_nmap:false use_mlock:true` in the generation call to create the chat, or make it user configurable? this would resolve the issue i believe. I can't find anywhere where it is already user configurable.

GiteaMirror commented

2025-11-11 14:20:01 -06:00

@tjbck commented on GitHub (Jun 2, 2024):

Added with v0.2.1!

@tjbck commented on GitHub (Jun 2, 2024): Added with v0.2.1!