mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-25 04:24:30 -05:00
feat: add mlock parameter from ollama (use ram to load models) #390
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @IGLADI on GitHub (Mar 1, 2024).
Is your feature request related to a problem? Please describe.
It seems like models from ollama barely/don't go into ram which slows it done by a lot (see iowait screenshot#1).
Describe the solution you'd like
Per ollama api docs & llama docs you could just send
use_mlock = trueto the api request, donno the code base so i couldn't change it myself but that should be an easy fix (good first commit?)Describe alternatives you've considered
Make it togglable into the UI instead or putting it into docker compose/do both.
Maybe an arg for the docker compose of ollama exist to do this?
Maybe another fix exist to load them into ram?
Additional context


Add any other context or screenshots about the feature request here.
All the ram used is basically from other services
Uses a lot of memory but none is loaded into ram.
@Ankk98 commented on GitHub (Mar 2, 2024):
I can see the same thing. I would like to contribute to fix this.
@NiLon commented on GitHub (Mar 14, 2024):
Isn't mlock just reserving (locking) the memory for the model for the maximum amount rather than dynamically increasing as the model needs it? I don't think it actually contributes towards model staying in the memory. If Ollama does release the model from memory after each prompt, then mlock would not solve it. Still useful option to have, but does it actually solve the problem?
@IGLADI commented on GitHub (Mar 14, 2024):
Well I think it would but Indeed it would do more and force it constantly into memory, I might be wrong tho and it may not change this issue as I haven't had time to test it in practice myself.
Maybe you think of another fix?
@NiLon commented on GitHub (Mar 16, 2024):
In my testing the model stays in the memory. It would only load model again if you swap it. If you are using GPU offloading then do keep in mind that it is GPU memory (VRAM), not the system RAM it is using. This is at least using docker version of ollama under Linux host.
@IGLADI commented on GitHub (Mar 16, 2024):
I run it on cpu, the screenshot is taken using mixtral (while prompting).
When using mistral (lighter model) it seems to load a part (still not all) of it in ram, really weird
Really weird might be an ollama issue? Idk
@NiLon commented on GitHub (Mar 16, 2024):
Fairly positive this is ollama thing. I would suggest to try running just bare ollama instance and confirm findings there. The webui can't do much towards fixing ollama behavior. You could try the mlock thing as well.
With mlock tho make sure you are able to allocate enough memory
ulimit -lMight be required to increase the limit.
/etc/systemd/user.conf
DefaultLimitMEMLOCK=infinity
Do note that it's not recommended to edit this file directly, but rather make new config for the pool directory if you plan to keep it.
Also for settings to stick you need to reload the daemon:
sudo systemctl daemon-reloadand then login to new session (open new bash for example) and runulimit -lto confirm, it should say unlimited if you used infinity value. You could also set it to actual value like 16G.For older systems / not using systemd you would edit the /etc/security/limits.conf (memlock)
@ZenDarva commented on GitHub (Apr 28, 2024):
The web-ui can specify
use_nmap:false use_mlock:truein the generation call to create the chat, or make it user configurable?
this would resolve the issue i believe.
I can't find anywhere where it is already user configurable.
@tjbck commented on GitHub (Jun 2, 2024):
Added with v0.2.1!