[GH-ISSUE #4955] Ollama should error with insufficient system memory and VRAM #28890

Closed
opened 2026-04-22 07:26:42 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @jmorganca on GitHub (Jun 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4955

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Currently, Ollama will allow loading massive models even on small amounts of VRAM and system memory, leading to paging to disk and eventually errors. It should limit the size of models to avoid errors.

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jmorganca on GitHub (Jun 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4955 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Currently, Ollama will allow loading massive models even on small amounts of VRAM and system memory, leading to paging to disk and eventually errors. It should limit the size of models to avoid errors. ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 07:26:42 -05:00
Author
Owner

@huanbd commented on GitHub (Jun 13, 2024):

Hi, I have GPU VRAM 24G - RTX 3090, but when run aya:35b the first time its run perfect, but after amount of time (about 10 minute) call api generate with model is aya:35b, I catch error unexpected server status: llm server loading model.
I dont know why please help?

<!-- gh-comment-id:2164709288 --> @huanbd commented on GitHub (Jun 13, 2024): Hi, I have GPU VRAM 24G - RTX 3090, but when run aya:35b the first time its run perfect, but after amount of time (about 10 minute) call api generate with model is aya:35b, I catch error unexpected server status: llm server loading model. I dont know why please help?
Author
Owner

@dhiltgen commented on GitHub (Jun 13, 2024):

This is partially addressed in #4517 although the system memory logic kicks in for concurrency, so a little refactoring will be required to prevent a single model load.

<!-- gh-comment-id:2166461369 --> @dhiltgen commented on GitHub (Jun 13, 2024): This is partially addressed in #4517 although the system memory logic kicks in for concurrency, so a little refactoring will be required to prevent a single model load.
Author
Owner

@denisidoro commented on GitHub (Aug 11, 2024):

Could you please consider adding a flag to optionally skip this check?

Context: I used to run a model just fine on my phone (for my non-prod purposes), but now I'm unable to do so

<!-- gh-comment-id:2282904597 --> @denisidoro commented on GitHub (Aug 11, 2024): Could you please consider adding a flag to optionally skip this check? Context: I used to run a model just fine on my phone (for my non-prod purposes), but now I'm unable to do so
Author
Owner

@dhiltgen commented on GitHub (Aug 12, 2024):

@denisidoro can you share more information about your scenario? How much physical memory, how much swap space, is there a GPU involved or are you just doing CPU inference, what model are you trying to load? The goal of this check is only to block model loads that will hit OOM crash scenarios, so if you have a scenario that can load the model which we're blocking, that's not the intent and something we'd like to fix.

<!-- gh-comment-id:2284677250 --> @dhiltgen commented on GitHub (Aug 12, 2024): @denisidoro can you share more information about your scenario? How much physical memory, how much swap space, is there a GPU involved or are you just doing CPU inference, what model are you trying to load? The goal of this check is only to block model loads that will hit OOM crash scenarios, so if you have a scenario that can load the model which we're blocking, that's not the intent and something we'd like to fix.
Author
Owner

@denisidoro commented on GitHub (Aug 13, 2024):

Thanks for the reply

Model: gemma2:2b

Device: a Xiaomi 9T Pro running Termux

Memory: the device has 5.7GB total RAM; the OS tells me I hav around 3.5GB or so free, but when I try to run the model, Ollama says it needs 3.3GB but there's only 2.1GB available (I'm not really sure why there's this gap between what the OS sees and what Termux sees)

<!-- gh-comment-id:2286155066 --> @denisidoro commented on GitHub (Aug 13, 2024): Thanks for the reply Model: gemma2:2b Device: a Xiaomi 9T Pro running Termux Memory: the device has 5.7GB total RAM; the OS tells me I hav around 3.5GB or so free, but when I try to run the model, Ollama says it needs 3.3GB but there's only 2.1GB available (I'm not really sure why there's this gap between what the OS sees and what Termux sees)
Author
Owner

@dhiltgen commented on GitHub (Aug 14, 2024):

@denisidoro we use /proc/meminfo to discover the available resources on the system.

<!-- gh-comment-id:2289300753 --> @dhiltgen commented on GitHub (Aug 14, 2024): @denisidoro we use `/proc/meminfo` to discover the available resources on the system.
Author
Owner

@RecentlyRezzed commented on GitHub (Sep 7, 2024):

@dhiltgen I think the problem with Termux/Android is the different memory management in comparison to Linux.

As phones didn't have a lot of memory, the apps have to save their state because they will be killed when another a app needs the memory and have to be OK with it. On the other hand, starting apps consumes CPU cycles and this consumes power. So Android wants to keep as much apps in memory as possible.

I think that leads to a situation where OOM-killing is kind of the normal state when running Android. On Android, Ollama should just take the memory it needs. Other apps have to move out of the way. And if Ollama itself gets killed, the model was to big.

Maybe someone with more knowledge can give a better answer.

<!-- gh-comment-id:2335106742 --> @RecentlyRezzed commented on GitHub (Sep 7, 2024): @dhiltgen I think the problem with Termux/Android is the different memory management in comparison to Linux. As phones didn't have a lot of memory, the apps have to save their state because they will be killed when another a app needs the memory and have to be OK with it. On the other hand, starting apps consumes CPU cycles and this consumes power. So Android wants to keep as much apps in memory as possible. I think that leads to a situation where OOM-killing is kind of the normal state when running Android. On Android, Ollama should just take the memory it needs. Other apps have to move out of the way. And if Ollama itself gets killed, the model was to big. Maybe someone with more knowledge can give a better answer.
Author
Owner

@dhiltgen commented on GitHub (Sep 7, 2024):

@RecentlyRezzed mobile devices aren't officially supported, so the code doesn't currently have any special case logic for dealing with their memory model. We're tracking mobile support in issue #1006

<!-- gh-comment-id:2336184921 --> @dhiltgen commented on GitHub (Sep 7, 2024): @RecentlyRezzed mobile devices aren't officially supported, so the code doesn't currently have any special case logic for dealing with their memory model. We're tracking mobile support in issue #1006
Author
Owner

@ShayBox commented on GitHub (Oct 25, 2024):

I didn't know this was already a feature, because it's never once shown me the error. Every time I try to load a 72b model it reserves all my VRAM, RAM, and SWAP, then my computer freezes. If I manually kill it before I freeze my RAM is still used and does not free until I restart, there's a memory leak in Ollama and this error message never works.

To be clear, I have 64GB of RAM, when I kill -9 ollama it does not free the the ~50GB that it used and nothing will free it until I restart my computer.

I have since switched to use LLaMA/C/C++ instead of Ollama because it does not have this issue and also Ollama does not have qwen2.5-coder:32b in their library yet.

<!-- gh-comment-id:2436706952 --> @ShayBox commented on GitHub (Oct 25, 2024): I didn't know this was already a feature, because it's never once shown me the error. Every time I try to load a 72b model it reserves all my VRAM, RAM, and SWAP, then my computer freezes. If I manually kill it before I freeze my RAM is still used and does not free until I restart, there's a memory leak in Ollama and this error message never works. To be clear, I have 64GB of RAM, when I kill -9 ollama it does not free the the ~50GB that it used and nothing will free it until I restart my computer. I have since switched to use LLaMA/C/C++ instead of Ollama because it does not have this issue and also Ollama does not have qwen2.5-coder:32b in their library yet.
Author
Owner

@dhiltgen commented on GitHub (Oct 30, 2024):

@ShayBox some users want to be able to overflow into swap to load very large models and are OK with their system thrashing to achieve very slow inference, so we consider available swap space as part of our upper bound of model size.

Ollama uses a subprocess for running inference, and by killing just the parent with a -9, we don't have a chance to clean up our child process. If you look for ollama_llama_server and kill that process, your system should recover.

<!-- gh-comment-id:2447661629 --> @dhiltgen commented on GitHub (Oct 30, 2024): @ShayBox some users want to be able to overflow into swap to load very large models and are OK with their system thrashing to achieve very slow inference, so we consider available swap space as part of our upper bound of model size. Ollama uses a subprocess for running inference, and by killing just the parent with a -9, we don't have a chance to clean up our child process. If you look for `ollama_llama_server` and kill that process, your system should recover.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28890