[GH-ISSUE #11497] Ollama counts cached memory as used, not allowing models to run even though there is enough memory available. #7594

Closed
opened 2026-04-12 19:40:51 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @LynxesExe on GitHub (Jul 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11497

What is the issue?

Hello,

in my setup I have Ollama 0.9.6 running inside a Docker container, which has access to the entirety of the host system memory (32 GB). I'm trying to run the deepseek-r1:14b on the CPU.
Initially the model ran fine, but after a while I started getting an error related to not enough system memory being available and more being required.

model requires more system memory (10.2 GiB) than is available (2.9 GiB)

However, the server that ollama is running on has 32 GBs or memory, and only 7 GBs are being used by other processes; that being said (at the time of checking, of course) 28 GBs (the number fluctuates) were used, if you also count cached memory.

Essentially, what's portrayed in the image below:

Image

This is greatly limiting, I do not have direct control over what Linux is doing with cached memory, and it shouldn't matter, since it's memory that is available to user (therefore Ollama) anyway.

If the model were to run anyway without checking for available RAM it would work fine, since it has way over 10 GBs available to allocate, which the kernel would provide.

I'm assuming there might be an issue with how Ollama checks the free memory available, counting cached memory as used, even though it's available for usage.

I'm not sure if memory checking can be disabled via some configuration.

Relevant log output

time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_tau
time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_eta
time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=tfs_z
time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat
time=2025-07-22T20:44:06.771Z level=INFO source=server.go:135 msg="system memory" total="31.1 GiB" free="5.0 GiB" free_swap="0 B"
time=2025-07-22T20:44:06.771Z level=WARN source=server.go:170 msg="model request too large for system" requested="9.8 GiB" available=5397721088 total="31.1 GiB" free="5.0 GiB" swap="0 B"
time=2025-07-22T20:44:06.771Z level=INFO source=sched.go:455 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e error="model requires more system memory (9.8 GiB) than is available (5.0 GiB)"

time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=tfs_z
time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat
time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_eta
time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_tau
time=2025-07-23T21:49:11.247Z level=INFO source=server.go:135 msg="system memory" total="31.1 GiB" free="20.2 GiB" free_swap="0 B"
time=2025-07-23T21:49:11.247Z level=WARN source=server.go:145 msg="requested context size too large for model" num_ctx=8192 num_parallel=1 n_ctx_train=2048
time=2025-07-23T21:49:11.247Z level=INFO source=server.go:175 msg=offload library=cpu layers.requested=-1 layers.model=13 layers.offload=0 layers.split="" memory.available="[20.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="297.4 MiB" memory.required.partial="0 B" memory.required.kv="6.0 MiB" memory.required.allocations="[297.4 MiB]" memory.weights.total="260.9 MiB" memory.weights.repeating="216.1 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @LynxesExe on GitHub (Jul 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11497 ### What is the issue? Hello, in my setup I have Ollama 0.9.6 running inside a Docker container, which has access to the entirety of the host system memory (32 GB). I'm trying to run the deepseek-r1:14b on the CPU. Initially the model ran fine, but after a while I started getting an error related to not enough system memory being available and more being required. `model requires more system memory (10.2 GiB) than is available (2.9 GiB)` However, the server that ollama is running on has 32 GBs or memory, and only 7 GBs are being used by other processes; that being said (at the time of checking, of course) 28 GBs (the number fluctuates) were used, _if you also count cached memory_. Essentially, what's portrayed in the image below: <img width="1657" height="137" alt="Image" src="https://github.com/user-attachments/assets/87165880-5173-496d-9973-fd9cb1b14e9a" /> This is greatly limiting, I do not have direct control over what Linux is doing with cached memory, and it shouldn't matter, since it's memory that is available to user (therefore Ollama) anyway. If the model were to run anyway without checking for available RAM it would work fine, since it has way over 10 GBs available to allocate, which the kernel would provide. I'm assuming there might be an issue with how Ollama checks the free memory available, counting cached memory as used, even though it's available for usage. I'm not sure if memory checking can be disabled via some configuration. ### Relevant log output ```shell time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_tau time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_eta time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=tfs_z time=2025-07-22T20:44:06.685Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat time=2025-07-22T20:44:06.771Z level=INFO source=server.go:135 msg="system memory" total="31.1 GiB" free="5.0 GiB" free_swap="0 B" time=2025-07-22T20:44:06.771Z level=WARN source=server.go:170 msg="model request too large for system" requested="9.8 GiB" available=5397721088 total="31.1 GiB" free="5.0 GiB" swap="0 B" time=2025-07-22T20:44:06.771Z level=INFO source=sched.go:455 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e error="model requires more system memory (9.8 GiB) than is available (5.0 GiB)" time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=tfs_z time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_eta time=2025-07-23T21:49:11.236Z level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_tau time=2025-07-23T21:49:11.247Z level=INFO source=server.go:135 msg="system memory" total="31.1 GiB" free="20.2 GiB" free_swap="0 B" time=2025-07-23T21:49:11.247Z level=WARN source=server.go:145 msg="requested context size too large for model" num_ctx=8192 num_parallel=1 n_ctx_train=2048 time=2025-07-23T21:49:11.247Z level=INFO source=server.go:175 msg=offload library=cpu layers.requested=-1 layers.model=13 layers.offload=0 layers.split="" memory.available="[20.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="297.4 MiB" memory.required.partial="0 B" memory.required.kv="6.0 MiB" memory.required.allocations="[297.4 MiB]" memory.weights.total="260.9 MiB" memory.weights.repeating="216.1 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 19:40:51 -05:00
Author
Owner

@LynxesExe commented on GitHub (Jul 22, 2025):

I apologize, I somehow did not manage to provide the other information that has been left as "no response".

OS: (uname -a):

  • Host Linux archnemo 6.12.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 18 Jan 2025 02:26:57 +0000 x86_64 GNU/Linux

  • Container Linux 2ff4fddab33a 6.12.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 18 Jan 2025 02:26:57 +0000 x86_64 x86_64 x86_64 GNU/Linux

  • GPU: N/A (Intel iGPU is available to the container but not used by Ollama)

  • CPU: 13th Gen Intel(R) Core(TM) i5-13500

  • Ollama Version: ollama version is 0.9.6, Docker container used is ollama/ollama:latest as of 22/07/2025

Thanks in advance

<!-- gh-comment-id:3104813760 --> @LynxesExe commented on GitHub (Jul 22, 2025): I apologize, I somehow did not manage to provide the other information that has been left as "no response". OS: (uname -a): - Host `Linux archnemo 6.12.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 18 Jan 2025 02:26:57 +0000 x86_64 GNU/Linux` - Container `Linux 2ff4fddab33a 6.12.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 18 Jan 2025 02:26:57 +0000 x86_64 x86_64 x86_64 GNU/Linux` - GPU: N/A (Intel iGPU is available to the container but not used by Ollama) - CPU: `13th Gen Intel(R) Core(TM) i5-13500` - Ollama Version: `ollama version is 0.9.6`, Docker container used is `ollama/ollama:latest` as of 22/07/2025 Thanks in advance
Author
Owner

@rick-github commented on GitHub (Jul 23, 2025):

ollama relies on the operating system to tell it how much RAM is available for model loading. Run this:

docker exec -it ollama cat /proc/meminfo | egrep "^(MemAvailable|MemFree|Buffers|Cached):"

If MemAvailable is set, that's what ollama uses as the amount of RAM available for loading models, otherwise it uses the sum of MemFree, Buffers and Cached. If you set OLLAMA_DEBUG=1 in the server environment, ollama will log RAM details.

<!-- gh-comment-id:3105667574 --> @rick-github commented on GitHub (Jul 23, 2025): ollama relies on the operating system to tell it how much RAM is available for model loading. Run this: ```shell docker exec -it ollama cat /proc/meminfo | egrep "^(MemAvailable|MemFree|Buffers|Cached):" ``` If `MemAvailable` is set, that's what ollama uses as the amount of RAM available for loading models, otherwise it uses the sum of `MemFree`, `Buffers` and `Cached`. If you set `OLLAMA_DEBUG=1` in the server environment, ollama will log RAM details.
Author
Owner

@LynxesExe commented on GitHub (Jul 23, 2025):

I see, so I'm guessing that the logic is: if MemAvailable, which is memory that can be allocated, use that otherwise sum unused memory, buffers and cached and consider that as "Available Memory". Correct?

That would make sense but there still is something that doesn't add up, both those sources would have indicated enough free memory, yet, Ollama refused to run claiming that there wasn't enough. Even though, I can assure, there was.

I had even checked /proc/meminfo, and noticed that MemAvailable was set and reporting around 25GBs of available memory both inside the container and on the host directly, this would not explain Ollama refusing to start.

I'll play around with Ollama and check shall it happen again, at the moment I cannot reproduce the issue.

<!-- gh-comment-id:3110415029 --> @LynxesExe commented on GitHub (Jul 23, 2025): I see, so I'm guessing that the logic is: _if `MemAvailable`, which is memory that can be allocated, use that_ otherwise _sum unused memory, buffers and cached and consider that as "Available Memory"._ Correct? That would make sense but there still is something that doesn't add up, both those sources would have indicated enough free memory, yet, Ollama refused to run claiming that there wasn't enough. Even though, I can assure, there was. I had even checked `/proc/meminfo`, and noticed that `MemAvailable` was set and reporting around 25GBs of available memory both inside the container and on the host directly, this would not explain Ollama refusing to start. I'll play around with Ollama and check shall it happen again, at the moment I cannot reproduce the issue.
Author
Owner

@rick-github commented on GitHub (Jul 23, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3110435433 --> @rick-github commented on GitHub (Jul 23, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@LynxesExe commented on GitHub (Jul 23, 2025):

Apologies, I have extracted and added what was logged to the container output. Unfortunately the container of the first occurrence has been restarted, but this had the same issue.

The logs report two calls very close to each other, one failed and one correctly executing.

Out of curiosity, and if this is the case I'm sorry for not thinking about it earlier, does Ollama allocate memory for the entire model every time it gets a request?

So for example, if I have a model that requires 9GB of memory, and I do two concurrent call, do I need 18 GBs of memory?
In all honesty at the time I didn't have any ollama runner threads doing any work (as somewhat visible from htop core monitors), but just out of curiosity.

<!-- gh-comment-id:3110568419 --> @LynxesExe commented on GitHub (Jul 23, 2025): Apologies, I have extracted and added what was logged to the container output. Unfortunately the container of the first occurrence has been restarted, but this had the same issue. The logs report two calls very close to each other, one failed and one correctly executing. Out of curiosity, and if this is the case I'm sorry for not thinking about it earlier, does Ollama allocate memory for the entire model _every time_ it gets a request? So for example, if I have a model that requires 9GB of memory, and I do two concurrent call, do I need 18 GBs of memory? In all honesty at the time I didn't have any ollama runner threads doing any work (as somewhat visible from htop core monitors), but just out of curiosity.
Author
Owner

@rick-github commented on GitHub (Jul 23, 2025):

Memory is allocated at startup for the model weights and each context buffer specified in OLLAMA_NUM_PARALLEL. So say you want to handle up to 4 concurrent requests using the default context size, total RAM usage would be len(model_weights) + ( OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH * sizeof(token)) = 9G + (4 * 4096 * 278k) = 9G + 5G = 14G.

The size of a token as stored in the model graph varies between models, 278k is for deepseek-r1:14b.

In the case of handling two concurrent requests, you need 11G. Note that if you don't have any concurrency (OLLAMA_NUM_PARALLEL=1) the model will still handle both queries, just in a serial fashion.

<!-- gh-comment-id:3111418450 --> @rick-github commented on GitHub (Jul 23, 2025): Memory is allocated at startup for the model weights and each context buffer specified in `OLLAMA_NUM_PARALLEL`. So say you want to handle up to 4 concurrent requests using the default context size, total RAM usage would be `len(model_weights) + ( OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH * sizeof(token))` = 9G + (4 * 4096 * 278k) = 9G + 5G = 14G. The size of a token as stored in the model graph varies between models, 278k is for deepseek-r1:14b. In the case of handling two concurrent requests, you need 11G. Note that if you don't have any concurrency (`OLLAMA_NUM_PARALLEL=1`) the model will still handle both queries, just in a serial fashion.
Author
Owner

@LynxesExe commented on GitHub (Jul 24, 2025):

I see.

I have checked the env variable is not specified at all in my env, which according to both the docs and the code I saw means it is supposed to be set to 1 by default.

I'm testing now and the container needs around 10 or so GBs of memory no matter how many queries I throw at it, therefore I'm not sure about where the previous behavior came from.
But I also wouldn't exclude some mistake on my part when it comes to how I'm calculating things.

Before wasting more of your time, it's probably best I dive deeper on the memory requirement, which is far more complex than I thought it was.
Regardless, thank you for your clarifications! I think this issue can be closed at least for now, unless I manage to replicate and find out where the high usage/request comes from.

Have a nice day/weekend!

<!-- gh-comment-id:3115092039 --> @LynxesExe commented on GitHub (Jul 24, 2025): I see. I have checked the env variable is not specified at all in my env, which according to both the docs and the code I saw means it is supposed to be set to 1 by default. I'm testing now and the container needs around 10 or so GBs of memory no matter how many queries I throw at it, therefore I'm not sure about where the previous behavior came from. But I also wouldn't exclude some mistake on my part when it comes to how I'm calculating things. Before wasting more of your time, it's probably best I dive deeper on the memory requirement, which is far more complex than I thought it was. Regardless, thank you for your clarifications! I think this issue can be closed at least for now, unless I manage to replicate and find out where the high usage/request comes from. Have a nice day/weekend!
Author
Owner

@LynxesExe commented on GitHub (Jul 26, 2025):

I'm going to close the issue for now as there is no point keeping it open.

<!-- gh-comment-id:3121587102 --> @LynxesExe commented on GitHub (Jul 26, 2025): I'm going to close the issue for now as there is no point keeping it open.
Author
Owner

@woliver99 commented on GitHub (Nov 16, 2025):

I'm having the same problem right now.
Edit: Found the culprit, It was ZFS ARC. meminfo considers ZFS ARC as used ram. I just limited it to 1GB and the problem is solved.

<!-- gh-comment-id:3537679362 --> @woliver99 commented on GitHub (Nov 16, 2025): I'm having the same problem right now. Edit: Found the culprit, It was ZFS ARC. meminfo considers ZFS ARC as used ram. I just limited it to 1GB and the problem is solved.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7594