[GH-ISSUE #6918] Unreliable free memory resulting in models not running #50889

Open
opened 2026-04-28 17:20:16 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @ddpasa on GitHub (Sep 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6918

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

From what I understand, new versions of ollama compare the expected memory requirements of a model with the amount of free memory seen by ollama, and prints an error message if the model memory requirements are larger. This make a lof of sense.

However, the free memory on Linux is (from what I understand) is not a very reliable estimate. For the same model on the same machine, I have had cases where ollama ran successfully, or reported insufficient memory.

Is it possible to disable this feature entirely?

OS

Linux

GPU

No response

CPU

No response

Ollama version

latest mainline

Originally created by @ddpasa on GitHub (Sep 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6918 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? From what I understand, new versions of ollama compare the expected memory requirements of a model with the amount of free memory seen by ollama, and prints an error message if the model memory requirements are larger. This make a lof of sense. However, the free memory on Linux is (from what I understand) is not a very reliable estimate. For the same model on the same machine, I have had cases where ollama ran successfully, or reported insufficient memory. Is it possible to disable this feature entirely? ### OS Linux ### GPU _No response_ ### CPU _No response_ ### Ollama version latest mainline
GiteaMirror added the feature requestlinux labels 2026-04-28 17:20:16 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 23, 2024):

Broadly speaking, ollama wants the sum of unallocated RAM and unallocated swap to be more than the required memory for loading the model + context space that don't fit on the GPU. The server logs will show the relevant values. If you are finding that a model loads sometimes and not others, then ollama thinks that your system is close to over-committing RAM and doesn't want to get in to a situation where the OOM-killer starts sniping processes. You can check the figures in the logs and if you find that the data is inconsistent then that should be followed up. You can mitigate the problems with model loading by using a smaller model, setting a smaller context size, or adding swap.

<!-- gh-comment-id:2368323858 --> @rick-github commented on GitHub (Sep 23, 2024): Broadly speaking, ollama wants the sum of unallocated RAM and unallocated swap to be more than the required memory for loading the model + context space that don't fit on the GPU. The [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show the relevant values. If you are finding that a model loads sometimes and not others, then ollama thinks that your system is close to over-committing RAM and doesn't want to get in to a situation where the OOM-killer starts sniping processes. You can check the figures in the logs and if you find that the data is inconsistent then that should be followed up. You can mitigate the problems with model loading by using a smaller model, setting a smaller context size, or adding swap.
Author
Owner

@ddpasa commented on GitHub (Sep 23, 2024):

If you are finding that a model loads sometimes and not others, then ollama thinks that your system is close to over-committing RAM and doesn't want to get in to a situation where the OOM-killer starts sniping processes.

I think this is exactly what is happening.

I have 16GB of ram on my laptop, and the free memory ollama sees fluctuated between 9.5GB and all the way up to 13GB. This is a huge range of memory fluctuations.

This is cpu only inference, no GPU involved.

<!-- gh-comment-id:2368331939 --> @ddpasa commented on GitHub (Sep 23, 2024): > If you are finding that a model loads sometimes and not others, then ollama thinks that your system is close to over-committing RAM and doesn't want to get in to a situation where the OOM-killer starts sniping processes. I think this is exactly what is happening. I have 16GB of ram on my laptop, and the free memory ollama sees fluctuated between 9.5GB and all the way up to 13GB. This is a huge range of memory fluctuations. This is cpu only inference, no GPU involved.
Author
Owner

@ddpasa commented on GitHub (Sep 23, 2024):

I think the current logic is a safe convervative choice that works most of the time. However, I know my own system very well, and would like to override the available memory with a larger value for making sure I avoid changing behaviour.

<!-- gh-comment-id:2368762999 --> @ddpasa commented on GitHub (Sep 23, 2024): I think the current logic is a safe convervative choice that works most of the time. However, I know my own system very well, and would like to override the available memory with a larger value for making sure I avoid changing behaviour.
Author
Owner

@dhiltgen commented on GitHub (Sep 25, 2024):

We look at available memory which should be buffer cache aware, along with swap free space to establish a threshold so we can block model loads that exceed that. Can you describe your scenario a bit more? Are you loading different models in rapid fire where we unload one to make room for the next, but still think there isn't room due to stale memory information? For GPUs we wait up to 5s for the VRAM reporting to converge, but we don't currently have code in place to do that for system memory. If that's the scenario you're running into, maybe that enhancement would help address the problem.

<!-- gh-comment-id:2372628415 --> @dhiltgen commented on GitHub (Sep 25, 2024): We look at available memory which should be buffer cache aware, along with swap free space to establish a threshold so we can block model loads that exceed that. Can you describe your scenario a bit more? Are you loading different models in rapid fire where we unload one to make room for the next, but still think there isn't room due to stale memory information? For GPUs we wait up to 5s for the VRAM reporting to converge, but we don't currently have code in place to do that for system memory. If that's the scenario you're running into, maybe that enhancement would help address the problem.
Author
Owner

@ddpasa commented on GitHub (Sep 27, 2024):

We look at available memory which should be buffer cache aware, along with swap free space to establish a threshold so we can block model loads that exceed that. Can you describe your scenario a bit more? Are you loading different models in rapid fire where we unload one to make room for the next, but still think there isn't room due to stale memory information? For GPUs we wait up to 5s for the VRAM reporting to converge, but we don't currently have code in place to do that for system memory. If that's the scenario you're running into, maybe that enhancement would help address the problem.

I don't think there is a major issue with the current logic, but it does not work well in case where I really want to push my system and run the largest model I can get away with. It seems a little too conservative.

I have a suspicion that it's due to other programs in my laptop using memory that confuses ollama.

A very simple solution is to allow users to override this value with an environment variable. It keeps the default safe behaviour, while allowing us to run with large models right around the memory threshold.

<!-- gh-comment-id:2379770908 --> @ddpasa commented on GitHub (Sep 27, 2024): > We look at available memory which should be buffer cache aware, along with swap free space to establish a threshold so we can block model loads that exceed that. Can you describe your scenario a bit more? Are you loading different models in rapid fire where we unload one to make room for the next, but still think there isn't room due to stale memory information? For GPUs we wait up to 5s for the VRAM reporting to converge, but we don't currently have code in place to do that for system memory. If that's the scenario you're running into, maybe that enhancement would help address the problem. I don't think there is a major issue with the current logic, but it does not work well in case where I really want to push my system and run the largest model I can get away with. It seems a little too conservative. I have a suspicion that it's due to other programs in my laptop using memory that confuses ollama. A very simple solution is to allow users to override this value with an environment variable. It keeps the default safe behaviour, while allowing us to run with large models right around the memory threshold.
Author
Owner

@xgdgsc commented on GitHub (Nov 20, 2024):

I also face this issue on x elite arm windows laptop and a m2 macbook both with 32gb ram. I often have tons of background edge/vscode windows that I don' t close that could be moved to swap safely. And I think the Windows swap size is dynamic? So I need an option to manually specify the max memory ollama ask from system so I could run models I need more easily. Currently I' m restricted to smaller models by this.

<!-- gh-comment-id:2488091281 --> @xgdgsc commented on GitHub (Nov 20, 2024): I also face this issue on x elite arm windows laptop and a m2 macbook both with 32gb ram. I often have tons of background edge/vscode windows that I don' t close that could be moved to swap safely. And I think the Windows swap size is dynamic? So I need an option to manually specify the max memory ollama ask from system so I could run models I need more easily. Currently I' m restricted to smaller models by this.
Author
Owner

@ddpasa commented on GitHub (Nov 20, 2024):

I think overriding with a environment variable ist he cleanest solution.

<!-- gh-comment-id:2488095479 --> @ddpasa commented on GitHub (Nov 20, 2024): I think overriding with a environment variable ist he cleanest solution.
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

Does adding swap not solve the problem?

<!-- gh-comment-id:2488171555 --> @rick-github commented on GitHub (Nov 20, 2024): Does adding swap not solve the problem?
Author
Owner

@xgdgsc commented on GitHub (Nov 20, 2024):

https://discussions.apple.com/thread/7417584?answerId=7417584021&sortBy=rank#7417584021 seems macos has no option of manually add. So considering both macos and Windows manages swap dynamically by default, adding an env variable should work.

<!-- gh-comment-id:2488221380 --> @xgdgsc commented on GitHub (Nov 20, 2024): https://discussions.apple.com/thread/7417584?answerId=7417584021&sortBy=rank#7417584021 seems macos has no option of manually add. So considering both macos and Windows manages swap dynamically by default, adding an env variable should work.
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

OLLAMA_MEM=30000000000 perl -e '$x="a" x $ENV{OLLAMA_MEM}; exec ("ollama","run","some-big-model","");'
<!-- gh-comment-id:2488267504 --> @rick-github commented on GitHub (Nov 20, 2024): ``` OLLAMA_MEM=30000000000 perl -e '$x="a" x $ENV{OLLAMA_MEM}; exec ("ollama","run","some-big-model","");' ```
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

Python is more likely to be installed on a windows system than perl, so for better cross platform support:

#!/usr/bin/env python3

# For systems with dynamic swap, alloc a large buffer to force the
# system to add swap, then exec ollama into that free space.
# Install this in a path before the actual ollama binary, and
# adjust `ollama_binary` below to point to the real ollama.
#
# use:  OLLAMA_MEMORY=10000000000 ollama run some-large-model

import os
import platform
import sys

ollama_binary = "ollama"
if platform.system() == "Windows":
  ollama_binary = "ollama.exe"

_ = "a"*int(os.environ.get("OLLAMA_MEMORY",0))
os.execvp(ollama_binary, [ollama_binary]+sys.argv[1:])
<!-- gh-comment-id:2488651203 --> @rick-github commented on GitHub (Nov 20, 2024): Python is more likely to be installed on a windows system than perl, so for better cross platform support: ```python #!/usr/bin/env python3 # For systems with dynamic swap, alloc a large buffer to force the # system to add swap, then exec ollama into that free space. # Install this in a path before the actual ollama binary, and # adjust `ollama_binary` below to point to the real ollama. # # use: OLLAMA_MEMORY=10000000000 ollama run some-large-model import os import platform import sys ollama_binary = "ollama" if platform.system() == "Windows": ollama_binary = "ollama.exe" _ = "a"*int(os.environ.get("OLLAMA_MEMORY",0)) os.execvp(ollama_binary, [ollama_binary]+sys.argv[1:]) ```
Author
Owner

@xgdgsc commented on GitHub (Nov 21, 2024):

Thanks. Works for me.

<!-- gh-comment-id:2489898633 --> @xgdgsc commented on GitHub (Nov 21, 2024): Thanks. Works for me.
Author
Owner

@unicorn667 commented on GitHub (Mar 6, 2025):

worked not for me

<!-- gh-comment-id:2703437820 --> @unicorn667 commented on GitHub (Mar 6, 2025): worked not for me
Author
Owner

@rick-github commented on GitHub (Mar 6, 2025):

You'll have to be more specific about what's not working.

<!-- gh-comment-id:2703449822 --> @rick-github commented on GitHub (Mar 6, 2025): You'll have to be more specific about what's not working.
Author
Owner

@thojo0 commented on GitHub (Mar 17, 2026):

For me I also can't load models even if enough memory is available.

We look at available memory which should be buffer cache aware, along with swap free space to establish a threshold so we can block model loads that exceed that.

I don't think, this is working correctly under linux (ollama v0.18.0): free -h

               total        used        free      shared  buff/cache   available
Mem:            32Gi        37Mi        10Gi        64Ki        21Gi        31Gi
Swap:          512Mi       4.4Mi       507Mi

In this case ollama refuses to load models bigger than 10G, but should be able to load models up to 31G.

<!-- gh-comment-id:4074881095 --> @thojo0 commented on GitHub (Mar 17, 2026): For me I also can't load models even if enough memory is available. > We look at available memory which should be buffer cache aware, along with swap free space to establish a threshold so we can block model loads that exceed that. I don't think, this is working correctly under linux (ollama v0.18.0): `free -h` ``` total used free shared buff/cache available Mem: 32Gi 37Mi 10Gi 64Ki 21Gi 31Gi Swap: 512Mi 4.4Mi 507Mi ``` In this case ollama refuses to load models bigger than 10G, but should be able to load models up to 31G.
Author
Owner

@aldem commented on GitHub (Mar 30, 2026):

Just hit similar issue - it reports that I have "not enough" RAM.

Checking MemFree is not reliable (even without containers), because MemFree does not reflect available memory, which is reported as MemAvailable:

MemAvailable %lu (since Linux 3.14)
  An estimate of how much memory is available for starting new applications, without swapping.

MemFree only shows not used memory, this means in particular that if most of the RAM is used for caches then it will be low, like in my case:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            64Gi        30Gi       557Mi       4.7Gi        37Gi        33Gi
Swap:             0B          0B          0B

I have 33G available but ollama refuses to run with message Error: model requires more system memory (539.6 MiB) than is available (486.9 MiB).

<!-- gh-comment-id:4158512279 --> @aldem commented on GitHub (Mar 30, 2026): Just hit similar issue - it reports that I have "not enough" RAM. Checking `MemFree` is *not reliable* (even without containers), because `MemFree` *does not* reflect *available* memory, which is reported as `MemAvailable`: ``` MemAvailable %lu (since Linux 3.14) An estimate of how much memory is available for starting new applications, without swapping. ``` `MemFree` only shows *not used* memory, this means in particular that if most of the RAM is used for caches then it will be low, like in my case: ``` $ free -h total used free shared buff/cache available Mem: 64Gi 30Gi 557Mi 4.7Gi 37Gi 33Gi Swap: 0B 0B 0B ``` I have 33G available but `ollama` refuses to run with message `Error: model requires more system memory (539.6 MiB) than is available (486.9 MiB)`.
Author
Owner

@rick-github commented on GitHub (Mar 30, 2026):

Checking MemFree is not reliable (even without containers), because MemFree does not reflect available memory, which is reported as MemAvailable:

ollama uses MemAvailable. Server logs will aid in debugging.

<!-- gh-comment-id:4158801659 --> @rick-github commented on GitHub (Mar 30, 2026): > Checking MemFree is not reliable (even without containers), because MemFree does not reflect available memory, which is reported as MemAvailable: ollama uses MemAvailable. [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@aldem commented on GitHub (Mar 30, 2026):

I am not sure it does, because (v0.19.0):

time=2026-03-31T01:37:08.450+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="2.4 GiB" free_swap="0 B"

Shows free 2.4 GiB, which matches meminfo:

MemTotal:       67108864 kB
MemFree:         2283532 kB
MemAvailable:   34857276 kB

Besides, as I mentioned above, it refused to run with 33 GiB available.

<!-- gh-comment-id:4158837279 --> @aldem commented on GitHub (Mar 30, 2026): I am not sure it does, because (v0.19.0): ``` time=2026-03-31T01:37:08.450+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="2.4 GiB" free_swap="0 B" ``` Shows free 2.4 GiB, which matches meminfo: ``` MemTotal: 67108864 kB MemFree: 2283532 kB MemAvailable: 34857276 kB ``` Besides, as I mentioned above, it refused to run with 33 GiB available.
Author
Owner

@rick-github commented on GitHub (Mar 30, 2026):

31f968fe1f/discover/cpu_linux.go (L43)

Server logs will aid in debugging.

<!-- gh-comment-id:4158846125 --> @rick-github commented on GitHub (Mar 30, 2026): https://github.com/ollama/ollama/blob/31f968fe1f0f774fe20ee0c64f749e90d54147fd/discover/cpu_linux.go#L43 [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@aldem commented on GitHub (Mar 31, 2026):

I saw the code, but... Debugging. Ok. Lets see:

time=2026-03-31T02:04:14.212+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="131.0 MiB" free_swap="0 B"
...
time=2026-03-31T02:04:14.212+02:00 level=WARN source=server.go:1046 msg="model request too large for system" requested="330.0 MiB" available="131.0 MiB" total="64.0 GiB" free="131.0 MiB" swap="0 B"
time=2026-03-31T02:04:14.212+02:00 level=INFO source=sched.go:511 msg="Load failed" model=/home/ollama/.ollama/models/blobs/sha256-33a8a1b6a1cbba662f292d32bb55f8d109c0e6cb02de2d243a1b70705ea20986 error="model requires more system memory (330.0 MiB) than is available (131.0 MiB)"

My cache is full, and meminfo is:

$ egrep Mem /proc/meminfo
MemTotal:       67108864 kB
MemFree:          131010 kB
MemAvailable:   35081704 kB

Does it look that it actually uses MemAvailable?

<!-- gh-comment-id:4158929366 --> @aldem commented on GitHub (Mar 31, 2026): I saw the code, but... Debugging. Ok. Lets see: ``` time=2026-03-31T02:04:14.212+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="131.0 MiB" free_swap="0 B" ... time=2026-03-31T02:04:14.212+02:00 level=WARN source=server.go:1046 msg="model request too large for system" requested="330.0 MiB" available="131.0 MiB" total="64.0 GiB" free="131.0 MiB" swap="0 B" time=2026-03-31T02:04:14.212+02:00 level=INFO source=sched.go:511 msg="Load failed" model=/home/ollama/.ollama/models/blobs/sha256-33a8a1b6a1cbba662f292d32bb55f8d109c0e6cb02de2d243a1b70705ea20986 error="model requires more system memory (330.0 MiB) than is available (131.0 MiB)" ``` My cache is full, and `meminfo` is: ``` $ egrep Mem /proc/meminfo MemTotal: 67108864 kB MemFree: 131010 kB MemAvailable: 35081704 kB ``` Does it look that it actually uses `MemAvailable`?
Author
Owner

@rick-github commented on GitHub (Mar 31, 2026):

Does it look that it actually uses MemAvailable?

Hard to say. If only there were server logs to look at.

<!-- gh-comment-id:4158962406 --> @rick-github commented on GitHub (Mar 31, 2026): > Does it look that it actually uses MemAvailable? Hard to say. If only there were [server logs](https://docs.ollama.com/troubleshooting) to look at.
Author
Owner

@aldem commented on GitHub (Mar 31, 2026):

Sorry, which else server logs should I post? Or you mean that I have to post everything from the log, even unrelated to memory?

<!-- gh-comment-id:4159009378 --> @aldem commented on GitHub (Mar 31, 2026): Sorry, which *else* server logs should I post? Or you mean that I have to post *everything* from the log, even unrelated to memory?
Author
Owner

@rick-github commented on GitHub (Mar 31, 2026):

Server logs contains information about environment, device selection, model parameters, layer allocation etc which may or not be useful in debugging. Better to provide a full log that may have too much detail than 3 lines of log which may exclude relevant information.

<!-- gh-comment-id:4159139548 --> @rick-github commented on GitHub (Mar 31, 2026): Server logs contains information about environment, device selection, model parameters, layer allocation etc which may or not be useful in debugging. Better to provide a full log that may have too much detail than 3 lines of log which may exclude relevant information.
Author
Owner

@aldem commented on GitHub (Mar 31, 2026):

OK, the full log: https://gist.github.com/aldem/f3a3313d89fe83f8dbeb05564bcfec87

And failed attempt:

$ ./ollama run ordis/jina-embeddings-v2-base-code "some text here"; egrep Mem /proc/meminfo
Error: model requires more system memory (330.0 MiB) than is available (170.3 MiB)
MemTotal:       67108864 kB
MemFree:          154768 kB
MemAvailable:   35162983 kB

While the code should work, it doesn't... free memory in log is matching MemFree (with slight difference).

I am running it in LXC container (Proxmox 9, Debian 13) - not sure if this matters.

<!-- gh-comment-id:4159277078 --> @aldem commented on GitHub (Mar 31, 2026): OK, the full log: https://gist.github.com/aldem/f3a3313d89fe83f8dbeb05564bcfec87 And failed attempt: ``` $ ./ollama run ordis/jina-embeddings-v2-base-code "some text here"; egrep Mem /proc/meminfo Error: model requires more system memory (330.0 MiB) than is available (170.3 MiB) MemTotal: 67108864 kB MemFree: 154768 kB MemAvailable: 35162983 kB ``` While the code should work, it doesn't... free memory in log is matching `MemFree` (with slight difference). I am running it in LXC container (Proxmox 9, Debian 13) - not sure if this matters.
Author
Owner

@rick-github commented on GitHub (Mar 31, 2026):

What does the egrep return when it's run inside the LXC container?

<!-- gh-comment-id:4160317365 --> @rick-github commented on GitHub (Mar 31, 2026): What does the `egrep` return when it's run inside the LXC container?
Author
Owner

@aldem commented on GitHub (Mar 31, 2026):

ollama serve and ollama run ... with egrep are running in the same container, so it returns actual container data.

<!-- gh-comment-id:4160668876 --> @aldem commented on GitHub (Mar 31, 2026): `ollama serve` and `ollama run ...` with `egrep` are running in the same container, so it returns actual container data.
Author
Owner

@JordanLoehr commented on GitHub (Apr 7, 2026):

I noticed this too running ollama on k3s on a raspberry pi 5 (8gb) for testing.

Would get
model requires more system memory (7.3 GiB) than is available (2.4 GiB)

Despite /proc/meminfo showing:
MemAvailable: 7434880 kB

Looking at https://github.com/ollama/ollama/blob/main/discover/cpu_linux.go though, its not just always looking at MemAvailable, but it also does a pass that checks if it is in a cgroup (getCPUMemByCgroups(mem)) and overrides the MemAvailable with the cgroup values if present.

8c8f8f3450/discover/cpu_linux.go (L69-L79)

The problem with this is "/sys/fs/cgroup/memory.current" includes things such as the page cache, which MemAvailable doesn't.

On my system running in k3s this explains the discrepancy:

/sys/fs/cgroup/memory.max returns max, which causes getUint64ValueFromFile to error and keep the total from /proc/meminfo instead (8256640 kB), but /sys/fs/cgroup/memory.current returns 5825003520 (bytes, or 5825004 kB),

So 8256640-5825004 = 2431636 or 2.4GiB, which is what the original error is showing as free, even though MemAvailable is showing 7.4GiB free.

To get the cgroup value closer to what MemAvailable is, you have to subtract all the reclaimable memory from memory.current, most of which you can get from memory.stat.

eg:

available = total - (memory.current - (memory.stat.anon + memory.stat.kernel + memory.stat.slab_unreclaimable + memory.stat.kernel_stack + memory.stat.pagetables + memory.stat.sec_pagetables + memory.stat.sock + memory.stat.vmalloc)

However this isn't exactly the same as MemAvailable because that factors in some of the per zone low watermark values from /proc/zoneinfo. Or just copy what Kubernetes does https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#memory-signals and use total - (memory.current - memory.stat.inactivefile) to get close enough.

tl;dr: if you are running in a cgroupv2 it's using the value from memory.current, not /proc/meminfo MemAvailable, to calculate the free memory, with the former including the page cache and other values.

<!-- gh-comment-id:4196275643 --> @JordanLoehr commented on GitHub (Apr 7, 2026): I noticed this too running ollama on k3s on a raspberry pi 5 (8gb) for testing. Would get `model requires more system memory (7.3 GiB) than is available (2.4 GiB)` Despite /proc/meminfo showing: `MemAvailable: 7434880 kB` Looking at https://github.com/ollama/ollama/blob/main/discover/cpu_linux.go though, its not just always looking at MemAvailable, but it also does a pass that checks if it is in a cgroup (`getCPUMemByCgroups(mem)`) and overrides the MemAvailable with the cgroup values if present. https://github.com/ollama/ollama/blob/8c8f8f3450d39735355fc6cd7f2e436c8aa42ab1/discover/cpu_linux.go#L69-L79 The problem with this is `"/sys/fs/cgroup/memory.current"` includes things such as the page cache, which MemAvailable doesn't. On my system running in k3s this explains the discrepancy: `/sys/fs/cgroup/memory.max` returns `max`, which causes getUint64ValueFromFile to error and keep the total from /proc/meminfo instead (8256640 kB), but `/sys/fs/cgroup/memory.current` returns `5825003520` (bytes, or 5825004 kB), So 8256640-5825004 = 2431636 or 2.4GiB, which is what the original error is showing as free, even though MemAvailable is showing 7.4GiB free. To get the cgroup value closer to what MemAvailable is, you have to subtract all the reclaimable memory from memory.current, most of which you can get from memory.stat. eg: available = total - (memory.current - (memory.stat.anon + memory.stat.kernel + memory.stat.slab_unreclaimable + memory.stat.kernel_stack + memory.stat.pagetables + memory.stat.sec_pagetables + memory.stat.sock + memory.stat.vmalloc) However this isn't exactly the same as MemAvailable because that factors in some of the per zone low watermark values from /proc/zoneinfo. Or just copy what Kubernetes does https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#memory-signals and use `total - (memory.current - memory.stat.inactivefile)` to get close enough. tl;dr: if you are running in a cgroupv2 it's using the value from memory.current, not /proc/meminfo MemAvailable, to calculate the free memory, with the former including the page cache and other values.
Author
Owner

@aldem commented on GitHub (Apr 7, 2026):

@JordanLoehr Exactly, you have nailed it! 👍

<!-- gh-comment-id:4197941212 --> @aldem commented on GitHub (Apr 7, 2026): @JordanLoehr Exactly, you have nailed it! 👍
Author
Owner

@mirceanis commented on GitHub (Apr 17, 2026):

I'm facing a similar issue.
Running ollama in a container.
On 32GB RAM + 8GB VRAM I can run gemma4:26b-a4b @ 4 bit quantization with a 192k context window.
BUT, as soon as the model is shut down it won't restart

<!-- gh-comment-id:4267047999 --> @mirceanis commented on GitHub (Apr 17, 2026): I'm facing a similar issue. Running ollama in a container. On 32GB RAM + 8GB VRAM I can run gemma4:26b-a4b @ 4 bit quantization with a 192k context window. BUT, as soon as the model is shut down it won't restart
Author
Owner

@markasoftware-tc commented on GitHub (Apr 17, 2026):

Yep I believe my pr #13782 fixes this exact issue

<!-- gh-comment-id:4269825780 --> @markasoftware-tc commented on GitHub (Apr 17, 2026): Yep I believe my pr #13782 fixes this exact issue
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50889