[GH-ISSUE #7686] Swap Disk Safeguard #51418

Closed
opened 2026-04-28 19:58:14 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @unclemusclez on GitHub (Nov 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7686

What is the issue?

If Ollama and a Model are bound as a startup process, there is a potentiality to allow Ollama to utilize the swap memory on start and cause an incredibly slow system/system hang.

If you compile Ollama with CPU capabilities, and the GPU driver does not load for some reason or was uninstalled without Ollama's startup process being turned off, AND you have a SWAP DISK that is large enough to hold a model that is bound to a start process, Ollama will automatically load a model into the swap disk, on CPU.

I think the only remedy would be to actually go to the server location and reboot in safe-mode, or revert to a previous snapshot. This could be system breaking.

OS

Linux

GPU

Nvidia, AMD

CPU

Intel, AMD

Ollama version

No response

Originally created by @unclemusclez on GitHub (Nov 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7686 ### What is the issue? If Ollama and a Model are bound as a startup process, there is a potentiality to allow Ollama to utilize the swap memory on start and cause an incredibly slow system/system hang. If you compile Ollama with CPU capabilities, and the GPU driver does not load for some reason or was uninstalled without Ollama's startup process being turned off, AND you have a SWAP DISK that is large enough to hold a model that is bound to a start process, Ollama will automatically load a model into the swap disk, on CPU. I think the only remedy would be to actually go to the server location and reboot in safe-mode, or revert to a previous snapshot. This could be system breaking. ### OS Linux ### GPU Nvidia, AMD ### CPU Intel, AMD ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-28 19:58:14 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

More of a sysadmin issue than an ollama issue. There are multiple ways to manage resources on a Linux system, eg add swap, or modify the ollama service file and add resource limits:

ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve'

Windows systems can use process-govener, I'm sure similar utilities are available for MacOS.

<!-- gh-comment-id:2479079154 --> @rick-github commented on GitHub (Nov 15, 2024): More of a sysadmin issue than an ollama issue. There are multiple ways to manage resources on a Linux system, eg add swap, or modify the ollama service file and add resource limits: ``` ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve' ``` Windows systems can use [process-govener](https://github.com/lowleveldesign/process-governor), I'm sure similar utilities are available for MacOS.
Author
Owner

@unclemusclez commented on GitHub (Nov 15, 2024):

More of a sysadmin issue than an ollama issue. There are multiple ways to manage resources on a Linux system, eg add swap, or modify the ollama service file and add resource limits:

ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve'

Windows systems can use process-govener, I'm sure similar utilities are available for MacOS.

i propose a --swap flag to enable swap memory utilization. This isn't the same thing as shared memory. For the overwhelming majority of users, I highly doubt swap memory is being utilized for model hosting.

The reason i came across this was because i had to reinstall ROCm drivers. The systemctl service I had created triggered the model load. Because i was trying to clean the driver and install fresh drivers on a rebooted system, the systemctl process launched without being able to use ROCm, causing a serious system hang.

I agree this is a very specific situation, but i was lucky i knew what to look for.

there is no real way to SSH into a system that is running this slow unless you let the entire model load up first.

and yes, I am familiar with this behavior just because of my personal system specifications. If I didn't know that the reason my system was hanging was due to Ollama trying to load 32GB onto a swap disk with CPU, i would probably think my machine was melting.

<!-- gh-comment-id:2479115209 --> @unclemusclez commented on GitHub (Nov 15, 2024): > More of a sysadmin issue than an ollama issue. There are multiple ways to manage resources on a Linux system, eg add swap, or modify the ollama service file and add resource limits: > > ``` > ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve' > ``` > > Windows systems can use [process-govener](https://github.com/lowleveldesign/process-governor), I'm sure similar utilities are available for MacOS. i propose a `--swap` flag to enable swap memory utilization. This isn't the same thing as shared memory. For the overwhelming majority of users, I highly doubt swap memory is being utilized for model hosting. The reason i came across this was because i had to reinstall ROCm drivers. The systemctl service I had created triggered the model load. Because i was trying to clean the driver and install fresh drivers on a rebooted system, the systemctl process launched without being able to use ROCm, causing a serious system hang. I agree this is a very specific situation, but i was lucky i knew what to look for. there is no real way to SSH into a system that is running this slow unless you let the entire model load up first. and yes, I am familiar with this behavior just because of my personal system specifications. If I didn't know that the reason my system was hanging was due to Ollama trying to load 32GB onto a swap disk with CPU, i would probably think my machine was melting.
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

Perhaps MemorySwapMax might be better for your use case.

<!-- gh-comment-id:2479267938 --> @rick-github commented on GitHub (Nov 15, 2024): Perhaps [MemorySwapMax](https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#MemorySwapMax=bytes) might be better for your use case.
Author
Owner

@unclemusclez commented on GitHub (Nov 15, 2024):

Perhaps MemorySwapMax might be better for your use case.

I think you are misunderstanding the severity. My machine is working as expected.

If, for any reason, a GPU driver that is linked to the runner becomes unavailable, you will automatically utilize available swap memory if it is large enough to support the model (and if dedicated system memory isn't enough). Ollama will load a model with CPU from your SSD/HDD into ANOTHER SSD/HDD. This can potentially a very long and undesired process.

Ollama, for most purposes, should never utilize swap memory.

For example:

My GPU is 32GB
my System Memory is 16GB
and my Swap Memory is 140GB.

this is a very unique case, but it exposes the CPU functionality of Ollama when resources are mismanaged. This can happen due to a simple apt update given the right conditions.

<!-- gh-comment-id:2479321530 --> @unclemusclez commented on GitHub (Nov 15, 2024): > Perhaps [MemorySwapMax](https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#MemorySwapMax=bytes) might be better for your use case. I think you are misunderstanding the severity. My machine is working as expected. If, for any reason, a GPU driver that is linked to the runner becomes unavailable, you will automatically utilize available swap memory if it is large enough to support the model (and if dedicated system memory isn't enough). Ollama will load a model with CPU from your SSD/HDD into **ANOTHER** SSD/HDD. This can potentially a very long and undesired process. Ollama, for most purposes, should never utilize swap memory. For example: My GPU is 32GB my System Memory is 16GB and my Swap Memory is 140GB. this is a very unique case, but it exposes the CPU functionality of Ollama when resources are mismanaged. This can happen due to a simple `apt update` given the right conditions.
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

If you limit the data segment, swap will not grow larger than that size. This will prevent the "very long and undesired process".

$ systemctl cat ollama | grep "^ExecStart"
ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve'
$ ollama list dolphin-2.7-mixtral-8x7b:f16
NAME                            ID              SIZE     MODIFIED     
dolphin-2.7-mixtral-8x7b:f16    f8df7041a6a8    93 GB    8 months ago 
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ time curl localhost:11343/api/generate -d '{"model":"dolphin-2.7-mixtral-8x7b:f16","options":{"num_gpu":0}}'
{"error":"llama runner process has terminated: error loading model: unable to allocate backend buffer\nllama_load_model_from_file: failed to load model"}
real	0m0.421s
user	0m0.005s
sys	0m0.017s
$ time curl localhost:11343/api/generate -d '{"model":"qwen2.5:14b","options":{"num_gpu":0}}'
{"error":"llama runner process has terminated: error loading model: unable to allocate backend buffer\nllama_load_model_from_file: failed to load model"}
real	0m0.679s
user	0m0.008s
sys	0m0.005s
$ time curl localhost:11343/api/generate -d '{"model":"qwen2.5:14b"}'
{"model":"qwen2.5:14b","created_at":"2024-11-15T16:47:49.760112712Z","response":"","done":true,"done_reason":"load"}
real	1m19.054s
user	0m0.006s
sys	0m0.009s

There are solutions to your problem that don't require an extra flag that will very rarely be used.

<!-- gh-comment-id:2479430205 --> @rick-github commented on GitHub (Nov 15, 2024): If you limit the data segment, swap will not grow larger than that size. This will prevent the "very long and undesired process". ```console $ systemctl cat ollama | grep "^ExecStart" ExecStart=bash -c 'exec prlimit --data=$[500 * 1024 * 1024] /usr/local/bin/ollama serve' $ ollama list dolphin-2.7-mixtral-8x7b:f16 NAME ID SIZE MODIFIED dolphin-2.7-mixtral-8x7b:f16 f8df7041a6a8 93 GB 8 months ago $ echo 3 | sudo tee /proc/sys/vm/drop_caches 3 $ time curl localhost:11343/api/generate -d '{"model":"dolphin-2.7-mixtral-8x7b:f16","options":{"num_gpu":0}}' {"error":"llama runner process has terminated: error loading model: unable to allocate backend buffer\nllama_load_model_from_file: failed to load model"} real 0m0.421s user 0m0.005s sys 0m0.017s $ time curl localhost:11343/api/generate -d '{"model":"qwen2.5:14b","options":{"num_gpu":0}}' {"error":"llama runner process has terminated: error loading model: unable to allocate backend buffer\nllama_load_model_from_file: failed to load model"} real 0m0.679s user 0m0.008s sys 0m0.005s $ time curl localhost:11343/api/generate -d '{"model":"qwen2.5:14b"}' {"model":"qwen2.5:14b","created_at":"2024-11-15T16:47:49.760112712Z","response":"","done":true,"done_reason":"load"} real 1m19.054s user 0m0.006s sys 0m0.009s ``` There are solutions to your problem that don't require an extra flag that will very rarely be used.
Author
Owner

@unclemusclez commented on GitHub (Nov 15, 2024):

none of these commands would be even possible if the system is halted

<!-- gh-comment-id:2479434944 --> @unclemusclez commented on GitHub (Nov 15, 2024): none of these commands would be even possible if the system is halted
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

Neither would ollama serve --no-swap. That's why this is a sysadmin issue, not an ollama issue.

<!-- gh-comment-id:2479448358 --> @rick-github commented on GitHub (Nov 15, 2024): Neither would `ollama serve --no-swap`. That's why this is a sysadmin issue, not an ollama issue.
Author
Owner

@unclemusclez commented on GitHub (Nov 15, 2024):

Neither would ollama serve --no-swap. That's why this is a sysadmin issue, not an ollama issue.

no, --swap would be to enable swap partitions for use. It should be disabled by default.

on top of this, yes it would get triggered during a systemctl call, that's how the system gets hung up in the first place. If ollama is ran with a systemctl script it will run essentially exactly how you just stated it wouldn't.

if there is a secondary process in which a model is activated on start, whether by API or by direct client commands/scripts, this will effectively halt the system.

<!-- gh-comment-id:2479470713 --> @unclemusclez commented on GitHub (Nov 15, 2024): > Neither would `ollama serve --no-swap`. That's why this is a sysadmin issue, not an ollama issue. no, `--swap` would be to enable swap partitions for use. It should be disabled by default. on top of this, yes it would get triggered during a systemctl call, that's how the system gets hung up in the first place. If ollama is ran with a systemctl script it will run essentially exactly how you just stated it wouldn't. if there is a secondary process in which a model is activated on start, whether by API or by direct client commands/scripts, this will effectively halt the system.
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

I think the premise that swap should be off by default is where we are not seeing eye-to-eye. Many ollama users like to experiment with models that do not fit in a combination of VRAM and RAM, which is why ollama considers free swap when it's computing memory requirements for loading a model.

In the case where a system absolutely must not use swap, options have been presented. So it then comes down to configuration. In the swap-off-by-default case, anybody who wants to experiment with large models needs to modify their configuration to add --swap. In the swap-on-by-default case, anybody who wants to limit swap needs to modify their configuration to add prlimit. So it's a sysadmin issue.

If ollama has been configured in the swap-on-by-default case with prlimit and a secondary process activates a model on start, the system will not be halted.

<!-- gh-comment-id:2479601568 --> @rick-github commented on GitHub (Nov 15, 2024): I think the premise that swap should be off by default is where we are not seeing eye-to-eye. Many ollama users like to experiment with models that do not fit in a combination of VRAM and RAM, which is why ollama considers free swap when it's computing memory requirements for loading a model. In the case where a system absolutely must not use swap, options have been presented. So it then comes down to configuration. In the swap-off-by-default case, anybody who wants to experiment with large models needs to modify their configuration to add `--swap`. In the swap-on-by-default case, anybody who wants to limit swap needs to modify their configuration to add `prlimit`. So it's a sysadmin issue. If ollama has been configured in the swap-on-by-default case with `prlimit` and a secondary process activates a model on start, the system will not be halted.
Author
Owner

@unclemusclez commented on GitHub (Nov 15, 2024):

I think the premise that swap should be off by default is where we are not seeing eye-to-eye. Many ollama users like to experiment with models that do not fit in a combination of VRAM and RAM, which is why ollama considers free swap when it's computing memory requirements for loading a model.

In the case where a system absolutely must not use swap, options have been presented. So it then comes down to configuration. In the swap-off-by-default case, anybody who wants to experiment with large models needs to modify their configuration to add --swap. In the swap-on-by-default case, anybody who wants to limit swap needs to modify their configuration to add prlimit. So it's a sysadmin issue.

If ollama has been configured in the swap-on-by-default case with prlimit and a secondary process activates a model on start, the system will not be halted.

I'm not married to the idea of a flag, however, I absolutely triggered this scenario. It would be a rare case, but it is definitely possible. I did it.

Luckily I knew what to look for. However, I can imagine someone not realizing what they did, and ending up having to reinstall the operating system, and potentially triggering it again because they didn't even understand what the issue was in the first place.

Just to reiterate, if Ollama and the model being loaded are startup processes (with a systemctl service script for example) + if there is not enough System RAM + the GPU driver does not load + the SWAP memory is sufficient, then The SWAP memory will be utilized, potentially causing a system hang until the model is fully loaded.

Fully loading the model could take a considerable amount of time depending on Model Size and Swap Disk Speed. For example with my system it is a low speed RPM drive loading 27GB. This take about 20-40 minutes.

Because the memory is maxed out during this window, The system will most likely be inaccessible until the model is unloaded from the system memory, if the system has a chance for that to occur. This means you cant SSH into the machine or kill a process. The machine will be too slow for practical usage.

<!-- gh-comment-id:2480117761 --> @unclemusclez commented on GitHub (Nov 15, 2024): > I think the premise that swap should be off by default is where we are not seeing eye-to-eye. Many ollama users like to experiment with models that do not fit in a combination of VRAM and RAM, which is why ollama considers free swap when it's computing memory requirements for loading a model. > > In the case where a system absolutely must not use swap, options have been presented. So it then comes down to configuration. In the swap-off-by-default case, anybody who wants to experiment with large models needs to modify their configuration to add `--swap`. In the swap-on-by-default case, anybody who wants to limit swap needs to modify their configuration to add `prlimit`. So it's a sysadmin issue. > > If ollama has been configured in the swap-on-by-default case with `prlimit` and a secondary process activates a model on start, the system will not be halted. I'm not married to the idea of a flag, however, I absolutely triggered this scenario. It would be a rare case, but it is definitely possible. I did it. Luckily I knew what to look for. However, I can imagine someone not realizing what they did, and ending up having to reinstall the operating system, and potentially triggering it again because they didn't even understand what the issue was in the first place. Just to reiterate, if Ollama and the model being loaded are startup processes (with a `systemctl` service script for example) + **if there is not enough System RAM** + **the GPU driver does not load** + **the SWAP memory is sufficient**, then **The SWAP memory will be utilized**, potentially **causing a system hang until the model is fully loaded**. Fully loading the model could take a considerable amount of time depending on Model Size and Swap Disk Speed. For example with my system it is a low speed RPM drive loading 27GB. This take about 20-40 minutes. Because the memory is maxed out during this window, The system will most likely be inaccessible until the model is unloaded from the system memory, if the system has a chance for that to occur. This means you cant SSH into the machine or kill a process. The machine will be too slow for practical usage.
Author
Owner

@rick-github commented on GitHub (Nov 16, 2024):

I'm not anti-flag, my position is that there are already mechanisms available that can protect a system from this event. If you would like to add additional mechanisms, file a PR and let the developers choose to integrate. The typical configuration method in ollama is via environment variables.

<!-- gh-comment-id:2480244659 --> @rick-github commented on GitHub (Nov 16, 2024): I'm not anti-flag, my position is that there are already mechanisms available that can protect a system from this event. If you would like to add additional mechanisms, file a PR and let the developers choose to integrate. The typical configuration method in ollama is via [environment variables](https://github.com/ollama/ollama/blob/d875e99e4639dc07af90b2e3ea0d175e2e692efb/envconfig/config.go#L235).
Author
Owner

@unclemusclez commented on GitHub (May 21, 2025):

i think this actually existed as a flag for llama.cpp.
--mlock force system to keep model in RAM rather than swapping or compressing

i believe if there is not enough memory when using this flag, it just wont start.

best wishes!

<!-- gh-comment-id:2899441911 --> @unclemusclez commented on GitHub (May 21, 2025): i think this actually existed as a flag for llama.cpp. `--mlock force system to keep model in RAM rather than swapping or compressing` i believe if there is not enough memory when using this flag, it just wont start. best wishes!
Author
Owner

@rick-github commented on GitHub (May 21, 2025):

mlock is not currently supported. Even when it was, it was advisory - if the server failed to mlock the entirety of the model, it would emit a warning to the log and continue.

<!-- gh-comment-id:2899487785 --> @rick-github commented on GitHub (May 21, 2025): `mlock` is not currently supported. Even when it was, it was advisory - if the server failed to mlock the entirety of the model, it would emit a warning to the log and continue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51418