[GH-ISSUE #5464] Ollama fails to work with CUDA after Linux suspend/resume, unlike other CUDA services #29180

New Issue

GiteaMirror · 2026-04-22T07:52:42-05:00

GiteaMirror commented

2026-04-22 07:52:42 -05:00

Originally created by @bwnjnOEI on GitHub (Jul 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5464

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Every time Linux resumes from suspension, it fails to correctly reload CUDA. However, this issue has been well-resolved using commands like sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm. After this, all CUDA-dependent services except Ollama can utilize CUDA and work normally again (e.g., torch.randn((2,2)).cuda(0)). GPU mode for Ollama can only be restored by restarting the Ollama service. This can be done by reloading systemd and restarting Ollama: systemctl daemon-reload and systemctl restart ollama. I'm not sure if I've missed something, such as specific Ollama settings, so I've reported this as a bug.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @bwnjnOEI on GitHub (Jul 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5464 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Every time Linux resumes from suspension, it fails to correctly reload `CUDA`. However, this issue has been well-resolved using commands like `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm`. After this, all CUDA-dependent services except `Ollama` can utilize `CUDA` and work normally again (e.g., `torch.randn((2,2)).cuda(0)`). GPU mode for Ollama can only be restored by restarting the Ollama service. This can be done by reloading systemd and restarting Ollama: `systemctl daemon-reload` and `systemctl restart ollama`. I'm not sure if I've missed something, such as specific `Ollama` settings, so I've reported this as a bug. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48

GiteaMirror added the nvidia bug labels 2026-04-22 07:52:42 -05:00

GiteaMirror commented

2026-04-22 07:52:43 -05:00

@dhiltgen commented on GitHub (Jul 3, 2024):

Can you share a server log showing the failure after resuming when the GPU doesn't work?

@dhiltgen commented on GitHub (Jul 3, 2024): Can you share a server log showing the failure after resuming when the GPU doesn't work?

GiteaMirror commented

2026-04-22 07:52:43 -05:00

@bwnjnOEI commented on GitHub (Jul 4, 2024):

Can you share a server log showing the failure after resuming when the GPU doesn't work?

2024/07/04 22:08:23 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/home/bwnjnoei/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:730 msg="total blobs: 0"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11435 (version 0.1.48)"
time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/tmp/ollama1961662252 error="remove /tmp/ollama1961662252: directory not empty"
time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:81 msg="failed to read ollama.pid" path=/tmp/ollama259058401 error="open /tmp/ollama259058401/ollama.pid: permission denied"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama558219018/runners
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/deps.txt.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/ollama_llama_server.gz
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx2/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cuda_v11/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/rocm_v60101/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60101 cpu]"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=sched.go:94 msg="starting llm scheduler"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:205 msg="Detecting GPUs"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcuda.so
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcuda.so** /usr/local/cuda*/targets//lib/libcuda.so /usr/lib/-linux-gnu/nvidia/current/libcuda.so /usr/lib/-linux-gnu/libcuda.so /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers//libcuda.so /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/usr/lib/i386-linux-gnu/libcuda.so.550.67 /usr/lib/x86_64-linux-gnu/libcuda.so.550.67]"
library /usr/lib/i386-linux-gnu/libcuda.so.550.67 load err: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32
time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/i386-linux-gnu/libcuda.so.550.67 error="Unable to load /usr/lib/i386-linux-gnu/libcuda.so.550.67 library to query for Nvidia GPUs: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32"
cuInit err: 999
time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.67 error="nvcuda init failure: 999"
time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcudart.so*
time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcudart.so** /tmp/ollama558219018/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers//libcudart.so /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-07-04T22:08:25.353+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 /usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117]"
cudaSetDevice err: 999
time=2024-07-04T22:08:25.354+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 999"
cudaSetDevice err: 999
time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117 error="cudart init failure: 999"
time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-07-04T22:08:25.356+08:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="41.4 GiB"

@bwnjnOEI commented on GitHub (Jul 4, 2024): > Can you share a server log showing the failure after resuming when the GPU doesn't work? 2024/07/04 22:08:23 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/home/bwnjnoei/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:730 msg="total blobs: 0" time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0" time=2024-07-04T22:08:23.991+08:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11435 (version 0.1.48)" time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/tmp/ollama1961662252 error="remove /tmp/ollama1961662252: directory not empty" time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:81 msg="failed to read ollama.pid" path=/tmp/ollama259058401 error="open /tmp/ollama259058401/ollama.pid: permission denied" time=2024-07-04T22:08:23.991+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama558219018/runners time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/deps.txt.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/ollama_llama_server.gz time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx2/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cuda_v11/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/rocm_v60101/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60101 cpu]" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=sched.go:94 msg="starting llm scheduler" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:205 msg="Detecting GPUs" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcuda.so* time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/usr/lib/i386-linux-gnu/libcuda.so.550.67 /usr/lib/x86_64-linux-gnu/libcuda.so.550.67]" library /usr/lib/i386-linux-gnu/libcuda.so.550.67 load err: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32 time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/i386-linux-gnu/libcuda.so.550.67 error="Unable to load /usr/lib/i386-linux-gnu/libcuda.so.550.67 library to query for Nvidia GPUs: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32" cuInit err: 999 time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.67 error="nvcuda init failure: 999" time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcudart.so* time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcudart.so** /tmp/ollama558219018/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" time=2024-07-04T22:08:25.353+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 /usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117]" cudaSetDevice err: 999 time=2024-07-04T22:08:25.354+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 999" cudaSetDevice err: 999 time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117 error="cudart init failure: 999" time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" time=2024-07-04T22:08:25.356+08:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="41.4 GiB"

GiteaMirror commented

2026-04-22 07:52:43 -05:00

@tcs-christian-ulrich commented on GitHub (Jul 24, 2024):

same here, any news ?

@tcs-christian-ulrich commented on GitHub (Jul 24, 2024): same here, any news ?

GiteaMirror commented

2026-04-22 07:52:44 -05:00

@marklysze commented on GitHub (Jul 30, 2024):

Same as well :(

@marklysze commented on GitHub (Jul 30, 2024): Same as well :(

GiteaMirror commented

2026-04-22 07:52:44 -05:00

@dej4vu commented on GitHub (Aug 7, 2024):

this issue cause my wls disk used grow up quickly, eventually no free space in c drive

@dej4vu commented on GitHub (Aug 7, 2024): this issue cause my wls disk used grow up quickly, eventually no free space in c drive ![image](https://github.com/user-attachments/assets/f84d21c3-e04a-48b9-ab98-c6cca1070f0c) ![image](https://github.com/user-attachments/assets/949c8bdc-d39d-4f14-9c93-e64477dac89a)

GiteaMirror commented

2026-04-22 07:52:45 -05:00

@dhiltgen commented on GitHub (Aug 8, 2024):

@dej4vu #6171 should fix the tmp cleaning issue.

@dhiltgen commented on GitHub (Aug 8, 2024): @dej4vu #6171 should fix the tmp cleaning issue.

GiteaMirror commented

2026-04-22 07:52:45 -05:00

@dhiltgen commented on GitHub (Sep 5, 2024):

Community contributions to improve our systemd setup to better integrate with suspend/resume would be welcome.

@dhiltgen commented on GitHub (Sep 5, 2024): Community contributions to improve our [systemd setup](https://github.com/ollama/ollama/blob/main/scripts/install.sh#L105) to better integrate with suspend/resume would be welcome.

GiteaMirror commented

2026-04-22 07:52:46 -05:00

@betz0r commented on GitHub (Dec 15, 2024):

I found a solution in combination with newest nvidia-driver 550 for linux and adding a file /etc/modprobe.d/nvidia-suspend.conf with:
options nvidia NVreg_PreserveVideoMemoryAllocations=1

Now a resume after suspend doesn't interrupt ollama's cuda access anymore and it work's as intended.

@betz0r commented on GitHub (Dec 15, 2024): I found a solution in combination with newest nvidia-driver 550 for linux and adding a file `/etc/modprobe.d/nvidia-suspend.conf` with: `options nvidia NVreg_PreserveVideoMemoryAllocations=1` Now a resume after suspend doesn't interrupt ollama's cuda access anymore and it work's as intended.

GiteaMirror commented

2026-04-22 07:52:46 -05:00

@jasondunsmore commented on GitHub (Jan 21, 2025):

@betz0r That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05.

@jasondunsmore commented on GitHub (Jan 21, 2025): @betz0r That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05.

GiteaMirror commented

2026-04-22 07:52:47 -05:00

@Quantumm2 commented on GitHub (Jan 21, 2025):

@jasondunsmore
In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4)
OS: Ubuntu flavor (24.04), KDE Desktop with X11.
Application: Open-WebUI (bundled with Ollama) GPU Container using Podman

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.

By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

@Quantumm2 commented on GitHub (Jan 21, 2025): @jasondunsmore In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r . **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: ```bash options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/tmp ``` After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

GiteaMirror commented

2026-04-22 07:52:47 -05:00

@betz0r commented on GitHub (Jan 22, 2025):

@betz0r That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05.

Nvidia GeForce GTX 1060 6GB on Ubuntu 22.04
NVIDIA-SMI 550.120
Driver Version: 550.120 (this is my installed version, no additional version number after 550.120)
CUDA Version: 12.4

It is indeed still working as intended and the GPU is always recognized by Ollama even after resume from suspend. (suspend to ram, not suspend to disc)

@betz0r commented on GitHub (Jan 22, 2025): > [@betz0r](https://github.com/betz0r) That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05. Nvidia GeForce GTX 1060 6GB on Ubuntu 22.04 NVIDIA-SMI 550.120 Driver Version: 550.120 (this is my installed version, no additional version number after 550.120) CUDA Version: 12.4 It is indeed still working as intended and the GPU is always recognized by Ollama even after resume from suspend. (suspend to ram, not suspend to disc)

GiteaMirror commented

2026-04-22 07:52:48 -05:00

@wgong commented on GitHub (Jan 23, 2025):

@Quantumm2 Thank you for the tip, I take the approached mentioned in https://askubuntu.com/questions/1228423/how-do-i-fix-cuda-breaking-after-suspend/1503961#1503961

by creating a shell script:

#!/bin/bash

echo "Stopping ollama"
systemctl stop ollama

echo "Calling daemon reload"
systemctl daemon-reload

echo "Removing nvidia_uvm"
rmmod nvidia_uvm

echo "Loading nvidia_uvm"
modprobe nvidia_uvm

echo "Starting ollama again"
systemctl start ollama

Then it works

@wgong commented on GitHub (Jan 23, 2025): @Quantumm2 Thank you for the tip, I take the approached mentioned in https://askubuntu.com/questions/1228423/how-do-i-fix-cuda-breaking-after-suspend/1503961#1503961 by creating a shell script: ``` #!/bin/bash echo "Stopping ollama" systemctl stop ollama echo "Calling daemon reload" systemctl daemon-reload echo "Removing nvidia_uvm" rmmod nvidia_uvm echo "Loading nvidia_uvm" modprobe nvidia_uvm echo "Starting ollama again" systemctl start ollama ``` Then it works

GiteaMirror commented

2026-04-22 07:52:48 -05:00

@zupermann commented on GitHub (Feb 7, 2025):

@jasondunsmore
In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4)
OS: Ubuntu flavor (24.04), KDE Desktop with X11.
Application: Open-WebUI (bundled with Ollama) GPU Container using Podman

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp
After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.

By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

Creating the conf file did not work for me initially, but I needed to do a
sudo update-initramfs -u
before reboot. Everything works as expected, thank you @jasondunsmore @betz0r

@zupermann commented on GitHub (Feb 7, 2025): > @jasondunsmore > In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r . > > **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) > **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. > **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > ```bash > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > options nvidia NVreg_TemporaryFilePath=/tmp > ``` > After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. > > By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend. Creating the conf file did not work for me initially, but I needed to do a `sudo update-initramfs -u` before reboot. Everything works as expected, thank you @jasondunsmore @betz0r

GiteaMirror commented

2026-04-22 07:52:49 -05:00

@betz0r commented on GitHub (Feb 7, 2025):

@jasondunsmore
In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .
GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4)
OS: Ubuntu flavor (24.04), KDE Desktop with X11.
Application: Open-WebUI (bundled with Ollama) GPU Container using Podman
In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.
This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.
So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.
By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

Creating the conf file did not work for me initially, but I needed to do a sudo update-initramfs -u before reboot. Everything works as expected, thank you @jasondunsmore @betz0r

Oh I might have been upgrading my kernel close to applying that fix via the conf file. That's maybe why it worked instantly for me without realizing you need run

sudo update-initramfs -u

Thanks @zupermann

@betz0r commented on GitHub (Feb 7, 2025): > > [@jasondunsmore](https://github.com/jasondunsmore) > > In my system, I solved this issue with the help of a solution which is similar to the one described by [@betz0r](https://github.com/betz0r) . > > **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) > > **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. > > **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > > options nvidia NVreg_TemporaryFilePath=/tmp > > > > > > > > > > > > > > > > > > > > After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. > > By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend. > > Creating the conf file did not work for me initially, but I needed to do a `sudo update-initramfs -u` before reboot. Everything works as expected, thank you [@jasondunsmore](https://github.com/jasondunsmore) [@betz0r](https://github.com/betz0r) Oh I might have been upgrading my kernel close to applying that fix via the conf file. That's maybe why it worked instantly for me without realizing you need run `sudo update-initramfs -u` Thanks @zupermann

GiteaMirror commented

2026-04-22 07:52:49 -05:00

@jasondunsmore commented on GitHub (Feb 7, 2025):

I used this procedure on Debian 12: https://askubuntu.com/a/1309807

@jasondunsmore commented on GitHub (Feb 7, 2025): I used this procedure on Debian 12: https://askubuntu.com/a/1309807

GiteaMirror commented

2026-04-22 07:52:50 -05:00

@austonpramodh commented on GitHub (May 24, 2025):

@jasondunsmore In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) OS: Ubuntu flavor (24.04), KDE Desktop with X11. Application: Open-WebUI (bundled with Ollama) GPU Container using Podman

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp
After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.

By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

This worked for me, I have tried many solutions, didn't work for me, tested with suspend and wake, it worked.

Ill keep and eye and post here if it still works after long sleep.

Thanks

[Update]

Looks like it does fail to run on GPU after wake up sometimes, rarely. But restarting the container works, don't have to do modprobe n all. Ill test more, for now disabled sleep.

@austonpramodh commented on GitHub (May 24, 2025): > [@jasondunsmore](https://github.com/jasondunsmore) In my system, I solved this issue with the help of a solution which is similar to the one described by [@betz0r](https://github.com/betz0r) . > > **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > options nvidia NVreg_TemporaryFilePath=/tmp > After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. > > By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend. This worked for me, I have tried many solutions, didn't work for me, tested with suspend and wake, it worked. Ill keep and eye and post here if it still works after long sleep. Thanks [Update] Looks like it does fail to run on GPU after wake up sometimes, rarely. But restarting the container works, don't have to do modprobe n all. Ill test more, for now disabled sleep.

GiteaMirror commented

2026-04-22 07:52:50 -05:00

@cesarb commented on GitHub (Jun 20, 2025):

Let me suggest a different approach, which might help with this issue on Linux: taking a systemd sleep delay inhibitor lock (https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models which are using the GPU. This might help not only with CUDA, but also with ROCm in case there's not enough system RAM to preserve the VRAM contents (see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/).

Taking a sleep delay inhibitor lock would give ollama a small amount of time (5 seconds by default, see https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=) to finish any running tasks and unload models from the VRAM. The algorithm I'd suggest to implement within "ollama serve" would be as follows:

Before loading any model which uses the GPU, take a sleep delay lock;
Whenever a "PrepareForSleep(true)" signal is received, take a lock which blocks any new chat requests until it's released, and force the keep_alive for all GPU-using models to 0 (unloading them immediately as soon as they're idle);
Once all models which use the GPU are unloaded, release the sleep delay lock;
When the "PrepareForSleep(false)" signal is received (which means a resume or failed suspend), release that lock which stops any new chat requests, letting any queued requests execute.

This algorithm obviously wouldn't help if a model is being used at that moment and takes more than a very few seconds to finish, but other than that, it should ensure that no ollama process is using the GPU when the computer actually suspends (since the lock is released only after all models are no longer using the GPU).

I don't know whether the same approach should also be used to also release models which use only the CPU. It could help by freeing more memory to save anything still on the VRAM, but could also lead to an annoying extra delay on the next chat after resuming, and could be especially annoying when their keep_alive was set to a high number on purpose.

@cesarb commented on GitHub (Jun 20, 2025): Let me suggest a different approach, which might help with this issue on Linux: taking a systemd sleep delay inhibitor lock (https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models which are using the GPU. This might help not only with CUDA, but also with ROCm in case there's not enough system RAM to preserve the VRAM contents (see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/). Taking a sleep delay inhibitor lock would give ollama a small amount of time (5 seconds by default, see https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=) to finish any running tasks and unload models from the VRAM. The algorithm I'd suggest to implement within "ollama serve" would be as follows: 1. Before loading any model which uses the GPU, take a sleep delay lock; 2. Whenever a "PrepareForSleep(true)" signal is received, take a lock which blocks any new chat requests until it's released, and force the `keep_alive` for all GPU-using models to 0 (unloading them immediately as soon as they're idle); 3. Once all models which use the GPU are unloaded, release the sleep delay lock; 4. When the "PrepareForSleep(false)" signal is received (which means a resume or failed suspend), release that lock which stops any new chat requests, letting any queued requests execute. This algorithm obviously wouldn't help if a model is being used at that moment and takes more than a very few seconds to finish, but other than that, it should ensure that no ollama process is using the GPU when the computer actually suspends (since the lock is released only after all models are no longer using the GPU). I don't know whether the same approach should also be used to also release models which use only the CPU. It could help by freeing more memory to save anything still on the VRAM, but could also lead to an annoying extra delay on the next chat after resuming, and could be especially annoying when their `keep_alive` was set to a high number on purpose.

GiteaMirror commented

2026-04-22 07:52:51 -05:00

@wgong commented on GitHub (Jun 21, 2025):

My ubuntu won't suspect, have to power off every time. Which is an
undesirable work around to issue discussed here

On Fri, Jun 20, 2025, 7:13 PM Cesar Eduardo Barros @.***>
wrote:

cesarb left a comment (ollama/ollama#5464)
https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881

Let me suggest a different approach, which might help with this issue on
Linux: taking a systemd sleep delay inhibitor lock (
https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models
which are using the GPU. This might help not only with CUDA, but also with
ROCm in case there's not enough system RAM to preserve the VRAM contents
(see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/).

Taking a sleep delay inhibitor lock would give ollama a small amount of
time (5 seconds by default, see
https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=)
to finish any running tasks and unload models from the VRAM. The algorithm
I'd suggest to implement within "ollama serve" would be as follows:

Before loading any model which uses the GPU, take a sleep delay
lock;

Whenever a "PrepareForSleep(true)" signal is received, take a lock
which blocks any new chat requests until it's released, and force the
keep_alive for all GPU-using models to 0 (unloading them immediately
as soon as they're idle);

Once all models which use the GPU are unloaded, release the sleep
delay lock;

When the "PrepareForSleep(false)" signal is received (which means a
resume or failed suspend), release that lock which stops any new chat
requests, letting any queued requests execute.

This algorithm obviously wouldn't help if a model is being used at that
moment and takes more than a very few seconds to finish, but other than
that, it should ensure that no ollama process is using the GPU when the
computer actually suspends (since the lock is released only after all
models are no longer using the GPU).

I don't know whether the same approach should also be used to also release
models which use only the CPU. It could help by freeing more memory to save
anything still on the VRAM, but could also lead to an annoying extra delay
on the next chat after resuming, and could be especially annoying when
their keep_alive was set to a high number on purpose.

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACQRSGNUZUMIYLUEYPY3WL3ESIQPAVCNFSM6AAAAABKJ7576CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJTGA4DKOBYGE
.
You are receiving this because you commented.Message ID:
@.***>

@wgong commented on GitHub (Jun 21, 2025): My ubuntu won't suspect, have to power off every time. Which is an undesirable work around to issue discussed here On Fri, Jun 20, 2025, 7:13 PM Cesar Eduardo Barros ***@***.***> wrote: > *cesarb* left a comment (ollama/ollama#5464) > <https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881> > > Let me suggest a different approach, which might help with this issue on > Linux: taking a systemd sleep delay inhibitor lock ( > https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models > which are using the GPU. This might help not only with CUDA, but also with > ROCm in case there's not enough system RAM to preserve the VRAM contents > (see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/). > > Taking a sleep delay inhibitor lock would give ollama a small amount of > time (5 seconds by default, see > https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=) > to finish any running tasks and unload models from the VRAM. The algorithm > I'd suggest to implement within "ollama serve" would be as follows: > > 1. Before loading any model which uses the GPU, take a sleep delay > lock; > 2. Whenever a "PrepareForSleep(true)" signal is received, take a lock > which blocks any new chat requests until it's released, and force the > keep_alive for all GPU-using models to 0 (unloading them immediately > as soon as they're idle); > 3. Once all models which use the GPU are unloaded, release the sleep > delay lock; > 4. When the "PrepareForSleep(false)" signal is received (which means a > resume or failed suspend), release that lock which stops any new chat > requests, letting any queued requests execute. > > This algorithm obviously wouldn't help if a model is being used at that > moment and takes more than a very few seconds to finish, but other than > that, it should ensure that no ollama process is using the GPU when the > computer actually suspends (since the lock is released only after all > models are no longer using the GPU). > > I don't know whether the same approach should also be used to also release > models which use only the CPU. It could help by freeing more memory to save > anything still on the VRAM, but could also lead to an annoying extra delay > on the next chat after resuming, and could be especially annoying when > their keep_alive was set to a high number on purpose. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACQRSGNUZUMIYLUEYPY3WL3ESIQPAVCNFSM6AAAAABKJ7576CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJTGA4DKOBYGE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-22 07:52:51 -05:00

@hs-ye commented on GitHub (Jun 28, 2025):

@jasondunsmore In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

Adding the /etc/modprobe.d/nvidia-power-management.conf appears to also work for me ( OS: Pop!_OS 20.04 LTS x86_64 - RTX 3080 - Driver Version: 560.35.03 )

I think the sudo rmmod nvidia_uvm method fails if you are using the GUI environment directly (since this is my desktop and not a remote machine) because GPU is in use.

I'm not confident enough in my bash scripting to raise a PR but it does feel like we should be modifying the install script to include creating this file if we are on linux?

@hs-ye commented on GitHub (Jun 28, 2025): > [@jasondunsmore](https://github.com/jasondunsmore) In my system, I solved this issue with the help of a solution which is similar to the one described by [@betz0r](https://github.com/betz0r) . > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > options nvidia NVreg_TemporaryFilePath=/tmp Adding the `/etc/modprobe.d/nvidia-power-management.conf` appears to also work for me ( OS: Pop!_OS 20.04 LTS x86_64 - RTX 3080 - Driver Version: 560.35.03 ) I think the `sudo rmmod nvidia_uvm` method fails if you are using the GUI environment directly (since this is my desktop and not a remote machine) because GPU is in use. I'm not confident enough in my bash scripting to raise a PR but it does feel like we should be modifying the install script to include creating this file if we are on linux?

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#29180

[GH-ISSUE #5464] Ollama fails to work with CUDA after Linux suspend/resume, unlike other CUDA services #29180

What is the issue?

OS

GPU

CPU

Ollama version

[GH-ISSUE #5464] `Ollama` fails to work with `CUDA` after `Linux` suspend/resume, unlike other `CUDA` services #29180