[GH-ISSUE #5464] Ollama fails to work with CUDA after Linux suspend/resume, unlike other CUDA services #29180

Open
opened 2026-04-22 07:52:42 -05:00 by GiteaMirror · 19 comments
Owner

Originally created by @bwnjnOEI on GitHub (Jul 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5464

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Every time Linux resumes from suspension, it fails to correctly reload CUDA. However, this issue has been well-resolved using commands like sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm. After this, all CUDA-dependent services except Ollama can utilize CUDA and work normally again (e.g., torch.randn((2,2)).cuda(0)). GPU mode for Ollama can only be restored by restarting the Ollama service. This can be done by reloading systemd and restarting Ollama: systemctl daemon-reload and systemctl restart ollama. I'm not sure if I've missed something, such as specific Ollama settings, so I've reported this as a bug.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @bwnjnOEI on GitHub (Jul 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5464 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Every time Linux resumes from suspension, it fails to correctly reload `CUDA`. However, this issue has been well-resolved using commands like `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm`. After this, all CUDA-dependent services except `Ollama` can utilize `CUDA` and work normally again (e.g., `torch.randn((2,2)).cuda(0)`). GPU mode for Ollama can only be restored by restarting the Ollama service. This can be done by reloading systemd and restarting Ollama: `systemctl daemon-reload` and `systemctl restart ollama`. I'm not sure if I've missed something, such as specific `Ollama` settings, so I've reported this as a bug. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48
GiteaMirror added the nvidiabug labels 2026-04-22 07:52:42 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jul 3, 2024):

Can you share a server log showing the failure after resuming when the GPU doesn't work?

<!-- gh-comment-id:2207160979 --> @dhiltgen commented on GitHub (Jul 3, 2024): Can you share a server log showing the failure after resuming when the GPU doesn't work?
Author
Owner

@bwnjnOEI commented on GitHub (Jul 4, 2024):

Can you share a server log showing the failure after resuming when the GPU doesn't work?

2024/07/04 22:08:23 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/home/bwnjnoei/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:730 msg="total blobs: 0"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11435 (version 0.1.48)"
time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/tmp/ollama1961662252 error="remove /tmp/ollama1961662252: directory not empty"
time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:81 msg="failed to read ollama.pid" path=/tmp/ollama259058401 error="open /tmp/ollama259058401/ollama.pid: permission denied"
time=2024-07-04T22:08:23.991+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama558219018/runners
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/deps.txt.gz
time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/ollama_llama_server.gz
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx2/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cuda_v11/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/rocm_v60101/ollama_llama_server
time=2024-07-04T22:08:25.348+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60101 cpu]"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=sched.go:94 msg="starting llm scheduler"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:205 msg="Detecting GPUs"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcuda.so

time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcuda.so** /usr/local/cuda*/targets//lib/libcuda.so /usr/lib/-linux-gnu/nvidia/current/libcuda.so /usr/lib/-linux-gnu/libcuda.so /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers//libcuda.so /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/usr/lib/i386-linux-gnu/libcuda.so.550.67 /usr/lib/x86_64-linux-gnu/libcuda.so.550.67]"
library /usr/lib/i386-linux-gnu/libcuda.so.550.67 load err: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32
time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/i386-linux-gnu/libcuda.so.550.67 error="Unable to load /usr/lib/i386-linux-gnu/libcuda.so.550.67 library to query for Nvidia GPUs: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32"
cuInit err: 999
time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.67 error="nvcuda init failure: 999"
time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcudart.so*
time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcudart.so** /tmp/ollama558219018/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers//libcudart.so /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-07-04T22:08:25.353+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 /usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117]"
cudaSetDevice err: 999
time=2024-07-04T22:08:25.354+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 999"
cudaSetDevice err: 999
time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117 error="cudart init failure: 999"
time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-07-04T22:08:25.356+08:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="41.4 GiB"

<!-- gh-comment-id:2209094390 --> @bwnjnOEI commented on GitHub (Jul 4, 2024): > Can you share a server log showing the failure after resuming when the GPU doesn't work? 2024/07/04 22:08:23 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/home/bwnjnoei/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:730 msg="total blobs: 0" time=2024-07-04T22:08:23.991+08:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0" time=2024-07-04T22:08:23.991+08:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11435 (version 0.1.48)" time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/tmp/ollama1961662252 error="remove /tmp/ollama1961662252: directory not empty" time=2024-07-04T22:08:23.991+08:00 level=WARN source=assets.go:81 msg="failed to read ollama.pid" path=/tmp/ollama259058401 error="open /tmp/ollama259058401/ollama.pid: permission denied" time=2024-07-04T22:08:23.991+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama558219018/runners time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.991+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/deps.txt.gz time=2024-07-04T22:08:23.992+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/ollama_llama_server.gz time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cpu_avx2/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/cuda_v11/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama558219018/runners/rocm_v60101/ollama_llama_server time=2024-07-04T22:08:25.348+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60101 cpu]" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=sched.go:94 msg="starting llm scheduler" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:205 msg="Detecting GPUs" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcuda.so* time=2024-07-04T22:08:25.348+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/usr/lib/i386-linux-gnu/libcuda.so.550.67 /usr/lib/x86_64-linux-gnu/libcuda.so.550.67]" library /usr/lib/i386-linux-gnu/libcuda.so.550.67 load err: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32 time=2024-07-04T22:08:25.350+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/i386-linux-gnu/libcuda.so.550.67 error="Unable to load /usr/lib/i386-linux-gnu/libcuda.so.550.67 library to query for Nvidia GPUs: /usr/lib/i386-linux-gnu/libcuda.so.550.67: wrong ELF class: ELFCLASS32" cuInit err: 999 time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:517 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.67 error="nvcuda init failure: 999" time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:435 msg="Searching for GPU library" name=libcudart.so* time=2024-07-04T22:08:25.352+08:00 level=DEBUG source=gpu.go:454 msg="gpu library search" globs="[/home/bwnjnoei/libcudart.so** /tmp/ollama558219018/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" time=2024-07-04T22:08:25.353+08:00 level=DEBUG source=gpu.go:488 msg="discovered GPU libraries" paths="[/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 /usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117]" cudaSetDevice err: 999 time=2024-07-04T22:08:25.354+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/tmp/ollama558219018/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 999" cudaSetDevice err: 999 time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=gpu.go:500 msg="Unable to load cudart" library=/usr/lib/x86_64-linux-gnu/libcudart.so.11.5.117 error="cudart init failure: 999" time=2024-07-04T22:08:25.356+08:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" time=2024-07-04T22:08:25.356+08:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="41.4 GiB"
Author
Owner

@tcs-christian-ulrich commented on GitHub (Jul 24, 2024):

same here, any news ?

<!-- gh-comment-id:2247782261 --> @tcs-christian-ulrich commented on GitHub (Jul 24, 2024): same here, any news ?
Author
Owner

@marklysze commented on GitHub (Jul 30, 2024):

Same as well :(

<!-- gh-comment-id:2259249981 --> @marklysze commented on GitHub (Jul 30, 2024): Same as well :(
Author
Owner

@dej4vu commented on GitHub (Aug 7, 2024):

this issue cause my wls disk used grow up quickly, eventually no free space in c drive
image
image

<!-- gh-comment-id:2272572588 --> @dej4vu commented on GitHub (Aug 7, 2024): this issue cause my wls disk used grow up quickly, eventually no free space in c drive ![image](https://github.com/user-attachments/assets/f84d21c3-e04a-48b9-ab98-c6cca1070f0c) ![image](https://github.com/user-attachments/assets/949c8bdc-d39d-4f14-9c93-e64477dac89a)
Author
Owner

@dhiltgen commented on GitHub (Aug 8, 2024):

@dej4vu #6171 should fix the tmp cleaning issue.

<!-- gh-comment-id:2276384819 --> @dhiltgen commented on GitHub (Aug 8, 2024): @dej4vu #6171 should fix the tmp cleaning issue.
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

Community contributions to improve our systemd setup to better integrate with suspend/resume would be welcome.

<!-- gh-comment-id:2332167306 --> @dhiltgen commented on GitHub (Sep 5, 2024): Community contributions to improve our [systemd setup](https://github.com/ollama/ollama/blob/main/scripts/install.sh#L105) to better integrate with suspend/resume would be welcome.
Author
Owner

@betz0r commented on GitHub (Dec 15, 2024):

I found a solution in combination with newest nvidia-driver 550 for linux and adding a file /etc/modprobe.d/nvidia-suspend.conf with:
options nvidia NVreg_PreserveVideoMemoryAllocations=1

Now a resume after suspend doesn't interrupt ollama's cuda access anymore and it work's as intended.

<!-- gh-comment-id:2544071393 --> @betz0r commented on GitHub (Dec 15, 2024): I found a solution in combination with newest nvidia-driver 550 for linux and adding a file `/etc/modprobe.d/nvidia-suspend.conf` with: `options nvidia NVreg_PreserveVideoMemoryAllocations=1` Now a resume after suspend doesn't interrupt ollama's cuda access anymore and it work's as intended.
Author
Owner

@jasondunsmore commented on GitHub (Jan 21, 2025):

@betz0r That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05.

<!-- gh-comment-id:2604776362 --> @jasondunsmore commented on GitHub (Jan 21, 2025): @betz0r That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05.
Author
Owner

@Quantumm2 commented on GitHub (Jan 21, 2025):

@jasondunsmore
In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4)
OS: Ubuntu flavor (24.04), KDE Desktop with X11.
Application: Open-WebUI (bundled with Ollama) GPU Container using Podman

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.

By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

<!-- gh-comment-id:2605816642 --> @Quantumm2 commented on GitHub (Jan 21, 2025): @jasondunsmore In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r . **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: ```bash options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/tmp ``` After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.
Author
Owner

@betz0r commented on GitHub (Jan 22, 2025):

@betz0r That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05.

Nvidia GeForce GTX 1060 6GB on Ubuntu 22.04
NVIDIA-SMI 550.120
Driver Version: 550.120 (this is my installed version, no additional version number after 550.120)
CUDA Version: 12.4

It is indeed still working as intended and the GPU is always recognized by Ollama even after resume from suspend. (suspend to ram, not suspend to disc)

<!-- gh-comment-id:2608124133 --> @betz0r commented on GitHub (Jan 22, 2025): > [@betz0r](https://github.com/betz0r) That didn't work for me. What is the full nvidia driver version? I'm running 550.127.05. Nvidia GeForce GTX 1060 6GB on Ubuntu 22.04 NVIDIA-SMI 550.120 Driver Version: 550.120 (this is my installed version, no additional version number after 550.120) CUDA Version: 12.4 It is indeed still working as intended and the GPU is always recognized by Ollama even after resume from suspend. (suspend to ram, not suspend to disc)
Author
Owner

@wgong commented on GitHub (Jan 23, 2025):

@Quantumm2 Thank you for the tip, I take the approached mentioned in https://askubuntu.com/questions/1228423/how-do-i-fix-cuda-breaking-after-suspend/1503961#1503961

by creating a shell script:

#!/bin/bash

echo "Stopping ollama"
systemctl stop ollama

echo "Calling daemon reload"
systemctl daemon-reload

echo "Removing nvidia_uvm"
rmmod nvidia_uvm

echo "Loading nvidia_uvm"
modprobe nvidia_uvm

echo "Starting ollama again"
systemctl start ollama

Then it works

<!-- gh-comment-id:2608833324 --> @wgong commented on GitHub (Jan 23, 2025): @Quantumm2 Thank you for the tip, I take the approached mentioned in https://askubuntu.com/questions/1228423/how-do-i-fix-cuda-breaking-after-suspend/1503961#1503961 by creating a shell script: ``` #!/bin/bash echo "Stopping ollama" systemctl stop ollama echo "Calling daemon reload" systemctl daemon-reload echo "Removing nvidia_uvm" rmmod nvidia_uvm echo "Loading nvidia_uvm" modprobe nvidia_uvm echo "Starting ollama again" systemctl start ollama ``` Then it works
Author
Owner

@zupermann commented on GitHub (Feb 7, 2025):

@jasondunsmore
In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4)
OS: Ubuntu flavor (24.04), KDE Desktop with X11.
Application: Open-WebUI (bundled with Ollama) GPU Container using Podman

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.

By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

Creating the conf file did not work for me initially, but I needed to do a
sudo update-initramfs -u
before reboot. Everything works as expected, thank you @jasondunsmore @betz0r

<!-- gh-comment-id:2642442531 --> @zupermann commented on GitHub (Feb 7, 2025): > @jasondunsmore > In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r . > > **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) > **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. > **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > ```bash > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > options nvidia NVreg_TemporaryFilePath=/tmp > ``` > After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. > > By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend. Creating the conf file did not work for me initially, but I needed to do a `sudo update-initramfs -u` before reboot. Everything works as expected, thank you @jasondunsmore @betz0r
Author
Owner

@betz0r commented on GitHub (Feb 7, 2025):

@jasondunsmore
In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .
GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4)
OS: Ubuntu flavor (24.04), KDE Desktop with X11.
Application: Open-WebUI (bundled with Ollama) GPU Container using Podman
In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.
This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.
So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.
By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

Creating the conf file did not work for me initially, but I needed to do a sudo update-initramfs -u before reboot. Everything works as expected, thank you @jasondunsmore @betz0r

Oh I might have been upgrading my kernel close to applying that fix via the conf file. That's maybe why it worked instantly for me without realizing you need run

sudo update-initramfs -u

Thanks @zupermann

<!-- gh-comment-id:2642712866 --> @betz0r commented on GitHub (Feb 7, 2025): > > [@jasondunsmore](https://github.com/jasondunsmore) > > In my system, I solved this issue with the help of a solution which is similar to the one described by [@betz0r](https://github.com/betz0r) . > > **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) > > **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. > > **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > > options nvidia NVreg_TemporaryFilePath=/tmp > > > > > > > > > > > > > > > > > > > > After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. > > By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend. > > Creating the conf file did not work for me initially, but I needed to do a `sudo update-initramfs -u` before reboot. Everything works as expected, thank you [@jasondunsmore](https://github.com/jasondunsmore) [@betz0r](https://github.com/betz0r) Oh I might have been upgrading my kernel close to applying that fix via the conf file. That's maybe why it worked instantly for me without realizing you need run `sudo update-initramfs -u` Thanks @zupermann
Author
Owner

@jasondunsmore commented on GitHub (Feb 7, 2025):

I used this procedure on Debian 12: https://askubuntu.com/a/1309807

<!-- gh-comment-id:2642968965 --> @jasondunsmore commented on GitHub (Feb 7, 2025): I used this procedure on Debian 12: https://askubuntu.com/a/1309807
Author
Owner

@austonpramodh commented on GitHub (May 24, 2025):

@jasondunsmore In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

GPU: NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) OS: Ubuntu flavor (24.04), KDE Desktop with X11. Application: Open-WebUI (bundled with Ollama) GPU Container using Podman

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp
After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference.

By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend.

This worked for me, I have tried many solutions, didn't work for me, tested with suspend and wake, it worked.

Ill keep and eye and post here if it still works after long sleep.

Thanks

[Update]

Looks like it does fail to run on GPU after wake up sometimes, rarely. But restarting the container works, don't have to do modprobe n all. Ill test more, for now disabled sleep.

<!-- gh-comment-id:2906318568 --> @austonpramodh commented on GitHub (May 24, 2025): > [@jasondunsmore](https://github.com/jasondunsmore) In my system, I solved this issue with the help of a solution which is similar to the one described by [@betz0r](https://github.com/betz0r) . > > **GPU:** NVIDIA RTX 4060 Notebook (Driver: 550.120, CUDA:12.4) **OS:** Ubuntu flavor (24.04), KDE Desktop with X11. **Application:** Open-WebUI (bundled with Ollama) GPU Container using Podman > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > options nvidia NVreg_TemporaryFilePath=/tmp > After that, I simply restarted the system and made sure that the above modifications persisted (by once again executing the checks mentioned in the above Arch Wiki). Following that, I started the Open-WebUI (bundled with Ollama) container and started a simple chat. Then, with the container running (and also stopped in further tests), I manually suspended (or put into sleep) the system, waited for about a minute, logged in back and continued the chat. Earlier, it used to be slow (CPU inference) after waking up and logs showed that Ollama could not detect the GPU. Now, it was super fast like in the last active session (which used GPU after fresh start or a restart) . Ollama logs also showed that the GPU is being used for inference. > > By the time of writing this comment, I had ensured that this solution works (atleast in my system) by repeatedly putting my system into sleep both manually and automatically (as per power profile settings) several times in combination with many restarts. In all my tests, GPU was recognized by Ollama after waking from sleep/suspend. This worked for me, I have tried many solutions, didn't work for me, tested with suspend and wake, it worked. Ill keep and eye and post here if it still works after long sleep. Thanks [Update] Looks like it does fail to run on GPU after wake up sometimes, rarely. But restarting the container works, don't have to do modprobe n all. Ill test more, for now disabled sleep.
Author
Owner

@cesarb commented on GitHub (Jun 20, 2025):

Let me suggest a different approach, which might help with this issue on Linux: taking a systemd sleep delay inhibitor lock (https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models which are using the GPU. This might help not only with CUDA, but also with ROCm in case there's not enough system RAM to preserve the VRAM contents (see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/).

Taking a sleep delay inhibitor lock would give ollama a small amount of time (5 seconds by default, see https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=) to finish any running tasks and unload models from the VRAM. The algorithm I'd suggest to implement within "ollama serve" would be as follows:

  1. Before loading any model which uses the GPU, take a sleep delay lock;
  2. Whenever a "PrepareForSleep(true)" signal is received, take a lock which blocks any new chat requests until it's released, and force the keep_alive for all GPU-using models to 0 (unloading them immediately as soon as they're idle);
  3. Once all models which use the GPU are unloaded, release the sleep delay lock;
  4. When the "PrepareForSleep(false)" signal is received (which means a resume or failed suspend), release that lock which stops any new chat requests, letting any queued requests execute.

This algorithm obviously wouldn't help if a model is being used at that moment and takes more than a very few seconds to finish, but other than that, it should ensure that no ollama process is using the GPU when the computer actually suspends (since the lock is released only after all models are no longer using the GPU).

I don't know whether the same approach should also be used to also release models which use only the CPU. It could help by freeing more memory to save anything still on the VRAM, but could also lead to an annoying extra delay on the next chat after resuming, and could be especially annoying when their keep_alive was set to a high number on purpose.

<!-- gh-comment-id:2993085881 --> @cesarb commented on GitHub (Jun 20, 2025): Let me suggest a different approach, which might help with this issue on Linux: taking a systemd sleep delay inhibitor lock (https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models which are using the GPU. This might help not only with CUDA, but also with ROCm in case there's not enough system RAM to preserve the VRAM contents (see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/). Taking a sleep delay inhibitor lock would give ollama a small amount of time (5 seconds by default, see https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=) to finish any running tasks and unload models from the VRAM. The algorithm I'd suggest to implement within "ollama serve" would be as follows: 1. Before loading any model which uses the GPU, take a sleep delay lock; 2. Whenever a "PrepareForSleep(true)" signal is received, take a lock which blocks any new chat requests until it's released, and force the `keep_alive` for all GPU-using models to 0 (unloading them immediately as soon as they're idle); 3. Once all models which use the GPU are unloaded, release the sleep delay lock; 4. When the "PrepareForSleep(false)" signal is received (which means a resume or failed suspend), release that lock which stops any new chat requests, letting any queued requests execute. This algorithm obviously wouldn't help if a model is being used at that moment and takes more than a very few seconds to finish, but other than that, it should ensure that no ollama process is using the GPU when the computer actually suspends (since the lock is released only after all models are no longer using the GPU). I don't know whether the same approach should also be used to also release models which use only the CPU. It could help by freeing more memory to save anything still on the VRAM, but could also lead to an annoying extra delay on the next chat after resuming, and could be especially annoying when their `keep_alive` was set to a high number on purpose.
Author
Owner

@wgong commented on GitHub (Jun 21, 2025):

My ubuntu won't suspect, have to power off every time. Which is an
undesirable work around to issue discussed here

On Fri, Jun 20, 2025, 7:13 PM Cesar Eduardo Barros @.***>
wrote:

cesarb left a comment (ollama/ollama#5464)
https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881

Let me suggest a different approach, which might help with this issue on
Linux: taking a systemd sleep delay inhibitor lock (
https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models
which are using the GPU. This might help not only with CUDA, but also with
ROCm in case there's not enough system RAM to preserve the VRAM contents
(see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/).

Taking a sleep delay inhibitor lock would give ollama a small amount of
time (5 seconds by default, see
https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=)
to finish any running tasks and unload models from the VRAM. The algorithm
I'd suggest to implement within "ollama serve" would be as follows:

  1. Before loading any model which uses the GPU, take a sleep delay
    lock;
  2. Whenever a "PrepareForSleep(true)" signal is received, take a lock
    which blocks any new chat requests until it's released, and force the
    keep_alive for all GPU-using models to 0 (unloading them immediately
    as soon as they're idle);
  3. Once all models which use the GPU are unloaded, release the sleep
    delay lock;
  4. When the "PrepareForSleep(false)" signal is received (which means a
    resume or failed suspend), release that lock which stops any new chat
    requests, letting any queued requests execute.

This algorithm obviously wouldn't help if a model is being used at that
moment and takes more than a very few seconds to finish, but other than
that, it should ensure that no ollama process is using the GPU when the
computer actually suspends (since the lock is released only after all
models are no longer using the GPU).

I don't know whether the same approach should also be used to also release
models which use only the CPU. It could help by freeing more memory to save
anything still on the VRAM, but could also lead to an annoying extra delay
on the next chat after resuming, and could be especially annoying when
their keep_alive was set to a high number on purpose.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACQRSGNUZUMIYLUEYPY3WL3ESIQPAVCNFSM6AAAAABKJ7576CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJTGA4DKOBYGE
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:2993680794 --> @wgong commented on GitHub (Jun 21, 2025): My ubuntu won't suspect, have to power off every time. Which is an undesirable work around to issue discussed here On Fri, Jun 20, 2025, 7:13 PM Cesar Eduardo Barros ***@***.***> wrote: > *cesarb* left a comment (ollama/ollama#5464) > <https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881> > > Let me suggest a different approach, which might help with this issue on > Linux: taking a systemd sleep delay inhibitor lock ( > https://systemd.io/INHIBITOR_LOCKS/), and using it to stop idle models > which are using the GPU. This might help not only with CUDA, but also with > ROCm in case there's not enough system RAM to preserve the VRAM contents > (see https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/). > > Taking a sleep delay inhibitor lock would give ollama a small amount of > time (5 seconds by default, see > https://www.freedesktop.org/software/systemd/man/latest/logind.conf.html#InhibitDelayMaxSec=) > to finish any running tasks and unload models from the VRAM. The algorithm > I'd suggest to implement within "ollama serve" would be as follows: > > 1. Before loading any model which uses the GPU, take a sleep delay > lock; > 2. Whenever a "PrepareForSleep(true)" signal is received, take a lock > which blocks any new chat requests until it's released, and force the > keep_alive for all GPU-using models to 0 (unloading them immediately > as soon as they're idle); > 3. Once all models which use the GPU are unloaded, release the sleep > delay lock; > 4. When the "PrepareForSleep(false)" signal is received (which means a > resume or failed suspend), release that lock which stops any new chat > requests, letting any queued requests execute. > > This algorithm obviously wouldn't help if a model is being used at that > moment and takes more than a very few seconds to finish, but other than > that, it should ensure that no ollama process is using the GPU when the > computer actually suspends (since the lock is released only after all > models are no longer using the GPU). > > I don't know whether the same approach should also be used to also release > models which use only the CPU. It could help by freeing more memory to save > anything still on the VRAM, but could also lead to an annoying extra delay > on the next chat after resuming, and could be especially annoying when > their keep_alive was set to a high number on purpose. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5464#issuecomment-2993085881>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACQRSGNUZUMIYLUEYPY3WL3ESIQPAVCNFSM6AAAAABKJ7576CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJTGA4DKOBYGE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@hs-ye commented on GitHub (Jun 28, 2025):

@jasondunsmore In my system, I solved this issue with the help of a solution which is similar to the one described by @betz0r .

In my system, the often suggested method of executing sudo rmmod nvidia_uvm and sudo modprobe nvidia_uvm did not succeed. sudo rmmod nvidia_uvm resulted in an error that the nvidia_uvm is in use. So, I had to search for other alternative solutions.

This Arch Wiki explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it.

So, I followed an answer from AskUbuntu which briefly explains how to enable it. It involves creating a file /etc/modprobe.d/nvidia-power-management.conf with following contents:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/tmp

Adding the /etc/modprobe.d/nvidia-power-management.conf appears to also work for me ( OS: Pop!_OS 20.04 LTS x86_64 - RTX 3080 - Driver Version: 560.35.03 )

I think the sudo rmmod nvidia_uvm method fails if you are using the GUI environment directly (since this is my desktop and not a remote machine) because GPU is in use.

I'm not confident enough in my bash scripting to raise a PR but it does feel like we should be modifying the install script to include creating this file if we are on linux?

<!-- gh-comment-id:3015191197 --> @hs-ye commented on GitHub (Jun 28, 2025): > [@jasondunsmore](https://github.com/jasondunsmore) In my system, I solved this issue with the help of a solution which is similar to the one described by [@betz0r](https://github.com/betz0r) . > > In my system, the often suggested method of executing `sudo rmmod nvidia_uvm` and `sudo modprobe nvidia_uvm` did not succeed. `sudo rmmod nvidia_uvm` resulted in an error that the `nvidia_uvm` is in use. So, I had to search for other alternative solutions. > > This [Arch Wiki](https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Preserve_video_memory_after_suspend) explains background about what happens to GPU when the Linux systems suspend. To overcome those drawbacks, it suggests to enable the option to preserve memory allocations from the last active sessions. It also provides some hints on how one can check whether this option is enabled in their systems. I followed that hint (basically running one command and checking the output), found out that it was not enabled in my system and decided to give it a try by enabling it. > > So, I followed an answer from [AskUbuntu](https://askubuntu.com/a/1503961) which briefly explains how to enable it. It involves creating a file `/etc/modprobe.d/nvidia-power-management.conf` with following contents: > > options nvidia NVreg_PreserveVideoMemoryAllocations=1 > options nvidia NVreg_TemporaryFilePath=/tmp Adding the `/etc/modprobe.d/nvidia-power-management.conf` appears to also work for me ( OS: Pop!_OS 20.04 LTS x86_64 - RTX 3080 - Driver Version: 560.35.03 ) I think the `sudo rmmod nvidia_uvm` method fails if you are using the GUI environment directly (since this is my desktop and not a remote machine) because GPU is in use. I'm not confident enough in my bash scripting to raise a PR but it does feel like we should be modifying the install script to include creating this file if we are on linux?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29180