[GH-ISSUE #8426] Ollama does not use GPU after hibernation #5415

Closed
opened 2026-04-12 16:39:11 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @mighty-services on GitHub (Jan 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8426

What is the issue?

First of all I want to say thanks to all Devs and Ops, who are contributing to this Repo!
I really like your service and its given me an easy way to step into the land of AI

I've setup ollama on my Tuxedo OS Laptop with a dedicated and an integrated GPU via the normal script given in the Description. It worked rightaway and much better than I thought it would (since this is a Laptop and not a full fledged GPU-Workstation). I downloaded several model like llava, phi4, mistral, ollama3 etc. and all work well out of the box, when I trigger them throug OpenWebUI (Docker Container).

The only thing Iḿ seeing is, that after waking up the Laptop from hibernation, the prompts are 10 times longer. Every answer before came after some seconds and after hibernation it takes 3 to 8 Minutes.

I checked existing issue reports and found some similar ones:

And they all seem to realte to my Problem, but stil I can find the fix for my issue there. Thatś why Iḿ opening up a new one.

When I use Ollama riht after the bootup (before hibernation) it shows me the server in to nvidia-smi Stats:

Tue Jan 14 20:39:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   51C    P8              3W /   27W |    5683MiB /   8188MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1887      G   /usr/lib/xorg/Xorg                             14MiB |
|    0   N/A  N/A     23141      G   /usr/bin/kwin_wayland                           2MiB |
|    0   N/A  N/A     84756      C   ...rs/cuda_v12_avx/ollama_llama_server       5618MiB |
+-----------------------------------------------------------------------------------------+

after hibernation the ollama_llama_server entry at the bottom is not sowing up, when I start a prompt.
I am running ollama version 0.5.4 and my Tuxedo OS is on the latest version:

Distributor ID: Tuxedo
Description:    TUXEDO OS
Release:        24.04
Codename:       noble
Kernel:    6.11.0-108013-tuxedo
RAM: 78 GB
CPU: IntelCore i9-14900HX 32 Cores
Platform: Wayland 
Qt-Version: 6.8.1
KDE Frameworks-Version: 6.8.0

the model is loaded sucessfully

NAME              ID              SIZE      PROCESSOR    UNTIL              
mistral:latest    f974a74358d6    6.3 GB    100% GPU     4 minutes from now  

and I used this script to the force Ollama to use GPU 0, which is the dedicated one - which is shown here in system.d:

### /etc/systemd/system/ollama.service
# [Unit]
# Description=Ollama Service
# After=network-online.target
# 
# [Service]
# Environment="CUDA_VISIBLE_DEVICES=0"
# ExecStart=/usr/local/bin/ollama serve
# User=ollama
# Group=ollama
# Restart=always
# RestartSec=3
# Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
# 
# [Install]
# WantedBy=default.target

Any thoughts how to solve my issue? Feel free to link me to already solved tips and Pages.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.4

Originally created by @mighty-services on GitHub (Jan 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8426 ### What is the issue? First of all I want to say thanks to all Devs and Ops, who are contributing to this Repo! I really like your service and its given me an easy way to step into the land of AI I've setup ollama on my Tuxedo OS Laptop with a dedicated and an integrated GPU via the normal script given in the Description. It worked rightaway and much better than I thought it would (since this is a Laptop and not a full fledged GPU-Workstation). I downloaded several model like llava, phi4, mistral, ollama3 etc. and all work well out of the box, when I trigger them throug OpenWebUI (Docker Container). The only thing Iḿ seeing is, that after waking up the Laptop from hibernation, the prompts are 10 times longer. Every answer before came after some seconds and after hibernation it takes 3 to 8 Minutes. I checked existing issue reports and found some similar ones: - [Ollama goes into uninterruptible sleep mode and cannot be shutdown #3489](https://github.com/ollama/ollama/issues/3489) - [#8023 Ollama is very slow after running for a while](https://github.com/ollama/ollama/issues/8023) And they all seem to realte to my Problem, but stil I can find the fix for my issue there. Thatś why Iḿ opening up a new one. When I use Ollama riht after the bootup (before hibernation) it shows me the server in to nvidia-smi Stats: ``` Tue Jan 14 20:39:21 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 ... Off | 00000000:01:00.0 On | N/A | | N/A 51C P8 3W / 27W | 5683MiB / 8188MiB | 6% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1887 G /usr/lib/xorg/Xorg 14MiB | | 0 N/A N/A 23141 G /usr/bin/kwin_wayland 2MiB | | 0 N/A N/A 84756 C ...rs/cuda_v12_avx/ollama_llama_server 5618MiB | +-----------------------------------------------------------------------------------------+ ``` after hibernation the ollama_llama_server entry at the bottom is not sowing up, when I start a prompt. I am running ollama version 0.5.4 and my Tuxedo OS is on the latest version: ``` Distributor ID: Tuxedo Description: TUXEDO OS Release: 24.04 Codename: noble Kernel: 6.11.0-108013-tuxedo RAM: 78 GB CPU: IntelCore i9-14900HX 32 Cores Platform: Wayland Qt-Version: 6.8.1 KDE Frameworks-Version: 6.8.0 ``` the model is loaded sucessfully ``` NAME ID SIZE PROCESSOR UNTIL mistral:latest f974a74358d6 6.3 GB 100% GPU 4 minutes from now ``` and I used this script to the force Ollama to use GPU 0, which is the dedicated one - which is shown here in system.d: ``` ### /etc/systemd/system/ollama.service # [Unit] # Description=Ollama Service # After=network-online.target # # [Service] # Environment="CUDA_VISIBLE_DEVICES=0" # ExecStart=/usr/local/bin/ollama serve # User=ollama # Group=ollama # Restart=always # RestartSec=3 # Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" # # [Install] # WantedBy=default.target ``` Any thoughts how to solve my issue? Feel free to link me to already solved tips and Pages. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-04-12 16:39:11 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

https://github.com/ollama/ollama/blob/main/docs/gpu.md#laptop-suspend-resume

<!-- gh-comment-id:2591390650 --> @rick-github commented on GitHub (Jan 15, 2025): https://github.com/ollama/ollama/blob/main/docs/gpu.md#laptop-suspend-resume
Author
Owner

@mighty-services commented on GitHub (Jan 15, 2025):

On thing I noticed today is, that the model is loaded in both cases and its saying ths is hundred percent GPU, but it's taking minutes to reply

$ ollama ps
NAME            ID              SIZE      PROCESSOR    UNTIL              
llava:latest    8dd30f6b0cb1    7.0 GB    100% GPU     4 minutes from now    

The ollama server doesnt show up after hibernation:

Wed Jan 15 07:47:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   49C    P8              9W /   27W |     169MiB /   8188MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1887      G   /usr/lib/xorg/Xorg                             14MiB |
|    0   N/A  N/A     23141      G   /usr/bin/kwin_wayland                           2MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:2591772874 --> @mighty-services commented on GitHub (Jan 15, 2025): On thing I noticed today is, that the model is loaded in both cases and its saying ths is hundred percent GPU, but it's taking minutes to reply ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llava:latest 8dd30f6b0cb1 7.0 GB 100% GPU 4 minutes from now ``` The ollama server doesnt show up after hibernation: ``` Wed Jan 15 07:47:47 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 ... Off | 00000000:01:00.0 On | N/A | | N/A 49C P8 9W / 27W | 169MiB / 8188MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1887 G /usr/lib/xorg/Xorg 14MiB | | 0 N/A N/A 23141 G /usr/bin/kwin_wayland 2MiB | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

Ollama has two pieces - the server and the runners. The server identifies that a GPU is available, and fully expects that it can use it. This is the basis of the PROCESSOR column - ollama has calculated that x% (in this case 100) of the model will fit in the GPU. It then launches the runner, whose job it is to actually load the model in the GPU. It's at this point that the runner finds it can't communicate with the GPU, and instead loads the model into system RAM. If you reload the nvidia_uvm module you should find that the runner can once again use the GPU.

<!-- gh-comment-id:2591782138 --> @rick-github commented on GitHub (Jan 15, 2025): Ollama has two pieces - the server and the runners. The server identifies that a GPU is available, and fully expects that it can use it. This is the basis of the PROCESSOR column - ollama has calculated that x% (in this case 100) of the model will fit in the GPU. It then launches the runner, whose job it is to actually load the model in the GPU. It's at this point that the runner finds it can't communicate with the GPU, and instead loads the model into system RAM. If you reload the `nvidia_uvm` module you should find that the runner can once again use the GPU.
Author
Owner

@mighty-services commented on GitHub (Jan 15, 2025):

If you reload the nvidia_uvm module you should find that the runner can once again use the GPU.

Thanks @rick-github for your quick answer. I assume thats what sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm does in the article you've posted before.

I did that and was unable to reload the module, because it was used. I looked up the used processes with sudo lsof /dev/nvidia* and killed the running ollama service.
After that, I was able to reload the module and it showed up in nvidia-smi. The model was loaded and working as before the hibernation.

So whatś your adise for longterm useage? Should i simply call a script, what stops ollama-service, reloads the module and start the service again or is there a better way at hand?

<!-- gh-comment-id:2591794545 --> @mighty-services commented on GitHub (Jan 15, 2025): > If you reload the `nvidia_uvm` module you should find that the runner can once again use the GPU. Thanks @rick-github for your quick answer. I assume thats what `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm` does in the article you've posted before. I did that and was unable to reload the module, because it was used. I looked up the used processes with `sudo lsof /dev/nvidia*` and killed the running ollama service. After that, I was able to reload the module and it showed up in `nvidia-smi`. The model was loaded and working as before the hibernation. So whatś your adise for longterm useage? Should i simply call a script, what stops ollama-service, reloads the module and start the service again or is there a better way at hand?
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

It's not ideal but the script is the only solution at the moment. The ollama team are aware of the issue and will work to resolve it, but in the meantime the workaround all that is available.

<!-- gh-comment-id:2591847392 --> @rick-github commented on GitHub (Jan 15, 2025): It's not ideal but the script is the only solution at the moment. The ollama team are aware of the issue and will work to resolve it, but in the meantime the workaround all that is available.
Author
Owner

@mighty-services commented on GitHub (Jan 15, 2025):

no problem! Iḿ happy for your help and thats totally fine on my side. Thank you for your help!

<!-- gh-comment-id:2591856335 --> @mighty-services commented on GitHub (Jan 15, 2025): no problem! Iḿ happy for your help and thats totally fine on my side. Thank you for your help!
Author
Owner

@SCRIER-org commented on GitHub (Feb 26, 2025):

Same problem with a desktop Mint Linux, running 0.5.0 ollama serve from the terminal. Apparently the memory allocation for the GPU gets frozen, or lost, upon Linux system Sleep, and if the GPU is full the system can't reuse nor reallocate GPU memory the next time ollama tries to run--even if from inside the same run process as before sleep--and defaults to CPU memory, which is about 50x slower but still works. I have to fix this by rebooting the machine every time I sleep it after running an ollama. This bug has been around for almost half a year. I'm not going to be able to run a production-level system with the current bug unless I turn off Sleep. This one seems to require deep juju on the GPU, so I'm hoping a GPU wizard can please fix this. Loving ollama except for this bug, thank you so much.

<!-- gh-comment-id:2683659051 --> @SCRIER-org commented on GitHub (Feb 26, 2025): Same problem with a desktop Mint Linux, running 0.5.0 ollama serve from the terminal. Apparently the memory allocation for the GPU gets frozen, or lost, upon Linux system Sleep, and if the GPU is full the system can't reuse nor reallocate GPU memory the next time ollama tries to run--even if from inside the same run process as before sleep--and defaults to CPU memory, which is about 50x slower but still works. I have to fix this by rebooting the machine every time I sleep it after running an ollama. This bug has been around for almost half a year. I'm not going to be able to run a production-level system with the current bug unless I turn off Sleep. This one seems to require deep juju on the GPU, so I'm hoping a GPU wizard can please fix this. Loving ollama except for this bug, thank you so much.
Author
Owner

@rick-github commented on GitHub (Feb 26, 2025):

Does reloading the nvidia_uvm module not restore access to the GPU?

<!-- gh-comment-id:2683663147 --> @rick-github commented on GitHub (Feb 26, 2025): Does reloading the nvidia_uvm module not restore access to the GPU?
Author
Owner

@SCRIER-org commented on GitHub (Feb 27, 2025):

You want me to...unload an nVidia driver...so as to fix a handshaking problem, without wiping out monitor usage. What could go wrong.

Further research reveals this to be a basic nVidia bug in their CUDA interface, which has been around unfixed since at least 2016.

Regrettably, no, this was not operable.

Sequence:
terminal 1: ollama serve
terminal 2: ollama run qwen2.5-coder:14b on a 16GB GPU with heavy web usage

ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.51 MiB
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloaded 47/49 layers to GPU
llm_load_tensors: CPU buffer size = 8566.04 MiB
llm_load_tensors: CUDA0 buffer size = 7372.87 MiB

Mint Linux Power->Session->Suspend [computer powers down]
[computer powers back up, no screens lost]

terminal 3: sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is in use
terminals 1 and 2 still running

terminal 2: >>> write a quicksort algorithm
Error: POST predict: Post "http://127.0.0.1:39149/completion": EOF
process crashes to shell prompt $

terminal 1:

...llama_model_load: vocab only - skipping tensors
[ GIN] 2025/02/27 - 15:22:36 | 200 | 29.526492083s | 127.0.0.1 | POST "/api/chat"
CUDA error: unspecified launch failure
current device: 0, in function ggml_backend_cuda_buffer_set_tensor at ggml-cuda.cu:542
cudaStreamSynchronize(((cudaStream_t)0x2))
ggml-cuda.cu:132: CUDA error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
SIGABRT: abort
PC=0x7f3a4029eb1c m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 7 gp=0xc0000fe000 m=0 mp=0x57ba577d0f20 [syscall]:
runtime.cgocall(0x57ba572b4a90, 0xc00007db48)
runtime/cgocall.go:157 +0x4b fp=0xc00007db20 sp=0xc00007dae8 pc=0x57ba570358ab
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7f39ec006460, {0xf, 0x7f39ec00f980, 0x0, 0x0, 0x7f39ec010190, 0x7f39ec0109a0, 0x7f39ec022c10, 0x7f39efd1a5e0, 0x0, ...})
_cgo_gotypes.go:548 +0x52 fp=0xc00007db48 sp=0xc00007db20 pc=0x57ba57132e32
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x57ba572b04eb?, 0x7f39ec006460?)
github.com/ollama/ollama/llama/llama.go:189 +0xd8 fp=0xc00007dc68 sp=0xc00007db48 pc=0x57ba57135518
github.com/ollama/ollama/llama.(*Context).Decode(0xc00007dd58?, 0x0?)
github.com/ollama/ollama/llama/llama.go:189 +0x13 fp=0xc00007dcb0 sp=0xc00007dc68 pc=0x57ba571353b3...
barfs like this for pages
then, later,
...[GIN] 2025/02/27 - 15:25:05 | 200 | 795.792582ms | 127.0.0.1 | POST "/api/chat"
cuda driver library failed to get device context 999time=2025-02-27T15:30:05.591-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2025-02-27T15:30:05.843-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2025-02-27T15:30:06.093-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" more barf pages...

terminal 3, for good luck:
sudo modprobe nvidia_uvm no reply.

terminal 2, restarting; it runs, but quite slowly, due to CPU usage.

terminal 1, upon term 2 trying to run again:

ggml_cuda_init: failed to initialize CUDA: unknown error
llm_load_tensors: ggml ctx size = 0.25 MiB
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloaded 47/49 layers to GPU
llm_load_tensors: CPU buffer size = 8566.04 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 384.00 MiB of pinned memory: unknown error
llama_kv_cache_init: CPU KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.60 MiB of pinned memory: unknown error
llama_new_context_with_model: CPU output buffer size = 0.60 MiB
ggml_cuda_host_malloc: failed to allocate 307.00 MiB of pinned memory: unknown error
llama_new_context_with_model: CUDA_Host compute buffer size = 307.00 MiB

this is pretty much what it does, failed to initialize CUDA: unknown error.
Also the "failed to allocate pinned memory".

Linux Mint 22 Cinnamon 6.2.9 nvidia-driver 560.35.03. good luck.

<!-- gh-comment-id:2689125759 --> @SCRIER-org commented on GitHub (Feb 27, 2025): You want me to...unload an nVidia driver...so as to fix a handshaking problem, without wiping out monitor usage. What could go wrong. Further research reveals this to be a basic nVidia bug in their CUDA interface, which has been around unfixed since at least 2016. Regrettably, no, this was not operable. Sequence: terminal 1: `ollama serve` terminal 2: `ollama run qwen2.5-coder:14b` _on a 16GB GPU with heavy web usage_ > ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.51 MiB llm_load_tensors: offloading 47 repeating layers to GPU llm_load_tensors: offloaded 47/49 layers to GPU llm_load_tensors: CPU buffer size = 8566.04 MiB llm_load_tensors: CUDA0 buffer size = 7372.87 MiB Mint Linux Power->Session->Suspend _[computer powers down]_ _[computer powers back up, no screens lost]_ terminal 3: `sudo rmmod nvidia_uvm` `rmmod: ERROR: Module nvidia_uvm is in use` terminals 1 and 2 still running terminal 2: `>>> write a quicksort algorithm` `Error: POST predict: Post "http://127.0.0.1:39149/completion": EOF` _process crashes to shell prompt $_ terminal 1: > ...llama_model_load: vocab only - skipping tensors [ GIN] 2025/02/27 - 15:22:36 | 200 | 29.526492083s | 127.0.0.1 | POST "/api/chat" CUDA error: unspecified launch failure current device: 0, in function ggml_backend_cuda_buffer_set_tensor at ggml-cuda.cu:542 cudaStreamSynchronize(((cudaStream_t)0x2)) ggml-cuda.cu:132: CUDA error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. SIGABRT: abort PC=0x7f3a4029eb1c m=0 sigcode=18446744073709551610 signal arrived during cgo execution > goroutine 7 gp=0xc0000fe000 m=0 mp=0x57ba577d0f20 [syscall]: runtime.cgocall(0x57ba572b4a90, 0xc00007db48) runtime/cgocall.go:157 +0x4b fp=0xc00007db20 sp=0xc00007dae8 pc=0x57ba570358ab github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7f39ec006460, {0xf, 0x7f39ec00f980, 0x0, 0x0, 0x7f39ec010190, 0x7f39ec0109a0, 0x7f39ec022c10, 0x7f39efd1a5e0, 0x0, ...}) _cgo_gotypes.go:548 +0x52 fp=0xc00007db48 sp=0xc00007db20 pc=0x57ba57132e32 github.com/ollama/ollama/llama.(*Context).Decode.func1(0x57ba572b04eb?, 0x7f39ec006460?) github.com/ollama/ollama/llama/llama.go:189 +0xd8 fp=0xc00007dc68 sp=0xc00007db48 pc=0x57ba57135518 github.com/ollama/ollama/llama.(*Context).Decode(0xc00007dd58?, 0x0?) github.com/ollama/ollama/llama/llama.go:189 +0x13 fp=0xc00007dcb0 sp=0xc00007dc68 pc=0x57ba571353b3... _barfs like this for pages_ _then, later,_ > ...[GIN] 2025/02/27 - 15:25:05 | 200 | 795.792582ms | 127.0.0.1 | POST "/api/chat" cuda driver library failed to get device context 999time=2025-02-27T15:30:05.591-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2025-02-27T15:30:05.843-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2025-02-27T15:30:06.093-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" _more barf pages..._ terminal 3, for good luck: `sudo modprobe nvidia_uvm` _no reply._ terminal 2, restarting; it runs, but quite slowly, due to CPU usage. terminal 1, upon term 2 trying to run again: > ggml_cuda_init: failed to initialize CUDA: unknown error llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: offloading 47 repeating layers to GPU llm_load_tensors: offloaded 47/49 layers to GPU llm_load_tensors: CPU buffer size = 8566.04 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 384.00 MiB of pinned memory: unknown error llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB ggml_cuda_host_malloc: failed to allocate 0.60 MiB of pinned memory: unknown error llama_new_context_with_model: CPU output buffer size = 0.60 MiB ggml_cuda_host_malloc: failed to allocate 307.00 MiB of pinned memory: unknown error llama_new_context_with_model: CUDA_Host compute buffer size = 307.00 MiB this is pretty much what it does, failed to initialize CUDA: unknown error. Also the "failed to allocate pinned memory". Linux Mint 22 Cinnamon 6.2.9 nvidia-driver 560.35.03. good luck.
Author
Owner

@SCRIER-org commented on GitHub (Feb 27, 2025):

One has to take the ollama serve completely down, in order to successfully be able to drop and reload the nvidia_uvm with sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm. At this point it seems to work, so far.

Presumably this loses any context that was contained in the serve. This is slightly better than rebooting the entire system, but not by much. I wish everyone who uses ollama would lean on nVidia to fix this memory-pin bug.

<!-- gh-comment-id:2689176743 --> @SCRIER-org commented on GitHub (Feb 27, 2025): One has to take the ollama serve completely down, in order to successfully be able to drop and reload the nvidia_uvm with `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm`. At this point it seems to work, so far. Presumably this loses any context that was contained in the serve. This is slightly better than rebooting the entire system, but not by much. I wish everyone who uses ollama would lean on nVidia to fix this memory-pin bug.
Author
Owner

@Wladastic commented on GitHub (May 5, 2025):

Or at least do this here (chose sudo so you do not have to enter the password 3 times..)
sudo systemctl stop ollama && sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm && sudo systemctl start ollama

What is weird though is that on my machines this behavior is not as consistent.
My Ubuntu 24 Machine (RTX 3060 12GB and AMD APU with Rocm enabled as well) does not have to do this although the other machines do detach the "Cuda capabilty" as the GPU is still recognized but returns different types of exceptions from not nvidia gpu responding to a random VRAM out of memory error although the GPU is empty.

<!-- gh-comment-id:2850193426 --> @Wladastic commented on GitHub (May 5, 2025): Or at least do this here (chose sudo so you do not have to enter the password 3 times..) `sudo systemctl stop ollama && sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm && sudo systemctl start ollama` What is weird though is that on my machines this behavior is not as consistent. My Ubuntu 24 Machine (RTX 3060 12GB and AMD APU with Rocm enabled as well) does not have to do this although the other machines do detach the "Cuda capabilty" as the GPU is still recognized but returns different types of exceptions from not nvidia gpu responding to a random VRAM out of memory error although the GPU is empty.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5415