[GH-ISSUE #8426] Ollama does not use GPU after hibernation #5415

New Issue

GiteaMirror · 2026-04-12T16:39:11-05:00

GiteaMirror commented

2026-04-12 16:39:11 -05:00

Originally created by @mighty-services on GitHub (Jan 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8426

What is the issue?

First of all I want to say thanks to all Devs and Ops, who are contributing to this Repo!
I really like your service and its given me an easy way to step into the land of AI

I've setup ollama on my Tuxedo OS Laptop with a dedicated and an integrated GPU via the normal script given in the Description. It worked rightaway and much better than I thought it would (since this is a Laptop and not a full fledged GPU-Workstation). I downloaded several model like llava, phi4, mistral, ollama3 etc. and all work well out of the box, when I trigger them throug OpenWebUI (Docker Container).

The only thing Iḿ seeing is, that after waking up the Laptop from hibernation, the prompts are 10 times longer. Every answer before came after some seconds and after hibernation it takes 3 to 8 Minutes.

I checked existing issue reports and found some similar ones:

And they all seem to realte to my Problem, but stil I can find the fix for my issue there. Thatś why Iḿ opening up a new one.

When I use Ollama riht after the bootup (before hibernation) it shows me the server in to nvidia-smi Stats:

Tue Jan 14 20:39:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   51C    P8              3W /   27W |    5683MiB /   8188MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1887      G   /usr/lib/xorg/Xorg                             14MiB |
|    0   N/A  N/A     23141      G   /usr/bin/kwin_wayland                           2MiB |
|    0   N/A  N/A     84756      C   ...rs/cuda_v12_avx/ollama_llama_server       5618MiB |
+-----------------------------------------------------------------------------------------+

after hibernation the ollama_llama_server entry at the bottom is not sowing up, when I start a prompt.
I am running ollama version 0.5.4 and my Tuxedo OS is on the latest version:

Distributor ID: Tuxedo
Description:    TUXEDO OS
Release:        24.04
Codename:       noble
Kernel:    6.11.0-108013-tuxedo
RAM: 78 GB
CPU: IntelCore i9-14900HX 32 Cores
Platform: Wayland 
Qt-Version: 6.8.1
KDE Frameworks-Version: 6.8.0

the model is loaded sucessfully

NAME              ID              SIZE      PROCESSOR    UNTIL              
mistral:latest    f974a74358d6    6.3 GB    100% GPU     4 minutes from now

and I used this script to the force Ollama to use GPU 0, which is the dedicated one - which is shown here in system.d:

### /etc/systemd/system/ollama.service
# [Unit]
# Description=Ollama Service
# After=network-online.target
# 
# [Service]
# Environment="CUDA_VISIBLE_DEVICES=0"
# ExecStart=/usr/local/bin/ollama serve
# User=ollama
# Group=ollama
# Restart=always
# RestartSec=3
# Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
# 
# [Install]
# WantedBy=default.target

Any thoughts how to solve my issue? Feel free to link me to already solved tips and Pages.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.4

Originally created by @mighty-services on GitHub (Jan 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8426 ### What is the issue? First of all I want to say thanks to all Devs and Ops, who are contributing to this Repo! I really like your service and its given me an easy way to step into the land of AI I've setup ollama on my Tuxedo OS Laptop with a dedicated and an integrated GPU via the normal script given in the Description. It worked rightaway and much better than I thought it would (since this is a Laptop and not a full fledged GPU-Workstation). I downloaded several model like llava, phi4, mistral, ollama3 etc. and all work well out of the box, when I trigger them throug OpenWebUI (Docker Container). The only thing Iḿ seeing is, that after waking up the Laptop from hibernation, the prompts are 10 times longer. Every answer before came after some seconds and after hibernation it takes 3 to 8 Minutes. I checked existing issue reports and found some similar ones: - [Ollama goes into uninterruptible sleep mode and cannot be shutdown #3489](https://github.com/ollama/ollama/issues/3489) - [#8023 Ollama is very slow after running for a while](https://github.com/ollama/ollama/issues/8023) And they all seem to realte to my Problem, but stil I can find the fix for my issue there. Thatś why Iḿ opening up a new one. When I use Ollama riht after the bootup (before hibernation) it shows me the server in to nvidia-smi Stats: ``` Tue Jan 14 20:39:21 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 ... Off | 00000000:01:00.0 On | N/A | | N/A 51C P8 3W / 27W | 5683MiB / 8188MiB | 6% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1887 G /usr/lib/xorg/Xorg 14MiB | | 0 N/A N/A 23141 G /usr/bin/kwin_wayland 2MiB | | 0 N/A N/A 84756 C ...rs/cuda_v12_avx/ollama_llama_server 5618MiB | +-----------------------------------------------------------------------------------------+ ``` after hibernation the ollama_llama_server entry at the bottom is not sowing up, when I start a prompt. I am running ollama version 0.5.4 and my Tuxedo OS is on the latest version: ``` Distributor ID: Tuxedo Description: TUXEDO OS Release: 24.04 Codename: noble Kernel: 6.11.0-108013-tuxedo RAM: 78 GB CPU: IntelCore i9-14900HX 32 Cores Platform: Wayland Qt-Version: 6.8.1 KDE Frameworks-Version: 6.8.0 ``` the model is loaded sucessfully ``` NAME ID SIZE PROCESSOR UNTIL mistral:latest f974a74358d6 6.3 GB 100% GPU 4 minutes from now ``` and I used this script to the force Ollama to use GPU 0, which is the dedicated one - which is shown here in system.d: ``` ### /etc/systemd/system/ollama.service # [Unit] # Description=Ollama Service # After=network-online.target # # [Service] # Environment="CUDA_VISIBLE_DEVICES=0" # ExecStart=/usr/local/bin/ollama serve # User=ollama # Group=ollama # Restart=always # RestartSec=3 # Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" # # [Install] # WantedBy=default.target ``` Any thoughts how to solve my issue? Feel free to link me to already solved tips and Pages. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.4

GiteaMirror added the bug label 2026-04-12 16:39:11 -05:00

GiteaMirror closed this issue

2026-04-12 16:39:12 -05:00

GiteaMirror commented

2026-04-12 16:39:15 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

https://github.com/ollama/ollama/blob/main/docs/gpu.md#laptop-suspend-resume

@rick-github commented on GitHub (Jan 15, 2025): https://github.com/ollama/ollama/blob/main/docs/gpu.md#laptop-suspend-resume

GiteaMirror commented

2026-04-12 16:39:16 -05:00

@mighty-services commented on GitHub (Jan 15, 2025):

On thing I noticed today is, that the model is loaded in both cases and its saying ths is hundred percent GPU, but it's taking minutes to reply

$ ollama ps
NAME            ID              SIZE      PROCESSOR    UNTIL              
llava:latest    8dd30f6b0cb1    7.0 GB    100% GPU     4 minutes from now

The ollama server doesnt show up after hibernation:

Wed Jan 15 07:47:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   49C    P8              9W /   27W |     169MiB /   8188MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1887      G   /usr/lib/xorg/Xorg                             14MiB |
|    0   N/A  N/A     23141      G   /usr/bin/kwin_wayland                           2MiB |
+-----------------------------------------------------------------------------------------+

@mighty-services commented on GitHub (Jan 15, 2025): On thing I noticed today is, that the model is loaded in both cases and its saying ths is hundred percent GPU, but it's taking minutes to reply ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llava:latest 8dd30f6b0cb1 7.0 GB 100% GPU 4 minutes from now ``` The ollama server doesnt show up after hibernation: ``` Wed Jan 15 07:47:47 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 ... Off | 00000000:01:00.0 On | N/A | | N/A 49C P8 9W / 27W | 169MiB / 8188MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1887 G /usr/lib/xorg/Xorg 14MiB | | 0 N/A N/A 23141 G /usr/bin/kwin_wayland 2MiB | +-----------------------------------------------------------------------------------------+ ```

GiteaMirror commented

2026-04-12 16:39:16 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

Ollama has two pieces - the server and the runners. The server identifies that a GPU is available, and fully expects that it can use it. This is the basis of the PROCESSOR column - ollama has calculated that x% (in this case 100) of the model will fit in the GPU. It then launches the runner, whose job it is to actually load the model in the GPU. It's at this point that the runner finds it can't communicate with the GPU, and instead loads the model into system RAM. If you reload the nvidia_uvm module you should find that the runner can once again use the GPU.

@rick-github commented on GitHub (Jan 15, 2025): Ollama has two pieces - the server and the runners. The server identifies that a GPU is available, and fully expects that it can use it. This is the basis of the PROCESSOR column - ollama has calculated that x% (in this case 100) of the model will fit in the GPU. It then launches the runner, whose job it is to actually load the model in the GPU. It's at this point that the runner finds it can't communicate with the GPU, and instead loads the model into system RAM. If you reload the `nvidia_uvm` module you should find that the runner can once again use the GPU.

GiteaMirror commented

2026-04-12 16:39:17 -05:00

@mighty-services commented on GitHub (Jan 15, 2025):

If you reload the nvidia_uvm module you should find that the runner can once again use the GPU.

Thanks @rick-github for your quick answer. I assume thats what sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm does in the article you've posted before.

I did that and was unable to reload the module, because it was used. I looked up the used processes with sudo lsof /dev/nvidia* and killed the running ollama service.
After that, I was able to reload the module and it showed up in nvidia-smi. The model was loaded and working as before the hibernation.

So whatś your adise for longterm useage? Should i simply call a script, what stops ollama-service, reloads the module and start the service again or is there a better way at hand?

@mighty-services commented on GitHub (Jan 15, 2025): > If you reload the `nvidia_uvm` module you should find that the runner can once again use the GPU. Thanks @rick-github for your quick answer. I assume thats what `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm` does in the article you've posted before. I did that and was unable to reload the module, because it was used. I looked up the used processes with `sudo lsof /dev/nvidia*` and killed the running ollama service. After that, I was able to reload the module and it showed up in `nvidia-smi`. The model was loaded and working as before the hibernation. So whatś your adise for longterm useage? Should i simply call a script, what stops ollama-service, reloads the module and start the service again or is there a better way at hand?

GiteaMirror commented

2026-04-12 16:39:18 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

It's not ideal but the script is the only solution at the moment. The ollama team are aware of the issue and will work to resolve it, but in the meantime the workaround all that is available.

@rick-github commented on GitHub (Jan 15, 2025): It's not ideal but the script is the only solution at the moment. The ollama team are aware of the issue and will work to resolve it, but in the meantime the workaround all that is available.

GiteaMirror commented

2026-04-12 16:39:18 -05:00

@mighty-services commented on GitHub (Jan 15, 2025):

no problem! Iḿ happy for your help and thats totally fine on my side. Thank you for your help!

@mighty-services commented on GitHub (Jan 15, 2025): no problem! Iḿ happy for your help and thats totally fine on my side. Thank you for your help!

GiteaMirror commented

2026-04-12 16:39:19 -05:00

@SCRIER-org commented on GitHub (Feb 26, 2025):

Same problem with a desktop Mint Linux, running 0.5.0 ollama serve from the terminal. Apparently the memory allocation for the GPU gets frozen, or lost, upon Linux system Sleep, and if the GPU is full the system can't reuse nor reallocate GPU memory the next time ollama tries to run--even if from inside the same run process as before sleep--and defaults to CPU memory, which is about 50x slower but still works. I have to fix this by rebooting the machine every time I sleep it after running an ollama. This bug has been around for almost half a year. I'm not going to be able to run a production-level system with the current bug unless I turn off Sleep. This one seems to require deep juju on the GPU, so I'm hoping a GPU wizard can please fix this. Loving ollama except for this bug, thank you so much.

@SCRIER-org commented on GitHub (Feb 26, 2025): Same problem with a desktop Mint Linux, running 0.5.0 ollama serve from the terminal. Apparently the memory allocation for the GPU gets frozen, or lost, upon Linux system Sleep, and if the GPU is full the system can't reuse nor reallocate GPU memory the next time ollama tries to run--even if from inside the same run process as before sleep--and defaults to CPU memory, which is about 50x slower but still works. I have to fix this by rebooting the machine every time I sleep it after running an ollama. This bug has been around for almost half a year. I'm not going to be able to run a production-level system with the current bug unless I turn off Sleep. This one seems to require deep juju on the GPU, so I'm hoping a GPU wizard can please fix this. Loving ollama except for this bug, thank you so much.

GiteaMirror commented

2026-04-12 16:39:19 -05:00

@rick-github commented on GitHub (Feb 26, 2025):

Does reloading the nvidia_uvm module not restore access to the GPU?

@rick-github commented on GitHub (Feb 26, 2025): Does reloading the nvidia_uvm module not restore access to the GPU?

GiteaMirror commented

2026-04-12 16:39:20 -05:00

@SCRIER-org commented on GitHub (Feb 27, 2025):

You want me to...unload an nVidia driver...so as to fix a handshaking problem, without wiping out monitor usage. What could go wrong.

Further research reveals this to be a basic nVidia bug in their CUDA interface, which has been around unfixed since at least 2016.

Regrettably, no, this was not operable.

Sequence:
terminal 1: ollama serve
terminal 2: ollama run qwen2.5-coder:14b on a 16GB GPU with heavy web usage

ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.51 MiB
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloaded 47/49 layers to GPU
llm_load_tensors: CPU buffer size = 8566.04 MiB
llm_load_tensors: CUDA0 buffer size = 7372.87 MiB

Mint Linux Power->Session->Suspend [computer powers down]
[computer powers back up, no screens lost]

terminal 3: sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is in use
terminals 1 and 2 still running

terminal 2: >>> write a quicksort algorithm
Error: POST predict: Post "http://127.0.0.1:39149/completion": EOF
process crashes to shell prompt $

terminal 1:

...llama_model_load: vocab only - skipping tensors
[ GIN] 2025/02/27 - 15:22:36 | 200 | 29.526492083s | 127.0.0.1 | POST "/api/chat"
CUDA error: unspecified launch failure
current device: 0, in function ggml_backend_cuda_buffer_set_tensor at ggml-cuda.cu:542
cudaStreamSynchronize(((cudaStream_t)0x2))
ggml-cuda.cu:132: CUDA error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
SIGABRT: abort
PC=0x7f3a4029eb1c m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 7 gp=0xc0000fe000 m=0 mp=0x57ba577d0f20 [syscall]:
runtime.cgocall(0x57ba572b4a90, 0xc00007db48)
runtime/cgocall.go:157 +0x4b fp=0xc00007db20 sp=0xc00007dae8 pc=0x57ba570358ab
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7f39ec006460, {0xf, 0x7f39ec00f980, 0x0, 0x0, 0x7f39ec010190, 0x7f39ec0109a0, 0x7f39ec022c10, 0x7f39efd1a5e0, 0x0, ...})
_cgo_gotypes.go:548 +0x52 fp=0xc00007db48 sp=0xc00007db20 pc=0x57ba57132e32
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x57ba572b04eb?, 0x7f39ec006460?)
github.com/ollama/ollama/llama/llama.go:189 +0xd8 fp=0xc00007dc68 sp=0xc00007db48 pc=0x57ba57135518
github.com/ollama/ollama/llama.(*Context).Decode(0xc00007dd58?, 0x0?)
github.com/ollama/ollama/llama/llama.go:189 +0x13 fp=0xc00007dcb0 sp=0xc00007dc68 pc=0x57ba571353b3...
barfs like this for pages
then, later,
...[GIN] 2025/02/27 - 15:25:05 | 200 | 795.792582ms | 127.0.0.1 | POST "/api/chat"
cuda driver library failed to get device context 999time=2025-02-27T15:30:05.591-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2025-02-27T15:30:05.843-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2025-02-27T15:30:06.093-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" more barf pages...

terminal 3, for good luck:
sudo modprobe nvidia_uvm no reply.

terminal 2, restarting; it runs, but quite slowly, due to CPU usage.

terminal 1, upon term 2 trying to run again:

ggml_cuda_init: failed to initialize CUDA: unknown error
llm_load_tensors: ggml ctx size = 0.25 MiB
llm_load_tensors: offloading 47 repeating layers to GPU
llm_load_tensors: offloaded 47/49 layers to GPU
llm_load_tensors: CPU buffer size = 8566.04 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 384.00 MiB of pinned memory: unknown error
llama_kv_cache_init: CPU KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.60 MiB of pinned memory: unknown error
llama_new_context_with_model: CPU output buffer size = 0.60 MiB
ggml_cuda_host_malloc: failed to allocate 307.00 MiB of pinned memory: unknown error
llama_new_context_with_model: CUDA_Host compute buffer size = 307.00 MiB

this is pretty much what it does, failed to initialize CUDA: unknown error.
Also the "failed to allocate pinned memory".

Linux Mint 22 Cinnamon 6.2.9 nvidia-driver 560.35.03. good luck.

@SCRIER-org commented on GitHub (Feb 27, 2025): You want me to...unload an nVidia driver...so as to fix a handshaking problem, without wiping out monitor usage. What could go wrong. Further research reveals this to be a basic nVidia bug in their CUDA interface, which has been around unfixed since at least 2016. Regrettably, no, this was not operable. Sequence: terminal 1: `ollama serve` terminal 2: `ollama run qwen2.5-coder:14b` _on a 16GB GPU with heavy web usage_ > ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.51 MiB llm_load_tensors: offloading 47 repeating layers to GPU llm_load_tensors: offloaded 47/49 layers to GPU llm_load_tensors: CPU buffer size = 8566.04 MiB llm_load_tensors: CUDA0 buffer size = 7372.87 MiB Mint Linux Power->Session->Suspend _[computer powers down]_ _[computer powers back up, no screens lost]_ terminal 3: `sudo rmmod nvidia_uvm` `rmmod: ERROR: Module nvidia_uvm is in use` terminals 1 and 2 still running terminal 2: `>>> write a quicksort algorithm` `Error: POST predict: Post "http://127.0.0.1:39149/completion": EOF` _process crashes to shell prompt $_ terminal 1: > ...llama_model_load: vocab only - skipping tensors [ GIN] 2025/02/27 - 15:22:36 | 200 | 29.526492083s | 127.0.0.1 | POST "/api/chat" CUDA error: unspecified launch failure current device: 0, in function ggml_backend_cuda_buffer_set_tensor at ggml-cuda.cu:542 cudaStreamSynchronize(((cudaStream_t)0x2)) ggml-cuda.cu:132: CUDA error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. SIGABRT: abort PC=0x7f3a4029eb1c m=0 sigcode=18446744073709551610 signal arrived during cgo execution > goroutine 7 gp=0xc0000fe000 m=0 mp=0x57ba577d0f20 [syscall]: runtime.cgocall(0x57ba572b4a90, 0xc00007db48) runtime/cgocall.go:157 +0x4b fp=0xc00007db20 sp=0xc00007dae8 pc=0x57ba570358ab github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7f39ec006460, {0xf, 0x7f39ec00f980, 0x0, 0x0, 0x7f39ec010190, 0x7f39ec0109a0, 0x7f39ec022c10, 0x7f39efd1a5e0, 0x0, ...}) _cgo_gotypes.go:548 +0x52 fp=0xc00007db48 sp=0xc00007db20 pc=0x57ba57132e32 github.com/ollama/ollama/llama.(*Context).Decode.func1(0x57ba572b04eb?, 0x7f39ec006460?) github.com/ollama/ollama/llama/llama.go:189 +0xd8 fp=0xc00007dc68 sp=0xc00007db48 pc=0x57ba57135518 github.com/ollama/ollama/llama.(*Context).Decode(0xc00007dd58?, 0x0?) github.com/ollama/ollama/llama/llama.go:189 +0x13 fp=0xc00007dcb0 sp=0xc00007dc68 pc=0x57ba571353b3... _barfs like this for pages_ _then, later,_ > ...[GIN] 2025/02/27 - 15:25:05 | 200 | 795.792582ms | 127.0.0.1 | POST "/api/chat" cuda driver library failed to get device context 999time=2025-02-27T15:30:05.591-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2025-02-27T15:30:05.843-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2025-02-27T15:30:06.093-05:00 level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" _more barf pages..._ terminal 3, for good luck: `sudo modprobe nvidia_uvm` _no reply._ terminal 2, restarting; it runs, but quite slowly, due to CPU usage. terminal 1, upon term 2 trying to run again: > ggml_cuda_init: failed to initialize CUDA: unknown error llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: offloading 47 repeating layers to GPU llm_load_tensors: offloaded 47/49 layers to GPU llm_load_tensors: CPU buffer size = 8566.04 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 384.00 MiB of pinned memory: unknown error llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB ggml_cuda_host_malloc: failed to allocate 0.60 MiB of pinned memory: unknown error llama_new_context_with_model: CPU output buffer size = 0.60 MiB ggml_cuda_host_malloc: failed to allocate 307.00 MiB of pinned memory: unknown error llama_new_context_with_model: CUDA_Host compute buffer size = 307.00 MiB this is pretty much what it does, failed to initialize CUDA: unknown error. Also the "failed to allocate pinned memory". Linux Mint 22 Cinnamon 6.2.9 nvidia-driver 560.35.03. good luck.

GiteaMirror commented

2026-04-12 16:39:20 -05:00

@SCRIER-org commented on GitHub (Feb 27, 2025):

One has to take the ollama serve completely down, in order to successfully be able to drop and reload the nvidia_uvm with sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm. At this point it seems to work, so far.

Presumably this loses any context that was contained in the serve. This is slightly better than rebooting the entire system, but not by much. I wish everyone who uses ollama would lean on nVidia to fix this memory-pin bug.

@SCRIER-org commented on GitHub (Feb 27, 2025): One has to take the ollama serve completely down, in order to successfully be able to drop and reload the nvidia_uvm with `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm`. At this point it seems to work, so far. Presumably this loses any context that was contained in the serve. This is slightly better than rebooting the entire system, but not by much. I wish everyone who uses ollama would lean on nVidia to fix this memory-pin bug.

GiteaMirror commented

2026-04-12 16:39:22 -05:00

@Wladastic commented on GitHub (May 5, 2025):

Or at least do this here (chose sudo so you do not have to enter the password 3 times..)
sudo systemctl stop ollama && sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm && sudo systemctl start ollama

What is weird though is that on my machines this behavior is not as consistent.
My Ubuntu 24 Machine (RTX 3060 12GB and AMD APU with Rocm enabled as well) does not have to do this although the other machines do detach the "Cuda capabilty" as the GPU is still recognized but returns different types of exceptions from not nvidia gpu responding to a random VRAM out of memory error although the GPU is empty.

@Wladastic commented on GitHub (May 5, 2025): Or at least do this here (chose sudo so you do not have to enter the password 3 times..) `sudo systemctl stop ollama && sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm && sudo systemctl start ollama` What is weird though is that on my machines this behavior is not as consistent. My Ubuntu 24 Machine (RTX 3060 12GB and AMD APU with Rocm enabled as well) does not have to do this although the other machines do detach the "Cuda capabilty" as the GPU is still recognized but returns different types of exceptions from not nvidia gpu responding to a random VRAM out of memory error although the GPU is empty.

GiteaMirror referenced this issue

2026-04-12 23:38:36 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #11772

GiteaMirror referenced this issue

2026-04-16 05:51:19 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #17043

GiteaMirror referenced this issue

2026-04-19 16:14:48 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #22312

GiteaMirror referenced this issue

2026-04-22 22:19:17 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #37645

GiteaMirror referenced this issue

2026-04-24 22:44:00 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #43020

GiteaMirror referenced this issue

2026-04-29 13:21:31 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #58469

GiteaMirror referenced this issue

2026-05-05 06:02:03 -05:00

[PR #5415] [CLOSED] [Feat] Support Api key for ollama apis #74066

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#5415