[GH-ISSUE #12674] Since v0.12.4 gpt-oss:20b does not run on GPU (CUDA) #54919

Open
opened 2026-04-29 07:58:40 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @vt-alt on GitHub (Oct 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12674

What is the issue?

Since v0.12.4 and up to v0.12.6, gpt-oss:20b does not run on CUDA (the test is on RTX 4090), while v0.12.3 worked OK.

$ ollama run --verbose gpt-oss:20b hi
Thinking...
We need to respond to "hi". Simple greeting.
...done thinking.

Hello! How can I help you today?

total duration:       7.427786912s
load duration:        2.364431973s
prompt eval count:    68 token(s)
prompt eval duration: 1.315199253s
prompt eval rate:     51.70 tokens/s
eval count:           30 token(s)
eval duration:        3.745768569s
eval rate:            8.01 tokens/s

Previously it was saying:

source=ggml.go:487 msg="offloading 24 repeating layers to GPU"
source=ggml.go:493 msg="offloading output layer to GPU"
source=ggml.go:498 msg="offloaded 25/25 layers to GPU"

Now

source=ggml.go:477 msg="offloading 0 repeating layers to GPU"
source=ggml.go:481 msg="offloading output layer to CPU"
source=ggml.go:488 msg="offloaded 0/25 layers to GPU"

Relevant log output

Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 |      28.695µs |       127.0.0.1 | HEAD     "/"
Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 |   77.590598ms |       127.0.0.1 | POST     "/api/show"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.202+03:00 level=INFO source=server.go:216 msg="enabling flash attention"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /var/lib/ollama/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 45589"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:681 msg="system memory" total="62.7 GiB" free="49.3 GiB" free_swap="0 B"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1299 msg="starting ollama engine"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:45589"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.215+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.262+03:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: found 1 CUDA devices:
Oct 17 13:25:55 pony ollama[4149997]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-ff90d373-e63e-427f-2f30-73348e89e4bd
Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.358+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=520,800 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.362+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="3.1 GiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="86.8 MiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:238 msg="total memory" size="16.0 GiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
Oct 17 13:25:57 pony ollama[4149997]: time=2025-10-17T13:25:57.287+03:00 level=INFO source=server.go:1309 msg="llama runner started in 2.08 seconds"
Oct 17 13:26:02 pony ollama[4149997]: [GIN] 2025/10/17 - 13:26:02 | 200 |  7.533357265s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

Manual build since v0.12.4 - including v0.12.6 intended for packaging but since the regress it cannot be packaged.

Originally created by @vt-alt on GitHub (Oct 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12674 ### What is the issue? Since v0.12.4 and up to v0.12.6, `gpt-oss:20b` does not run on CUDA (the test is on RTX 4090), while v0.12.3 worked OK. ``` $ ollama run --verbose gpt-oss:20b hi Thinking... We need to respond to "hi". Simple greeting. ...done thinking. Hello! How can I help you today? total duration: 7.427786912s load duration: 2.364431973s prompt eval count: 68 token(s) prompt eval duration: 1.315199253s prompt eval rate: 51.70 tokens/s eval count: 30 token(s) eval duration: 3.745768569s eval rate: 8.01 tokens/s ``` Previously it was saying: ``` source=ggml.go:487 msg="offloading 24 repeating layers to GPU" source=ggml.go:493 msg="offloading output layer to GPU" source=ggml.go:498 msg="offloaded 25/25 layers to GPU" ``` Now ``` source=ggml.go:477 msg="offloading 0 repeating layers to GPU" source=ggml.go:481 msg="offloading output layer to CPU" source=ggml.go:488 msg="offloaded 0/25 layers to GPU" ``` ### Relevant log output ```shell Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 | 28.695µs | 127.0.0.1 | HEAD "/" Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 | 77.590598ms | 127.0.0.1 | POST "/api/show" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.202+03:00 level=INFO source=server.go:216 msg="enabling flash attention" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /var/lib/ollama/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 45589" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1 Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:681 msg="system memory" total="62.7 GiB" free="49.3 GiB" free_swap="0 B" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1299 msg="starting ollama engine" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:45589" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.215+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.262+03:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: found 1 CUDA devices: Oct 17 13:25:55 pony ollama[4149997]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-ff90d373-e63e-427f-2f30-73348e89e4bd Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.358+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=520,800 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.362+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="3.1 GiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="86.8 MiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:238 msg="total memory" size="16.0 GiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model" Oct 17 13:25:57 pony ollama[4149997]: time=2025-10-17T13:25:57.287+03:00 level=INFO source=server.go:1309 msg="llama runner started in 2.08 seconds" Oct 17 13:26:02 pony ollama[4149997]: [GIN] 2025/10/17 - 13:26:02 | 200 | 7.533357265s | 127.0.0.1 | POST "/api/generate" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version Manual build since v0.12.4 - including v0.12.6 intended for packaging but since the regress it cannot be packaged.
GiteaMirror added the buildnvidiafeature request labels 2026-04-29 07:58:41 -05:00
Author
Owner

@sucream commented on GitHub (Oct 17, 2025):

I had the same problem.

When you upgrade your Ollama in Linux, you must remove previous version of Ollama.

The Ollama team said If you are upgrading from a prior version, you MUST remove the old libraries with sudo rm -rf /usr/lib/ollama first.

<!-- gh-comment-id:3415350291 --> @sucream commented on GitHub (Oct 17, 2025): I had the same problem. When you upgrade your Ollama in Linux, you must remove previous version of Ollama. The Ollama team said [ `If you are upgrading from a prior version, you MUST remove the old libraries with sudo rm -rf /usr/lib/ollama first.`](https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install)
Author
Owner

@rick-github commented on GitHub (Oct 17, 2025):

sucream is correct, the CUDA library is being loaded from the wrong location.

Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so

The recommendation is to use the official install.

<!-- gh-comment-id:3415511433 --> @rick-github commented on GitHub (Oct 17, 2025): sucream is correct, the CUDA library is being loaded from the wrong location. ``` Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so ``` The recommendation is to use the official install.
Author
Owner

@vt-alt commented on GitHub (Oct 17, 2025):

We package build it into rpm, so it replaces all the files automatically when we upgrade. Also, it's built with -DGGML_BACKEND_DIR=%_libexecdir/ollama so the location is intended.

It consistently works on CUDA before version v0.12.4 (with all previous versions we compiled, stated to build for CUDA from v0.5.13), and then since v0.12.4,5,6 and it's uses only CPU.

We cannot use official binaries since for our distro (ALT Linux) we only build from sources.

<!-- gh-comment-id:3415691016 --> @vt-alt commented on GitHub (Oct 17, 2025): We package build it into rpm, so it replaces all the files automatically when we upgrade. Also, it's built with `-DGGML_BACKEND_DIR=%_libexecdir/ollama` so the location is intended. It consistently works on CUDA before version v0.12.4 (with all previous versions we compiled, stated to build for CUDA from v0.5.13), and then since v0.12.4,5,6 and it's uses only CPU. We cannot use official binaries since for our distro (ALT Linux) we only build from sources.
Author
Owner

@rick-github commented on GitHub (Oct 17, 2025):

Also, it's built with -DGGML_BACKEND_DIR=%_libexecdir/ollama so the location is intended.

Perhaps, but the library usually lives in a cuda_v12 directory. If your build process is not preserving the directory structure, it's no wonder it stopped working.

<!-- gh-comment-id:3415700723 --> @rick-github commented on GitHub (Oct 17, 2025): > Also, it's built with -DGGML_BACKEND_DIR=%_libexecdir/ollama so the location is intended. Perhaps, but the library usually lives in a cuda_v12 directory. If your build process is not preserving the directory structure, it's no wonder it stopped working.
Author
Owner

@vt-alt commented on GitHub (Oct 17, 2025):

Thanks for the suggestion! I will try to investigate OLLAMA_RUNNER_DIR usage.

<!-- gh-comment-id:3415722364 --> @vt-alt commented on GitHub (Oct 17, 2025): Thanks for the suggestion! I will try to investigate `OLLAMA_RUNNER_DIR` usage.
Author
Owner

@kiliansinger commented on GitHub (Oct 30, 2025):

I had similar problems with other model and could fix it with this PR: https://github.com/ollama/ollama/pull/12856

<!-- gh-comment-id:3470655740 --> @kiliansinger commented on GitHub (Oct 30, 2025): I had similar problems with other model and could fix it with this PR: https://github.com/ollama/ollama/pull/12856
Author
Owner

@vt-alt commented on GitHub (Oct 31, 2025):

Well, in our case this is not a crash, but ollama is unable to calculate VRAM size. By the log you see that ggml finds cuda libs (see load_backend lines, also, moving into "cuda_v12" dir does not help, and this is just build option, we don't move files after cmake installs them). This feature is introduced in the HUGE commit bc8909fb38

I tried to strace and there is no obvious errors (cuda and nvidia files are loaded). Perhaps, the difference from mainline build is slight, but I'm unable to comprehend it yet.

Everything builds and installs as before (this commit) so other downstream builders may not even notice regression.

<!-- gh-comment-id:3473057196 --> @vt-alt commented on GitHub (Oct 31, 2025): Well, in our case this is not a crash, but ollama is unable to calculate VRAM size. By the log you see that ggml finds cuda libs (see `load_backend` lines, also, moving into "cuda_v12" dir does not help, and this is just build option, we don't move files after cmake installs them). This feature is introduced in the HUGE commit https://github.com/ollama/ollama/commit/bc8909fb38525c89dda842d4ecfc86a933089a99 I tried to strace and there is no obvious errors (cuda and nvidia files are loaded). Perhaps, the difference from mainline build is slight, but I'm unable to comprehend it yet. Everything builds and installs as before (this commit) so other downstream builders may not even notice regression.
Author
Owner

@rick-github commented on GitHub (Oct 31, 2025):

What's the output of the following commands:

command -v ollama
find /usr/lib/ollama
find $(dirname $(dirname $(command -v ollama)))/lib/ollama
<!-- gh-comment-id:3473099948 --> @rick-github commented on GitHub (Oct 31, 2025): What's the output of the following commands: ``` command -v ollama find /usr/lib/ollama find $(dirname $(dirname $(command -v ollama)))/lib/ollama ```
Author
Owner

@vt-alt commented on GitHub (Oct 31, 2025):

First I want to note, since this is for a distribution we package to standard (LSB) directories, the quality check will not pass the package if it packs ELF objects in wrong trees. But, I can package perfectly libggml-cuda.so into cuda_v12 subdir if needed, ❶ I only removed the infix since we don't have CUDA versions other than v12. Also, ❷ since we have CUDA v12 in the repository we don't need to bundle CUDA libs like you do, the correct CUDA libs are in the system and will be maintained automatically correctly by the package system.

Therefore, native building for a particular distribution is a different thing than universal builds for any distributions, and is not incorrect just because it's somehow different. And we can avoid packaging a lot of redundant files (avoiding to have ~2GB package).

小马:~$ command -v ollama
/usr/bin/ollama
小马:~$ find /usr/lib/ollama
/usr/lib/ollama
/usr/lib/ollama/libggml-cpu-icelake.so
/usr/lib/ollama/libggml-cpu-sandybridge.so
/usr/lib/ollama/libggml-cpu-alderlake.so
/usr/lib/ollama/libggml-cpu-haswell.so
/usr/lib/ollama/libggml-base.so
/usr/lib/ollama/libggml-cuda.so
/usr/lib/ollama/libggml-cpu-sse42.so
/usr/lib/ollama/libggml-cpu-skylakex.so
/usr/lib/ollama/libggml-cpu-x64.so
小马:~$ find $(dirname $(dirname $(command -v ollama)))/lib/ollama
/usr/lib/ollama
/usr/lib/ollama/libggml-cpu-icelake.so
/usr/lib/ollama/libggml-cpu-sandybridge.so
/usr/lib/ollama/libggml-cpu-alderlake.so
/usr/lib/ollama/libggml-cpu-haswell.so
/usr/lib/ollama/libggml-base.so
/usr/lib/ollama/libggml-cuda.so
/usr/lib/ollama/libggml-cpu-sse42.so
/usr/lib/ollama/libggml-cpu-skylakex.so
/usr/lib/ollama/libggml-cpu-x64.so

But, see that cuda module is correctly linked with the system libraries and libggml-base.so:

小马:~$ ldd /usr/lib/ollama/libggml-cuda.so
        linux-vdso.so.1 (0x00007f2ea65c1000)
        libggml-base.so => /usr/lib/ollama/libggml-base.so (0x00007f2ea653c000)
        libcudart.so.12 => /lib64/libcudart.so.12 (0x00007f2ea1000000)
        libcublas.so.12 => /lib64/libcublas.so.12 (0x00007f2e99e00000)
        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f2e94000000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2e93c00000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2ea6436000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2ea13d2000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f2e93a05000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2ea65c3000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f2ea6431000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2ea642c000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f2ea6427000)
        libcublasLt.so.12 => /lib64/libcublasLt.so.12 (0x00007f2e61600000)

Unsynchronized upgrade of the libraries or binaries is impossible due to strict package dependencies. It installs to exactly the same packages (i.e. libs) it's built for.

(The same build process worked well for v0.12.3, and cuda acceleration was working.)

As an experiment I created correct symlinks:

/usr/lib/ollama# mkdir cuda_v12
/usr/lib/ollama# cd cuda_v12
/usr/lib/ollama/cuda_v12# ln -s /lib64/libcudart.so.12
/usr/lib/ollama/cuda_v12# ln -s /lib64/libcublas.so.12
/usr/lib/ollama/cuda_v12# ln -s /lib64/libcublasLt.so.12
/usr/lib/ollama/cuda_v12# ln -s /usr/lib/ollama/libggml-cuda.so

After restart of ollama.service it still didn't find vram and working only on cpu.

ps. I noticed in strace logs that some logging goes into /dev/null, and will try to enable it.
441676 write(2</dev/null>, "ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 25174212608 total: 25757220864\n", 105) = 105

<!-- gh-comment-id:3473352058 --> @vt-alt commented on GitHub (Oct 31, 2025): First I want to note, since this is for a distribution we package to standard (LSB) directories, the quality check will not pass the package if it packs ELF objects in wrong trees. But, I can package perfectly `libggml-cuda.so` into `cuda_v12` subdir if needed, ❶ I only removed the infix since we don't have CUDA versions other than v12. Also, ❷ since we have CUDA v12 in the repository we don't need to bundle CUDA libs like you do, the correct CUDA libs are **in the system** and will be maintained automatically correctly by the package system. Therefore, **native** building for a particular distribution is a different thing than universal builds for **any** distributions, and is not incorrect just because it's somehow different. And we can avoid packaging a lot of redundant files (avoiding to have ~2GB package). ``` 小马:~$ command -v ollama /usr/bin/ollama 小马:~$ find /usr/lib/ollama /usr/lib/ollama /usr/lib/ollama/libggml-cpu-icelake.so /usr/lib/ollama/libggml-cpu-sandybridge.so /usr/lib/ollama/libggml-cpu-alderlake.so /usr/lib/ollama/libggml-cpu-haswell.so /usr/lib/ollama/libggml-base.so /usr/lib/ollama/libggml-cuda.so /usr/lib/ollama/libggml-cpu-sse42.so /usr/lib/ollama/libggml-cpu-skylakex.so /usr/lib/ollama/libggml-cpu-x64.so 小马:~$ find $(dirname $(dirname $(command -v ollama)))/lib/ollama /usr/lib/ollama /usr/lib/ollama/libggml-cpu-icelake.so /usr/lib/ollama/libggml-cpu-sandybridge.so /usr/lib/ollama/libggml-cpu-alderlake.so /usr/lib/ollama/libggml-cpu-haswell.so /usr/lib/ollama/libggml-base.so /usr/lib/ollama/libggml-cuda.so /usr/lib/ollama/libggml-cpu-sse42.so /usr/lib/ollama/libggml-cpu-skylakex.so /usr/lib/ollama/libggml-cpu-x64.so ``` But, see that cuda module is correctly linked with the system libraries and libggml-base.so: ``` 小马:~$ ldd /usr/lib/ollama/libggml-cuda.so linux-vdso.so.1 (0x00007f2ea65c1000) libggml-base.so => /usr/lib/ollama/libggml-base.so (0x00007f2ea653c000) libcudart.so.12 => /lib64/libcudart.so.12 (0x00007f2ea1000000) libcublas.so.12 => /lib64/libcublas.so.12 (0x00007f2e99e00000) libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f2e94000000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2e93c00000) libm.so.6 => /lib64/libm.so.6 (0x00007f2ea6436000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2ea13d2000) libc.so.6 => /lib64/libc.so.6 (0x00007f2e93a05000) /lib64/ld-linux-x86-64.so.2 (0x00007f2ea65c3000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f2ea6431000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2ea642c000) librt.so.1 => /lib64/librt.so.1 (0x00007f2ea6427000) libcublasLt.so.12 => /lib64/libcublasLt.so.12 (0x00007f2e61600000) ``` Unsynchronized upgrade of the libraries or binaries is impossible due to strict package dependencies. It installs to exactly the same packages (i.e. libs) it's built for. (The same build process worked well for v0.12.3, and cuda acceleration was working.) As an experiment I created correct symlinks: ``` /usr/lib/ollama# mkdir cuda_v12 /usr/lib/ollama# cd cuda_v12 /usr/lib/ollama/cuda_v12# ln -s /lib64/libcudart.so.12 /usr/lib/ollama/cuda_v12# ln -s /lib64/libcublas.so.12 /usr/lib/ollama/cuda_v12# ln -s /lib64/libcublasLt.so.12 /usr/lib/ollama/cuda_v12# ln -s /usr/lib/ollama/libggml-cuda.so ``` After restart of ollama.service it still didn't find vram and working only on cpu. ps. I noticed in strace logs that some logging goes into /dev/null, and will try to enable it. `441676 write(2</dev/null>, "ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 25174212608 total: 25757220864\n", 105) = 105`
Author
Owner

@rick-github commented on GitHub (Oct 31, 2025):

incorrect just because it's somehow different

It's incorrect because it doesn't work, not because it's different. Different is fine. Arch, for example, has a different build and it works fine. We need to figure out why yours doesn't. What's the simplest way to do an ALT Linux install in a VM? Iso? Or docker image?

<!-- gh-comment-id:3473461621 --> @rick-github commented on GitHub (Oct 31, 2025): > incorrect just because it's somehow different It's incorrect because it doesn't work, not because it's different. Different is fine. Arch, for example, has a different build and it works fine. We need to figure out why yours doesn't. What's the simplest way to do an ALT Linux install in a VM? Iso? Or docker image?
Author
Owner

@vt-alt commented on GitHub (Oct 31, 2025):

Yeah, I don't blame the failure on you, and just want to make it work again and maybe some suggestions or debugging hints. This may be useful for other builders too.

We didn't commit the regressed package into the repository, it's still in the testing queue, so I will prepare test install instructions for Docker. Thanks.

It's useful to know Arch have working builds I will investigate more how they do it, thanks again.

<!-- gh-comment-id:3473582200 --> @vt-alt commented on GitHub (Oct 31, 2025): Yeah, I don't blame the failure on you, and just want to make it work again and maybe some suggestions or debugging hints. This may be useful for other builders too. We didn't commit the regressed package into the repository, it's still in the testing queue, so I will prepare test install instructions for Docker. Thanks. It's useful to know Arch have working builds I will investigate more [how](https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/blob/main/PKGBUILD?ref_type=heads) they do it, thanks again.
Author
Owner

@kiliansinger commented on GitHub (Oct 31, 2025):

I posted a PR that might fix your issue which started with v0.12.3 in my case: https://github.com/ollama/ollama/pull/12856

<!-- gh-comment-id:3474154007 --> @kiliansinger commented on GitHub (Oct 31, 2025): I posted a PR that might fix your issue which started with v0.12.3 in my case: https://github.com/ollama/ollama/pull/12856
Author
Owner

@rick-github commented on GitHub (Oct 31, 2025):

Would you please stop spamming issues with your PR. It has no relevance to this one or the others you posted it to.

<!-- gh-comment-id:3474158731 --> @rick-github commented on GitHub (Oct 31, 2025): Would you please stop spamming issues with your PR. It has no relevance to this one or the others you posted it to.
Author
Owner

@vt-alt commented on GitHub (Oct 31, 2025):

First, I thought to reproduce the success of Arch in Docker, but, it seems, there is the same problem. My steps for Arch:

$ docker run --gpus=all --rm -it archlinux
[root@90be03c146dd /]# pacman -Syu nvidia nvidia-utils
[root@90be03c146dd /]# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   49C    P8              3W /  450W |      72MiB /  24564MiB |      6%      Default |
...
[root@90be03c146dd /]# pacman -S ollama
[root@90be03c146dd /]# ollama serve &
[root@90be03c146dd /]# ollama --version
ollama version is 0.12.7
[root@90be03c146dd /]# ollama pull gpt-oss:20b
[root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi
...
time=2025-10-31T17:08:34.064Z level=INFO source=ggml.go:493 msg="offloaded 0/25 layers to GPU"
...
total duration:       11.543593045s
load duration:        163.15578ms
prompt eval count:    68 token(s)
prompt eval duration: 252.647953ms
prompt eval rate:     269.15 tokens/s
eval count:           41 token(s)
eval duration:        11.104702456s
eval rate:            3.69 tokens/s

But there is no similar bug reports in Arch bug tracker for ollama:
https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/issues?sort=created_date&state=all
(There is only bug report that it stopped working on CUDA v13 for very old NVIDIA Pascal Architecture. But we have CUDA v12 and GPU is 4090, so this is a different thing.) I wonder if it works on GPU for other people.

I tested on Arch with ollama installed from the Archive (date 2025/09/27) to get version 0.12.3 (the last one worked on GPU for us) to test my gpu+docker setup and it runs on GPU! jfyi, my steps:

[root@90be03c146dd /]# pacman -R ollama-cuda
[root@90be03c146dd /]# pacman -R ollama
[root@90be03c146dd /]# vim /etc/pacman.d/mirrorlist
Server = https://archive.archlinux.org/repos/2025/09/27/$repo/os/$arch
[root@90be03c146dd /]# pacman -Syyu
[root@90be03c146dd /]# pacman -S ollama-cuda
[root@90be03c146dd /]# ollama serve &
[root@90be03c146dd /]# ollama --version
ollama version is 0.12.3
[root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi
time=2025-10-31T17:48:52.210Z level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU"
total duration:       2.711742531s
load duration:        2.294663797s
prompt eval count:    68 token(s)
prompt eval duration: 159.899295ms
prompt eval rate:     425.27 tokens/s
eval count:           40 token(s)
eval duration:        256.697256ms
eval rate:            155.83 tokens/s

So the same box, GPU, docker run can run ollama on GPU and then not, with supposedly correct Arch build, this becomes even more puzzling for me what is different.

<!-- gh-comment-id:3474203400 --> @vt-alt commented on GitHub (Oct 31, 2025): First, I thought to reproduce the success of Arch in Docker, but, it seems, there is the same problem. My steps for Arch: ``` $ docker run --gpus=all --rm -it archlinux [root@90be03c146dd /]# pacman -Syu nvidia nvidia-utils [root@90be03c146dd /]# nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off | | 0% 49C P8 3W / 450W | 72MiB / 24564MiB | 6% Default | ... [root@90be03c146dd /]# pacman -S ollama [root@90be03c146dd /]# ollama serve & [root@90be03c146dd /]# ollama --version ollama version is 0.12.7 [root@90be03c146dd /]# ollama pull gpt-oss:20b [root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi ... time=2025-10-31T17:08:34.064Z level=INFO source=ggml.go:493 msg="offloaded 0/25 layers to GPU" ... total duration: 11.543593045s load duration: 163.15578ms prompt eval count: 68 token(s) prompt eval duration: 252.647953ms prompt eval rate: 269.15 tokens/s eval count: 41 token(s) eval duration: 11.104702456s eval rate: 3.69 tokens/s ``` But there is no similar bug reports in Arch bug tracker for ollama: https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/issues?sort=created_date&state=all (There is only bug report that it stopped working on CUDA v13 for very old NVIDIA Pascal Architecture. But we have CUDA v12 and GPU is 4090, so this is a different thing.) I wonder if it works on GPU for other people. I tested on Arch with ollama installed from the Archive (date 2025/09/27) to get version `0.12.3` (the last one worked on GPU for us) to test my gpu+docker setup and it runs on GPU! jfyi, my steps: ``` [root@90be03c146dd /]# pacman -R ollama-cuda [root@90be03c146dd /]# pacman -R ollama [root@90be03c146dd /]# vim /etc/pacman.d/mirrorlist Server = https://archive.archlinux.org/repos/2025/09/27/$repo/os/$arch [root@90be03c146dd /]# pacman -Syyu [root@90be03c146dd /]# pacman -S ollama-cuda [root@90be03c146dd /]# ollama serve & [root@90be03c146dd /]# ollama --version ollama version is 0.12.3 [root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi time=2025-10-31T17:48:52.210Z level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU" total duration: 2.711742531s load duration: 2.294663797s prompt eval count: 68 token(s) prompt eval duration: 159.899295ms prompt eval rate: 425.27 tokens/s eval count: 40 token(s) eval duration: 256.697256ms eval rate: 155.83 tokens/s ``` So the same box, GPU, `docker run` can run ollama on GPU and then not, with supposedly correct Arch build, this becomes even more puzzling for me what is different.
Author
Owner

@vt-alt commented on GitHub (Oct 31, 2025):

JFYI, this is unrelated, but because it's now on CPU for me I noticed 50% performance degradation for v0.12.7 (built by me in the same way as v0.12.6), the test is with 2 consecutive runs of ollama run --verbose gpt-oss:20b hi on i9-10900:

ollama-0.12.6

  • 1st run:
    total duration: 9.131255307s
    load duration: 3.185033924s
    prompt eval count: 68 token(s)
    prompt eval duration: 1.481592551s
    prompt eval rate: 45.90 tokens/s
    eval count: 33 token(s)
    eval duration: 4.413564763s
    eval rate: 7.48 tokens/s
  • 2nd run:
    total duration: 5.732011048s
    load duration: 166.657341ms
    prompt eval count: 68 token(s)
    prompt eval duration: 138.470444ms
    prompt eval rate: 491.08 tokens/s
    eval count: 40 token(s)
    eval duration: 5.41032075s
    eval rate: 7.39 tokens/s

ollama-0.12.7

  • 1st run:
    total duration: 16.626052217s
    load duration: 184.671881ms
    prompt eval count: 68 token(s)
    prompt eval duration: 286.664223ms
    prompt eval rate: 237.21 tokens/s
    eval count: 48 token(s)
    eval duration: 16.132840691s
    eval rate: 2.98 tokens/s
  • 2nd run:
    total duration: 10.061348294s
    load duration: 195.0805ms
    prompt eval count: 68 token(s)
    prompt eval duration: 288.919604ms
    prompt eval rate: 235.36 tokens/s
    eval count: 34 token(s)
    eval duration: 9.56369189s
    eval rate: 3.56 tokens/s

I can create another bug report if you wish.

<!-- gh-comment-id:3474241223 --> @vt-alt commented on GitHub (Oct 31, 2025): JFYI, this is unrelated, but because it's now on CPU for me I noticed 50% performance degradation for v0.12.7 (built by me in the same way as v0.12.6), the test is with 2 consecutive runs of `ollama run --verbose gpt-oss:20b hi` on `i9-10900`: ## ollama-0.12.6 - 1st run: total duration: 9.131255307s load duration: 3.185033924s prompt eval count: 68 token(s) prompt eval duration: 1.481592551s prompt eval rate: 45.90 tokens/s eval count: 33 token(s) eval duration: 4.413564763s eval rate: 7.48 tokens/s - 2nd run: total duration: 5.732011048s load duration: 166.657341ms prompt eval count: 68 token(s) prompt eval duration: 138.470444ms prompt eval rate: 491.08 tokens/s eval count: 40 token(s) eval duration: 5.41032075s eval rate: 7.39 tokens/s ## ollama-0.12.7 - 1st run: total duration: 16.626052217s load duration: 184.671881ms prompt eval count: 68 token(s) prompt eval duration: 286.664223ms prompt eval rate: 237.21 tokens/s eval count: 48 token(s) eval duration: 16.132840691s eval rate: 2.98 tokens/s - 2nd run: total duration: 10.061348294s load duration: 195.0805ms prompt eval count: 68 token(s) prompt eval duration: 288.919604ms prompt eval rate: 235.36 tokens/s eval count: 34 token(s) eval duration: 9.56369189s eval rate: 3.56 tokens/s I can create another bug report if you wish.
Author
Owner

@rick-github commented on GitHub (Oct 31, 2025):

Performance decrease for CPU only is likely https://github.com/ollama/ollama/issues/12886.

<!-- gh-comment-id:3474460344 --> @rick-github commented on GitHub (Oct 31, 2025): Performance decrease for CPU only is likely https://github.com/ollama/ollama/issues/12886.
Author
Owner

@vt-alt commented on GitHub (Nov 2, 2025):

We figured out what is the problem with our build. Since we have PTX JIT compiler we only compile for 2 virtual CUDA architectures (-DCMAKE_CUDA_ARCHITECTURES='52-virtual;80-virtual') to save space. This made nvcc define __CUDA_ARCH_LIST__ as 520,800. The RTX 4090 we use for tests is compute capability 8.9, as a consequence, this ollama-specific code triggers assertion failure, because it tries to find in the __CUDA_ARCH_LIST__ the exact match with 890:

#ifdef __CUDA_ARCH_LIST__
        if (std::getenv("GGML_CUDA_INIT") != NULL) {
            GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch");
        }
#endif // defined(__CUDA_ARCH_LIST__)

I'd suggest this code is somehow reworked (or removed?), for example ggml_cuda_has_arch() call could be replaced with ggml_cuda_highest_compiled_arch() > 0.

AFAIK, the exact architecture code comparison is not completely reflect CUDA compatibility model even for real architectures, because CUDA have some backward compatibility inside of major compute capability. So sm_80 cubin code should be runnable on sm_89 GPU (even though slower), but not vice versa. PTX (which you compile too) provides even greater "forward" compatibility with the small cost of first-time start delay (seconds, of compiling PTX to native code). You can see more details there https://docs.nvidia.com/cuda/ampere-compatibility-guide/

What helped is to have OLLAMA_DEBUG=2, in that case there is appeared error message: ollama[991926]: /usr/src/RPM/BUILD/ollama-0.12.7/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:329: GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch") failed and we see that runner is aborted, then it's matter of code reading.

<!-- gh-comment-id:3477389780 --> @vt-alt commented on GitHub (Nov 2, 2025): We figured out what is the problem with our build. Since we have PTX JIT compiler we only compile for 2 virtual CUDA architectures (`-DCMAKE_CUDA_ARCHITECTURES='52-virtual;80-virtual'`) to save space. This made nvcc define `__CUDA_ARCH_LIST__` as `520,800`. The RTX 4090 we use for tests is compute capability 8.9, as a consequence, this ollama-specific code triggers assertion failure, because it tries to find in the `__CUDA_ARCH_LIST__` the exact match with `890`: ``` #ifdef __CUDA_ARCH_LIST__ if (std::getenv("GGML_CUDA_INIT") != NULL) { GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch"); } #endif // defined(__CUDA_ARCH_LIST__) ``` I'd suggest this code is somehow reworked (or removed?), for example `ggml_cuda_has_arch()` call could be replaced with `ggml_cuda_highest_compiled_arch() > 0`. AFAIK, the exact architecture code comparison is not completely reflect CUDA compatibility model even for real architectures, because CUDA have some backward compatibility inside of major compute capability. So `sm_80` cubin code should be runnable on `sm_89` GPU (even though slower), but not vice versa. PTX (which you compile too) provides even greater "forward" compatibility with the small cost of first-time start delay (seconds, of compiling PTX to native code). You can see more details there https://docs.nvidia.com/cuda/ampere-compatibility-guide/ What helped is to have `OLLAMA_DEBUG=2`, in that case there is appeared error message: `ollama[991926]: /usr/src/RPM/BUILD/ollama-0.12.7/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:329: GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch") failed` and we see that runner is aborted, then it's matter of code reading.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54919