[GH-ISSUE #12674] Since v0.12.4 gpt-oss:20b does not run on GPU (CUDA) #54919

New Issue

GiteaMirror · 2026-04-29T07:58:40-05:00

GiteaMirror commented

2026-04-29 07:58:40 -05:00

Originally created by @vt-alt on GitHub (Oct 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12674

What is the issue?

Since v0.12.4 and up to v0.12.6, gpt-oss:20b does not run on CUDA (the test is on RTX 4090), while v0.12.3 worked OK.

$ ollama run --verbose gpt-oss:20b hi
Thinking...
We need to respond to "hi". Simple greeting.
...done thinking.

Hello! How can I help you today?

total duration:       7.427786912s
load duration:        2.364431973s
prompt eval count:    68 token(s)
prompt eval duration: 1.315199253s
prompt eval rate:     51.70 tokens/s
eval count:           30 token(s)
eval duration:        3.745768569s
eval rate:            8.01 tokens/s

Previously it was saying:

source=ggml.go:487 msg="offloading 24 repeating layers to GPU"
source=ggml.go:493 msg="offloading output layer to GPU"
source=ggml.go:498 msg="offloaded 25/25 layers to GPU"

Now

source=ggml.go:477 msg="offloading 0 repeating layers to GPU"
source=ggml.go:481 msg="offloading output layer to CPU"
source=ggml.go:488 msg="offloaded 0/25 layers to GPU"

Relevant log output

Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 |      28.695µs |       127.0.0.1 | HEAD     "/"
Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 |   77.590598ms |       127.0.0.1 | POST     "/api/show"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.202+03:00 level=INFO source=server.go:216 msg="enabling flash attention"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /var/lib/ollama/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 45589"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:681 msg="system memory" total="62.7 GiB" free="49.3 GiB" free_swap="0 B"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1299 msg="starting ollama engine"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:45589"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.215+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.262+03:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: found 1 CUDA devices:
Oct 17 13:25:55 pony ollama[4149997]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-ff90d373-e63e-427f-2f30-73348e89e4bd
Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.358+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=520,800 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.362+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="3.1 GiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="86.8 MiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:238 msg="total memory" size="16.0 GiB"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
Oct 17 13:25:57 pony ollama[4149997]: time=2025-10-17T13:25:57.287+03:00 level=INFO source=server.go:1309 msg="llama runner started in 2.08 seconds"
Oct 17 13:26:02 pony ollama[4149997]: [GIN] 2025/10/17 - 13:26:02 | 200 |  7.533357265s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

Manual build since v0.12.4 - including v0.12.6 intended for packaging but since the regress it cannot be packaged.

Originally created by @vt-alt on GitHub (Oct 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12674 ### What is the issue? Since v0.12.4 and up to v0.12.6, `gpt-oss:20b` does not run on CUDA (the test is on RTX 4090), while v0.12.3 worked OK. ``` $ ollama run --verbose gpt-oss:20b hi Thinking... We need to respond to "hi". Simple greeting. ...done thinking. Hello! How can I help you today? total duration: 7.427786912s load duration: 2.364431973s prompt eval count: 68 token(s) prompt eval duration: 1.315199253s prompt eval rate: 51.70 tokens/s eval count: 30 token(s) eval duration: 3.745768569s eval rate: 8.01 tokens/s ``` Previously it was saying: ``` source=ggml.go:487 msg="offloading 24 repeating layers to GPU" source=ggml.go:493 msg="offloading output layer to GPU" source=ggml.go:498 msg="offloaded 25/25 layers to GPU" ``` Now ``` source=ggml.go:477 msg="offloading 0 repeating layers to GPU" source=ggml.go:481 msg="offloading output layer to CPU" source=ggml.go:488 msg="offloaded 0/25 layers to GPU" ``` ### Relevant log output ```shell Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 | 28.695µs | 127.0.0.1 | HEAD "/" Oct 17 13:25:54 pony ollama[4149997]: [GIN] 2025/10/17 - 13:25:54 | 200 | 77.590598ms | 127.0.0.1 | POST "/api/show" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.202+03:00 level=INFO source=server.go:216 msg="enabling flash attention" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /var/lib/ollama/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 45589" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:675 msg="loading model" "model layers"=25 requested=-1 Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.203+03:00 level=INFO source=server.go:681 msg="system memory" total="62.7 GiB" free="49.3 GiB" free_swap="0 B" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1299 msg="starting ollama engine" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.210+03:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:45589" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.215+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.262+03:00 level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Oct 17 13:25:55 pony ollama[4149997]: ggml_cuda_init: found 1 CUDA devices: Oct 17 13:25:55 pony ollama[4149997]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-ff90d373-e63e-427f-2f30-73348e89e4bd Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.358+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=520,800 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.362+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=runner.go:1172 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:10 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:477 msg="offloading 0 repeating layers to GPU" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:481 msg="offloading output layer to CPU" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=ggml.go:488 msg="offloaded 0/25 layers to GPU" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="12.8 GiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:222 msg="kv cache" device=CPU size="3.1 GiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="86.8 MiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=device.go:238 msg="total memory" size="16.0 GiB" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1271 msg="waiting for llama runner to start responding" Oct 17 13:25:55 pony ollama[4149997]: time=2025-10-17T13:25:55.780+03:00 level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model" Oct 17 13:25:57 pony ollama[4149997]: time=2025-10-17T13:25:57.287+03:00 level=INFO source=server.go:1309 msg="llama runner started in 2.08 seconds" Oct 17 13:26:02 pony ollama[4149997]: [GIN] 2025/10/17 - 13:26:02 | 200 | 7.533357265s | 127.0.0.1 | POST "/api/generate" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version Manual build since v0.12.4 - including v0.12.6 intended for packaging but since the regress it cannot be packaged.

GiteaMirror added the build nvidia feature request labels 2026-04-29 07:58:41 -05:00

GiteaMirror commented

2026-04-29 07:58:44 -05:00

@sucream commented on GitHub (Oct 17, 2025):

I had the same problem.

When you upgrade your Ollama in Linux, you must remove previous version of Ollama.

The Ollama team said If you are upgrading from a prior version, you MUST remove the old libraries with sudo rm -rf /usr/lib/ollama first.

@sucream commented on GitHub (Oct 17, 2025): I had the same problem. When you upgrade your Ollama in Linux, you must remove previous version of Ollama. The Ollama team said [ `If you are upgrading from a prior version, you MUST remove the old libraries with sudo rm -rf /usr/lib/ollama first.`](https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install)

GiteaMirror commented

2026-04-29 07:58:45 -05:00

@rick-github commented on GitHub (Oct 17, 2025):

sucream is correct, the CUDA library is being loaded from the wrong location.

Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so

The recommendation is to use the official install.

@rick-github commented on GitHub (Oct 17, 2025): sucream is correct, the CUDA library is being loaded from the wrong location. ``` Oct 17 13:25:55 pony ollama[4149997]: load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so ``` The recommendation is to use the official install.

GiteaMirror commented

2026-04-29 07:58:46 -05:00

@vt-alt commented on GitHub (Oct 17, 2025):

We package build it into rpm, so it replaces all the files automatically when we upgrade. Also, it's built with -DGGML_BACKEND_DIR=%_libexecdir/ollama so the location is intended.

It consistently works on CUDA before version v0.12.4 (with all previous versions we compiled, stated to build for CUDA from v0.5.13), and then since v0.12.4,5,6 and it's uses only CPU.

We cannot use official binaries since for our distro (ALT Linux) we only build from sources.

@vt-alt commented on GitHub (Oct 17, 2025): We package build it into rpm, so it replaces all the files automatically when we upgrade. Also, it's built with `-DGGML_BACKEND_DIR=%_libexecdir/ollama` so the location is intended. It consistently works on CUDA before version v0.12.4 (with all previous versions we compiled, stated to build for CUDA from v0.5.13), and then since v0.12.4,5,6 and it's uses only CPU. We cannot use official binaries since for our distro (ALT Linux) we only build from sources.

GiteaMirror commented

2026-04-29 07:58:46 -05:00

@rick-github commented on GitHub (Oct 17, 2025):

Also, it's built with -DGGML_BACKEND_DIR=%_libexecdir/ollama so the location is intended.

Perhaps, but the library usually lives in a cuda_v12 directory. If your build process is not preserving the directory structure, it's no wonder it stopped working.

@rick-github commented on GitHub (Oct 17, 2025): > Also, it's built with -DGGML_BACKEND_DIR=%_libexecdir/ollama so the location is intended. Perhaps, but the library usually lives in a cuda_v12 directory. If your build process is not preserving the directory structure, it's no wonder it stopped working.

GiteaMirror commented

2026-04-29 07:58:47 -05:00

@vt-alt commented on GitHub (Oct 17, 2025):

Thanks for the suggestion! I will try to investigate OLLAMA_RUNNER_DIR usage.

@vt-alt commented on GitHub (Oct 17, 2025): Thanks for the suggestion! I will try to investigate `OLLAMA_RUNNER_DIR` usage.

GiteaMirror commented

2026-04-29 07:58:48 -05:00

@kiliansinger commented on GitHub (Oct 30, 2025):

I had similar problems with other model and could fix it with this PR: https://github.com/ollama/ollama/pull/12856

@kiliansinger commented on GitHub (Oct 30, 2025): I had similar problems with other model and could fix it with this PR: https://github.com/ollama/ollama/pull/12856

GiteaMirror commented

2026-04-29 07:58:49 -05:00

@vt-alt commented on GitHub (Oct 31, 2025):

Well, in our case this is not a crash, but ollama is unable to calculate VRAM size. By the log you see that ggml finds cuda libs (see load_backend lines, also, moving into "cuda_v12" dir does not help, and this is just build option, we don't move files after cmake installs them). This feature is introduced in the HUGE commit bc8909fb38

I tried to strace and there is no obvious errors (cuda and nvidia files are loaded). Perhaps, the difference from mainline build is slight, but I'm unable to comprehend it yet.

Everything builds and installs as before (this commit) so other downstream builders may not even notice regression.

@vt-alt commented on GitHub (Oct 31, 2025): Well, in our case this is not a crash, but ollama is unable to calculate VRAM size. By the log you see that ggml finds cuda libs (see `load_backend` lines, also, moving into "cuda_v12" dir does not help, and this is just build option, we don't move files after cmake installs them). This feature is introduced in the HUGE commit https://github.com/ollama/ollama/commit/bc8909fb38525c89dda842d4ecfc86a933089a99 I tried to strace and there is no obvious errors (cuda and nvidia files are loaded). Perhaps, the difference from mainline build is slight, but I'm unable to comprehend it yet. Everything builds and installs as before (this commit) so other downstream builders may not even notice regression.

GiteaMirror commented

2026-04-29 07:58:50 -05:00

@rick-github commented on GitHub (Oct 31, 2025):

What's the output of the following commands:

command -v ollama
find /usr/lib/ollama
find $(dirname $(dirname $(command -v ollama)))/lib/ollama

@rick-github commented on GitHub (Oct 31, 2025): What's the output of the following commands: ``` command -v ollama find /usr/lib/ollama find $(dirname $(dirname $(command -v ollama)))/lib/ollama ```

GiteaMirror commented

2026-04-29 07:58:51 -05:00

@vt-alt commented on GitHub (Oct 31, 2025):

First I want to note, since this is for a distribution we package to standard (LSB) directories, the quality check will not pass the package if it packs ELF objects in wrong trees. But, I can package perfectly libggml-cuda.so into cuda_v12 subdir if needed, ❶ I only removed the infix since we don't have CUDA versions other than v12. Also, ❷ since we have CUDA v12 in the repository we don't need to bundle CUDA libs like you do, the correct CUDA libs are in the system and will be maintained automatically correctly by the package system.

Therefore, native building for a particular distribution is a different thing than universal builds for any distributions, and is not incorrect just because it's somehow different. And we can avoid packaging a lot of redundant files (avoiding to have ~2GB package).

小马:~$ command -v ollama
/usr/bin/ollama
小马:~$ find /usr/lib/ollama
/usr/lib/ollama
/usr/lib/ollama/libggml-cpu-icelake.so
/usr/lib/ollama/libggml-cpu-sandybridge.so
/usr/lib/ollama/libggml-cpu-alderlake.so
/usr/lib/ollama/libggml-cpu-haswell.so
/usr/lib/ollama/libggml-base.so
/usr/lib/ollama/libggml-cuda.so
/usr/lib/ollama/libggml-cpu-sse42.so
/usr/lib/ollama/libggml-cpu-skylakex.so
/usr/lib/ollama/libggml-cpu-x64.so
小马:~$ find $(dirname $(dirname $(command -v ollama)))/lib/ollama
/usr/lib/ollama
/usr/lib/ollama/libggml-cpu-icelake.so
/usr/lib/ollama/libggml-cpu-sandybridge.so
/usr/lib/ollama/libggml-cpu-alderlake.so
/usr/lib/ollama/libggml-cpu-haswell.so
/usr/lib/ollama/libggml-base.so
/usr/lib/ollama/libggml-cuda.so
/usr/lib/ollama/libggml-cpu-sse42.so
/usr/lib/ollama/libggml-cpu-skylakex.so
/usr/lib/ollama/libggml-cpu-x64.so

But, see that cuda module is correctly linked with the system libraries and libggml-base.so:

小马:~$ ldd /usr/lib/ollama/libggml-cuda.so
        linux-vdso.so.1 (0x00007f2ea65c1000)
        libggml-base.so => /usr/lib/ollama/libggml-base.so (0x00007f2ea653c000)
        libcudart.so.12 => /lib64/libcudart.so.12 (0x00007f2ea1000000)
        libcublas.so.12 => /lib64/libcublas.so.12 (0x00007f2e99e00000)
        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f2e94000000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2e93c00000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2ea6436000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2ea13d2000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f2e93a05000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2ea65c3000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f2ea6431000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2ea642c000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f2ea6427000)
        libcublasLt.so.12 => /lib64/libcublasLt.so.12 (0x00007f2e61600000)

Unsynchronized upgrade of the libraries or binaries is impossible due to strict package dependencies. It installs to exactly the same packages (i.e. libs) it's built for.

(The same build process worked well for v0.12.3, and cuda acceleration was working.)

As an experiment I created correct symlinks:

/usr/lib/ollama# mkdir cuda_v12
/usr/lib/ollama# cd cuda_v12
/usr/lib/ollama/cuda_v12# ln -s /lib64/libcudart.so.12
/usr/lib/ollama/cuda_v12# ln -s /lib64/libcublas.so.12
/usr/lib/ollama/cuda_v12# ln -s /lib64/libcublasLt.so.12
/usr/lib/ollama/cuda_v12# ln -s /usr/lib/ollama/libggml-cuda.so

After restart of ollama.service it still didn't find vram and working only on cpu.

ps. I noticed in strace logs that some logging goes into /dev/null, and will try to enable it.
441676 write(2</dev/null>, "ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 25174212608 total: 25757220864\n", 105) = 105

@vt-alt commented on GitHub (Oct 31, 2025): First I want to note, since this is for a distribution we package to standard (LSB) directories, the quality check will not pass the package if it packs ELF objects in wrong trees. But, I can package perfectly `libggml-cuda.so` into `cuda_v12` subdir if needed, ❶ I only removed the infix since we don't have CUDA versions other than v12. Also, ❷ since we have CUDA v12 in the repository we don't need to bundle CUDA libs like you do, the correct CUDA libs are **in the system** and will be maintained automatically correctly by the package system. Therefore, **native** building for a particular distribution is a different thing than universal builds for **any** distributions, and is not incorrect just because it's somehow different. And we can avoid packaging a lot of redundant files (avoiding to have ~2GB package). ``` 小马:~$ command -v ollama /usr/bin/ollama 小马:~$ find /usr/lib/ollama /usr/lib/ollama /usr/lib/ollama/libggml-cpu-icelake.so /usr/lib/ollama/libggml-cpu-sandybridge.so /usr/lib/ollama/libggml-cpu-alderlake.so /usr/lib/ollama/libggml-cpu-haswell.so /usr/lib/ollama/libggml-base.so /usr/lib/ollama/libggml-cuda.so /usr/lib/ollama/libggml-cpu-sse42.so /usr/lib/ollama/libggml-cpu-skylakex.so /usr/lib/ollama/libggml-cpu-x64.so 小马:~$ find $(dirname $(dirname $(command -v ollama)))/lib/ollama /usr/lib/ollama /usr/lib/ollama/libggml-cpu-icelake.so /usr/lib/ollama/libggml-cpu-sandybridge.so /usr/lib/ollama/libggml-cpu-alderlake.so /usr/lib/ollama/libggml-cpu-haswell.so /usr/lib/ollama/libggml-base.so /usr/lib/ollama/libggml-cuda.so /usr/lib/ollama/libggml-cpu-sse42.so /usr/lib/ollama/libggml-cpu-skylakex.so /usr/lib/ollama/libggml-cpu-x64.so ``` But, see that cuda module is correctly linked with the system libraries and libggml-base.so: ``` 小马:~$ ldd /usr/lib/ollama/libggml-cuda.so linux-vdso.so.1 (0x00007f2ea65c1000) libggml-base.so => /usr/lib/ollama/libggml-base.so (0x00007f2ea653c000) libcudart.so.12 => /lib64/libcudart.so.12 (0x00007f2ea1000000) libcublas.so.12 => /lib64/libcublas.so.12 (0x00007f2e99e00000) libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f2e94000000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2e93c00000) libm.so.6 => /lib64/libm.so.6 (0x00007f2ea6436000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2ea13d2000) libc.so.6 => /lib64/libc.so.6 (0x00007f2e93a05000) /lib64/ld-linux-x86-64.so.2 (0x00007f2ea65c3000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f2ea6431000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2ea642c000) librt.so.1 => /lib64/librt.so.1 (0x00007f2ea6427000) libcublasLt.so.12 => /lib64/libcublasLt.so.12 (0x00007f2e61600000) ``` Unsynchronized upgrade of the libraries or binaries is impossible due to strict package dependencies. It installs to exactly the same packages (i.e. libs) it's built for. (The same build process worked well for v0.12.3, and cuda acceleration was working.) As an experiment I created correct symlinks: ``` /usr/lib/ollama# mkdir cuda_v12 /usr/lib/ollama# cd cuda_v12 /usr/lib/ollama/cuda_v12# ln -s /lib64/libcudart.so.12 /usr/lib/ollama/cuda_v12# ln -s /lib64/libcublas.so.12 /usr/lib/ollama/cuda_v12# ln -s /lib64/libcublasLt.so.12 /usr/lib/ollama/cuda_v12# ln -s /usr/lib/ollama/libggml-cuda.so ``` After restart of ollama.service it still didn't find vram and working only on cpu. ps. I noticed in strace logs that some logging goes into /dev/null, and will try to enable it. `441676 write(2</dev/null>, "ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 25174212608 total: 25757220864\n", 105) = 105`

GiteaMirror commented

2026-04-29 07:58:54 -05:00

@rick-github commented on GitHub (Oct 31, 2025):

incorrect just because it's somehow different

It's incorrect because it doesn't work, not because it's different. Different is fine. Arch, for example, has a different build and it works fine. We need to figure out why yours doesn't. What's the simplest way to do an ALT Linux install in a VM? Iso? Or docker image?

@rick-github commented on GitHub (Oct 31, 2025): > incorrect just because it's somehow different It's incorrect because it doesn't work, not because it's different. Different is fine. Arch, for example, has a different build and it works fine. We need to figure out why yours doesn't. What's the simplest way to do an ALT Linux install in a VM? Iso? Or docker image?

GiteaMirror commented

2026-04-29 07:58:56 -05:00

@vt-alt commented on GitHub (Oct 31, 2025):

Yeah, I don't blame the failure on you, and just want to make it work again and maybe some suggestions or debugging hints. This may be useful for other builders too.

We didn't commit the regressed package into the repository, it's still in the testing queue, so I will prepare test install instructions for Docker. Thanks.

It's useful to know Arch have working builds I will investigate more how they do it, thanks again.

@vt-alt commented on GitHub (Oct 31, 2025): Yeah, I don't blame the failure on you, and just want to make it work again and maybe some suggestions or debugging hints. This may be useful for other builders too. We didn't commit the regressed package into the repository, it's still in the testing queue, so I will prepare test install instructions for Docker. Thanks. It's useful to know Arch have working builds I will investigate more [how](https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/blob/main/PKGBUILD?ref_type=heads) they do it, thanks again.

GiteaMirror commented

2026-04-29 07:58:59 -05:00

@kiliansinger commented on GitHub (Oct 31, 2025):

I posted a PR that might fix your issue which started with v0.12.3 in my case: https://github.com/ollama/ollama/pull/12856

@kiliansinger commented on GitHub (Oct 31, 2025): I posted a PR that might fix your issue which started with v0.12.3 in my case: https://github.com/ollama/ollama/pull/12856

GiteaMirror commented

2026-04-29 07:59:01 -05:00

@rick-github commented on GitHub (Oct 31, 2025):

Would you please stop spamming issues with your PR. It has no relevance to this one or the others you posted it to.

@rick-github commented on GitHub (Oct 31, 2025): Would you please stop spamming issues with your PR. It has no relevance to this one or the others you posted it to.

GiteaMirror commented

2026-04-29 07:59:01 -05:00

@vt-alt commented on GitHub (Oct 31, 2025):

First, I thought to reproduce the success of Arch in Docker, but, it seems, there is the same problem. My steps for Arch:

$ docker run --gpus=all --rm -it archlinux
[root@90be03c146dd /]# pacman -Syu nvidia nvidia-utils
[root@90be03c146dd /]# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   49C    P8              3W /  450W |      72MiB /  24564MiB |      6%      Default |
...
[root@90be03c146dd /]# pacman -S ollama
[root@90be03c146dd /]# ollama serve &
[root@90be03c146dd /]# ollama --version
ollama version is 0.12.7
[root@90be03c146dd /]# ollama pull gpt-oss:20b
[root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi
...
time=2025-10-31T17:08:34.064Z level=INFO source=ggml.go:493 msg="offloaded 0/25 layers to GPU"
...
total duration:       11.543593045s
load duration:        163.15578ms
prompt eval count:    68 token(s)
prompt eval duration: 252.647953ms
prompt eval rate:     269.15 tokens/s
eval count:           41 token(s)
eval duration:        11.104702456s
eval rate:            3.69 tokens/s

But there is no similar bug reports in Arch bug tracker for ollama:
https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/issues?sort=created_date&state=all
(There is only bug report that it stopped working on CUDA v13 for very old NVIDIA Pascal Architecture. But we have CUDA v12 and GPU is 4090, so this is a different thing.) I wonder if it works on GPU for other people.

I tested on Arch with ollama installed from the Archive (date 2025/09/27) to get version 0.12.3 (the last one worked on GPU for us) to test my gpu+docker setup and it runs on GPU! jfyi, my steps:

[root@90be03c146dd /]# pacman -R ollama-cuda
[root@90be03c146dd /]# pacman -R ollama
[root@90be03c146dd /]# vim /etc/pacman.d/mirrorlist
Server = https://archive.archlinux.org/repos/2025/09/27/$repo/os/$arch
[root@90be03c146dd /]# pacman -Syyu
[root@90be03c146dd /]# pacman -S ollama-cuda
[root@90be03c146dd /]# ollama serve &
[root@90be03c146dd /]# ollama --version
ollama version is 0.12.3
[root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi
time=2025-10-31T17:48:52.210Z level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU"
total duration:       2.711742531s
load duration:        2.294663797s
prompt eval count:    68 token(s)
prompt eval duration: 159.899295ms
prompt eval rate:     425.27 tokens/s
eval count:           40 token(s)
eval duration:        256.697256ms
eval rate:            155.83 tokens/s

So the same box, GPU, docker run can run ollama on GPU and then not, with supposedly correct Arch build, this becomes even more puzzling for me what is different.

@vt-alt commented on GitHub (Oct 31, 2025): First, I thought to reproduce the success of Arch in Docker, but, it seems, there is the same problem. My steps for Arch: ``` $ docker run --gpus=all --rm -it archlinux [root@90be03c146dd /]# pacman -Syu nvidia nvidia-utils [root@90be03c146dd /]# nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off | | 0% 49C P8 3W / 450W | 72MiB / 24564MiB | 6% Default | ... [root@90be03c146dd /]# pacman -S ollama [root@90be03c146dd /]# ollama serve & [root@90be03c146dd /]# ollama --version ollama version is 0.12.7 [root@90be03c146dd /]# ollama pull gpt-oss:20b [root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi ... time=2025-10-31T17:08:34.064Z level=INFO source=ggml.go:493 msg="offloaded 0/25 layers to GPU" ... total duration: 11.543593045s load duration: 163.15578ms prompt eval count: 68 token(s) prompt eval duration: 252.647953ms prompt eval rate: 269.15 tokens/s eval count: 41 token(s) eval duration: 11.104702456s eval rate: 3.69 tokens/s ``` But there is no similar bug reports in Arch bug tracker for ollama: https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/issues?sort=created_date&state=all (There is only bug report that it stopped working on CUDA v13 for very old NVIDIA Pascal Architecture. But we have CUDA v12 and GPU is 4090, so this is a different thing.) I wonder if it works on GPU for other people. I tested on Arch with ollama installed from the Archive (date 2025/09/27) to get version `0.12.3` (the last one worked on GPU for us) to test my gpu+docker setup and it runs on GPU! jfyi, my steps: ``` [root@90be03c146dd /]# pacman -R ollama-cuda [root@90be03c146dd /]# pacman -R ollama [root@90be03c146dd /]# vim /etc/pacman.d/mirrorlist Server = https://archive.archlinux.org/repos/2025/09/27/$repo/os/$arch [root@90be03c146dd /]# pacman -Syyu [root@90be03c146dd /]# pacman -S ollama-cuda [root@90be03c146dd /]# ollama serve & [root@90be03c146dd /]# ollama --version ollama version is 0.12.3 [root@90be03c146dd /]# ollama run --verbose gpt-oss:20b hi time=2025-10-31T17:48:52.210Z level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU" total duration: 2.711742531s load duration: 2.294663797s prompt eval count: 68 token(s) prompt eval duration: 159.899295ms prompt eval rate: 425.27 tokens/s eval count: 40 token(s) eval duration: 256.697256ms eval rate: 155.83 tokens/s ``` So the same box, GPU, `docker run` can run ollama on GPU and then not, with supposedly correct Arch build, this becomes even more puzzling for me what is different.

GiteaMirror commented

2026-04-29 07:59:02 -05:00

@vt-alt commented on GitHub (Oct 31, 2025):

JFYI, this is unrelated, but because it's now on CPU for me I noticed 50% performance degradation for v0.12.7 (built by me in the same way as v0.12.6), the test is with 2 consecutive runs of ollama run --verbose gpt-oss:20b hi on i9-10900:

ollama-0.12.6

1st run:
total duration: 9.131255307s
load duration: 3.185033924s
prompt eval count: 68 token(s)
prompt eval duration: 1.481592551s
prompt eval rate: 45.90 tokens/s
eval count: 33 token(s)
eval duration: 4.413564763s
eval rate: 7.48 tokens/s
2nd run:
total duration: 5.732011048s
load duration: 166.657341ms
prompt eval count: 68 token(s)
prompt eval duration: 138.470444ms
prompt eval rate: 491.08 tokens/s
eval count: 40 token(s)
eval duration: 5.41032075s
eval rate: 7.39 tokens/s

ollama-0.12.7

1st run:
total duration: 16.626052217s
load duration: 184.671881ms
prompt eval count: 68 token(s)
prompt eval duration: 286.664223ms
prompt eval rate: 237.21 tokens/s
eval count: 48 token(s)
eval duration: 16.132840691s
eval rate: 2.98 tokens/s
2nd run:
total duration: 10.061348294s
load duration: 195.0805ms
prompt eval count: 68 token(s)
prompt eval duration: 288.919604ms
prompt eval rate: 235.36 tokens/s
eval count: 34 token(s)
eval duration: 9.56369189s
eval rate: 3.56 tokens/s

I can create another bug report if you wish.

@vt-alt commented on GitHub (Oct 31, 2025): JFYI, this is unrelated, but because it's now on CPU for me I noticed 50% performance degradation for v0.12.7 (built by me in the same way as v0.12.6), the test is with 2 consecutive runs of `ollama run --verbose gpt-oss:20b hi` on `i9-10900`: ## ollama-0.12.6 - 1st run: total duration: 9.131255307s load duration: 3.185033924s prompt eval count: 68 token(s) prompt eval duration: 1.481592551s prompt eval rate: 45.90 tokens/s eval count: 33 token(s) eval duration: 4.413564763s eval rate: 7.48 tokens/s - 2nd run: total duration: 5.732011048s load duration: 166.657341ms prompt eval count: 68 token(s) prompt eval duration: 138.470444ms prompt eval rate: 491.08 tokens/s eval count: 40 token(s) eval duration: 5.41032075s eval rate: 7.39 tokens/s ## ollama-0.12.7 - 1st run: total duration: 16.626052217s load duration: 184.671881ms prompt eval count: 68 token(s) prompt eval duration: 286.664223ms prompt eval rate: 237.21 tokens/s eval count: 48 token(s) eval duration: 16.132840691s eval rate: 2.98 tokens/s - 2nd run: total duration: 10.061348294s load duration: 195.0805ms prompt eval count: 68 token(s) prompt eval duration: 288.919604ms prompt eval rate: 235.36 tokens/s eval count: 34 token(s) eval duration: 9.56369189s eval rate: 3.56 tokens/s I can create another bug report if you wish.

GiteaMirror commented

2026-04-29 07:59:03 -05:00

@rick-github commented on GitHub (Oct 31, 2025):

Performance decrease for CPU only is likely https://github.com/ollama/ollama/issues/12886.

@rick-github commented on GitHub (Oct 31, 2025): Performance decrease for CPU only is likely https://github.com/ollama/ollama/issues/12886.

GiteaMirror commented

2026-04-29 07:59:04 -05:00

@vt-alt commented on GitHub (Nov 2, 2025):

We figured out what is the problem with our build. Since we have PTX JIT compiler we only compile for 2 virtual CUDA architectures (-DCMAKE_CUDA_ARCHITECTURES='52-virtual;80-virtual') to save space. This made nvcc define __CUDA_ARCH_LIST__ as 520,800. The RTX 4090 we use for tests is compute capability 8.9, as a consequence, this ollama-specific code triggers assertion failure, because it tries to find in the __CUDA_ARCH_LIST__ the exact match with 890:

#ifdef __CUDA_ARCH_LIST__
        if (std::getenv("GGML_CUDA_INIT") != NULL) {
            GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch");
        }
#endif // defined(__CUDA_ARCH_LIST__)

I'd suggest this code is somehow reworked (or removed?), for example ggml_cuda_has_arch() call could be replaced with ggml_cuda_highest_compiled_arch() > 0.

AFAIK, the exact architecture code comparison is not completely reflect CUDA compatibility model even for real architectures, because CUDA have some backward compatibility inside of major compute capability. So sm_80 cubin code should be runnable on sm_89 GPU (even though slower), but not vice versa. PTX (which you compile too) provides even greater "forward" compatibility with the small cost of first-time start delay (seconds, of compiling PTX to native code). You can see more details there https://docs.nvidia.com/cuda/ampere-compatibility-guide/

What helped is to have OLLAMA_DEBUG=2, in that case there is appeared error message: ollama[991926]: /usr/src/RPM/BUILD/ollama-0.12.7/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:329: GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch") failed and we see that runner is aborted, then it's matter of code reading.

@vt-alt commented on GitHub (Nov 2, 2025): We figured out what is the problem with our build. Since we have PTX JIT compiler we only compile for 2 virtual CUDA architectures (`-DCMAKE_CUDA_ARCHITECTURES='52-virtual;80-virtual'`) to save space. This made nvcc define `__CUDA_ARCH_LIST__` as `520,800`. The RTX 4090 we use for tests is compute capability 8.9, as a consequence, this ollama-specific code triggers assertion failure, because it tries to find in the `__CUDA_ARCH_LIST__` the exact match with `890`: ``` #ifdef __CUDA_ARCH_LIST__ if (std::getenv("GGML_CUDA_INIT") != NULL) { GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch"); } #endif // defined(__CUDA_ARCH_LIST__) ``` I'd suggest this code is somehow reworked (or removed?), for example `ggml_cuda_has_arch()` call could be replaced with `ggml_cuda_highest_compiled_arch() > 0`. AFAIK, the exact architecture code comparison is not completely reflect CUDA compatibility model even for real architectures, because CUDA have some backward compatibility inside of major compute capability. So `sm_80` cubin code should be runnable on `sm_89` GPU (even though slower), but not vice versa. PTX (which you compile too) provides even greater "forward" compatibility with the small cost of first-time start delay (seconds, of compiling PTX to native code). You can see more details there https://docs.nvidia.com/cuda/ampere-compatibility-guide/ What helped is to have `OLLAMA_DEBUG=2`, in that case there is appeared error message: `ollama[991926]: /usr/src/RPM/BUILD/ollama-0.12.7/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:329: GGML_ASSERT(ggml_cuda_has_arch(info.devices[id].cc) && "ggml was not compiled with support for this arch") failed` and we see that runner is aborted, then it's matter of code reading.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#54919