[GH-ISSUE #11714] gpt-oss 20b gguf model fail to run #69814

New Issue

GiteaMirror · 2026-05-04T19:27:16-05:00

GiteaMirror commented

2026-05-04 19:27:16 -05:00

Originally created by @VictorWangwz on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11714

What is the issue?

The original model coudl run without problem, but the gguf model fail to run for below errors

May need an update of ggml dependencies like llama.cpp https://github.com/ggml-org/llama.cpp/pull/15091

Note: Running gguf on llama.cpp without problem.

Relevant log output

Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: failed to read tensor info

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @VictorWangwz on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11714 ### What is the issue? The original model coudl run without problem, but the gguf model fail to run for below errors May need an update of ggml dependencies like llama.cpp https://github.com/ggml-org/llama.cpp/pull/15091 Note: Running gguf on llama.cpp without problem. ### Relevant log output ```shell Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: failed to read tensor info ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_

GiteaMirror added the bug label 2026-05-04 19:27:16 -05:00

GiteaMirror commented

2026-05-04 19:27:20 -05:00

@a-makarov-kaspi commented on GitHub (Aug 6, 2025):

Yep, ollama returns 500 error

Ollama 0.11.2 for mac OS

@a-makarov-kaspi commented on GitHub (Aug 6, 2025): Yep, ollama returns 500 error Ollama 0.11.2 for mac OS <img width="1409" height="74" alt="Image" src="https://github.com/user-attachments/assets/707744bb-9e6b-4a3e-9a69-256443c31eb0" />

GiteaMirror commented

2026-05-04 19:27:20 -05:00

@teodorgross commented on GitHub (Aug 6, 2025):

Yes' it's not working

@teodorgross commented on GitHub (Aug 6, 2025): Yes' it's not working

GiteaMirror commented

2026-05-04 19:27:21 -05:00

@ramkumarkb commented on GitHub (Aug 6, 2025):

I can also confirm that the same error with the latest vesion - 0.11.3-rc0
I tried with the following GGUF models -

@ramkumarkb commented on GitHub (Aug 6, 2025): I can also confirm that the same error with the latest vesion - `0.11.3-rc0` I tried with the following GGUF models - 1. https://huggingface.co/unsloth/gpt-oss-20b-GGUF 2. https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

GiteaMirror commented

2026-05-04 19:27:22 -05:00

@maksir commented on GitHub (Aug 6, 2025):

maybe will help

@maksir commented on GitHub (Aug 6, 2025): maybe will help <img width="850" height="91" alt="Image" src="https://github.com/user-attachments/assets/1ba82961-3bf1-4037-ad9b-9b20891d1f31" />

GiteaMirror commented

2026-05-04 19:27:22 -05:00

@teodorgross commented on GitHub (Aug 6, 2025):

@teodorgross commented on GitHub (Aug 6, 2025): <img width="649" height="88" alt="Image" src="https://github.com/user-attachments/assets/a9b97ed7-4d41-4158-8ceb-e4f1c516a96c" />

GiteaMirror commented

2026-05-04 19:27:23 -05:00

@rbnhln commented on GitHub (Aug 6, 2025):

Got the same error:

ollama  | gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
ollama  | gguf_init_from_file_impl: failed to read tensor info
ollama  | llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-b6f8d6dec25430529281f42096a0a38f9a73c4007650075047d54a1899f14fa5
ollama  | 
ollama  | llama_model_load_from_file_impl: failed to load model

Tried different models and sizes:

unsloth/gpt-oss-20b-GGUF:Q2_K
unsloth/gpt-oss-20b-GGUF:F16
ggml-org/gpt-oss-20b-GGUF:latest
lmstudio-community/gpt-oss-20b-GGUF:latest

ollama version: 0.11.2

The gpt-oss_20b model is running.

@rbnhln commented on GitHub (Aug 6, 2025): Got the same error: ```bash ollama | gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) ollama | gguf_init_from_file_impl: failed to read tensor info ollama | llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-b6f8d6dec25430529281f42096a0a38f9a73c4007650075047d54a1899f14fa5 ollama | ollama | llama_model_load_from_file_impl: failed to load model ``` Tried different models and sizes: - unsloth/gpt-oss-20b-GGUF:Q2_K - unsloth/gpt-oss-20b-GGUF:F16 - ggml-org/gpt-oss-20b-GGUF:latest - lmstudio-community/gpt-oss-20b-GGUF:latest ollama version: 0.11.2 The gpt-oss_20b model is running.

GiteaMirror commented

2026-05-04 19:27:24 -05:00

@discostur commented on GitHub (Aug 6, 2025):

Same error here:

time=2025-08-06T11:14:32.565Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b library=cuda total="15.6 GiB" available="158.2 MiB"
time=2025-08-06T11:14:34.149Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b parallel=1 available=16594370560 required="1.9 GiB"
time=2025-08-06T11:14:34.362Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="114.1 GiB" free_swap="0 B"
time=2025-08-06T11:14:34.363Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.0 GiB" memory.weights.repeating="454.3 MiB" memory.weights.nonrepeating="586.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174

llama_model_load_from_file_impl: failed to load model
time=2025-08-06T11:14:34.607Z level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 error="unable to load model: /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174"

@discostur commented on GitHub (Aug 6, 2025): Same error here: ```bash time=2025-08-06T11:14:32.565Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b library=cuda total="15.6 GiB" available="158.2 MiB" time=2025-08-06T11:14:34.149Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b parallel=1 available=16594370560 required="1.9 GiB" time=2025-08-06T11:14:34.362Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="114.1 GiB" free_swap="0 B" time=2025-08-06T11:14:34.363Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.0 GiB" memory.weights.repeating="454.3 MiB" memory.weights.nonrepeating="586.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) gguf_init_from_file_impl: failed to read tensor info llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 llama_model_load_from_file_impl: failed to load model time=2025-08-06T11:14:34.607Z level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 error="unable to load model: /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174" ```

GiteaMirror commented

2026-05-04 19:27:24 -05:00

@kappa8219 commented on GitHub (Aug 6, 2025):

Some more:

root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest
pulling manifest 
pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  11 GB                         
pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB                         
pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  149 B                         
pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  558 B                         
verifying sha256 digest 
writing manifest 
success 
root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest
Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b
root@open-webui-ollama-849869d898-dkg82:/# ollama -v
ollama version is 0.11.2

Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.

@kappa8219 commented on GitHub (Aug 6, 2025): Some more: ``` root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest pulling manifest pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 GB pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 149 B pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 558 B verifying sha256 digest writing manifest success root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b root@open-webui-ollama-849869d898-dkg82:/# ollama -v ollama version is 0.11.2 ``` Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.

GiteaMirror commented

2026-05-04 19:27:24 -05:00

@kappa8219 commented on GitHub (Aug 6, 2025):

Quite intriguing what would be the result of quantitazing of already prequantized model :)

@kappa8219 commented on GitHub (Aug 6, 2025): Quite intriguing what would be the result of quantitazing of already prequantized model :)

GiteaMirror commented

2026-05-04 19:27:25 -05:00

@snowarch commented on GitHub (Aug 6, 2025):

Yes.. Ollama always tends to screw up the GGUF models. I'm using the one from unsloth with llama.cpp and it works great.

@snowarch commented on GitHub (Aug 6, 2025): Yes.. Ollama always tends to screw up the GGUF models. I'm using the one from unsloth with llama.cpp and it works great.

GiteaMirror commented

2026-05-04 19:27:26 -05:00

@teodorgross commented on GitHub (Aug 6, 2025):

Some more:

root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest
pulling manifest 
pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  11 GB                         
pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB                         
pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  149 B                         
pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  558 B                         
verifying sha256 digest 
writing manifest 
success 
root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest
Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b
root@open-webui-ollama-849869d898-dkg82:/# ollama -v
ollama version is 0.11.2

Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.

It's working well in others tools except ollama

@teodorgross commented on GitHub (Aug 6, 2025): > Some more: > > ``` > root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest > pulling manifest > pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 GB > pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB > pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 149 B > pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 558 B > verifying sha256 digest > writing manifest > success > root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest > Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b > root@open-webui-ollama-849869d898-dkg82:/# ollama -v > ollama version is 0.11.2 > ``` > > Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem. > It's working well in others tools except ollama

GiteaMirror commented

2026-05-04 19:27:27 -05:00

@musica2016 commented on GitHub (Aug 7, 2025):

Same problem
Ollama versions is 0.11.3-rc0
System: ubuntu

@musica2016 commented on GitHub (Aug 7, 2025): Same problem Ollama versions is 0.11.3-rc0 System: ubuntu

GiteaMirror commented

2026-05-04 19:27:27 -05:00

@406747925 commented on GitHub (Aug 7, 2025):

same problem
ollama version is 0.11.3
CUDA Version: 12.4
tesla v100

@406747925 commented on GitHub (Aug 7, 2025): same problem ollama version is 0.11.3 CUDA Version: 12.4 tesla v100

GiteaMirror commented

2026-05-04 19:27:28 -05:00

@niehao100 commented on GitHub (Aug 7, 2025):

same problem
ollama version is 0.11.3
rocm version 1.15
RX7700xt

@niehao100 commented on GitHub (Aug 7, 2025): same problem ollama version is 0.11.3 rocm version 1.15 RX7700xt

GiteaMirror commented

2026-05-04 19:27:29 -05:00

@billchurch commented on GitHub (Aug 7, 2025):

I'm experiencing the same issue with the unsloth gpt-oss-20b model on my AMD setup. Here are my system details:

Environment:

Ollama version: 0.11.3
OS: Ubuntu 24.04 LTS (running in LXC container on Proxmox)
Kernel: 6.8.12-11-pve
CPU: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
GPU: AMD Radeon 890M (gfx1150) with 32GB VRAM allocated
ROCm/HSA: Using HSA Runtime 5.7.1 (libhsa-runtime64-1)

Model tested:
hf.co/unsloth/gpt-oss-20b-GGUF:F16

Error reproduced:

$ ollama run hf.co/unsloth/gpt-oss-20b-GGUF:F16 "test"
Error: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7

Relevant logs from journalctl:

gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7
llama_model_load_from_file_impl: failed to load model

The error is identical to what others are reporting - the tensor blk.0.ffn_down_exps.weight has an invalid ggml type 39 (NONE). This happens on AMD ROCm setup as well, not just CUDA environments.

Other models like gpt-oss:20b, llama3.2:1b, phi3:mini, and gemma3n:e4b work perfectly fine with 100% GPU offloading on this same setup.

@billchurch commented on GitHub (Aug 7, 2025): I'm experiencing the same issue with the unsloth gpt-oss-20b model on my AMD setup. Here are my system details: **Environment:** - **Ollama version:** 0.11.3 - **OS:** Ubuntu 24.04 LTS (running in LXC container on Proxmox) - **Kernel:** 6.8.12-11-pve - **CPU:** AMD Ryzen AI 9 HX 370 w/ Radeon 890M - **GPU:** AMD Radeon 890M (gfx1150) with 32GB VRAM allocated - **ROCm/HSA:** Using HSA Runtime 5.7.1 (libhsa-runtime64-1) **Model tested:** `hf.co/unsloth/gpt-oss-20b-GGUF:F16` **Error reproduced:** ``` $ ollama run hf.co/unsloth/gpt-oss-20b-GGUF:F16 "test" Error: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7 ``` **Relevant logs from journalctl:** ``` gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) gguf_init_from_file_impl: failed to read tensor info llama_model_load: error loading model: llama_model_loader: failed to load model from /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7 llama_model_load_from_file_impl: failed to load model ``` The error is identical to what others are reporting - the tensor `blk.0.ffn_down_exps.weight` has an invalid ggml type 39 (NONE). This happens on AMD ROCm setup as well, not just CUDA environments. Other models like gpt-oss:20b, llama3.2:1b, phi3:mini, and gemma3n:e4b work perfectly fine with 100% GPU offloading on this same setup.

GiteaMirror commented

2026-05-04 19:27:29 -05:00

@srd00 commented on GitHub (Aug 7, 2025):

apparently the solution is to update llama.cpp https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/11

@srd00 commented on GitHub (Aug 7, 2025): apparently the solution is to update llama.cpp https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/11

GiteaMirror commented

2026-05-04 19:27:30 -05:00

@mabhay3420 commented on GitHub (Aug 7, 2025):

on apple silicon, following version installed with brew works file.
thanks @srd00 for the suggestion to use different version.

$ llama-cli --version
version: 6100 (65c797c4)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin23.6.0

previously i was getting the error:

gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info

@mabhay3420 commented on GitHub (Aug 7, 2025): on apple silicon, following version installed with brew works file. thanks @srd00 for the suggestion to use different version. ``` $ llama-cli --version version: 6100 (65c797c4) built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin23.6.0 ``` previously i was getting the error: ``` gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) gguf_init_from_file_impl: failed to read tensor info ```

GiteaMirror commented

2026-05-04 19:27:30 -05:00

@Kira-PH commented on GitHub (Aug 9, 2025):

I get this on mradermacher/gpt-oss-20b-uncensored-bf16-GGUF:Q8_0 and DavidAU/OpenAi-GPT-oss-20b-abliterated-uncensored-NEO-Imatrix-gguf:Q8_0

@Kira-PH commented on GitHub (Aug 9, 2025): I get this on mradermacher/gpt-oss-20b-uncensored-bf16-GGUF:Q8_0 and DavidAU/OpenAi-GPT-oss-20b-abliterated-uncensored-NEO-Imatrix-gguf:Q8_0 <img width="884" height="140" alt="Image" src="https://github.com/user-attachments/assets/b9a09047-6894-4acd-ab9c-8a8f85c530b8" />

GiteaMirror commented

2026-05-04 19:27:30 -05:00

@billchurch commented on GitHub (Aug 9, 2025):

@HughPH There’s definitely something missing from some ollama distributions, 0.11.3 produces the same errors for me for all gpt-oss variations except the original from OpenAI. I compied llama.cpp and I’m not having these problems. Ollama 0.11.3 should support this but it’s not working on my linux distros.

@billchurch commented on GitHub (Aug 9, 2025): @HughPH There’s definitely something missing from some ollama distributions, 0.11.3 produces the same errors for me for all gpt-oss variations except the original from OpenAI. I compied llama.cpp and I’m not having these problems. Ollama 0.11.3 should support this but it’s not working on my linux distros.

GiteaMirror commented

2026-05-04 19:27:30 -05:00

@Teravus commented on GitHub (Aug 10, 2025):

I have this issue too with the unsloth versions on Ollama.

ollama --version
ollama version is 0.11.4

llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 193
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 178
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q5_1: 1 tensors
llama_model_loader: - type q8_0: 169 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 20.55 GiB (8.44 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gpt-oss'
llama_model_load_from_file_impl: failed to load model
time=2025-08-09T17:20:47.030-07:00 level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=

I don't have anything to add except, 'me too'.

@Teravus commented on GitHub (Aug 10, 2025): I have this issue too with the unsloth versions on Ollama. ``` ollama --version ollama version is 0.11.4 ``` llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 193 llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 178 llama_model_loader: - type f32: 289 tensors llama_model_loader: - type q5_1: 1 tensors llama_model_loader: - type q8_0: 169 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 20.55 GiB (8.44 BPW) llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gpt-oss' llama_model_load_from_file_impl: failed to load model time=2025-08-09T17:20:47.030-07:00 level=INFO source=sched.go:453 msg="NewLlamaServer failed" model= I don't have anything to add except, 'me too'.

GiteaMirror commented

2026-05-04 19:27:31 -05:00

@ggerganov commented on GitHub (Aug 10, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:

Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (https://github.com/ollama/ollama/pull/11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.

Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their ggml fork and copy the upstream implementation (https://github.com/ollama/ollama/pull/11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version of ggml will be merged and ollama will once again be fast and compatible.

Hope this helps.

@ggerganov commented on GitHub (Aug 10, 2025): Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support (https://github.com/ollama/ollama/pull/11672). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their `ggml` fork and copy the upstream implementation (https://github.com/ollama/ollama/pull/11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some [wrong information](https://github.com/ollama/ollama/issues/11816#issuecomment-3168974327). But none of this matters, because soon the upstream version of `ggml` will be merged and ollama will once again be fast and compatible. Hope this helps.

GiteaMirror commented

2026-05-04 19:27:31 -05:00

@fitlemon commented on GitHub (Aug 10, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama...
...
Hope this helps.

Absolutely legend ;). Cause of this problem I migrated from ollama to llama.cpp.

@fitlemon commented on GitHub (Aug 10, 2025): > Since none of the maintainers here seem to care enough to explain the actual reason for ollama... > ... > Hope this helps. Absolutely legend ;). Cause of this problem I migrated from ollama to llama.cpp.

GiteaMirror commented

2026-05-04 19:27:32 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

Note all this stuff is a one-liner with docker model runner:

docker model run ai/gpt-oss

docker model runner uses llama.cpp, is open-source and open to contributions just like llama.cpp . Get involved where appropriate.

Another neat feature is they are stored as OCI artifacts, so you can push these models to any old OCI registry.

@ericcurtin commented on GitHub (Aug 11, 2025): Note all this stuff is a one-liner with docker model runner: ``` docker model run ai/gpt-oss ``` docker model runner uses llama.cpp, is open-source and open to contributions just like llama.cpp . Get involved where appropriate. Another neat feature is they are stored as OCI artifacts, so you can push these models to any old OCI registry.

GiteaMirror commented

2026-05-04 19:27:32 -05:00

@Teravus commented on GitHub (Aug 11, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:

Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (#11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.

Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their ggml fork and copy the upstream implementation (#11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version of ggml will be merged and ollama will once again be fast and compatible.

Hope this helps.

While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster.

I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using:
Then loaded the model with
llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF
and then pointed open-webui's OpenAI compatible connection at the v1 endpoint.

It worked with the ggml-org mxfp4 model.

Also, it returned about 5 tokens per second with a 4096 context.
Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama..

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist;
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second)
eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second)
total time = 94598.94 ms / 580 tokens
srv update_slots: all slots are idle

I appreciate the work that you've done. You've done an absolute technical marvel.

And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience.

There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it.

I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts.
Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again.

Just saying, that's my experience.

@Teravus commented on GitHub (Aug 11, 2025): > Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: > > Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support ([#11672](https://github.com/ollama/ollama/pull/11672)). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. > > Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their `ggml` fork and copy the upstream implementation ([#11823](https://github.com/ollama/ollama/pull/11823)). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some [wrong information](https://github.com/ollama/ollama/issues/11816#issuecomment-3168974327). But none of this matters, because soon the upstream version of `ggml` will be merged and ollama will once again be fast and compatible. > > Hope this helps. While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster. I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using: Then loaded the model with `llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and then pointed open-webui's OpenAI compatible connection at the v1 endpoint. It worked with the ggml-org mxfp4 model. Also, it returned about 5 tokens per second with a 4096 context. Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama.. There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second) eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second) total time = 94598.94 ms / 580 tokens srv update_slots: all slots are idle I appreciate the work that you've done. You've done an absolute technical marvel. And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience. There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it. I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts. Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again. Just saying, that's my experience.

GiteaMirror commented

2026-05-04 19:27:33 -05:00

@ngxson commented on GitHub (Aug 11, 2025):

@Teravus We are actively working on the problems that you mentioned, just give us a bit of time.

Having both best performance and good UX is not an easy task, given the community-driven nature of llama.cpp. Some of llama.cpp maintainers even have to work during their vacations just to have someone else copy their work without giving any credits.

@ngxson commented on GitHub (Aug 11, 2025): @Teravus We are actively working on the problems that you mentioned, just give us a bit of time. Having both best performance and good UX is not an easy task, given the community-driven nature of llama.cpp. Some of llama.cpp maintainers even have to work during their vacations just to have someone else copy their work without giving any credits.

GiteaMirror commented

2026-05-04 19:27:34 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:
Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (#11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.
Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their ggml fork and copy the upstream implementation (#11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version of ggml will be merged and ollama will once again be fast and compatible.
Hope this helps.

While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster.

I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using: Then loaded the model with llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and then pointed open-webui's OpenAI compatible connection at the v1 endpoint.

Don't have Nvidia hardware but I think there are two key flags you are missing, try toggling them to see if they help (flash attention and cache-reuse):

llama-server --flash-attn --cache-reuse 256 --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF

It worked with the ggml-org mxfp4 model.

Also, it returned about 5 tokens per second with a 4096 context. Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama..

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second) eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second) total time = 94598.94 ms / 580 tokens srv update_slots: all slots are idle

I appreciate the work that you've done. You've done an absolute technical marvel.

And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience.

There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it.

I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts. Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again.

Just saying, that's my experience.

@ericcurtin commented on GitHub (Aug 11, 2025): > > Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: > > Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support ([#11672](https://github.com/ollama/ollama/pull/11672)). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. > > Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their `ggml` fork and copy the upstream implementation ([#11823](https://github.com/ollama/ollama/pull/11823)). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some [wrong information](https://github.com/ollama/ollama/issues/11816#issuecomment-3168974327). But none of this matters, because soon the upstream version of `ggml` will be merged and ollama will once again be fast and compatible. > > Hope this helps. > > While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster. > > I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using: Then loaded the model with `llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and then pointed open-webui's OpenAI compatible connection at the v1 endpoint. Don't have Nvidia hardware but I think there are two key flags you are missing, try toggling them to see if they help (flash attention and cache-reuse): ``` llama-server --flash-attn --cache-reuse 256 --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF ``` > > It worked with the ggml-org mxfp4 model. > > Also, it returned about 5 tokens per second with a 4096 context. Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama.. > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second) eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second) total time = 94598.94 ms / 580 tokens srv update_slots: all slots are idle > > I appreciate the work that you've done. You've done an absolute technical marvel. > > And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience. > > There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it. > > I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts. Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again. > > Just saying, that's my experience.

GiteaMirror commented

2026-05-04 19:27:35 -05:00

@pwilkin commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄

Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

@pwilkin commented on GitHub (Aug 11, 2025): > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks.

GiteaMirror commented

2026-05-04 19:27:36 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

Ha yes -ngl 999 is pretty crucial 😄

@ericcurtin commented on GitHub (Aug 11, 2025): Ha yes `-ngl 999` is pretty crucial 😄

GiteaMirror commented

2026-05-04 19:27:37 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

I know --cache-reuse 256 has been recommended by @ggerganov in the past, don't have Nvidia hardware myself, so don't know how significant it it.

@ericcurtin commented on GitHub (Aug 11, 2025): I know `--cache-reuse 256` has been recommended by @ggerganov in the past, don't have Nvidia hardware myself, so don't know how significant it it.

GiteaMirror commented

2026-05-04 19:27:37 -05:00

@ITankForCAD commented on GitHub (Aug 11, 2025):

Better yet, use the dedicated benchmark binary provided by the fine folks at llama.cpp : llama-bench

@ITankForCAD commented on GitHub (Aug 11, 2025): Better yet, use the dedicated benchmark binary provided by the fine folks at llama.cpp : [llama-bench](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench)

GiteaMirror commented

2026-05-04 19:27:38 -05:00

@mudler commented on GitHub (Aug 11, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:

Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (#11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.

Worse part of all of this that nothing will change in the long run. This has been already the case for long time, just to cite another case is llama multimodal capabilities - and have mixed feelings on ollama exactly for this reason, it would have been much better if all projects that depend on @ggerganov's and the ggml team work would have upstreamed the contributions directly so anyone in the ecosystem could benefit, and avoid vendor lock-in and the duplicated efforts everywhere.

For instance, in LocalAI - and as well like in LM Studio and Docker you will find everything to "just work" because of working as a community, and actually giving credits to whom belong ( you guys really rock ;) ), consuming llama.cpp and reporting issues and upstream any change directly there.

It is quite frustrating to see that the Open source scene is really getting derailed lately by this kind of bad attitude.

@mudler commented on GitHub (Aug 11, 2025): > Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: > > Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support ([#11672](https://github.com/ollama/ollama/pull/11672)). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. > Worse part of all of this that nothing will change in the long run. This has been already the case for long time, just to cite another case is llama multimodal capabilities - and have mixed feelings on ollama exactly for this reason, it would have been much better if all projects that depend on @ggerganov's and the ggml team work would have upstreamed the contributions directly so anyone in the ecosystem could benefit, and avoid vendor lock-in and the duplicated efforts everywhere. For instance, in LocalAI - and as well like in LM Studio and Docker you will find everything to "just work" because of working as a community, and actually giving credits to whom belong ( you guys really rock ;) ), consuming llama.cpp and reporting issues and upstream any change directly there. It is quite frustrating to see that the Open source scene is really getting derailed lately by this kind of bad attitude.

GiteaMirror commented

2026-05-04 19:27:39 -05:00

@Teravus commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄

Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.

I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.

I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).

There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

@Teravus commented on GitHub (Aug 11, 2025): > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > Ha yes `-ngl 999` is pretty crucial 😄 The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. I think [ngxson](https://github.com/ngxson) gets it. I agree that doing *both* the technical solution *and* having something that is easy to use is hard. I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they *are* adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

GiteaMirror commented

2026-05-04 19:27:41 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.

I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.

I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).

There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄

One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

@ericcurtin commented on GitHub (Aug 11, 2025): > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > Ha yes `-ngl 999` is pretty crucial 😄 > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. Just trying to help out @Teravus not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

GiteaMirror commented

2026-05-04 19:27:42 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

@ggerganov @ngxson could we document reasonable defaults somewhere in upstream llama.cpp as a short-term solution? Kinda like how q4_k_m is a reasonable default for gguf? In my head this is what I have. Even if they are not perfect and don't fit every little use case it's better than nothing, I volunteer to open a PR, I need help with the info though! I don't have enough experience with the various stacks and hardware to know exactly. One thing I do know from running CPU inferencing on an Ampere machine with a tonne of CPU cores is "--threads (number of cores/2)" seems like a reasonable default:

CPU:

llama-server --jinja --cache-reuse 256 --threads (number of cores/2) -hf some-model

CUDA:

llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model

METAL:

llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model

ROCM:

?

VULKAN:

?

OPENCL:

?

MUSA:

?

CANN:

?

BLAS:

?

@ericcurtin commented on GitHub (Aug 11, 2025): @ggerganov @ngxson could we document reasonable defaults somewhere in upstream llama.cpp as a short-term solution? Kinda like how q4_k_m is a reasonable default for gguf? In my head this is what I have. Even if they are not perfect and don't fit every little use case it's better than nothing, I volunteer to open a PR, I need help with the info though! I don't have enough experience with the various stacks and hardware to know exactly. One thing I do know from running CPU inferencing on an Ampere machine with a tonne of CPU cores is "--threads (number of cores/2)" seems like a reasonable default: CPU: llama-server --jinja --cache-reuse 256 --threads (number of cores/2) -hf some-model CUDA: llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model METAL: llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model ROCM: ? VULKAN: ? OPENCL: ? MUSA: ? CANN: ? BLAS: ?

GiteaMirror commented

2026-05-04 19:27:45 -05:00

@pwilkin commented on GitHub (Aug 11, 2025):

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.

For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.

But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

@pwilkin commented on GitHub (Aug 11, 2025): @Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

GiteaMirror commented

2026-05-04 19:27:47 -05:00

@Teravus commented on GitHub (Aug 11, 2025):

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.

For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.

But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

Ollama, for sure, needs to provide something to the user that says that they're using code from llama.cpp. Usually, this is in an about box. I don't even see an about box. Therefore, doesn't look like they're complying with that. I only know that ollama uses llama.cpp under the hood from having model issues with some models and needing to manually prepare them with llama.cpp in order for them to work under ollama. It was, at that point, that I learned that the underlying technology, that was state of the art and that the 'good' implementations relied on, was llama.cpp. The only mention that I see of llama.cpp isn't really a 'we use this software' reference. It's just:
Supported backends
llama.cpp project founded by Georgi Gerganov.
This doesn't seem like enough. Skirting the issue, by treating llama.cpp like a back-end.

A deeper look down the rabbit hole, Looks like Ollama documents the "patches" that they make to llamacpp here.
https://github.com/ollama/ollama/tree/main/llama/patches
The last batch of them have been about gpt-oss.

Looks like, also, there may be some NDAs involved.
From a comment: "this is exactly how they prepped the 0.11.0 release w/o breaking OpenAI NDAs, it made a cuda crash refer to a nonsense line number."
I'm not sure how that affected the situation with gpt-oss.

@Teravus commented on GitHub (Aug 11, 2025): > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. Ollama, for sure, needs to provide something to the user that says that they're using code from llama.cpp. Usually, this is in an about box. I don't even see an about box. Therefore, doesn't look like they're complying with that. I only know that ollama uses llama.cpp under the hood from having model issues with some models and needing to manually prepare them with llama.cpp in order for them to work under ollama. It was, at that point, that I learned that the underlying technology, that was state of the art and that the 'good' implementations relied on, was llama.cpp. The only mention that I see of llama.cpp isn't really a 'we use this software' reference. It's just: Supported backends [llama.cpp](https://github.com/ggml-org/llama.cpp) project founded by Georgi Gerganov. This doesn't seem like enough. Skirting the issue, by treating llama.cpp like a back-end. A deeper look down the rabbit hole, Looks like Ollama documents the "patches" that they make to llamacpp here. [https://github.com/ollama/ollama/tree/main/llama/patches](https://github.com/ollama/ollama/tree/main/llama/patches ) The last batch of them have been about gpt-oss. Looks like, also, there may be some NDAs involved. From a comment: "this is exactly how they prepped the 0.11.0 release w/o breaking OpenAI NDAs, it made a cuda crash refer to a nonsense line number." I'm not sure how that affected the situation with gpt-oss.

GiteaMirror commented

2026-05-04 19:27:48 -05:00

@Teravus commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄

One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

I think you've been very reasonable.
I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.

For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.

But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience.
People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.

I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

@Teravus commented on GitHub (Aug 11, 2025): > > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > > > > > Ha yes `-ngl 999` is pretty crucial 😄 > > > > > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. > > Just trying to help out [@Teravus](https://github.com/Teravus) not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 > > One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything. > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it. I wonder if they realize that they're **actually** thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

GiteaMirror commented

2026-05-04 19:27:48 -05:00

@ericcurtin commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄
One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.
For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.
But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.

I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course.

If you do GPU vs GPU, llama.cpp wins.

I think more documentation is a good idea.

@ericcurtin commented on GitHub (Aug 11, 2025): > > > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > > > > > > > > Ha yes `-ngl 999` is pretty crucial 😄 > > > > > > > > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. > > > > > > Just trying to help out [@Teravus](https://github.com/Teravus) not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 > > One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp > > I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything. > > > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. > > A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it. > > I wonder if they realize that they're **actually** thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience. The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course. If you do GPU vs GPU, llama.cpp wins. I think more documentation is a good idea.

GiteaMirror commented

2026-05-04 19:27:49 -05:00

@Teravus commented on GitHub (Aug 11, 2025):

llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄
One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.
For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.
But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.
I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course.

If you do GPU vs GPU, llama.cpp wins.

I think more documentation is a good idea.

Yep, after using:
llama-server -fa -ngl 99 --jinja --port 9001 -hf ggml-org/gpt-oss-20b-GGUF

prompt eval time = 494.48 ms / 1651 tokens ( 0.30 ms per token, 3338.89 tokens per second)
eval time = 3214.37 ms / 386 tokens ( 8.33 ms per token, 120.09 tokens per second)
total time = 3708.84 ms / 2037 tokens

120t/s llama.cpp, 72/s Ollama

llama.cpp wins.

I'm going to leave the old post, unedited. I'm curious how many thumbs-downs it will get.

@Teravus commented on GitHub (Aug 11, 2025): > llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF > > > > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > > > > > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > > > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > > > > > > > > > > > Ha yes `-ngl 999` is pretty crucial 😄 > > > > > > > > > > > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. > > > > > > > > > Just trying to help out [@Teravus](https://github.com/Teravus) not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 > > > One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp > > > > > > I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything. > > > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. > > > > > > A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it. > > I wonder if they realize that they're **actually** thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience. > > The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course. > > If you do GPU vs GPU, llama.cpp wins. > > I think more documentation is a good idea. Yep, after using: `llama-server -fa -ngl 99 --jinja --port 9001 -hf ggml-org/gpt-oss-20b-GGUF` prompt eval time = 494.48 ms / 1651 tokens ( 0.30 ms per token, 3338.89 tokens per second) eval time = 3214.37 ms / 386 tokens ( 8.33 ms per token, 120.09 tokens per second) total time = 3708.84 ms / 2037 tokens 120t/s llama.cpp, 72/s Ollama llama.cpp wins. I'm going to leave the old post, unedited. I'm curious how many thumbs-downs it will get.

GiteaMirror commented

2026-05-04 19:27:49 -05:00

@JohannesGaessler commented on GitHub (Aug 12, 2025):

My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in.

By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.

@JohannesGaessler commented on GitHub (Aug 12, 2025): >My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.

GiteaMirror commented

2026-05-04 19:27:50 -05:00

@Teravus commented on GitHub (Aug 12, 2025):

My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in.

By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.

I understand the perspective.

I guess it boils down to: "Who is the project serving?"

Who is the target audience?

llama.cpp is targeted at a different, more niche, user than ollama.

That, in and of itself, isn't a bad thing, as long as llama.cpp understands that many more users will flock to ollama first and ollama will be many people's first introduction to llms... because it's easier and less intimidating than llama.cpp. It will be more popular, and more people will use it as a result. The more niche users will continue to use llama.cpp directly. There may be some converts from ollama to llama.cpp who try ollama and want to do more, but, they'll be the same kind of niche users that llama.cpp attracts with its focus.

llama.cpp

advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'. Better for an actual single hosted production model where you control every aspect.

ollama

low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model. They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application. No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required.

If llama.cpp doesn't want that kind of user, then they're doing OK already.

What a lot of people in this thread don't want to hear is that, If llama.cpp wants to be 'more popular' or be user's first introduction to running llms locally, it has to serve that second type of user as long as something like ollama exists.
Easier = more users.
More detail = less, but, more invested, users.

A portion of the community who uses llama.cpp (not the core developers) suggests that all of the innovation comes from llama.cpp and ollama is just a thin wrapper around llama.cpp. If that's the case then it should be simple to 'just make it easier'. 'just make a wrapper that does it'.

Again, this isn't to say that Ollama is doing well. They're not in a few ways, but, one specific issue is: they're clearly using code that is from gguf and llama.cpp without correct MIT license attribution. The only mention of llama.cpp is in the readme.md as 'the one and only backend'. And, that's not enough to satisfy the license. At the very least, it needs an about-box in user-space that includes attribution to software that they use.

@Teravus commented on GitHub (Aug 12, 2025): > > My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. > > By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority. I understand the perspective. I guess it boils down to: **"Who is the project serving?"** Who is the target audience? llama.cpp is targeted at a different, more niche, user than ollama. That, in and of itself, isn't a bad thing, as long as llama.cpp understands that many more users will flock to ollama first and ollama will be many people's first introduction to llms... because it's easier and less intimidating than llama.cpp. It will be more popular, and more people will use it as a result. The more niche users will continue to use llama.cpp directly. There may be some converts from ollama to llama.cpp who try ollama and want to do more, but, they'll be the same kind of niche users that llama.cpp attracts with its focus. llama.cpp - advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'. Better for an actual single hosted production model where you control every aspect. ollama - low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model. They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application. No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required. If llama.cpp doesn't want that kind of user, then they're doing OK already. What a lot of people in this thread don't want to hear is that, If llama.cpp wants to be 'more popular' or be user's first introduction to running llms locally, it has to serve that second type of user as **long as something like ollama exists**. Easier = more users. More detail = less, but, more invested, users. A portion of the community who uses llama.cpp (not the core developers) suggests that all of the innovation comes from llama.cpp and ollama is just a thin wrapper around llama.cpp. If that's the case then it should be simple to 'just make it easier'. 'just make a wrapper that does it'. Again, this isn't to say that Ollama is doing well. They're not in a few ways, but, one specific issue is: they're clearly using code that is from gguf and llama.cpp without correct MIT license attribution. The only mention of llama.cpp is in the readme.md as 'the one and only backend'. And, that's not enough to satisfy the license. At the very least, it needs an about-box in user-space that includes attribution to software that they use.

GiteaMirror commented

2026-05-04 19:27:51 -05:00

@Kira-PH commented on GitHub (Aug 13, 2025):

llama.cpp

* advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'.  Better for an actual single hosted production model where you control every aspect.

ollama

* low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model.  They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application.  No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required.

There are people in the middle of those extremes. I was the AI/ML domain architect at IFS. I've been coding professionally for some 25 years. I read the chat templates and code against the /generate endpoint, because I've been messing with LLMs since before the GPT3 closed beta, and I feel happier getting that bit closer to the model. But fundamentally I just want to offload the effort when I'm tinkering at home. If I was in a professional environment then I'd dig into llama-server and get to know it intimately, but for idly messing about I just want to load a model and go. This is like an F1 mechanic. When he's not at work, he just wants to get in his Audi and drive somewhere, not spend an hour preheating his tyres, tinkering with the engine and reading telemetry data.

@Kira-PH commented on GitHub (Aug 13, 2025): > llama.cpp > > * advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'. Better for an actual single hosted production model where you control every aspect. > > > ollama > > * low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model. They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application. No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required. There are people in the middle of those extremes. I was the AI/ML domain architect at IFS. I've been coding professionally for some 25 years. I read the chat templates and code against the /generate endpoint, because I've been messing with LLMs since before the GPT3 closed beta, and I feel happier getting that bit closer to the model. But fundamentally I just want to offload the effort when I'm tinkering at home. If I was in a professional environment then I'd dig into llama-server and get to know it intimately, but for idly messing about I just want to load a model and go. This is like an F1 mechanic. When he's not at work, he just wants to get in his Audi and drive somewhere, not spend an hour preheating his tyres, tinkering with the engine and reading telemetry data.

GiteaMirror commented

2026-05-04 19:27:52 -05:00

@mkultra333 commented on GitHub (Aug 14, 2025):

So I've wasted hours and hours over two nights slowly downloading and re-downloading 15GB models from HF, only to find out the reason they aren't working is because Ollama is using botched, rushed code and didn't warn anyone that it was broken for the gpt oss models apart from their own?

Thanks Ollama.

Their implementation ran so damn slow anyway, getting a response from their own 20B model on my 5060ti is like watching paint dry as the all the work seems to be happening on the CPU instead of the GPU even though there's VRAM to spare.

Ugh. I'm trying LMStudio, see if that's any better. Pity I also have to now re-download the models AGAIN because for some reason Ollama has to put all the gguf in it's own weird blob format. This is so tedious.

@mkultra333 commented on GitHub (Aug 14, 2025): So I've wasted hours and hours over two nights slowly downloading and re-downloading 15GB models from HF, only to find out the reason they aren't working is because Ollama is using botched, rushed code and didn't warn anyone that it was broken for the gpt oss models apart from their own? Thanks Ollama. Their implementation ran so damn slow anyway, getting a response from their own 20B model on my 5060ti is like watching paint dry as the all the work seems to be happening on the CPU instead of the GPU even though there's VRAM to spare. Ugh. I'm trying LMStudio, see if that's any better. Pity I also have to now re-download the models AGAIN because for some reason Ollama has to put all the gguf in it's own weird blob format. This is so tedious.

GiteaMirror commented

2026-05-04 19:27:52 -05:00

@kappa8219 commented on GitHub (Aug 15, 2025):

With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU.

@kappa8219 commented on GitHub (Aug 15, 2025): With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU. <img width="241" height="98" alt="Image" src="https://github.com/user-attachments/assets/f2bed93c-b4fc-4125-bf2f-1efab8f0729d" />

GiteaMirror commented

2026-05-04 19:27:53 -05:00

@expnn commented on GitHub (Sep 4, 2025):

Ollama 0.11.8 runs successfully at first, but it crashes after generating a small amount of text. See my example here: https://github.com/ollama/ollama/issues/10993#issuecomment-3248383362

@expnn commented on GitHub (Sep 4, 2025): Ollama 0.11.8 runs successfully at first, but it crashes after generating a small amount of text. See my example here: https://github.com/ollama/ollama/issues/10993#issuecomment-3248383362

GiteaMirror commented

2026-05-04 19:27:54 -05:00

@shimmyshimmer commented on GitHub (Sep 5, 2025):

With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU.

This is because we preset our GGUFs in Ollama with top_k = 0 which slows down the GGUFs a lot. In our testing when we remove the top_k setting, it scores the same results as Ollama.

In the future we will be removing the pre-set top_k = 0 to maybe 64 or 128 instead.

@shimmyshimmer commented on GitHub (Sep 5, 2025): > With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU. > > <img alt="Image" width="241" height="98" src="https://private-user-images.githubusercontent.com/4374115/478349296-f2bed93c-b4fc-4125-bf2f-1efab8f0729d.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTcxMTMwODUsIm5iZiI6MTc1NzExMjc4NSwicGF0aCI6Ii80Mzc0MTE1LzQ3ODM0OTI5Ni1mMmJlZDkzYy1iNGZjLTQxMjUtYmYyZi0xZWZhYjhmMDcyOWQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDkwNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA5MDVUMjI1MzA1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NmMzZTA1NzcyNmMwYjI4OTUzNjJlMGU3YTI0YWUzMjQ0ODc2ZWFiZTYxNzhjNzJmOTViMzFkMTQ2NWRhMGM4YSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.FdSMbivFpK1n6wig7y03RSlfCaUvsGHgOTvg-TAWiyQ"> This is because we preset our GGUFs in Ollama with `top_k = 0` which slows down the GGUFs a lot. In our testing when we remove the top_k setting, it scores the same results as Ollama. In the future we will be removing the pre-set `top_k = 0 ` to maybe 64 or 128 instead.

GiteaMirror commented

2026-05-04 19:27:54 -05:00

@JohannesGaessler commented on GitHub (Sep 6, 2025):

I don't know what ollama uses for sampling but in llama.cpp the issue with top-k 0 was that the fast custom bucket sort was only implemented for top-k so disabling top-k resulted in a fallback to the slower std::sort for the whole token array. The implementation was generalized in https://github.com/ggml-org/llama.cpp/pull/15665 + an optimization that first tries sorting only the top 128 tokens (which should be enough for most cases).

@JohannesGaessler commented on GitHub (Sep 6, 2025): I don't know what ollama uses for sampling but in llama.cpp the issue with top-k 0 was that the fast custom bucket sort was only implemented for top-k so disabling top-k resulted in a fallback to the slower `std::sort` for the whole token array. The implementation was generalized in https://github.com/ggml-org/llama.cpp/pull/15665 + an optimization that first tries sorting only the top 128 tokens (which should be enough for most cases).

GiteaMirror commented

2026-05-04 19:27:55 -05:00

@OracleToes commented on GitHub (Sep 27, 2025):

I'm still getting this problem, and it seems like from the conversation in this issue, we know how to fix it, so why is it that a month later we still can't run ggufs of gpt-oss models?
It's worth noting that the gguf models work in the playground, but not in the regular chat interface.

@OracleToes commented on GitHub (Sep 27, 2025): I'm still getting this problem, and it seems like from the conversation in this issue, we know how to fix it, so why is it that a month later we still can't run ggufs of gpt-oss models? It's worth noting that the gguf models work in the playground, but not in the regular chat interface.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#69814