[GH-ISSUE #11714] gpt-oss 20b gguf model fail to run #33515

Open
opened 2026-04-22 16:17:13 -05:00 by GiteaMirror · 48 comments
Owner

Originally created by @VictorWangwz on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11714

What is the issue?

The original model coudl run without problem, but the gguf model fail to run for below errors

May need an update of ggml dependencies like llama.cpp https://github.com/ggml-org/llama.cpp/pull/15091

Note: Running gguf on llama.cpp without problem.

Relevant log output

Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: failed to read tensor info

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @VictorWangwz on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11714 ### What is the issue? The original model coudl run without problem, but the gguf model fail to run for below errors May need an update of ggml dependencies like llama.cpp https://github.com/ggml-org/llama.cpp/pull/15091 Note: Running gguf on llama.cpp without problem. ### Relevant log output ```shell Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) Aug 06 03:40:33 ml-machine-1 ollama[2649079]: gguf_init_from_file_impl: failed to read tensor info ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 16:17:13 -05:00
Author
Owner

@a-makarov-kaspi commented on GitHub (Aug 6, 2025):

Yep, ollama returns 500 error

Ollama 0.11.2 for mac OS

Image
<!-- gh-comment-id:3157812662 --> @a-makarov-kaspi commented on GitHub (Aug 6, 2025): Yep, ollama returns 500 error Ollama 0.11.2 for mac OS <img width="1409" height="74" alt="Image" src="https://github.com/user-attachments/assets/707744bb-9e6b-4a3e-9a69-256443c31eb0" />
Author
Owner

@teodorgross commented on GitHub (Aug 6, 2025):

Yes' it's not working

<!-- gh-comment-id:3157826000 --> @teodorgross commented on GitHub (Aug 6, 2025): Yes' it's not working
Author
Owner

@ramkumarkb commented on GitHub (Aug 6, 2025):

I can also confirm that the same error with the latest vesion - 0.11.3-rc0
I tried with the following GGUF models -

  1. https://huggingface.co/unsloth/gpt-oss-20b-GGUF
  2. https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
<!-- gh-comment-id:3157835198 --> @ramkumarkb commented on GitHub (Aug 6, 2025): I can also confirm that the same error with the latest vesion - `0.11.3-rc0` I tried with the following GGUF models - 1. https://huggingface.co/unsloth/gpt-oss-20b-GGUF 2. https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
Author
Owner

@maksir commented on GitHub (Aug 6, 2025):

maybe will help

Image
<!-- gh-comment-id:3157980335 --> @maksir commented on GitHub (Aug 6, 2025): maybe will help <img width="850" height="91" alt="Image" src="https://github.com/user-attachments/assets/1ba82961-3bf1-4037-ad9b-9b20891d1f31" />
Author
Owner

@teodorgross commented on GitHub (Aug 6, 2025):

Image
<!-- gh-comment-id:3158256249 --> @teodorgross commented on GitHub (Aug 6, 2025): <img width="649" height="88" alt="Image" src="https://github.com/user-attachments/assets/a9b97ed7-4d41-4158-8ceb-e4f1c516a96c" />
Author
Owner

@rbnhln commented on GitHub (Aug 6, 2025):

Got the same error:

ollama  | gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
ollama  | gguf_init_from_file_impl: failed to read tensor info
ollama  | llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-b6f8d6dec25430529281f42096a0a38f9a73c4007650075047d54a1899f14fa5
ollama  | 
ollama  | llama_model_load_from_file_impl: failed to load model

Tried different models and sizes:

  • unsloth/gpt-oss-20b-GGUF:Q2_K
  • unsloth/gpt-oss-20b-GGUF:F16
  • ggml-org/gpt-oss-20b-GGUF:latest
  • lmstudio-community/gpt-oss-20b-GGUF:latest

ollama version: 0.11.2

The gpt-oss_20b model is running.

<!-- gh-comment-id:3159457450 --> @rbnhln commented on GitHub (Aug 6, 2025): Got the same error: ```bash ollama | gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) ollama | gguf_init_from_file_impl: failed to read tensor info ollama | llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-b6f8d6dec25430529281f42096a0a38f9a73c4007650075047d54a1899f14fa5 ollama | ollama | llama_model_load_from_file_impl: failed to load model ``` Tried different models and sizes: - unsloth/gpt-oss-20b-GGUF:Q2_K - unsloth/gpt-oss-20b-GGUF:F16 - ggml-org/gpt-oss-20b-GGUF:latest - lmstudio-community/gpt-oss-20b-GGUF:latest ollama version: 0.11.2 The gpt-oss_20b model is running.
Author
Owner

@discostur commented on GitHub (Aug 6, 2025):

Same error here:

time=2025-08-06T11:14:32.565Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b library=cuda total="15.6 GiB" available="158.2 MiB"
time=2025-08-06T11:14:34.149Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b parallel=1 available=16594370560 required="1.9 GiB"
time=2025-08-06T11:14:34.362Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="114.1 GiB" free_swap="0 B"
time=2025-08-06T11:14:34.363Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.0 GiB" memory.weights.repeating="454.3 MiB" memory.weights.nonrepeating="586.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB"
gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174

llama_model_load_from_file_impl: failed to load model
time=2025-08-06T11:14:34.607Z level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 error="unable to load model: /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174"
<!-- gh-comment-id:3159737934 --> @discostur commented on GitHub (Aug 6, 2025): Same error here: ```bash time=2025-08-06T11:14:32.565Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b library=cuda total="15.6 GiB" available="158.2 MiB" time=2025-08-06T11:14:34.149Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 gpu=GPU-60d0c2c0-f953-d178-82f0-754b2f08821b parallel=1 available=16594370560 required="1.9 GiB" time=2025-08-06T11:14:34.362Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="114.1 GiB" free_swap="0 B" time=2025-08-06T11:14:34.363Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="192.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.0 GiB" memory.weights.repeating="454.3 MiB" memory.weights.nonrepeating="586.8 MiB" memory.graph.full="256.0 MiB" memory.graph.partial="256.0 MiB" gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) gguf_init_from_file_impl: failed to read tensor info llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 llama_model_load_from_file_impl: failed to load model time=2025-08-06T11:14:34.607Z level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174 error="unable to load model: /root/.ollama/models/blobs/sha256-eacc290fd6b05f927d98e94c6acb9c315bdacfbcd7290c88ca4d088d77089174" ```
Author
Owner

@kappa8219 commented on GitHub (Aug 6, 2025):

Some more:

root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest
pulling manifest 
pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  11 GB                         
pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB                         
pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  149 B                         
pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  558 B                         
verifying sha256 digest 
writing manifest 
success 
root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest
Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b
root@open-webui-ollama-849869d898-dkg82:/# ollama -v
ollama version is 0.11.2

Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.

<!-- gh-comment-id:3159789145 --> @kappa8219 commented on GitHub (Aug 6, 2025): Some more: ``` root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest pulling manifest pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 GB pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 149 B pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 558 B verifying sha256 digest writing manifest success root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b root@open-webui-ollama-849869d898-dkg82:/# ollama -v ollama version is 0.11.2 ``` Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.
Author
Owner

@kappa8219 commented on GitHub (Aug 6, 2025):

Quite intriguing what would be the result of quantitazing of already prequantized model :)

<!-- gh-comment-id:3159827787 --> @kappa8219 commented on GitHub (Aug 6, 2025): Quite intriguing what would be the result of quantitazing of already prequantized model :)
Author
Owner

@snowarch commented on GitHub (Aug 6, 2025):

Yes.. Ollama always tends to screw up the GGUF models. I'm using the one from unsloth with llama.cpp and it works great.

<!-- gh-comment-id:3159849718 --> @snowarch commented on GitHub (Aug 6, 2025): Yes.. Ollama always tends to screw up the GGUF models. I'm using the one from unsloth with llama.cpp and it works great.
Author
Owner

@teodorgross commented on GitHub (Aug 6, 2025):

Some more:

root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest
pulling manifest 
pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  11 GB                         
pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB                         
pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  149 B                         
pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  558 B                         
verifying sha256 digest 
writing manifest 
success 
root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest
Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b
root@open-webui-ollama-849869d898-dkg82:/# ollama -v
ollama version is 0.11.2

Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.

It's working well in others tools except ollama

<!-- gh-comment-id:3161066004 --> @teodorgross commented on GitHub (Aug 6, 2025): > Some more: > > ``` > root@open-webui-ollama-849869d898-dkg82:/# ollama pull hf.co/unsloth/gpt-oss-20b-GGUF:latest > pulling manifest > pulling 7dd573dc3e0b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 GB > pulling 51468a0fd901: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.4 KB > pulling 264230288548: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 149 B > pulling eb1dfc8996e5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 558 B > verifying sha256 digest > writing manifest > success > root@open-webui-ollama-849869d898-dkg82:/# ollama run hf.co/unsloth/gpt-oss-20b-GGUF:latest > Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-7dd573dc3e0ba2e7d6bf76e16d400cf69b6afc2ae58f213e4eb1d133c38e938b > root@open-webui-ollama-849869d898-dkg82:/# ollama -v > ollama version is 0.11.2 > ``` > > Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem. > It's working well in others tools except ollama
Author
Owner

@musica2016 commented on GitHub (Aug 7, 2025):

Same problem
Ollama versions is 0.11.3-rc0
System: ubuntu

<!-- gh-comment-id:3162160520 --> @musica2016 commented on GitHub (Aug 7, 2025): Same problem Ollama versions is 0.11.3-rc0 System: ubuntu
Author
Owner

@406747925 commented on GitHub (Aug 7, 2025):

same problem
ollama version is 0.11.3
CUDA Version: 12.4
tesla v100

<!-- gh-comment-id:3162894491 --> @406747925 commented on GitHub (Aug 7, 2025): same problem ollama version is 0.11.3 CUDA Version: 12.4 tesla v100
Author
Owner

@niehao100 commented on GitHub (Aug 7, 2025):

same problem
ollama version is 0.11.3
rocm version 1.15
RX7700xt

<!-- gh-comment-id:3163272254 --> @niehao100 commented on GitHub (Aug 7, 2025): same problem ollama version is 0.11.3 rocm version 1.15 RX7700xt
Author
Owner

@billchurch commented on GitHub (Aug 7, 2025):

I'm experiencing the same issue with the unsloth gpt-oss-20b model on my AMD setup. Here are my system details:

Environment:

  • Ollama version: 0.11.3
  • OS: Ubuntu 24.04 LTS (running in LXC container on Proxmox)
  • Kernel: 6.8.12-11-pve
  • CPU: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  • GPU: AMD Radeon 890M (gfx1150) with 32GB VRAM allocated
  • ROCm/HSA: Using HSA Runtime 5.7.1 (libhsa-runtime64-1)

Model tested:
hf.co/unsloth/gpt-oss-20b-GGUF:F16

Error reproduced:

$ ollama run hf.co/unsloth/gpt-oss-20b-GGUF:F16 "test"
Error: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7

Relevant logs from journalctl:

gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7
llama_model_load_from_file_impl: failed to load model

The error is identical to what others are reporting - the tensor blk.0.ffn_down_exps.weight has an invalid ggml type 39 (NONE). This happens on AMD ROCm setup as well, not just CUDA environments.

Other models like gpt-oss:20b, llama3.2:1b, phi3:mini, and gemma3n:e4b work perfectly fine with 100% GPU offloading on this same setup.

<!-- gh-comment-id:3163617645 --> @billchurch commented on GitHub (Aug 7, 2025): I'm experiencing the same issue with the unsloth gpt-oss-20b model on my AMD setup. Here are my system details: **Environment:** - **Ollama version:** 0.11.3 - **OS:** Ubuntu 24.04 LTS (running in LXC container on Proxmox) - **Kernel:** 6.8.12-11-pve - **CPU:** AMD Ryzen AI 9 HX 370 w/ Radeon 890M - **GPU:** AMD Radeon 890M (gfx1150) with 32GB VRAM allocated - **ROCm/HSA:** Using HSA Runtime 5.7.1 (libhsa-runtime64-1) **Model tested:** `hf.co/unsloth/gpt-oss-20b-GGUF:F16` **Error reproduced:** ``` $ ollama run hf.co/unsloth/gpt-oss-20b-GGUF:F16 "test" Error: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7 ``` **Relevant logs from journalctl:** ``` gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) gguf_init_from_file_impl: failed to read tensor info llama_model_load: error loading model: llama_model_loader: failed to load model from /usr/share/ollama/.ollama/models/blobs/sha256-fcbc7ec4c2d1527c3da84b7049e59dc5af065876169216ec518bceab841e73f7 llama_model_load_from_file_impl: failed to load model ``` The error is identical to what others are reporting - the tensor `blk.0.ffn_down_exps.weight` has an invalid ggml type 39 (NONE). This happens on AMD ROCm setup as well, not just CUDA environments. Other models like gpt-oss:20b, llama3.2:1b, phi3:mini, and gemma3n:e4b work perfectly fine with 100% GPU offloading on this same setup.
Author
Owner

@srd00 commented on GitHub (Aug 7, 2025):

apparently the solution is to update llama.cpp https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/11

<!-- gh-comment-id:3163957405 --> @srd00 commented on GitHub (Aug 7, 2025): apparently the solution is to update llama.cpp https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/11
Author
Owner

@mabhay3420 commented on GitHub (Aug 7, 2025):

on apple silicon, following version installed with brew works file.
thanks @srd00 for the suggestion to use different version.

$ llama-cli --version
version: 6100 (65c797c4)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin23.6.0

previously i was getting the error:

gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
<!-- gh-comment-id:3164754954 --> @mabhay3420 commented on GitHub (Aug 7, 2025): on apple silicon, following version installed with brew works file. thanks @srd00 for the suggestion to use different version. ``` $ llama-cli --version version: 6100 (65c797c4) built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin23.6.0 ``` previously i was getting the error: ``` gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE) gguf_init_from_file_impl: failed to read tensor info ```
Author
Owner

@Kira-PH commented on GitHub (Aug 9, 2025):

I get this on mradermacher/gpt-oss-20b-uncensored-bf16-GGUF:Q8_0 and DavidAU/OpenAi-GPT-oss-20b-abliterated-uncensored-NEO-Imatrix-gguf:Q8_0

Image
<!-- gh-comment-id:3172047017 --> @Kira-PH commented on GitHub (Aug 9, 2025): I get this on mradermacher/gpt-oss-20b-uncensored-bf16-GGUF:Q8_0 and DavidAU/OpenAi-GPT-oss-20b-abliterated-uncensored-NEO-Imatrix-gguf:Q8_0 <img width="884" height="140" alt="Image" src="https://github.com/user-attachments/assets/b9a09047-6894-4acd-ab9c-8a8f85c530b8" />
Author
Owner

@billchurch commented on GitHub (Aug 9, 2025):

@HughPH There’s definitely something missing from some ollama distributions, 0.11.3 produces the same errors for me for all gpt-oss variations except the original from OpenAI. I compied llama.cpp and I’m not having these problems. Ollama 0.11.3 should support this but it’s not working on my linux distros.

<!-- gh-comment-id:3172062396 --> @billchurch commented on GitHub (Aug 9, 2025): @HughPH There’s definitely something missing from some ollama distributions, 0.11.3 produces the same errors for me for all gpt-oss variations except the original from OpenAI. I compied llama.cpp and I’m not having these problems. Ollama 0.11.3 should support this but it’s not working on my linux distros.
Author
Owner

@Teravus commented on GitHub (Aug 10, 2025):

I have this issue too with the unsloth versions on Ollama.

ollama --version
ollama version is 0.11.4

llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 193
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 178
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q5_1: 1 tensors
llama_model_loader: - type q8_0: 169 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 20.55 GiB (8.44 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gpt-oss'
llama_model_load_from_file_impl: failed to load model
time=2025-08-09T17:20:47.030-07:00 level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=

I don't have anything to add except, 'me too'.

<!-- gh-comment-id:3172319886 --> @Teravus commented on GitHub (Aug 10, 2025): I have this issue too with the unsloth versions on Ollama. ``` ollama --version ollama version is 0.11.4 ``` llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 193 llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 178 llama_model_loader: - type f32: 289 tensors llama_model_loader: - type q5_1: 1 tensors llama_model_loader: - type q8_0: 169 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 20.55 GiB (8.44 BPW) llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gpt-oss' llama_model_load_from_file_impl: failed to load model time=2025-08-09T17:20:47.030-07:00 level=INFO source=sched.go:453 msg="NewLlamaServer failed" model= I don't have anything to add except, 'me too'.
Author
Owner

@ggerganov commented on GitHub (Aug 10, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:

Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (https://github.com/ollama/ollama/pull/11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.

Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their ggml fork and copy the upstream implementation (https://github.com/ollama/ollama/pull/11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version of ggml will be merged and ollama will once again be fast and compatible.

Hope this helps.

<!-- gh-comment-id:3172893576 --> @ggerganov commented on GitHub (Aug 10, 2025): Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support (https://github.com/ollama/ollama/pull/11672). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their `ggml` fork and copy the upstream implementation (https://github.com/ollama/ollama/pull/11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some [wrong information](https://github.com/ollama/ollama/issues/11816#issuecomment-3168974327). But none of this matters, because soon the upstream version of `ggml` will be merged and ollama will once again be fast and compatible. Hope this helps.
Author
Owner

@fitlemon commented on GitHub (Aug 10, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama...
...
Hope this helps.

Absolutely legend ;). Cause of this problem I migrated from ollama to llama.cpp.

<!-- gh-comment-id:3172902736 --> @fitlemon commented on GitHub (Aug 10, 2025): > Since none of the maintainers here seem to care enough to explain the actual reason for ollama... > ... > Hope this helps. Absolutely legend ;). Cause of this problem I migrated from ollama to llama.cpp.
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

Note all this stuff is a one-liner with docker model runner:

docker model run ai/gpt-oss

docker model runner uses llama.cpp, is open-source and open to contributions just like llama.cpp . Get involved where appropriate.

Another neat feature is they are stored as OCI artifacts, so you can push these models to any old OCI registry.

<!-- gh-comment-id:3173087530 --> @ericcurtin commented on GitHub (Aug 11, 2025): Note all this stuff is a one-liner with docker model runner: ``` docker model run ai/gpt-oss ``` docker model runner uses llama.cpp, is open-source and open to contributions just like llama.cpp . Get involved where appropriate. Another neat feature is they are stored as OCI artifacts, so you can push these models to any old OCI registry.
Author
Owner

@Teravus commented on GitHub (Aug 11, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:

Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (#11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.

Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their ggml fork and copy the upstream implementation (#11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version of ggml will be merged and ollama will once again be fast and compatible.

Hope this helps.

While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster.

I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using:
Then loaded the model with
llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF
and then pointed open-webui's OpenAI compatible connection at the v1 endpoint.

It worked with the ggml-org mxfp4 model.

Also, it returned about 5 tokens per second with a 4096 context.
Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama..

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist;
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second)
eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second)
total time = 94598.94 ms / 580 tokens
srv update_slots: all slots are idle

I appreciate the work that you've done. You've done an absolute technical marvel.

And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience.

There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it.

I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts.
Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again.

Just saying, that's my experience.

<!-- gh-comment-id:3173334808 --> @Teravus commented on GitHub (Aug 11, 2025): > Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: > > Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support ([#11672](https://github.com/ollama/ollama/pull/11672)). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. > > Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their `ggml` fork and copy the upstream implementation ([#11823](https://github.com/ollama/ollama/pull/11823)). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some [wrong information](https://github.com/ollama/ollama/issues/11816#issuecomment-3168974327). But none of this matters, because soon the upstream version of `ggml` will be merged and ollama will once again be fast and compatible. > > Hope this helps. While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster. I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using: Then loaded the model with `llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and then pointed open-webui's OpenAI compatible connection at the v1 endpoint. It worked with the ggml-org mxfp4 model. Also, it returned about 5 tokens per second with a 4096 context. Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama.. There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second) eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second) total time = 94598.94 ms / 580 tokens srv update_slots: all slots are idle I appreciate the work that you've done. You've done an absolute technical marvel. And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience. There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it. I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts. Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again. Just saying, that's my experience.
Author
Owner

@ngxson commented on GitHub (Aug 11, 2025):

@Teravus We are actively working on the problems that you mentioned, just give us a bit of time.

Having both best performance and good UX is not an easy task, given the community-driven nature of llama.cpp. Some of llama.cpp maintainers even have to work during their vacations just to have someone else copy their work without giving any credits.

<!-- gh-comment-id:3174632621 --> @ngxson commented on GitHub (Aug 11, 2025): @Teravus We are actively working on the problems that you mentioned, just give us a bit of time. Having both best performance and good UX is not an easy task, given the community-driven nature of llama.cpp. Some of llama.cpp maintainers even have to work during their vacations just to have someone else copy their work without giving any credits.
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:
Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (#11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.
Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their ggml fork and copy the upstream implementation (#11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version of ggml will be merged and ollama will once again be fast and compatible.
Hope this helps.

While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster.

I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using: Then loaded the model with llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and then pointed open-webui's OpenAI compatible connection at the v1 endpoint.

Don't have Nvidia hardware but I think there are two key flags you are missing, try toggling them to see if they help (flash attention and cache-reuse):

llama-server --flash-attn --cache-reuse 256 --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF

It worked with the ggml-org mxfp4 model.

Also, it returned about 5 tokens per second with a 4096 context. Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama..

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second) eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second) total time = 94598.94 ms / 580 tokens srv update_slots: all slots are idle

I appreciate the work that you've done. You've done an absolute technical marvel.

And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience.

There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it.

I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts. Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again.

Just saying, that's my experience.

<!-- gh-comment-id:3174793747 --> @ericcurtin commented on GitHub (Aug 11, 2025): > > Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: > > Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support ([#11672](https://github.com/ollama/ollama/pull/11672)). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. > > Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their `ggml` fork and copy the upstream implementation ([#11823](https://github.com/ollama/ollama/pull/11823)). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some [wrong information](https://github.com/ollama/ollama/issues/11816#issuecomment-3168974327). But none of this matters, because soon the upstream version of `ggml` will be merged and ollama will once again be fast and compatible. > > Hope this helps. > > While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster. > > I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using: Then loaded the model with `llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and then pointed open-webui's OpenAI compatible connection at the v1 endpoint. Don't have Nvidia hardware but I think there are two key flags you are missing, try toggling them to see if they help (flash attention and cache-reuse): ``` llama-server --flash-attn --cache-reuse 256 --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF ``` > > It worked with the ggml-org mxfp4 model. > > Also, it returned about 5 tokens per second with a 4096 context. Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama.. > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second) eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second) total time = 94598.94 ms / 580 tokens srv update_slots: all slots are idle > > I appreciate the work that you've done. You've done an absolute technical marvel. > > And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience. > > There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it. > > I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts. Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again. > > Just saying, that's my experience.
Author
Owner

@pwilkin commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄

Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

<!-- gh-comment-id:3174872195 --> @pwilkin commented on GitHub (Aug 11, 2025): > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks.
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

Ha yes -ngl 999 is pretty crucial 😄

<!-- gh-comment-id:3174881043 --> @ericcurtin commented on GitHub (Aug 11, 2025): Ha yes `-ngl 999` is pretty crucial 😄
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

I know --cache-reuse 256 has been recommended by @ggerganov in the past, don't have Nvidia hardware myself, so don't know how significant it it.

<!-- gh-comment-id:3174885315 --> @ericcurtin commented on GitHub (Aug 11, 2025): I know `--cache-reuse 256` has been recommended by @ggerganov in the past, don't have Nvidia hardware myself, so don't know how significant it it.
Author
Owner

@ITankForCAD commented on GitHub (Aug 11, 2025):

Better yet, use the dedicated benchmark binary provided by the fine folks at llama.cpp : llama-bench

<!-- gh-comment-id:3174896627 --> @ITankForCAD commented on GitHub (Aug 11, 2025): Better yet, use the dedicated benchmark binary provided by the fine folks at llama.cpp : [llama-bench](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench)
Author
Owner

@mudler commented on GitHub (Aug 11, 2025):

Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:

Before the model was released, the ollama devs decided to fork the ggml inference engine in order to implement gpt-oss support (#11672). In the process, they did not coordinate the changes with the upstream maintainers of ggml. As a result, the ollama implementation is not only incompatible with the vast majority of gpt-oss GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for gpt-oss and get featured in the major announcements on the release day.

Worse part of all of this that nothing will change in the long run. This has been already the case for long time, just to cite another case is llama multimodal capabilities - and have mixed feelings on ollama exactly for this reason, it would have been much better if all projects that depend on @ggerganov's and the ggml team work would have upstreamed the contributions directly so anyone in the ecosystem could benefit, and avoid vendor lock-in and the duplicated efforts everywhere.

For instance, in LocalAI - and as well like in LM Studio and Docker you will find everything to "just work" because of working as a community, and actually giving credits to whom belong ( you guys really rock ;) ), consuming llama.cpp and reporting issues and upstream any change directly there.

It is quite frustrating to see that the Open source scene is really getting derailed lately by this kind of bad attitude.

<!-- gh-comment-id:3175288625 --> @mudler commented on GitHub (Aug 11, 2025): > Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it: > > Before the model was released, the ollama devs decided to fork the `ggml` inference engine in order to implement `gpt-oss` support ([#11672](https://github.com/ollama/ollama/pull/11672)). In the process, they did not coordinate the changes with the upstream maintainers of `ggml`. As a result, the ollama implementation is not only incompatible with the vast majority of `gpt-oss` GGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support for `gpt-oss` and get featured in the major announcements on the release day. > Worse part of all of this that nothing will change in the long run. This has been already the case for long time, just to cite another case is llama multimodal capabilities - and have mixed feelings on ollama exactly for this reason, it would have been much better if all projects that depend on @ggerganov's and the ggml team work would have upstreamed the contributions directly so anyone in the ecosystem could benefit, and avoid vendor lock-in and the duplicated efforts everywhere. For instance, in LocalAI - and as well like in LM Studio and Docker you will find everything to "just work" because of working as a community, and actually giving credits to whom belong ( you guys really rock ;) ), consuming llama.cpp and reporting issues and upstream any change directly there. It is quite frustrating to see that the Open source scene is really getting derailed lately by this kind of bad attitude.
Author
Owner

@Teravus commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄

Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.

I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.

I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).

There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

<!-- gh-comment-id:3175597727 --> @Teravus commented on GitHub (Aug 11, 2025): > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > Ha yes `-ngl 999` is pretty crucial 😄 The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. I think [ngxson](https://github.com/ngxson) gets it. I agree that doing *both* the technical solution *and* having something that is easy to use is hard. I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they *are* adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.

I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.

I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).

There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄

One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

<!-- gh-comment-id:3175707547 --> @ericcurtin commented on GitHub (Aug 11, 2025): > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > Ha yes `-ngl 999` is pretty crucial 😄 > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. Just trying to help out @Teravus not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

@ggerganov @ngxson could we document reasonable defaults somewhere in upstream llama.cpp as a short-term solution? Kinda like how q4_k_m is a reasonable default for gguf? In my head this is what I have. Even if they are not perfect and don't fit every little use case it's better than nothing, I volunteer to open a PR, I need help with the info though! I don't have enough experience with the various stacks and hardware to know exactly. One thing I do know from running CPU inferencing on an Ampere machine with a tonne of CPU cores is "--threads (number of cores/2)" seems like a reasonable default:

CPU:

llama-server --jinja --cache-reuse 256 --threads (number of cores/2) -hf some-model

CUDA:

llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model

METAL:

llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model

ROCM:

?

VULKAN:

?

OPENCL:

?

MUSA:

?

CANN:

?

BLAS:

?

<!-- gh-comment-id:3175802823 --> @ericcurtin commented on GitHub (Aug 11, 2025): @ggerganov @ngxson could we document reasonable defaults somewhere in upstream llama.cpp as a short-term solution? Kinda like how q4_k_m is a reasonable default for gguf? In my head this is what I have. Even if they are not perfect and don't fit every little use case it's better than nothing, I volunteer to open a PR, I need help with the info though! I don't have enough experience with the various stacks and hardware to know exactly. One thing I do know from running CPU inferencing on an Ampere machine with a tonne of CPU cores is "--threads (number of cores/2)" seems like a reasonable default: CPU: llama-server --jinja --cache-reuse 256 --threads (number of cores/2) -hf some-model CUDA: llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model METAL: llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model ROCM: ? VULKAN: ? OPENCL: ? MUSA: ? CANN: ? BLAS: ?
Author
Owner

@pwilkin commented on GitHub (Aug 11, 2025):

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.

For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.

But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

<!-- gh-comment-id:3175999505 --> @pwilkin commented on GitHub (Aug 11, 2025): @Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.
Author
Owner

@Teravus commented on GitHub (Aug 11, 2025):

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.

For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.

But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

Ollama, for sure, needs to provide something to the user that says that they're using code from llama.cpp. Usually, this is in an about box. I don't even see an about box. Therefore, doesn't look like they're complying with that. I only know that ollama uses llama.cpp under the hood from having model issues with some models and needing to manually prepare them with llama.cpp in order for them to work under ollama. It was, at that point, that I learned that the underlying technology, that was state of the art and that the 'good' implementations relied on, was llama.cpp. The only mention that I see of llama.cpp isn't really a 'we use this software' reference. It's just:
Supported backends
llama.cpp project founded by Georgi Gerganov.
This doesn't seem like enough. Skirting the issue, by treating llama.cpp like a back-end.

A deeper look down the rabbit hole, Looks like Ollama documents the "patches" that they make to llamacpp here.
https://github.com/ollama/ollama/tree/main/llama/patches
The last batch of them have been about gpt-oss.

Looks like, also, there may be some NDAs involved.
From a comment: "this is exactly how they prepped the 0.11.0 release w/o breaking OpenAI NDAs, it made a cuda crash refer to a nonsense line number."
I'm not sure how that affected the situation with gpt-oss.

<!-- gh-comment-id:3176339445 --> @Teravus commented on GitHub (Aug 11, 2025): > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. Ollama, for sure, needs to provide something to the user that says that they're using code from llama.cpp. Usually, this is in an about box. I don't even see an about box. Therefore, doesn't look like they're complying with that. I only know that ollama uses llama.cpp under the hood from having model issues with some models and needing to manually prepare them with llama.cpp in order for them to work under ollama. It was, at that point, that I learned that the underlying technology, that was state of the art and that the 'good' implementations relied on, was llama.cpp. The only mention that I see of llama.cpp isn't really a 'we use this software' reference. It's just: Supported backends [llama.cpp](https://github.com/ggml-org/llama.cpp) project founded by Georgi Gerganov. This doesn't seem like enough. Skirting the issue, by treating llama.cpp like a back-end. A deeper look down the rabbit hole, Looks like Ollama documents the "patches" that they make to llamacpp here. [https://github.com/ollama/ollama/tree/main/llama/patches](https://github.com/ollama/ollama/tree/main/llama/patches ) The last batch of them have been about gpt-oss. Looks like, also, there may be some NDAs involved. From a comment: "this is exactly how they prepped the 0.11.0 release w/o breaking OpenAI NDAs, it made a cuda crash refer to a nonsense line number." I'm not sure how that affected the situation with gpt-oss.
Author
Owner

@Teravus commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄

One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

I think you've been very reasonable.
I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.

For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.

But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience.
People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.

I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

<!-- gh-comment-id:3176679895 --> @Teravus commented on GitHub (Aug 11, 2025): > > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > > > > > Ha yes `-ngl 999` is pretty crucial 😄 > > > > > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. > > Just trying to help out [@Teravus](https://github.com/Teravus) not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 > > One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything. > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it. I wonder if they realize that they're **actually** thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄
One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.
For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.
But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.

I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course.

If you do GPU vs GPU, llama.cpp wins.

I think more documentation is a good idea.

<!-- gh-comment-id:3176692056 --> @ericcurtin commented on GitHub (Aug 11, 2025): > > > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > > > > > > > > Ha yes `-ngl 999` is pretty crucial 😄 > > > > > > > > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. > > > > > > Just trying to help out [@Teravus](https://github.com/Teravus) not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 > > One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp > > I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything. > > > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. > > A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it. > > I wonder if they realize that they're **actually** thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience. The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course. If you do GPU vs GPU, llama.cpp wins. I think more documentation is a good idea.
Author
Owner

@Teravus commented on GitHub (Aug 11, 2025):

llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF

There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free

I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF and tell us how that benchmarks.

Ha yes -ngl 999 is pretty crucial 😄

The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.

Just trying to help out @Teravus not make fun of you! I missed that -ngl 999 was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄
One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp

I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.

@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.
For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.
But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.

A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.
I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.

The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course.

If you do GPU vs GPU, llama.cpp wins.

I think more documentation is a good idea.

Yep, after using:
llama-server -fa -ngl 99 --jinja --port 9001 -hf ggml-org/gpt-oss-20b-GGUF

prompt eval time = 494.48 ms / 1651 tokens ( 0.30 ms per token, 3338.89 tokens per second)
eval time = 3214.37 ms / 386 tokens ( 8.33 ms per token, 120.09 tokens per second)
total time = 3708.84 ms / 2037 tokens

120t/s llama.cpp, 72/s Ollama

llama.cpp wins.

I'm going to leave the old post, unedited. I'm curious how many thumbs-downs it will get.

<!-- gh-comment-id:3177224982 --> @Teravus commented on GitHub (Aug 11, 2025): > llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF > > > > > > There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist; llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free > > > > > > > > > > > > > > > I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄 > > > > > Try `llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUF` and tell us how that benchmarks. > > > > > > > > > > > > > Ha yes `-ngl 999` is pretty crucial 😄 > > > > > > > > > > > > The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in. > > > > I think [ngxson](https://github.com/ngxson) gets it. I agree that doing _both_ the technical solution _and_ having something that is easy to use is hard. > > > > I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they _are_ adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions). > > > > There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user. > > > > > > > > > Just trying to help out [@Teravus](https://github.com/Teravus) not make fun of you! I missed that `-ngl 999` was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄 > > > One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp > > > > > > I think you've been very reasonable. I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything. > > > [@Teravus](https://github.com/Teravus) I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time. > > > For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well. > > > But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing [@ggerganov](https://github.com/ggerganov) is pointing out here. > > > > > > A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience. People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it. > > I wonder if they realize that they're **actually** thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience. > > The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course. > > If you do GPU vs GPU, llama.cpp wins. > > I think more documentation is a good idea. Yep, after using: `llama-server -fa -ngl 99 --jinja --port 9001 -hf ggml-org/gpt-oss-20b-GGUF` prompt eval time = 494.48 ms / 1651 tokens ( 0.30 ms per token, 3338.89 tokens per second) eval time = 3214.37 ms / 386 tokens ( 8.33 ms per token, 120.09 tokens per second) total time = 3708.84 ms / 2037 tokens 120t/s llama.cpp, 72/s Ollama llama.cpp wins. I'm going to leave the old post, unedited. I'm curious how many thumbs-downs it will get.
Author
Owner

@JohannesGaessler commented on GitHub (Aug 12, 2025):

My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in.

By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.

<!-- gh-comment-id:3177906551 --> @JohannesGaessler commented on GitHub (Aug 12, 2025): >My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.
Author
Owner

@Teravus commented on GitHub (Aug 12, 2025):

My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in.

By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.

I understand the perspective.

I guess it boils down to: "Who is the project serving?"

Who is the target audience?

llama.cpp is targeted at a different, more niche, user than ollama.

That, in and of itself, isn't a bad thing, as long as llama.cpp understands that many more users will flock to ollama first and ollama will be many people's first introduction to llms... because it's easier and less intimidating than llama.cpp. It will be more popular, and more people will use it as a result. The more niche users will continue to use llama.cpp directly. There may be some converts from ollama to llama.cpp who try ollama and want to do more, but, they'll be the same kind of niche users that llama.cpp attracts with its focus.

llama.cpp

  • advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'. Better for an actual single hosted production model where you control every aspect.

ollama

  • low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model. They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application. No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required.

If llama.cpp doesn't want that kind of user, then they're doing OK already.

What a lot of people in this thread don't want to hear is that, If llama.cpp wants to be 'more popular' or be user's first introduction to running llms locally, it has to serve that second type of user as long as something like ollama exists.
Easier = more users.
More detail = less, but, more invested, users.

A portion of the community who uses llama.cpp (not the core developers) suggests that all of the innovation comes from llama.cpp and ollama is just a thin wrapper around llama.cpp. If that's the case then it should be simple to 'just make it easier'. 'just make a wrapper that does it'.

Again, this isn't to say that Ollama is doing well. They're not in a few ways, but, one specific issue is: they're clearly using code that is from gguf and llama.cpp without correct MIT license attribution. The only mention of llama.cpp is in the readme.md as 'the one and only backend'. And, that's not enough to satisfy the license. At the very least, it needs an about-box in user-space that includes attribution to software that they use.

<!-- gh-comment-id:3179783543 --> @Teravus commented on GitHub (Aug 12, 2025): > > My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. > > By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority. I understand the perspective. I guess it boils down to: **"Who is the project serving?"** Who is the target audience? llama.cpp is targeted at a different, more niche, user than ollama. That, in and of itself, isn't a bad thing, as long as llama.cpp understands that many more users will flock to ollama first and ollama will be many people's first introduction to llms... because it's easier and less intimidating than llama.cpp. It will be more popular, and more people will use it as a result. The more niche users will continue to use llama.cpp directly. There may be some converts from ollama to llama.cpp who try ollama and want to do more, but, they'll be the same kind of niche users that llama.cpp attracts with its focus. llama.cpp - advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'. Better for an actual single hosted production model where you control every aspect. ollama - low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model. They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application. No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required. If llama.cpp doesn't want that kind of user, then they're doing OK already. What a lot of people in this thread don't want to hear is that, If llama.cpp wants to be 'more popular' or be user's first introduction to running llms locally, it has to serve that second type of user as **long as something like ollama exists**. Easier = more users. More detail = less, but, more invested, users. A portion of the community who uses llama.cpp (not the core developers) suggests that all of the innovation comes from llama.cpp and ollama is just a thin wrapper around llama.cpp. If that's the case then it should be simple to 'just make it easier'. 'just make a wrapper that does it'. Again, this isn't to say that Ollama is doing well. They're not in a few ways, but, one specific issue is: they're clearly using code that is from gguf and llama.cpp without correct MIT license attribution. The only mention of llama.cpp is in the readme.md as 'the one and only backend'. And, that's not enough to satisfy the license. At the very least, it needs an about-box in user-space that includes attribution to software that they use.
Author
Owner

@Kira-PH commented on GitHub (Aug 13, 2025):

llama.cpp

* advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'.  Better for an actual single hosted production model where you control every aspect.

ollama

* low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model.  They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application.  No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required.

There are people in the middle of those extremes. I was the AI/ML domain architect at IFS. I've been coding professionally for some 25 years. I read the chat templates and code against the /generate endpoint, because I've been messing with LLMs since before the GPT3 closed beta, and I feel happier getting that bit closer to the model. But fundamentally I just want to offload the effort when I'm tinkering at home. If I was in a professional environment then I'd dig into llama-server and get to know it intimately, but for idly messing about I just want to load a model and go. This is like an F1 mechanic. When he's not at work, he just wants to get in his Audi and drive somewhere, not spend an hour preheating his tyres, tinkering with the engine and reading telemetry data.

<!-- gh-comment-id:3181804459 --> @Kira-PH commented on GitHub (Aug 13, 2025): > llama.cpp > > * advanced users who want the latest/greatest/fastest at the cost of having to manage significantly more detail, know their architecture, have looked at the model; understand the template, and are willing to spend some time on this 'per model'. Better for an actual single hosted production model where you control every aspect. > > > ollama > > * low skill and medium skill users who just don't want to spend the time in the trenches that people who use llama.cpp have invested, per model. They just want to load a model in their application, that already has an integration with ollama, specifically, (and *some support for openai style v1 endpoints), or they want to use the new 'built-in ui' that comes up in 0.11.4, and are either OK with leaving some performance on the table, or, don't know any better. Model templates, stop tokens and other minutia are handled automatically, either by ollama ,or, their model marketplace. Switching models on the fly is supported right from the application. No need to mess with semi-undocumented yaml files like with llama-swap. Docker is not required. There are people in the middle of those extremes. I was the AI/ML domain architect at IFS. I've been coding professionally for some 25 years. I read the chat templates and code against the /generate endpoint, because I've been messing with LLMs since before the GPT3 closed beta, and I feel happier getting that bit closer to the model. But fundamentally I just want to offload the effort when I'm tinkering at home. If I was in a professional environment then I'd dig into llama-server and get to know it intimately, but for idly messing about I just want to load a model and go. This is like an F1 mechanic. When he's not at work, he just wants to get in his Audi and drive somewhere, not spend an hour preheating his tyres, tinkering with the engine and reading telemetry data.
Author
Owner

@mkultra333 commented on GitHub (Aug 14, 2025):

So I've wasted hours and hours over two nights slowly downloading and re-downloading 15GB models from HF, only to find out the reason they aren't working is because Ollama is using botched, rushed code and didn't warn anyone that it was broken for the gpt oss models apart from their own?

Thanks Ollama.

Their implementation ran so damn slow anyway, getting a response from their own 20B model on my 5060ti is like watching paint dry as the all the work seems to be happening on the CPU instead of the GPU even though there's VRAM to spare.

Ugh. I'm trying LMStudio, see if that's any better. Pity I also have to now re-download the models AGAIN because for some reason Ollama has to put all the gguf in it's own weird blob format. This is so tedious.

<!-- gh-comment-id:3188159945 --> @mkultra333 commented on GitHub (Aug 14, 2025): So I've wasted hours and hours over two nights slowly downloading and re-downloading 15GB models from HF, only to find out the reason they aren't working is because Ollama is using botched, rushed code and didn't warn anyone that it was broken for the gpt oss models apart from their own? Thanks Ollama. Their implementation ran so damn slow anyway, getting a response from their own 20B model on my 5060ti is like watching paint dry as the all the work seems to be happening on the CPU instead of the GPU even though there's VRAM to spare. Ugh. I'm trying LMStudio, see if that's any better. Pity I also have to now re-download the models AGAIN because for some reason Ollama has to put all the gguf in it's own weird blob format. This is so tedious.
Author
Owner

@kappa8219 commented on GitHub (Aug 15, 2025):

With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU.

Image
<!-- gh-comment-id:3190879938 --> @kappa8219 commented on GitHub (Aug 15, 2025): With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU. <img width="241" height="98" alt="Image" src="https://github.com/user-attachments/assets/f2bed93c-b4fc-4125-bf2f-1efab8f0729d" />
Author
Owner

@expnn commented on GitHub (Sep 4, 2025):

Ollama 0.11.8 runs successfully at first, but it crashes after generating a small amount of text. See my example here: https://github.com/ollama/ollama/issues/10993#issuecomment-3248383362

<!-- gh-comment-id:3251566672 --> @expnn commented on GitHub (Sep 4, 2025): Ollama 0.11.8 runs successfully at first, but it crashes after generating a small amount of text. See my example here: https://github.com/ollama/ollama/issues/10993#issuecomment-3248383362
Author
Owner

@shimmyshimmer commented on GitHub (Sep 5, 2025):

With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU.

Image

This is because we preset our GGUFs in Ollama with top_k = 0 which slows down the GGUFs a lot. In our testing when we remove the top_k setting, it scores the same results as Ollama.

In the future we will be removing the pre-set top_k = 0 to maybe 64 or 128 instead.

<!-- gh-comment-id:3259963086 --> @shimmyshimmer commented on GitHub (Sep 5, 2025): > With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU. > > <img alt="Image" width="241" height="98" src="https://private-user-images.githubusercontent.com/4374115/478349296-f2bed93c-b4fc-4125-bf2f-1efab8f0729d.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTcxMTMwODUsIm5iZiI6MTc1NzExMjc4NSwicGF0aCI6Ii80Mzc0MTE1LzQ3ODM0OTI5Ni1mMmJlZDkzYy1iNGZjLTQxMjUtYmYyZi0xZWZhYjhmMDcyOWQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDkwNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA5MDVUMjI1MzA1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NmMzZTA1NzcyNmMwYjI4OTUzNjJlMGU3YTI0YWUzMjQ0ODc2ZWFiZTYxNzhjNzJmOTViMzFkMTQ2NWRhMGM4YSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.FdSMbivFpK1n6wig7y03RSlfCaUvsGHgOTvg-TAWiyQ"> This is because we preset our GGUFs in Ollama with `top_k = 0` which slows down the GGUFs a lot. In our testing when we remove the top_k setting, it scores the same results as Ollama. In the future we will be removing the pre-set `top_k = 0 ` to maybe 64 or 128 instead.
Author
Owner

@JohannesGaessler commented on GitHub (Sep 6, 2025):

I don't know what ollama uses for sampling but in llama.cpp the issue with top-k 0 was that the fast custom bucket sort was only implemented for top-k so disabling top-k resulted in a fallback to the slower std::sort for the whole token array. The implementation was generalized in https://github.com/ggml-org/llama.cpp/pull/15665 + an optimization that first tries sorting only the top 128 tokens (which should be enough for most cases).

<!-- gh-comment-id:3261518987 --> @JohannesGaessler commented on GitHub (Sep 6, 2025): I don't know what ollama uses for sampling but in llama.cpp the issue with top-k 0 was that the fast custom bucket sort was only implemented for top-k so disabling top-k resulted in a fallback to the slower `std::sort` for the whole token array. The implementation was generalized in https://github.com/ggml-org/llama.cpp/pull/15665 + an optimization that first tries sorting only the top 128 tokens (which should be enough for most cases).
Author
Owner

@OracleToes commented on GitHub (Sep 27, 2025):

I'm still getting this problem, and it seems like from the conversation in this issue, we know how to fix it, so why is it that a month later we still can't run ggufs of gpt-oss models?
It's worth noting that the gguf models work in the playground, but not in the regular chat interface.

<!-- gh-comment-id:3340963233 --> @OracleToes commented on GitHub (Sep 27, 2025): I'm still getting this problem, and it seems like from the conversation in this issue, we know how to fix it, so why is it that a month later we still can't run ggufs of gpt-oss models? It's worth noting that the gguf models work in the playground, but not in the regular chat interface.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33515