[GH-ISSUE #12637] Wrap llama.cpp #8389

Closed
opened 2026-04-12 21:02:10 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @iplayfast on GitHub (Oct 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12637

At one time ollama wrapped llama.cpp, now apparently it doesn't.
This thread on twitter/x https://x.com/DFinsterwalder/status/1978372050239516989
shows that benchmarking on ollama vs llama.cpp ollama loses by a large amount.

I was quite surprised by this as I had thought that ollama wrapped llama.cpp
I think maybe ollama needs to go back to it's roots.

Originally created by @iplayfast on GitHub (Oct 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12637 At one time ollama wrapped llama.cpp, now apparently it doesn't. This thread on twitter/x https://x.com/DFinsterwalder/status/1978372050239516989 shows that benchmarking on ollama vs llama.cpp ollama loses by a large amount. I was quite surprised by this as I had thought that ollama wrapped llama.cpp I think maybe ollama needs to go back to it's roots.
GiteaMirror added the feature request label 2026-04-12 21:02:10 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 15, 2025):

The twit doesn't indicate where the table came from and it's not mentioned in the llama.cpp discussion, so I did some tests to get actual data. Since I don't have access to a DGX Spark, these tests were done on a RTX 3080, using the gpt-oss:20b GGUF published by gglm.org.

ollama:

$ for i in {1..5} ; do ollama run gglm-org/gpt-oss:20b --verbose hello 2>&1 ; done | sed -ne 's/^eval rate: *\([^ ]*\) .*/\1/p' | gnuplot -e 'stats "-" nooutput; print STATS_mean,STATS_stddev'
117.6 1.73046814475159

llama.cpp:

$ llama-server -m gpt-oss-20b.gguf -ngl 25 --port 8000 &
$ for i in {1..5} ; do curl -s localhost:8000/v1/chat/completions -d '{"model":"gpt-oss:20b","messages":[{"role":"user","content":"hello"}]}' | jq .timings.predicted_per_second ; done | gnuplot -e 'stats "-" nooutput; print STATS_mean,STATS_stddev'
121.319669439504 1.74871199165077

So llama.cpp is about 3% faster.

<!-- gh-comment-id:3408098461 --> @rick-github commented on GitHub (Oct 15, 2025): The twit doesn't indicate where the table came from and it's not mentioned in the llama.cpp discussion, so I did some tests to get actual data. Since I don't have access to a DGX Spark, these tests were done on a RTX 3080, using the gpt-oss:20b GGUF published by gglm.org. ollama: ```console $ for i in {1..5} ; do ollama run gglm-org/gpt-oss:20b --verbose hello 2>&1 ; done | sed -ne 's/^eval rate: *\([^ ]*\) .*/\1/p' | gnuplot -e 'stats "-" nooutput; print STATS_mean,STATS_stddev' 117.6 1.73046814475159 ``` llama.cpp: ```console $ llama-server -m gpt-oss-20b.gguf -ngl 25 --port 8000 & $ for i in {1..5} ; do curl -s localhost:8000/v1/chat/completions -d '{"model":"gpt-oss:20b","messages":[{"role":"user","content":"hello"}]}' | jq .timings.predicted_per_second ; done | gnuplot -e 'stats "-" nooutput; print STATS_mean,STATS_stddev' 121.319669439504 1.74871199165077 ``` So llama.cpp is about 3% faster.
Author
Owner

@pdevine commented on GitHub (Oct 16, 2025):

Here are the 120b full precision (BF16 qkv tensor) numbers I'm seeing on 0.11.6rc1:

total duration:       17.1862498s
load duration:        111.612653ms
prompt eval count:    142 token(s)
prompt eval duration: 167.295603ms
prompt eval rate:     848.80 tokens/s
eval count:           709 token(s)
eval duration:        16.713867189s
eval rate:            42.42 tokens/s

Which is faster than what Georgi was posting for the half precision (q8_0 qkv tensor) weights.

I'm going to close this as unhelpful.

<!-- gh-comment-id:3408788920 --> @pdevine commented on GitHub (Oct 16, 2025): Here are the 120b full precision (BF16 qkv tensor) numbers I'm seeing on `0.11.6rc1`: ``` total duration: 17.1862498s load duration: 111.612653ms prompt eval count: 142 token(s) prompt eval duration: 167.295603ms prompt eval rate: 848.80 tokens/s eval count: 709 token(s) eval duration: 16.713867189s eval rate: 42.42 tokens/s ``` Which is faster than what Georgi was posting for the half precision (q8_0 qkv tensor) weights. I'm going to close this as unhelpful.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8389