[GH-ISSUE #14677] LLamafile tech on ollama #56013

Closed
opened 2026-04-29 10:08:35 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @RevoltPW on GitHub (Mar 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14677

why is not the llamafile tech incoporated on ollama?, actually using models downloaded from ollama on llamafile and they run faster on my cpu and consume less ran, that would be great to be incorporated into ollamas functionality, thanks for amazing solutions tho!

Originally created by @RevoltPW on GitHub (Mar 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14677 why is not the [llamafile](https://github.com/mozilla-ai/llamafile) tech incoporated on ollama?, actually using models downloaded from ollama on llamafile and they run faster on my cpu and consume less ran, that would be great to be incorporated into ollamas functionality, thanks for amazing solutions tho!
GiteaMirror added the feature request label 2026-04-29 10:08:35 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 7, 2026):

Ollama uses llama.cpp as a backend so performance should be comparable. What model performs better with llamafile?

<!-- gh-comment-id:4016135273 --> @rick-github commented on GitHub (Mar 7, 2026): Ollama uses llama.cpp as a backend so performance should be comparable. What model performs better with llamafile?
Author
Owner

@RevoltPW commented on GitHub (Mar 8, 2026):

Ollama uses llama.cpp as a backend so performance should be comparable. What model performs better with llamafile?

Got me! Ollama is 1,236x faster in my test benchmark, Thanks!

### Llamafile with -m qwen3-coder:latest

(scripts) pc@pc:~/PythonCode/ollama-llamafile-bechmark$ python3 benchmark.py --server-url "http://localhost:8080" --model "qwen3-coder:latest" --num-requests 5 --port 8080

Starting streaming benchmark...
Server: http://localhost:8080
Model: qwen3-coder:latest
Prompt: Explain quantum computing in simple terms. Be comp...
Number of requests: 5
--------------------------------------------------
Request 1/5...
Request 1: 6.65 tokens/sec, First token: 8.383s, Memory: 20.10 GB
Request 2/5...
Request 2: 12.30 tokens/sec, First token: 2.468s, Memory: 17.70 GB
Request 3/5...
Request 3: 12.98 tokens/sec, First token: 0.495s, Memory: 16.68 GB
Request 4/5...
Request 4: 14.58 tokens/sec, First token: 0.492s, Memory: 16.55 GB
Request 5/5...
Request 5: 13.99 tokens/sec, First token: 0.515s, Memory: 16.42 GB

============================================================
STREAMING BENCHMARK SUMMARY
============================================================
Total Requests: 5
Successful Requests: 5
Failed Requests: 0
Average Time: 16.507 seconds
Average Tokens per Second: 12.10
Average Time to First Token: 2.471 seconds
Average Memory Usage: 17.49 GB
Max Memory Usage: 20.10 GB
Min Memory Usage: 16.42 GB
Total Processing Time: 82.53 seconds
============================================================```

### Ollama with -m qwen3-coder:latest

` (scripts) pc@pc:~/PythonCode/ollama-llamafile-bechmark$ python3 benchmark.py --server-url "http://localhost:11434" --model "qwen3-coder:latest" --num-requests 5 --port 11434

Starting streaming benchmark...
Server: http://localhost:11434
Model: qwen3-coder:latest
Prompt: Explain quantum computing in simple terms. Be comp...
Number of requests: 5
--------------------------------------------------
Request 1/5...
Request 1: 13.87 tokens/sec, First token: 0.924s, Memory: 19.19 GB
Request 2/5...
Request 2: 14.74 tokens/sec, First token: 0.192s, Memory: 19.21 GB
Request 3/5...
Request 3: 15.46 tokens/sec, First token: 0.196s, Memory: 19.24 GB
Request 4/5...
Request 4: 15.05 tokens/sec, First token: 0.189s, Memory: 19.23 GB
Request 5/5...
Request 5: 15.46 tokens/sec, First token: 0.182s, Memory: 19.24 GB

============================================================
STREAMING BENCHMARK SUMMARY
============================================================
Total Requests: 5
Successful Requests: 5
Failed Requests: 0
Average Time: 12.372 seconds
Average Tokens per Second: 14.92
Average Time to First Token: 0.337 seconds
Average Memory Usage: 19.22 GB
Max Memory Usage: 19.24 GB
Min Memory Usage: 19.19 GB
Total Processing Time: 61.86 seconds
============================================================`
<!-- gh-comment-id:4019711022 --> @RevoltPW commented on GitHub (Mar 8, 2026): > Ollama uses llama.cpp as a backend so performance should be comparable. What model performs better with llamafile? Got me! Ollama is 1,236x faster in my test benchmark, Thanks! ``` ### Llamafile with -m qwen3-coder:latest (scripts) pc@pc:~/PythonCode/ollama-llamafile-bechmark$ python3 benchmark.py --server-url "http://localhost:8080" --model "qwen3-coder:latest" --num-requests 5 --port 8080 Starting streaming benchmark... Server: http://localhost:8080 Model: qwen3-coder:latest Prompt: Explain quantum computing in simple terms. Be comp... Number of requests: 5 -------------------------------------------------- Request 1/5... Request 1: 6.65 tokens/sec, First token: 8.383s, Memory: 20.10 GB Request 2/5... Request 2: 12.30 tokens/sec, First token: 2.468s, Memory: 17.70 GB Request 3/5... Request 3: 12.98 tokens/sec, First token: 0.495s, Memory: 16.68 GB Request 4/5... Request 4: 14.58 tokens/sec, First token: 0.492s, Memory: 16.55 GB Request 5/5... Request 5: 13.99 tokens/sec, First token: 0.515s, Memory: 16.42 GB ============================================================ STREAMING BENCHMARK SUMMARY ============================================================ Total Requests: 5 Successful Requests: 5 Failed Requests: 0 Average Time: 16.507 seconds Average Tokens per Second: 12.10 Average Time to First Token: 2.471 seconds Average Memory Usage: 17.49 GB Max Memory Usage: 20.10 GB Min Memory Usage: 16.42 GB Total Processing Time: 82.53 seconds ============================================================``` ### Ollama with -m qwen3-coder:latest ` (scripts) pc@pc:~/PythonCode/ollama-llamafile-bechmark$ python3 benchmark.py --server-url "http://localhost:11434" --model "qwen3-coder:latest" --num-requests 5 --port 11434 Starting streaming benchmark... Server: http://localhost:11434 Model: qwen3-coder:latest Prompt: Explain quantum computing in simple terms. Be comp... Number of requests: 5 -------------------------------------------------- Request 1/5... Request 1: 13.87 tokens/sec, First token: 0.924s, Memory: 19.19 GB Request 2/5... Request 2: 14.74 tokens/sec, First token: 0.192s, Memory: 19.21 GB Request 3/5... Request 3: 15.46 tokens/sec, First token: 0.196s, Memory: 19.24 GB Request 4/5... Request 4: 15.05 tokens/sec, First token: 0.189s, Memory: 19.23 GB Request 5/5... Request 5: 15.46 tokens/sec, First token: 0.182s, Memory: 19.24 GB ============================================================ STREAMING BENCHMARK SUMMARY ============================================================ Total Requests: 5 Successful Requests: 5 Failed Requests: 0 Average Time: 12.372 seconds Average Tokens per Second: 14.92 Average Time to First Token: 0.337 seconds Average Memory Usage: 19.22 GB Max Memory Usage: 19.24 GB Min Memory Usage: 19.19 GB Total Processing Time: 61.86 seconds ============================================================`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56013