[GH-ISSUE #8305] Speed ten times slower than llamafile #67374

Closed
opened 2026-05-04 10:08:09 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @ErfolgreichCharismatisch on GitHub (Jan 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8305

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Llamafile is much faster on cpu than ollama, what takes ollama 33 minutes takes llamafile 3 minutes with the same model.
llamafile crashes unfortunately after reusing it and spins its wheels staying at 100% CPU for hours.

I'd rather use a stable ollama, but you must work on speed on CPU

OS

Linux

CPU

Intel

Originally created by @ErfolgreichCharismatisch on GitHub (Jan 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8305 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Llamafile is much faster on cpu than ollama, what takes ollama 33 minutes takes llamafile 3 minutes with the same model. llamafile crashes unfortunately after reusing it and spins its wheels staying at 100% CPU for hours. I'd rather use a stable ollama, but you must work on speed on CPU ### OS Linux ### CPU Intel
GiteaMirror added the bugperformanceneeds more info labels 2026-05-04 10:08:10 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 4, 2025):

How did you compare them?

<!-- gh-comment-id:2571277470 --> @rick-github commented on GitHub (Jan 4, 2025): How did you compare them?
Author
Owner

@rick-github commented on GitHub (Jan 4, 2025):

services:
  ollama:
    image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
    volumes:
      - ${OLLAMA_MODELS-../ollama-data}:/root/.ollama
    ports:
      - 11434:11434

  llamafile:
    image: iverly/llamafile-docker:${LLAMAFILE_DOCKER_TAG-latest}
    volumes:
      - ${LLAMAFILE_MODEL-./llama3.1-8b.gguf}:/model
    ports:
      - 8081:8080
#!/usr/bin/env python3

import requests
import time

prompt = "why is the sky blue"
base_url = "http://localhost:{port}/v1/chat/completions"
model = "llama3.1:8b"

ports = {
  "11434": "ollama-0.5.4",
  "8081": "llamafile-0.8.17",
}

data = {
  "model": model,
  "messages": [{"role":"user","content":prompt}],
  "temperature":0,
  "seed":-1,
}

for port in ports.keys():
  tokens = 0
  elapsed = 0
  for _ in range(10):
    start = time.time()
    response = requests.post(base_url.format(port=port), json=data).json()
    end = time.time()
    tokens = tokens + response['usage']['completion_tokens']
    elapsed = elapsed + end - start
  print(f"{ports[port]:<16}: {tokens} tokens in {elapsed:.2f} seconds, {tokens/elapsed:.2f} tokens/s")
$ docker compose up -d
[+] Running 2/3
 ⠏ Network 8305_default        Created                                                                                                                          0.9s 
 ✔ Container 8305-llamafile-1  Started                                                                                                                          0.8s 
 ✔ Container 8305-ollama-1     Started                                                                                                                          0.8s 
$ ./8305.py 
ollama-0.5.4    : 4535 tokens in 363.67 seconds, 12.47 tokens/s
llamafile-0.8.17: 4250 tokens in 334.54 seconds, 12.70 tokens/s
$ docker compose logs | grep threads
llamafile-1  | {"function":"server_cli","level":"INFO","line":2918,"msg":"system info","n_threads":8,"n_threads_batch":8,"system_info":"AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"12066432","timestamp":1736004445,"total_threads":24}
ollama-1     | time=2025-01-04T15:27:47.600Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 42999"
ollama-1     | time=2025-01-04T15:27:47.604Z level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
<!-- gh-comment-id:2571341863 --> @rick-github commented on GitHub (Jan 4, 2025): ```yaml services: ollama: image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-../ollama-data}:/root/.ollama ports: - 11434:11434 llamafile: image: iverly/llamafile-docker:${LLAMAFILE_DOCKER_TAG-latest} volumes: - ${LLAMAFILE_MODEL-./llama3.1-8b.gguf}:/model ports: - 8081:8080 ``` ```python #!/usr/bin/env python3 import requests import time prompt = "why is the sky blue" base_url = "http://localhost:{port}/v1/chat/completions" model = "llama3.1:8b" ports = { "11434": "ollama-0.5.4", "8081": "llamafile-0.8.17", } data = { "model": model, "messages": [{"role":"user","content":prompt}], "temperature":0, "seed":-1, } for port in ports.keys(): tokens = 0 elapsed = 0 for _ in range(10): start = time.time() response = requests.post(base_url.format(port=port), json=data).json() end = time.time() tokens = tokens + response['usage']['completion_tokens'] elapsed = elapsed + end - start print(f"{ports[port]:<16}: {tokens} tokens in {elapsed:.2f} seconds, {tokens/elapsed:.2f} tokens/s") ``` ```console $ docker compose up -d [+] Running 2/3 ⠏ Network 8305_default Created 0.9s ✔ Container 8305-llamafile-1 Started 0.8s ✔ Container 8305-ollama-1 Started 0.8s $ ./8305.py ollama-0.5.4 : 4535 tokens in 363.67 seconds, 12.47 tokens/s llamafile-0.8.17: 4250 tokens in 334.54 seconds, 12.70 tokens/s ``` ```console $ docker compose logs | grep threads llamafile-1 | {"function":"server_cli","level":"INFO","line":2918,"msg":"system info","n_threads":8,"n_threads_batch":8,"system_info":"AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"12066432","timestamp":1736004445,"total_threads":24} ollama-1 | time=2025-01-04T15:27:47.600Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 42999" ollama-1 | time=2025-01-04T15:27:47.604Z level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 ```
Author
Owner

@rick-github commented on GitHub (Jan 5, 2025):

From https://justine.lol/matmul/, llamafile can improve prompt processing on CPU but not completion processing. Since llamafile uses llama.cpp as the backend, as the llamafile improvements are upstreamed to llama.cpp (https://github.com/ggerganov/llama.cpp/pull/6414, https://github.com/ggerganov/llama.cpp/pull/6412), ollama benefits automatically. Ollama is already receiving this as shown in the LLAMAFILE setting in the system info.

<!-- gh-comment-id:2571452369 --> @rick-github commented on GitHub (Jan 5, 2025): From https://justine.lol/matmul/, llamafile can improve prompt processing on CPU but not completion processing. Since llamafile uses llama.cpp as the backend, as the llamafile improvements are upstreamed to llama.cpp (https://github.com/ggerganov/llama.cpp/pull/6414, https://github.com/ggerganov/llama.cpp/pull/6412), ollama benefits automatically. Ollama is already receiving this as shown in the LLAMAFILE setting in the system info.
Author
Owner

@jart commented on GitHub (Jan 5, 2025):

Even if ollama is using my matmul code, it's not going to help unless you tune your build system to target all the various microarchitectures at runtime. See for example:

This is what lets llamafile not only have really fast matrix multiplication, but also dispatch to the most appropriate compilation for the specific microarchitecture of your cpu.

<!-- gh-comment-id:2571536150 --> @jart commented on GitHub (Jan 5, 2025): Even if ollama is using my matmul code, it's not going to help unless you tune your build system to target all the various microarchitectures at runtime. See for example: - https://github.com/Mozilla-Ocho/llamafile/blob/c2933599286e6d58e63916e4494fdd3b30363ce7/llamafile/BUILD.mk#L70-L129 - https://github.com/Mozilla-Ocho/llamafile/blob/c2933599286e6d58e63916e4494fdd3b30363ce7/llamafile/sgemm.cpp#L26-L131 This is what lets llamafile not only have really fast matrix multiplication, but also dispatch to the most appropriate compilation for the specific microarchitecture of your cpu.
Author
Owner

@rick-github commented on GitHub (Jan 5, 2025):

Thanks, good to know.

<!-- gh-comment-id:2571544339 --> @rick-github commented on GitHub (Jan 5, 2025): Thanks, good to know.
Author
Owner

@dhiltgen commented on GitHub (Apr 9, 2025):

@ErfolgreichCharismatisch do you still see the performance slow-down on recent versions? We now generate multiple CPU architectures for x86

<!-- gh-comment-id:2790844800 --> @dhiltgen commented on GitHub (Apr 9, 2025): @ErfolgreichCharismatisch do you still see the performance slow-down on recent versions? We now generate multiple CPU architectures for x86
Author
Owner

@pdevine commented on GitHub (Oct 3, 2025):

Going to close this as stale. We can reopen if need be.

<!-- gh-comment-id:3367281249 --> @pdevine commented on GitHub (Oct 3, 2025): Going to close this as stale. We can reopen if need be.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67374