[PR #8643] [MERGED] benchmark: performance of running ollama server #12738

Closed
opened 2026-04-13 00:08:29 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/8643
Author: @BruceMacD
Created: 1/28/2025
Status: Merged
Merged: 3/21/2025
Merged by: @BruceMacD

Base: mainHead: brucemacd/e2e-benchmark


📝 Commits (3)

  • be7c079 benchmark: e2e performance of a running server
  • ac6bc09 Update docs/benchmark.md
  • bc46d0f use benchmark loop

📊 Changes

2 files changed (+237 additions, -0 deletions)

View changed files

benchmark/server_benchmark_test.go (+178 -0)
docs/benchmark.md (+59 -0)

📄 Description

This PR introduces a benchmarking framework for measuring Ollama's inference performance across different models and scenarios. The implementation measures Time To First Token (TTFT), total generation time, and tokens per second throughput.

Key Features

  • Measures both cold start and warm start performance
  • Tests varying prompt lengths (short/medium/long)
  • Collects metrics: TTFT, total time, token count, tokens/second

Implementation Notes

  • Uses external Ollama server (localhost:11434) for testing since C dependencies must be packaged in the binary and cannot be called directly from tests
  • Handles model unloading between cold start tests to ensure accurate measurements
  • Implements smart warm-up for warm start scenarios
  • Aggregates and averages results across test iterations via go benchmark

Requirements

  • Ollama server must be running locally on the default port
  • Test models must be pre-downloaded

Sample Usage

go test -bench=. -m llama3.1:8b ./...

The output is a standard go benchmark log with some extra metadata added.

Sample output:

goos: darwin
goarch: arm64
pkg: github.com/ollama/ollama/benchmark
cpu: Apple M3 Max
BenchmarkColdStart/llama3.1:8b/cold/short_prompt-16                    1        2800975666 ns/op           0.00 MB/s            58.62 gen_tok/s        100.0 gen_tokens             578.0 load_ms            52.63 prompt_tok/s              14.00 prompt_tokens            848.0 ttft_ms
BenchmarkColdStart/llama3.1:8b/cold/medium_prompt-16                   1        10570117834 ns/op          0.00 MB/s            52.65 gen_tok/s        500.0 gen_tokens        573.0 load_ms            59.52 prompt_tok/s              15.00 prompt_tokens            828.0 ttft_ms
BenchmarkColdStart/llama3.1:8b/cold/long_prompt-16                     1        19942159833 ns/op          0.00 MB/s            53.17 gen_tok/s       1000 gen_tokens          573.0 load_ms            58.61 prompt_tok/s              16.00 prompt_tokens            848.0 ttft_ms
BenchmarkWarmStart/llama3.1:8b/warm/short_prompt-16                    1        1791833416 ns/op           0.00 MB/s            56.82 gen_tok/s        100.0 gen_tokens         12.00 load_ms          823.5 prompt_tok/s               14.00 prompt_tokens             31.00 ttft_ms
BenchmarkWarmStart/llama3.1:8b/warm/medium_prompt-16                   1        9783085500 ns/op           0.00 MB/s            51.28 gen_tok/s        500.0 gen_tokens         13.00 load_ms          882.4 prompt_tok/s               15.00 prompt_tokens             32.00 ttft_ms
BenchmarkWarmStart/llama3.1:8b/warm/long_prompt-16                     1        21034040166 ns/op          0.00 MB/s            47.63 gen_tok/s       1000 gen_tokens           13.00 load_ms          727.3 prompt_tok/s               16.00 prompt_tokens             37.00 ttft_ms
PASS
ok      github.com/ollama/ollama/benchmark      72.374s

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/8643 **Author:** [@BruceMacD](https://github.com/BruceMacD) **Created:** 1/28/2025 **Status:** ✅ Merged **Merged:** 3/21/2025 **Merged by:** [@BruceMacD](https://github.com/BruceMacD) **Base:** `main` ← **Head:** `brucemacd/e2e-benchmark` --- ### 📝 Commits (3) - [`be7c079`](https://github.com/ollama/ollama/commit/be7c0794703d1da08d8b5df7734b01aa421a5024) benchmark: e2e performance of a running server - [`ac6bc09`](https://github.com/ollama/ollama/commit/ac6bc09fa3f991baef64f1dd962b5f2dc2c29163) Update docs/benchmark.md - [`bc46d0f`](https://github.com/ollama/ollama/commit/bc46d0f2ddba1bbd2846024ef5f090c17f587daf) use benchmark loop ### 📊 Changes **2 files changed** (+237 additions, -0 deletions) <details> <summary>View changed files</summary> ➕ `benchmark/server_benchmark_test.go` (+178 -0) ➕ `docs/benchmark.md` (+59 -0) </details> ### 📄 Description This PR introduces a benchmarking framework for measuring Ollama's inference performance across different models and scenarios. The implementation measures Time To First Token (TTFT), total generation time, and tokens per second throughput. ## Key Features - Measures both cold start and warm start performance - Tests varying prompt lengths (short/medium/long) - Collects metrics: TTFT, total time, token count, tokens/second ## Implementation Notes - Uses external Ollama server (localhost:11434) for testing since C dependencies must be packaged in the binary and cannot be called directly from tests - Handles model unloading between cold start tests to ensure accurate measurements - Implements smart warm-up for warm start scenarios - Aggregates and averages results across test iterations via go benchmark ## Requirements - Ollama server must be running locally on the default port - Test models must be pre-downloaded ## Sample Usage ```go go test -bench=. -m llama3.1:8b ./... ``` The output is a standard go benchmark log with some extra metadata added. Sample output: ```go goos: darwin goarch: arm64 pkg: github.com/ollama/ollama/benchmark cpu: Apple M3 Max BenchmarkColdStart/llama3.1:8b/cold/short_prompt-16 1 2800975666 ns/op 0.00 MB/s 58.62 gen_tok/s 100.0 gen_tokens 578.0 load_ms 52.63 prompt_tok/s 14.00 prompt_tokens 848.0 ttft_ms BenchmarkColdStart/llama3.1:8b/cold/medium_prompt-16 1 10570117834 ns/op 0.00 MB/s 52.65 gen_tok/s 500.0 gen_tokens 573.0 load_ms 59.52 prompt_tok/s 15.00 prompt_tokens 828.0 ttft_ms BenchmarkColdStart/llama3.1:8b/cold/long_prompt-16 1 19942159833 ns/op 0.00 MB/s 53.17 gen_tok/s 1000 gen_tokens 573.0 load_ms 58.61 prompt_tok/s 16.00 prompt_tokens 848.0 ttft_ms BenchmarkWarmStart/llama3.1:8b/warm/short_prompt-16 1 1791833416 ns/op 0.00 MB/s 56.82 gen_tok/s 100.0 gen_tokens 12.00 load_ms 823.5 prompt_tok/s 14.00 prompt_tokens 31.00 ttft_ms BenchmarkWarmStart/llama3.1:8b/warm/medium_prompt-16 1 9783085500 ns/op 0.00 MB/s 51.28 gen_tok/s 500.0 gen_tokens 13.00 load_ms 882.4 prompt_tok/s 15.00 prompt_tokens 32.00 ttft_ms BenchmarkWarmStart/llama3.1:8b/warm/long_prompt-16 1 21034040166 ns/op 0.00 MB/s 47.63 gen_tok/s 1000 gen_tokens 13.00 load_ms 727.3 prompt_tok/s 16.00 prompt_tokens 37.00 ttft_ms PASS ok github.com/ollama/ollama/benchmark 72.374s ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:08:29 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#12738