[PR #11863] [MERGED] perf: build graph for next batch async to keep GPU busy #60335

Closed
opened 2026-04-29 15:15:38 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11863
Author: @dhiltgen
Created: 8/11/2025
Status: Merged
Merged: 8/29/2025
Merged by: @dhiltgen

Base: mainHead: async_graph


📝 Commits (2)

  • b89be6e perf: build graph for next batch in parallel to keep GPU busy
  • a5f99c2 tests: tune integration tests for ollama engine

📊 Changes

20 files changed (+589 additions, -233 deletions)

View changed files

📝 integration/README.md (+5 -2)
📝 integration/api_test.go (+1 -1)
📝 integration/basic_test.go (+22 -7)
📝 integration/concurrency_test.go (+13 -13)
📝 integration/context_test.go (+56 -5)
📝 integration/llm_image_test.go (+12 -5)
integration/llm_test.go (+0 -47)
📝 integration/max_queue_test.go (+20 -9)
📝 integration/utils_test.go (+130 -8)
📝 ml/backend.go (+3 -0)
📝 ml/backend/ggml/ggml.go (+16 -0)
📝 model/model.go (+3 -5)
📝 model/models/gemma3/model.go (+8 -8)
📝 model/models/llama4/model.go (+12 -12)
📝 model/models/mistral3/model.go (+6 -6)
📝 model/models/mllama/model.go (+1 -1)
📝 model/models/qwen25vl/model.go (+7 -7)
📝 runner/ollamarunner/cache.go (+9 -9)
📝 runner/ollamarunner/cache_test.go (+50 -50)
📝 runner/ollamarunner/runner.go (+215 -38)

📄 Description

This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

On metal, I see a 2-3% speedup in token rate. On a single RTX 4090 I see a ~7% speedup.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11863 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 8/11/2025 **Status:** ✅ Merged **Merged:** 8/29/2025 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `async_graph` --- ### 📝 Commits (2) - [`b89be6e`](https://github.com/ollama/ollama/commit/b89be6e4ef0db7cb886c0dfab17a397434d7107a) perf: build graph for next batch in parallel to keep GPU busy - [`a5f99c2`](https://github.com/ollama/ollama/commit/a5f99c232ba521ae56d23f7a99e8f78b85405a5e) tests: tune integration tests for ollama engine ### 📊 Changes **20 files changed** (+589 additions, -233 deletions) <details> <summary>View changed files</summary> 📝 `integration/README.md` (+5 -2) 📝 `integration/api_test.go` (+1 -1) 📝 `integration/basic_test.go` (+22 -7) 📝 `integration/concurrency_test.go` (+13 -13) 📝 `integration/context_test.go` (+56 -5) 📝 `integration/llm_image_test.go` (+12 -5) ➖ `integration/llm_test.go` (+0 -47) 📝 `integration/max_queue_test.go` (+20 -9) 📝 `integration/utils_test.go` (+130 -8) 📝 `ml/backend.go` (+3 -0) 📝 `ml/backend/ggml/ggml.go` (+16 -0) 📝 `model/model.go` (+3 -5) 📝 `model/models/gemma3/model.go` (+8 -8) 📝 `model/models/llama4/model.go` (+12 -12) 📝 `model/models/mistral3/model.go` (+6 -6) 📝 `model/models/mllama/model.go` (+1 -1) 📝 `model/models/qwen25vl/model.go` (+7 -7) 📝 `runner/ollamarunner/cache.go` (+9 -9) 📝 `runner/ollamarunner/cache_test.go` (+50 -50) 📝 `runner/ollamarunner/runner.go` (+215 -38) </details> ### 📄 Description This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. On metal, I see a 2-3% speedup in token rate. On a single RTX 4090 I see a ~7% speedup. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 15:15:39 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#60335