[PR #10171] [MERGED] ollamarunner: Preallocate worst case graph at startup #59863

Closed
opened 2026-04-29 14:47:15 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/10171
Author: @jessegross
Created: 4/8/2025
Status: Merged
Merged: 4/8/2025
Merged by: @jessegross

Base: mainHead: jessegross/worst


📝 Commits (2)

  • 685c839 ggml: Check for OOM and return as Go errors
  • 0a3227b ollamarunner: Preallocate worst case graph at startup

📊 Changes

10 files changed (+181 additions, -55 deletions)

View changed files

📝 kvcache/cache.go (+3 -2)
📝 kvcache/causal.go (+39 -30)
📝 kvcache/causal_test.go (+5 -3)
📝 kvcache/encoder.go (+12 -3)
📝 kvcache/wrapper.go (+2 -2)
📝 ml/backend.go (+7 -0)
📝 ml/backend/ggml/ggml.go (+61 -13)
📝 model/model.go (+1 -1)
📝 runner/ollamarunner/cache_test.go (+1 -1)
📝 runner/ollamarunner/runner.go (+50 -0)

📄 Description

Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/10171 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 4/8/2025 **Status:** ✅ Merged **Merged:** 4/8/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/worst` --- ### 📝 Commits (2) - [`685c839`](https://github.com/ollama/ollama/commit/685c839cda8b90144be3388dedec486247d563d8) ggml: Check for OOM and return as Go errors - [`0a3227b`](https://github.com/ollama/ollama/commit/0a3227bcebfa785d579a60b363894d2291a5829f) ollamarunner: Preallocate worst case graph at startup ### 📊 Changes **10 files changed** (+181 additions, -55 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/cache.go` (+3 -2) 📝 `kvcache/causal.go` (+39 -30) 📝 `kvcache/causal_test.go` (+5 -3) 📝 `kvcache/encoder.go` (+12 -3) 📝 `kvcache/wrapper.go` (+2 -2) 📝 `ml/backend.go` (+7 -0) 📝 `ml/backend/ggml/ggml.go` (+61 -13) 📝 `model/model.go` (+1 -1) 📝 `runner/ollamarunner/cache_test.go` (+1 -1) 📝 `runner/ollamarunner/runner.go` (+50 -0) </details> ### 📄 Description Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context. This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future. Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 14:47:15 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#59863