[PR #13244] ggml: Use max graph memory allocation when reserving #24671

Open
opened 2026-04-19 17:43:48 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13244
Author: @jessegross
Created: 11/26/2025
Status: 🔄 Open

Base: mainHead: jessegross/multi_chunk_reserve


📝 Commits (1)

  • f03b8bc ggml: Use max graph memory allocation when reserving

📊 Changes

1 file changed (+6 additions, -6 deletions)

View changed files

📝 ml/backend/ggml/ggml.go (+6 -6)

📄 Description

When calculating the size of the memory required for a compute graph, we may test multiple graphs - for example a vision encoder and the text model. Since these graphs are never run at the same time, we just want the max size.

Typically, a new graph only reallocates memory if it doesn't fit in the existing space, so the last graph reservation is the max size. However, the Vulkan backend imposes a 1G cap for a single allocation, which means that the graph may require multiple allocations. This results in a problem if:

  • There is an old graph with one small chunk and one big chunk
  • A new graph with one big chunk that is smaller than the total of the old graph. In this case, the big chunk of the new graph will trigger a reallocation, which will free the old second chunk. The total amount of memory reported will be lower than the max. To avoid this, we should explicitly take the max from each graph.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13244 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 11/26/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `jessegross/multi_chunk_reserve` --- ### 📝 Commits (1) - [`f03b8bc`](https://github.com/ollama/ollama/commit/f03b8bc51afa14fabd06412a16b27ee53d45b664) ggml: Use max graph memory allocation when reserving ### 📊 Changes **1 file changed** (+6 additions, -6 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml.go` (+6 -6) </details> ### 📄 Description When calculating the size of the memory required for a compute graph, we may test multiple graphs - for example a vision encoder and the text model. Since these graphs are never run at the same time, we just want the max size. Typically, a new graph only reallocates memory if it doesn't fit in the existing space, so the last graph reservation is the max size. However, the Vulkan backend imposes a 1G cap for a single allocation, which means that the graph may require multiple allocations. This results in a problem if: - There is an old graph with one small chunk and one big chunk - A new graph with one big chunk that is smaller than the total of the old graph. In this case, the big chunk of the new graph will trigger a reallocation, which will free the old second chunk. The total amount of memory reported will be lower than the max. To avoid this, we should explicitly take the max from each graph. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:43:48 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#24671