[PR #12400] [MERGED] Preallocate CUDA pool memory #76105

Closed
opened 2026-05-05 08:33:48 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12400
Author: @jessegross
Created: 9/24/2025
Status: Merged
Merged: 9/30/2025
Merged by: @jessegross

Base: mainHead: jessegross/memory


📝 Commits (3)

  • f4aba08 ggml: Remove allocation status reporting
  • cf954fe ggml: Backport scale kernel fixes
  • 0b076e8 ggml: Preallocate CUDA pool memory

📊 Changes

15 files changed (+1077 additions, -333 deletions)

View changed files

📝 llama/patches/0014-graph-memory-reporting-on-failure.patch (+22 -37)
📝 llama/patches/0022-ggml-No-alloc-mode.patch (+586 -33)
llama/patches/0026-ggml-Backport-scale-kernel-fixes.patch (+57 -0)
📝 llm/server.go (+19 -19)
📝 llm/server_test.go (+5 -5)
📝 ml/backend.go (+22 -73)
📝 ml/backend/ggml/ggml.go (+56 -88)
📝 ml/backend/ggml/ggml/include/ggml-alloc.h (+1 -6)
📝 ml/backend/ggml/ggml/include/ggml-backend.h (+2 -7)
📝 ml/backend/ggml/ggml/src/ggml-alloc.c (+3 -5)
📝 ml/backend/ggml/ggml/src/ggml-backend-impl.h (+14 -0)
📝 ml/backend/ggml/ggml/src/ggml-backend.cpp (+57 -9)
📝 ml/backend/ggml/ggml/src/ggml-cuda/common.cuh (+46 -2)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+177 -40)
📝 ml/backend/ggml/ggml/src/ggml-cuda/scale.cu (+10 -9)

📄 Description

The GGML CUDA backend allocates additional memory for intermediate results during calculation. This memory isn't currently allocated during worst case graph reservation and therefore not included in scheduling. This means that as these buffers potentially grow with context length, we could crash.

This extends the memory allocation system down layer from the GGML graph to the CUDA layer, preallocating the worst case memory there as well.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12400 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 9/24/2025 **Status:** ✅ Merged **Merged:** 9/30/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/memory` --- ### 📝 Commits (3) - [`f4aba08`](https://github.com/ollama/ollama/commit/f4aba085a19f342c67164d9beb43d4b9079ea45d) ggml: Remove allocation status reporting - [`cf954fe`](https://github.com/ollama/ollama/commit/cf954fec8b761ba4b3110c3d35bb217a1fa0a5b6) ggml: Backport scale kernel fixes - [`0b076e8`](https://github.com/ollama/ollama/commit/0b076e86d6f429675c0be05ea953a5cececc5662) ggml: Preallocate CUDA pool memory ### 📊 Changes **15 files changed** (+1077 additions, -333 deletions) <details> <summary>View changed files</summary> 📝 `llama/patches/0014-graph-memory-reporting-on-failure.patch` (+22 -37) 📝 `llama/patches/0022-ggml-No-alloc-mode.patch` (+586 -33) ➕ `llama/patches/0026-ggml-Backport-scale-kernel-fixes.patch` (+57 -0) 📝 `llm/server.go` (+19 -19) 📝 `llm/server_test.go` (+5 -5) 📝 `ml/backend.go` (+22 -73) 📝 `ml/backend/ggml/ggml.go` (+56 -88) 📝 `ml/backend/ggml/ggml/include/ggml-alloc.h` (+1 -6) 📝 `ml/backend/ggml/ggml/include/ggml-backend.h` (+2 -7) 📝 `ml/backend/ggml/ggml/src/ggml-alloc.c` (+3 -5) 📝 `ml/backend/ggml/ggml/src/ggml-backend-impl.h` (+14 -0) 📝 `ml/backend/ggml/ggml/src/ggml-backend.cpp` (+57 -9) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/common.cuh` (+46 -2) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+177 -40) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/scale.cu` (+10 -9) </details> ### 📄 Description The GGML CUDA backend allocates additional memory for intermediate results during calculation. This memory isn't currently allocated during worst case graph reservation and therefore not included in scheduling. This means that as these buffers potentially grow with context length, we could crash. This extends the memory allocation system down layer from the GGML graph to the CUDA layer, preallocating the worst case memory there as well. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 08:33:48 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#76105