[PR #724] [MERGED] improve vram safety with 5% vram memory buffer #15578

Closed
opened 2026-04-16 05:02:33 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/724
Author: @BruceMacD
Created: 10/6/2023
Status: Merged
Merged: 10/10/2023
Merged by: @BruceMacD

Base: mainHead: brucemacd/vram-buffer


📝 Commits (7)

📊 Changes

1 file changed (+13 additions, -7 deletions)

View changed files

📝 llm/llama.go (+13 -7)

📄 Description

In testing how much VRAM should be allocated we typically used a model which could be entirely loaded into VRAM. This masked an issue when a model is larger than the available VRAM it is possible to consume all available VRAM and fail with an error:

Error: llama runner failed: out of memory

This change leaves a 10% buffer on available VRAM to prevent running out of memory.

Tested on a T4:

  • llama2:7b: easily offloads all layers to GPU
  • llama2:13b: easily offloads all layers to GPU
  • llama2:70b: offloaded 29 layers to GPU, was slow but did not run out of memory on load (as it did before)

Resolves #725


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/724 **Author:** [@BruceMacD](https://github.com/BruceMacD) **Created:** 10/6/2023 **Status:** ✅ Merged **Merged:** 10/10/2023 **Merged by:** [@BruceMacD](https://github.com/BruceMacD) **Base:** `main` ← **Head:** `brucemacd/vram-buffer` --- ### 📝 Commits (7) - [`74f5c4a`](https://github.com/ollama/ollama/commit/74f5c4aa847569b3d74325dffeeb3e2306b6b0f8) improve vram safety - [`9b8afa1`](https://github.com/ollama/ollama/commit/9b8afa1afb444ba6fc3181baf32ed2511d2bd2c3) check free memory not total - [`805c7d8`](https://github.com/ollama/ollama/commit/805c7d8d47e97ba794eaf036beb475ae7fe69c32) 5% buffer rather than 10% - [`01c04a4`](https://github.com/ollama/ollama/commit/01c04a462187140b1e058f0ecf78621866ebd271) rename variable - [`be707cb`](https://github.com/ollama/ollama/commit/be707cb89306ae78502467500cafac81b0171316) wait for subprocess to exit - [`5048f1b`](https://github.com/ollama/ollama/commit/5048f1beb1b6976c31984440935da2b41de33db7) logging format - [`e11454c`](https://github.com/ollama/ollama/commit/e11454c05053950539e695a338221c79f02dca29) wait for command to exit, no timeout ### 📊 Changes **1 file changed** (+13 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `llm/llama.go` (+13 -7) </details> ### 📄 Description In testing how much VRAM should be allocated we typically used a model which could be entirely loaded into VRAM. This masked an issue when a model is larger than the available VRAM it is possible to consume all available VRAM and fail with an error: ``` Error: llama runner failed: out of memory ``` This change leaves a 10% buffer on available VRAM to prevent running out of memory. Tested on a T4: - `llama2:7b`: easily offloads all layers to GPU - `llama2:13b`: easily offloads all layers to GPU - `llama2:70b`: offloaded 29 layers to GPU, was slow but did not run out of memory on load (as it did before) Resolves #725 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 05:02:33 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#15578