[PR #921] [MERGED] offload 75% of available vram to improve stability #20928

Closed
opened 2026-04-19 15:19:43 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/921
Author: @BruceMacD
Created: 10/26/2023
Status: Merged
Merged: 10/27/2023
Merged by: @BruceMacD

Base: mainHead: brucemacd/vram-stability


📝 Commits (1)

  • 6aeedb1 offload 75% of available vram to improve stability

📊 Changes

1 file changed (+7 additions, -4 deletions)

View changed files

📝 llm/llama.go (+7 -4)

📄 Description

Reducing the amount of layers off-loaded to vram to prevent out of memory errors on larger models. In my tests 7B and 13B models with 4-bit quantization still maxed out the number of layers and ran fast. Meanwhile, 70B models can now run on a T4 when they would previously crash with out-of-memory errors, but its still slow in this case.

Our heuristic of memory required per layer being roughly the file size divided by the number of layers seems somewhat reliable after additional testing. The calculation for the amount of memory needed for weights looks something like this:

size of weights = number of parameters * number of bytes per parameter

This roughly equates to file size.

I also added some clarity in the comment around storing the kv cache in vram.

resolves #790


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/921 **Author:** [@BruceMacD](https://github.com/BruceMacD) **Created:** 10/26/2023 **Status:** ✅ Merged **Merged:** 10/27/2023 **Merged by:** [@BruceMacD](https://github.com/BruceMacD) **Base:** `main` ← **Head:** `brucemacd/vram-stability` --- ### 📝 Commits (1) - [`6aeedb1`](https://github.com/ollama/ollama/commit/6aeedb16b8e9c14b438cd200f3d840d677204957) offload 75% of available vram to improve stability ### 📊 Changes **1 file changed** (+7 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `llm/llama.go` (+7 -4) </details> ### 📄 Description Reducing the amount of layers off-loaded to vram to prevent out of memory errors on larger models. In my tests 7B and 13B models with 4-bit quantization still maxed out the number of layers and ran fast. Meanwhile, 70B models can now run on a T4 when they would previously crash with out-of-memory errors, but its still slow in this case. Our heuristic of memory required per layer being roughly the file size divided by the number of layers seems somewhat reliable after additional testing. The calculation for the amount of memory needed for weights looks something like this: ``` size of weights = number of parameters * number of bytes per parameter ``` This roughly equates to file size. I also added some clarity in the comment around storing the kv cache in vram. resolves #790 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 15:19:43 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#20928