[PR #1850] [MERGED] Offload layers to GPU based on new model size estimates #21245

Closed
opened 2026-04-19 15:32:27 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/1850
Author: @jmorganca
Created: 1/8/2024
Status: Merged
Merged: 1/8/2024
Merged by: @jmorganca

Base: mainHead: gpu-calc


📝 Commits (10+)

  • a7fd2b5 select layers based on estimated model memory usage
  • 2188377 always account for scratch vram
  • f630032 dont load +1 layers
  • 3cc91b6 better estmation for graph alloc
  • 20a5803 Update gpu/gpu_darwin.go
  • 889aa5b Update llm/llm.go
  • 0e49307 Update llm/llm.go
  • 0b8e9ab add overhead for cuda memory
  • 5c55808 Update llm/llm.go
  • 8ab3b0b fix build error on linux

📊 Changes

10 files changed (+161 additions, -154 deletions)

View changed files

📝 gpu/gpu.go (+7 -26)
📝 gpu/gpu_darwin.go (+17 -17)
📝 llm/ext_server_common.go (+3 -10)
📝 llm/ext_server_default.go (+2 -2)
📝 llm/ggml.go (+5 -1)
📝 llm/gguf.go (+38 -3)
📝 llm/llama.go (+1 -60)
📝 llm/llm.go (+85 -32)
📝 llm/shim_darwin.go (+1 -1)
📝 llm/shim_ext_server.go (+2 -2)

📄 Description

This PR fixes a large number of crashes and "out of memory" errors related to VRAM allocation, by using a more accurate estimation of how much memory is required to run a model with a given context size.

Models such as mixtral will now run on lower end hardware that would previously before, even if defaulting to the CPU is required. Also, more layers are loaded to Nvidia GPUs which should result in a speedup on Linux.

Details:

  • VRAM estimation now accounts for the kv cache and tensor graph (which can grow to GiBs for large context sizes)
  • On macOS, Ollama will now run in CPU mode, even on Apple Silicon (arm64) if the GPU doesn't have enough VRAM. Models such as mixtral, llama2:70b, etc will now work (perhaps slowly) instead of crashing
  • On Linux, the number of layers to be offloaded to the GPU now accounts for the kv cache which is also partially offloaded

Todo in a follow up:

  • Handle smaller batch sizes as mention in #1812
  • Still seeing some errors with very large context sizes (64k, 128k)
  • Limit num_ctx to what the model is trained on

Fixes #1838
Fixes #1812
Fixes #1516
Fixes #1674
Fixes #1374
Fixes #1534
Fixes #1303
Fixes #1413
Fixes #1636
Fixes #1837
Fixes #1627
Fixes #1566
Fixes #1576
Fixes #1703


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/1850 **Author:** [@jmorganca](https://github.com/jmorganca) **Created:** 1/8/2024 **Status:** ✅ Merged **Merged:** 1/8/2024 **Merged by:** [@jmorganca](https://github.com/jmorganca) **Base:** `main` ← **Head:** `gpu-calc` --- ### 📝 Commits (10+) - [`a7fd2b5`](https://github.com/ollama/ollama/commit/a7fd2b51097ccefb31dd8f6064bc3ff1dbcca80e) select layers based on estimated model memory usage - [`2188377`](https://github.com/ollama/ollama/commit/2188377cba6c08befbafc11870aab4ddd3e5d7b0) always account for scratch vram - [`f630032`](https://github.com/ollama/ollama/commit/f63003254c5a43c486f7f425c4339647a5577a80) dont load +1 layers - [`3cc91b6`](https://github.com/ollama/ollama/commit/3cc91b6276a986160d50e6cf57f87ebac4206ef3) better estmation for graph alloc - [`20a5803`](https://github.com/ollama/ollama/commit/20a5803c5221380499b7e95ca77d69b8f5a6f8e5) Update gpu/gpu_darwin.go - [`889aa5b`](https://github.com/ollama/ollama/commit/889aa5b8f273c15f0f9d6a926ab0af1981c2487b) Update llm/llm.go - [`0e49307`](https://github.com/ollama/ollama/commit/0e4930719b9a2be2467da47728930e65fe5f587d) Update llm/llm.go - [`0b8e9ab`](https://github.com/ollama/ollama/commit/0b8e9ab6d7acab502f512ef6e72cad962e40f8e1) add overhead for cuda memory - [`5c55808`](https://github.com/ollama/ollama/commit/5c55808f4a1f0aad374dbd1e801812475c0b7662) Update llm/llm.go - [`8ab3b0b`](https://github.com/ollama/ollama/commit/8ab3b0b9c85b2c18574328adb4d3f7fe7f1b09ca) fix build error on linux ### 📊 Changes **10 files changed** (+161 additions, -154 deletions) <details> <summary>View changed files</summary> 📝 `gpu/gpu.go` (+7 -26) 📝 `gpu/gpu_darwin.go` (+17 -17) 📝 `llm/ext_server_common.go` (+3 -10) 📝 `llm/ext_server_default.go` (+2 -2) 📝 `llm/ggml.go` (+5 -1) 📝 `llm/gguf.go` (+38 -3) 📝 `llm/llama.go` (+1 -60) 📝 `llm/llm.go` (+85 -32) 📝 `llm/shim_darwin.go` (+1 -1) 📝 `llm/shim_ext_server.go` (+2 -2) </details> ### 📄 Description This PR fixes a large number of crashes and "out of memory" errors related to VRAM allocation, by using a more accurate estimation of how much memory is required to run a model with a given context size. Models such as `mixtral` will now run on lower end hardware that would previously before, even if defaulting to the CPU is required. Also, more layers are loaded to Nvidia GPUs which should result in a speedup on Linux. Details: - VRAM estimation now accounts for the kv cache and tensor graph (which can grow to GiBs for large context sizes) - On macOS, Ollama will now run in CPU mode, even on Apple Silicon (`arm64`) if the GPU doesn't have enough VRAM. Models such as `mixtral`, `llama2:70b`, etc will now work (perhaps slowly) instead of crashing - On Linux, the number of layers to be offloaded to the GPU now accounts for the kv cache which is also partially offloaded Todo in a follow up: - Handle smaller batch sizes as mention in #1812 - Still seeing some errors with very large context sizes (64k, 128k) - Limit `num_ctx` to what the model is trained on Fixes #1838 Fixes #1812 Fixes #1516 Fixes #1674 Fixes #1374 Fixes #1534 Fixes #1303 Fixes #1413 Fixes #1636 Fixes #1837 Fixes #1627 Fixes #1566 Fixes #1576 Fixes #1703 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 15:32:27 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#21245