[PR #5469] [MERGED] Prevent loading models larger than total memory #11791

Closed
opened 2026-04-12 23:39:01 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/5469
Author: @dhiltgen
Created: 7/3/2024
Status: Merged
Merged: 7/5/2024
Merged by: @dhiltgen

Base: mainHead: prevent_system_oom


📝 Commits (1)

  • 3c75113 Prevent loading models larger than total memory

📊 Changes

2 files changed (+38 additions, -0 deletions)

View changed files

📝 server/sched.go (+26 -0)
📝 server/sched_test.go (+12 -0)

📄 Description

Users may not realize the shiny new model they're trying to load fits on their disk, but can't load into system+GPU memory. Today we crash, but with this fix, we'll give them a better error message before even trying to load it.

Fixes #3837 #4955

Verified by using stress-ng to saturate system memory, and loaded a secondary model on another ollama instance to use up GPU memory, then tried to load a model

% ollama run gemma:7b
Error: requested model (5.5 GiB) is too large for this system (4.5 GiB)

Debug server logs from the manual test

time=2024-07-03T15:12:11.901-07:00 level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="31.3 GiB" before.free="4.0 GiB" now.total="31.3 GiB" now.free="3.9 GiB"
CUDA driver version: 11.4
time=2024-07-03T15:12:12.000-07:00 level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-1c750365-54dc-7082-7c6b-9dd953a68ab6 name="NVIDIA GeForce GTX 1060 6GB" before.total="5.9 GiB" before.free="548.9 MiB" now.total="5.9 GiB" now.free="548.9 MiB" now.used="5.4 GiB"
releasing cuda driver library
time=2024-07-03T15:12:12.000-07:00 level=DEBUG source=sched.go:186 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2024-07-03T15:12:12.033-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[548.9 MiB]"
time=2024-07-03T15:12:12.034-07:00 level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:6372196352 FreeMemory:575537152} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-1c750365-54dc-7082-7c6b-9dd953a68ab6 Name:NVIDIA GeForce GTX 1060 6GB Compute:6.1 DriverMajor:11 DriverMinor:4}"
time=2024-07-03T15:12:12.034-07:00 level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers"
time=2024-07-03T15:12:12.034-07:00 level=WARN source=sched.go:216 msg="model request too large for system" requested="5.5 GiB" system="4.5 GiB"
[GIN] 2024/07/03 - 15:12:12 | 500 |  200.768652ms |      10.16.0.83 | POST     "/api/chat"

The system under test:

% free -h; nvidia-smi
               total        used        free      shared  buff/cache   available
Mem:            31Gi        26Gi       394Mi       197Mi       4.2Gi       3.9Gi
Swap:             0B          0B          0B
Wed Jul  3 15:15:17 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 33%   33C    P8    13W / 120W |   5467MiB /  6077MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1582218      C   ...a_v11/ollama_llama_server     5465MiB |
+-----------------------------------------------------------------------------+

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/5469 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 7/3/2024 **Status:** ✅ Merged **Merged:** 7/5/2024 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `prevent_system_oom` --- ### 📝 Commits (1) - [`3c75113`](https://github.com/ollama/ollama/commit/3c75113e37cc2b5d9ad8cb5c21841437aab482cc) Prevent loading models larger than total memory ### 📊 Changes **2 files changed** (+38 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `server/sched.go` (+26 -0) 📝 `server/sched_test.go` (+12 -0) </details> ### 📄 Description Users may not realize the shiny new model they're trying to load fits on their disk, but can't load into system+GPU memory. Today we crash, but with this fix, we'll give them a better error message before even trying to load it. Fixes #3837 #4955 Verified by using `stress-ng` to saturate system memory, and loaded a secondary model on another ollama instance to use up GPU memory, then tried to load a model ``` % ollama run gemma:7b Error: requested model (5.5 GiB) is too large for this system (4.5 GiB) ``` Debug server logs from the manual test ``` time=2024-07-03T15:12:11.901-07:00 level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="31.3 GiB" before.free="4.0 GiB" now.total="31.3 GiB" now.free="3.9 GiB" CUDA driver version: 11.4 time=2024-07-03T15:12:12.000-07:00 level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-1c750365-54dc-7082-7c6b-9dd953a68ab6 name="NVIDIA GeForce GTX 1060 6GB" before.total="5.9 GiB" before.free="548.9 MiB" now.total="5.9 GiB" now.free="548.9 MiB" now.used="5.4 GiB" releasing cuda driver library time=2024-07-03T15:12:12.000-07:00 level=DEBUG source=sched.go:186 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2024-07-03T15:12:12.033-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[548.9 MiB]" time=2024-07-03T15:12:12.034-07:00 level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:6372196352 FreeMemory:575537152} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-1c750365-54dc-7082-7c6b-9dd953a68ab6 Name:NVIDIA GeForce GTX 1060 6GB Compute:6.1 DriverMajor:11 DriverMinor:4}" time=2024-07-03T15:12:12.034-07:00 level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers" time=2024-07-03T15:12:12.034-07:00 level=WARN source=sched.go:216 msg="model request too large for system" requested="5.5 GiB" system="4.5 GiB" [GIN] 2024/07/03 - 15:12:12 | 500 | 200.768652ms | 10.16.0.83 | POST "/api/chat" ``` The system under test: ``` % free -h; nvidia-smi total used free shared buff/cache available Mem: 31Gi 26Gi 394Mi 197Mi 4.2Gi 3.9Gi Swap: 0B 0B 0B Wed Jul 3 15:15:17 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 33% 33C P8 13W / 120W | 5467MiB / 6077MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1582218 C ...a_v11/ollama_llama_server 5465MiB | +-----------------------------------------------------------------------------+ ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-12 23:39:01 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#11791