[GH-ISSUE #11015] detected OS VRAM overhead #69321

Open
opened 2026-05-04 17:47:30 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @pihapi on GitHub (Jun 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11015

Originally assigned to: @jessegross on GitHub.

What is the issue?

Two 3060 cards and 2 GB each out of 12 GB on each is not taken into account. The cards are not used anywhere else. What is this "detected OS VRAM overhead"? In general, 3 cards and one under the monitor. And then not a single card is fully loaded by 10 GB. In my opinion, earlier in old versions, the cards were fully loaded, I can't say for sure, I didn't pay much attention.

Relevant log output

time=2025-06-08T12:31:46.461+05:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda_v12 OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\u\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:5 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-06-08T12:31:46.606+05:00 level=INFO source=images.go:479 msg="total blobs: 149"
time=2025-06-08T12:31:46.641+05:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-06-08T12:31:46.671+05:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)"
time=2025-06-08T12:31:46.671+05:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-06-08T12:31:46.671+05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-06-08T12:31:46.671+05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-06-08T12:31:46.963+05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-f09d2af5-2fd7-5b36-5a41-5d42518d1539 library=cuda compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" overhead="1.9 GiB"
time=2025-06-08T12:31:47.121+05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ae75581b-1c00-dd57-68e3-c5651d19235e library=cuda compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" overhead="1.9 GiB"
time=2025-06-08T12:31:47.124+05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-f09d2af5-2fd7-5b36-5a41-5d42518d1539 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="10.0 GiB"
time=2025-06-08T12:31:47.124+05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ae75581b-1c00-dd57-68e3-c5651d19235e library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="10.0 GiB"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.9.0

Originally created by @pihapi on GitHub (Jun 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11015 Originally assigned to: @jessegross on GitHub. ### What is the issue? Two 3060 cards and 2 GB each out of 12 GB on each is not taken into account. The cards are not used anywhere else. What is this "detected OS VRAM overhead"? In general, 3 cards and one under the monitor. And then not a single card is fully loaded by 10 GB. In my opinion, earlier in old versions, the cards were fully loaded, I can't say for sure, I didn't pay much attention. ### Relevant log output ```shell time=2025-06-08T12:31:46.461+05:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda_v12 OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:3 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\u\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:5 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-06-08T12:31:46.606+05:00 level=INFO source=images.go:479 msg="total blobs: 149" time=2025-06-08T12:31:46.641+05:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-06-08T12:31:46.671+05:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)" time=2025-06-08T12:31:46.671+05:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-08T12:31:46.671+05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-06-08T12:31:46.671+05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12 time=2025-06-08T12:31:46.963+05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-f09d2af5-2fd7-5b36-5a41-5d42518d1539 library=cuda compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" overhead="1.9 GiB" time=2025-06-08T12:31:47.121+05:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-ae75581b-1c00-dd57-68e3-c5651d19235e library=cuda compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" overhead="1.9 GiB" time=2025-06-08T12:31:47.124+05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-f09d2af5-2fd7-5b36-5a41-5d42518d1539 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="10.0 GiB" time=2025-06-08T12:31:47.124+05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ae75581b-1c00-dd57-68e3-c5651d19235e library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="10.0 GiB" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-05-04 17:47:30 -05:00
Author
Owner

@pihapi commented on GitHub (Jun 8, 2025):

As a result, RAM is used beyond the norm, when it is quite possible to use the cards to the maximum.

<!-- gh-comment-id:2953695130 --> @pihapi commented on GitHub (Jun 8, 2025): As a result, RAM is used beyond the norm, when it is quite possible to use the cards to the maximum.
Author
Owner

@rick-github commented on GitHub (Jun 8, 2025):

OS VRAM overhead is the difference between free VRAM reported by the GPU management library and free VRAM reported by the card. It's the overhead added by the management layer.

<!-- gh-comment-id:2953991917 --> @rick-github commented on GitHub (Jun 8, 2025): OS VRAM overhead is the difference between free VRAM [reported](https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g387248458ef4dfe9afe425280f420f41) by the GPU management library and free VRAM [reported](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g376b97f5ab20321ca46f7cfa9511b978) by the card. It's the overhead added by the management layer.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69321