[GH-ISSUE #12223] OLLAMA_GPU_OVERHEAD is not respected #54646

Closed
opened 2026-04-29 06:45:31 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @markemus on GitHub (Sep 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12223

What is the issue?

Hey, I'm trying to reserve a certain amount of GPU memory space for other applications, and setting this flag before calling ollama serve seems to do nothing. Am I misunderstanding what that flag does? I need GPU space available for another application and Ollama is using too much. I want to run the model partially on CPU and it needs to be done dynamically to support multiple cards.

Relevant log output

` C:\Users\Markemus>ollama ps
NAME                      ID              SIZE     PROCESSOR    UNTIL
orieg/gemma3-tools:27b    e9bddb0fafe2    27 GB    100% GPU     4 minutes from now`

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.7.0

Originally created by @markemus on GitHub (Sep 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12223 ### What is the issue? Hey, I'm trying to reserve a certain amount of GPU memory space for other applications, and setting this flag before calling `ollama serve` seems to do nothing. Am I misunderstanding what that flag does? I need GPU space available for another application and Ollama is using too much. I want to run the model partially on CPU and it needs to be done dynamically to support multiple cards. ### Relevant log output ```shell ` C:\Users\Markemus>ollama ps NAME ID SIZE PROCESSOR UNTIL orieg/gemma3-tools:27b e9bddb0fafe2 27 GB 100% GPU 4 minutes from now` ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.7.0
GiteaMirror added the bug label 2026-04-29 06:45:31 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 9, 2025):

What value are you setting in OLLAMA_GPU_OVERHEAD? Server logs may help in debugging.

<!-- gh-comment-id:3270089380 --> @rick-github commented on GitHub (Sep 9, 2025): What value are you setting in `OLLAMA_GPU_OVERHEAD`? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging.
Author
Owner

@markemus commented on GitHub (Sep 9, 2025):

(base) C:\Users\Markemus>set OLLAMA_GPU_OVERHEAD="20000000"

(base) C:\Users\Markemus>ollama serve
time=2025-09-09T14:19:32.637-04:00 level=INFO source=routes.go:1205 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:20000000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Markemus\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-09-09T14:19:32.639-04:00 level=INFO source=images.go:463 msg="total blobs: 4"
time=2025-09-09T14:19:32.640-04:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-09-09T14:19:32.641-04:00 level=INFO source=routes.go:1258 msg="Listening on 127.0.0.1:11434 (version 0.7.0)"
time=2025-09-09T14:19:32.641-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-09T14:19:32.641-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-09-09T14:19:32.641-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-09-09T14:19:32.890-04:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" overhead="261.4 MiB"
time=2025-09-09T14:19:32.894-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB"
[GIN] 2025/09/09 - 14:19:46 | 200 |       532.5µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 14:19:46 | 200 |      1.0346ms |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/09/09 - 14:19:48 | 200 |       523.5µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 14:19:48 | 200 |      8.6655ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/09/09 - 14:20:10 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 14:20:13 | 200 |    2.3229787s |       127.0.0.1 | POST     "/api/show"
time=2025-09-09T14:20:13.388-04:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Markemus\.ollama\models\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 gpu=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 parallel=1 available=32316944384 required="25.6 GiB"
time=2025-09-09T14:20:13.402-04:00 level=INFO source=server.go:135 msg="system memory" total="31.9 GiB" free="22.7 GiB" free_swap="44.3 GiB"
time=2025-09-09T14:20:13.405-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[30.1 GiB]" memory.gpu_overhead="19.1 MiB" memory.required.full="25.6 GiB" memory.required.partial="25.6 GiB" memory.required.kv="2.8 GiB" memory.required.allocations="[25.6 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.9 GiB" memory.graph.partial="2.0 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-09-09T14:20:13.463-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\Markemus\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\Markemus\\.ollama\\models\\blobs\\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 28872 --batch-size 512 --n-gpu-layers 63 --threads 8 --no-mmap --parallel 1 --port 56756"
time=2025-09-09T14:20:13.467-04:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-09-09T14:20:13.467-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-09-09T14:20:13.469-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-09-09T14:20:13.549-04:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-09-09T14:20:13.551-04:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:56756"
time=2025-09-09T14:20:13.591-04:00 level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40
time=2025-09-09T14:20:13.721-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
load_backend: loaded CPU backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-09-09T14:20:14.448-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-09T14:20:14.676-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="16.8 GiB"
time=2025-09-09T14:20:14.676-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-09-09T14:20:26.491-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.8 GiB"
time=2025-09-09T14:20:26.491-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"
time=2025-09-09T14:20:26.507-04:00 level=INFO source=server.go:630 msg="llama runner started in 13.04 seconds"
[GIN] 2025/09/09 - 14:20:26 | 200 |   13.2275041s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/09/09 - 14:20:28 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 14:20:28 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"```
<!-- gh-comment-id:3271814945 --> @markemus commented on GitHub (Sep 9, 2025): 20000000. ``` (base) C:\Users\Markemus>set OLLAMA_GPU_OVERHEAD="20000000" (base) C:\Users\Markemus>ollama serve time=2025-09-09T14:19:32.637-04:00 level=INFO source=routes.go:1205 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:20000000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Markemus\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-09-09T14:19:32.639-04:00 level=INFO source=images.go:463 msg="total blobs: 4" time=2025-09-09T14:19:32.640-04:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-09-09T14:19:32.641-04:00 level=INFO source=routes.go:1258 msg="Listening on 127.0.0.1:11434 (version 0.7.0)" time=2025-09-09T14:19:32.641-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-09T14:19:32.641-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-09-09T14:19:32.641-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-09-09T14:19:32.890-04:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" overhead="261.4 MiB" time=2025-09-09T14:19:32.894-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB" [GIN] 2025/09/09 - 14:19:46 | 200 | 532.5µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/09 - 14:19:46 | 200 | 1.0346ms | 127.0.0.1 | GET "/api/ps" [GIN] 2025/09/09 - 14:19:48 | 200 | 523.5µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/09 - 14:19:48 | 200 | 8.6655ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/09/09 - 14:20:10 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/09/09 - 14:20:13 | 200 | 2.3229787s | 127.0.0.1 | POST "/api/show" time=2025-09-09T14:20:13.388-04:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Markemus\.ollama\models\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 gpu=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 parallel=1 available=32316944384 required="25.6 GiB" time=2025-09-09T14:20:13.402-04:00 level=INFO source=server.go:135 msg="system memory" total="31.9 GiB" free="22.7 GiB" free_swap="44.3 GiB" time=2025-09-09T14:20:13.405-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[30.1 GiB]" memory.gpu_overhead="19.1 MiB" memory.required.full="25.6 GiB" memory.required.partial="25.6 GiB" memory.required.kv="2.8 GiB" memory.required.allocations="[25.6 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.9 GiB" memory.graph.partial="2.0 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-09-09T14:20:13.463-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\Markemus\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\Markemus\\.ollama\\models\\blobs\\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 28872 --batch-size 512 --n-gpu-layers 63 --threads 8 --no-mmap --parallel 1 --port 56756" time=2025-09-09T14:20:13.467-04:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-09-09T14:20:13.467-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-09-09T14:20:13.469-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-09-09T14:20:13.549-04:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-09-09T14:20:13.551-04:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:56756" time=2025-09-09T14:20:13.591-04:00 level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40 time=2025-09-09T14:20:13.721-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" load_backend: loaded CPU backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes load_backend: loaded CUDA backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-09-09T14:20:14.448-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-09T14:20:14.676-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="16.8 GiB" time=2025-09-09T14:20:14.676-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-09-09T14:20:26.491-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.8 GiB" time=2025-09-09T14:20:26.491-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" time=2025-09-09T14:20:26.507-04:00 level=INFO source=server.go:630 msg="llama runner started in 13.04 seconds" [GIN] 2025/09/09 - 14:20:26 | 200 | 13.2275041s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/09/09 - 14:20:28 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/09/09 - 14:20:28 | 200 | 0s | 127.0.0.1 | GET "/api/ps"```
Author
Owner

@rick-github commented on GitHub (Sep 9, 2025):

20000000 is 0.06% of the available VRAM, it's not going to make a difference.

<!-- gh-comment-id:3272003644 --> @rick-github commented on GitHub (Sep 9, 2025): 20000000 is 0.06% of the available VRAM, it's not going to make a difference.
Author
Owner

@markemus commented on GitHub (Sep 9, 2025):

I tried adding more and it's still using full GPU:


(base) C:\Users\Markemus>ollama serve
time=2025-09-09T16:12:28.798-04:00 level=INFO source=routes.go:1205 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Markemus\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-09-09T16:12:28.800-04:00 level=INFO source=images.go:463 msg="total blobs: 4"
time=2025-09-09T16:12:28.801-04:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-09-09T16:12:28.802-04:00 level=INFO source=routes.go:1258 msg="Listening on 127.0.0.1:11434 (version 0.7.0)"
time=2025-09-09T16:12:28.802-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-09T16:12:28.802-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-09-09T16:12:28.802-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-09-09T16:12:29.052-04:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" overhead="304.6 MiB"
time=2025-09-09T16:12:29.055-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB"
[GIN] 2025/09/09 - 16:12:35 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 16:12:35 | 200 |     61.6542ms |       127.0.0.1 | POST     "/api/show"
time=2025-09-09T16:12:35.572-04:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Markemus\.ollama\models\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 gpu=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 parallel=1 available=32339984384 required="25.6 GiB"
time=2025-09-09T16:12:35.589-04:00 level=INFO source=server.go:135 msg="system memory" total="31.9 GiB" free="23.1 GiB" free_swap="44.7 GiB"
time=2025-09-09T16:12:35.592-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[30.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="25.6 GiB" memory.required.partial="25.6 GiB" memory.required.kv="2.8 GiB" memory.required.allocations="[25.6 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.9 GiB" memory.graph.partial="2.0 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-09-09T16:12:35.647-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\Markemus\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\Markemus\\.ollama\\models\\blobs\\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 28872 --batch-size 512 --n-gpu-layers 63 --threads 8 --no-mmap --parallel 1 --port 57137"
time=2025-09-09T16:12:35.651-04:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-09-09T16:12:35.651-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-09-09T16:12:35.651-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-09-09T16:12:35.728-04:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-09-09T16:12:35.730-04:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:57137"
time=2025-09-09T16:12:35.770-04:00 level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40
load_backend: loaded CPU backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-09-09T16:12:35.903-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-09-09T16:12:35.908-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-09T16:12:36.093-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-09-09T16:12:36.093-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="16.8 GiB"
[GIN] 2025/09/09 - 16:12:38 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 16:12:38 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-09-09T16:12:43.419-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.8 GiB"
time=2025-09-09T16:12:43.419-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"
time=2025-09-09T16:12:43.426-04:00 level=INFO source=server.go:630 msg="llama runner started in 7.78 seconds"
[GIN] 2025/09/09 - 16:12:43 | 200 |    7.9597308s |       127.0.0.1 | POST     "/api/generate"```

```(base) C:\Users\Markemus>ollama ps
NAME                      ID              SIZE     PROCESSOR    UNTIL
orieg/gemma3-tools:27b    e9bddb0fafe2    27 GB    100% GPU     4 minutes from now```
<!-- gh-comment-id:3272127118 --> @markemus commented on GitHub (Sep 9, 2025): I tried adding more and it's still using full GPU: ```(base) C:\Users\Markemus>set OLLAMA_GPU_OVERHEAD = 20000000000000 (base) C:\Users\Markemus>ollama serve time=2025-09-09T16:12:28.798-04:00 level=INFO source=routes.go:1205 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Markemus\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-09-09T16:12:28.800-04:00 level=INFO source=images.go:463 msg="total blobs: 4" time=2025-09-09T16:12:28.801-04:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-09-09T16:12:28.802-04:00 level=INFO source=routes.go:1258 msg="Listening on 127.0.0.1:11434 (version 0.7.0)" time=2025-09-09T16:12:28.802-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-09T16:12:28.802-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-09-09T16:12:28.802-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-09-09T16:12:29.052-04:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" overhead="304.6 MiB" time=2025-09-09T16:12:29.055-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5090" total="31.8 GiB" available="30.1 GiB" [GIN] 2025/09/09 - 16:12:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/09/09 - 16:12:35 | 200 | 61.6542ms | 127.0.0.1 | POST "/api/show" time=2025-09-09T16:12:35.572-04:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Markemus\.ollama\models\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 gpu=GPU-b9023680-1c0d-f981-1ef1-4336683033a4 parallel=1 available=32339984384 required="25.6 GiB" time=2025-09-09T16:12:35.589-04:00 level=INFO source=server.go:135 msg="system memory" total="31.9 GiB" free="23.1 GiB" free_swap="44.7 GiB" time=2025-09-09T16:12:35.592-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[30.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="25.6 GiB" memory.required.partial="25.6 GiB" memory.required.kv="2.8 GiB" memory.required.allocations="[25.6 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.9 GiB" memory.graph.partial="2.0 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-09-09T16:12:35.647-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\Markemus\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\Markemus\\.ollama\\models\\blobs\\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 28872 --batch-size 512 --n-gpu-layers 63 --threads 8 --no-mmap --parallel 1 --port 57137" time=2025-09-09T16:12:35.651-04:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-09-09T16:12:35.651-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-09-09T16:12:35.651-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-09-09T16:12:35.728-04:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-09-09T16:12:35.730-04:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:57137" time=2025-09-09T16:12:35.770-04:00 level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40 load_backend: loaded CPU backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-09-09T16:12:35.903-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes load_backend: loaded CUDA backend from C:\Users\Markemus\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-09-09T16:12:35.908-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-09T16:12:36.093-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-09-09T16:12:36.093-04:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="16.8 GiB" [GIN] 2025/09/09 - 16:12:38 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/09/09 - 16:12:38 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-09-09T16:12:43.419-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.8 GiB" time=2025-09-09T16:12:43.419-04:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" time=2025-09-09T16:12:43.426-04:00 level=INFO source=server.go:630 msg="llama runner started in 7.78 seconds" [GIN] 2025/09/09 - 16:12:43 | 200 | 7.9597308s | 127.0.0.1 | POST "/api/generate"``` ```(base) C:\Users\Markemus>ollama ps NAME ID SIZE PROCESSOR UNTIL orieg/gemma3-tools:27b e9bddb0fafe2 27 GB 100% GPU 4 minutes from now```
Author
Owner

@rick-github commented on GitHub (Sep 9, 2025):

 OLLAMA_GPU_OVERHEAD:0
<!-- gh-comment-id:3272130499 --> @rick-github commented on GitHub (Sep 9, 2025): ``` OLLAMA_GPU_OVERHEAD:0 ```
Author
Owner

@markemus commented on GitHub (Sep 9, 2025):

Thanks, once I set the parameter correctly it is working. Sorry for the trouble.

<!-- gh-comment-id:3272165145 --> @markemus commented on GitHub (Sep 9, 2025): Thanks, once I set the parameter correctly it is working. Sorry for the trouble.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54646