[GH-ISSUE #10486] Ollama Windows using CPU instead of 2 GPUs #6897

Closed
opened 2026-04-12 18:46:28 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @mesquitafmr on GitHub (Apr 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10486

What is the issue?

I have 2 GPUs on my machine with a combined DRAM of 28 GB. RTX 5070 Ti and 3060. When I try to load models with more ram then 1 GPU it falls back to the CPU and don't use the other GPU at all. The server logs detect both devices. Am I missing something?

Relevant log output

2025/04/29 16:17:51 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\Ollama OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-04-29T16:17:51.858-03:00 level=INFO source=images.go:458 msg="total blobs: 11"
time=2025-04-29T16:17:51.858-03:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-04-29T16:17:51.859-03:00 level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)"
time=2025-04-29T16:17:51.859-03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-29T16:17:51.859-03:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-04-29T16:17:51.859-03:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-04-29T16:17:52.046-03:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-436c21e2-82bf-4549-fb06-2500d63511e5 library=cuda compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" overhead="867.5 MiB"
time=2025-04-29T16:17:52.354-03:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="24.9 GiB"
time=2025-04-29T16:17:52.356-03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-69cd42ba-857b-b52e-9696-fa403086fcf1 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5070 Ti" total="15.9 GiB" available="14.6 GiB"
time=2025-04-29T16:17:52.356-03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-436c21e2-82bf-4549-fb06-2500d63511e5 library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
[GIN] 2025/04/29 - 16:17:59 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2025-04-29T16:17:59.543-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-29T16:17:59.559-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
[GIN] 2025/04/29 - 16:17:59 | 200 |     33.4988ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-29T16:17:59.582-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-29T16:17:59.632-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-29T16:17:59.647-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-29T16:17:59.697-03:00 level=INFO source=server.go:105 msg="system memory" total="61.7 GiB" free="45.4 GiB" free_swap="39.7 GiB"
time=2025-04-29T16:17:59.698-03:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=31,31 memory.available="[13.2 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.1 GiB" memory.required.partial="20.5 GiB" memory.required.kv="784.0 MiB" memory.required.allocations="[11.1 GiB 9.4 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-04-29T16:17:59.721-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-29T16:17:59.724-03:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-29T16:17:59.736-03:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\mesqu\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Ollama\\blobs\\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 2048 --batch-size 512 --n-gpu-layers 62 --threads 6 --no-mmap --parallel 1 --tensor-split 31,31 --port 59575"
time=2025-04-29T16:17:59.740-03:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-29T16:17:59.740-03:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-29T16:17:59.741-03:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-29T16:17:59.764-03:00 level=INFO source=runner.go:866 msg="starting ollama engine"
time=2025-04-29T16:17:59.770-03:00 level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:59575"
time=2025-04-29T16:17:59.794-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-29T16:17:59.796-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
time=2025-04-29T16:17:59.796-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
time=2025-04-29T16:17:59.796-03:00 level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\mesqu\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\mesqu\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-04-29T16:17:59.927-03:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-04-29T16:17:59.992-03:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="6.0 GiB"
time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="6.7 GiB"
time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.7 GiB"
time=2025-04-29T16:18:02.261-03:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-29T16:18:02.322-03:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="153.5 MiB"
time=2025-04-29T16:18:02.322-03:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="156.0 MiB"
time=2025-04-29T16:18:02.322-03:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"
time=2025-04-29T16:18:02.498-03:00 level=INFO source=server.go:619 msg="llama runner started in 2.76 seconds"
[GIN] 2025/04/29 - 16:18:02 | 200 |    2.9320773s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/04/29 - 16:18:07 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/29 - 16:18:07 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-04-29T16:23:07.526-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0271288 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87
time=2025-04-29T16:23:07.776-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2771421 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87
time=2025-04-29T16:23:08.026-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5270175 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87
[GIN] 2025/04/29 - 16:24:24 | 200 |      1.1665ms |   192.168.3.130 | GET      "/api/tags"
[GIN] 2025/04/29 - 16:24:24 | 200 |            0s |   192.168.3.130 | GET      "/api/version"
[GIN] 2025/04/29 - 16:24:36 | 200 |       998.4µs |   192.168.3.130 | GET      "/api/tags"
[GIN] 2025/04/29 - 16:24:40 | 200 |       536.7µs |   192.168.3.130 | GET      "/api/tags"
[GIN] 2025/04/29 - 16:24:51 | 200 |       999.1µs |   192.168.3.130 | GET      "/api/tags"

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.6.6

Originally created by @mesquitafmr on GitHub (Apr 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10486 ### What is the issue? I have 2 GPUs on my machine with a combined DRAM of 28 GB. RTX 5070 Ti and 3060. When I try to load models with more ram then 1 GPU it falls back to the CPU and don't use the other GPU at all. The server logs detect both devices. Am I missing something? ### Relevant log output ```shell 2025/04/29 16:17:51 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\Ollama OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-04-29T16:17:51.858-03:00 level=INFO source=images.go:458 msg="total blobs: 11" time=2025-04-29T16:17:51.858-03:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-04-29T16:17:51.859-03:00 level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)" time=2025-04-29T16:17:51.859-03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-29T16:17:51.859-03:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-04-29T16:17:51.859-03:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12 time=2025-04-29T16:17:52.046-03:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-436c21e2-82bf-4549-fb06-2500d63511e5 library=cuda compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" overhead="867.5 MiB" time=2025-04-29T16:17:52.354-03:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="24.9 GiB" time=2025-04-29T16:17:52.356-03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-69cd42ba-857b-b52e-9696-fa403086fcf1 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5070 Ti" total="15.9 GiB" available="14.6 GiB" time=2025-04-29T16:17:52.356-03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-436c21e2-82bf-4549-fb06-2500d63511e5 library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" [GIN] 2025/04/29 - 16:17:59 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2025-04-29T16:17:59.543-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-29T16:17:59.559-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 [GIN] 2025/04/29 - 16:17:59 | 200 | 33.4988ms | 127.0.0.1 | POST "/api/show" time=2025-04-29T16:17:59.582-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-29T16:17:59.632-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-29T16:17:59.647-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-29T16:17:59.697-03:00 level=INFO source=server.go:105 msg="system memory" total="61.7 GiB" free="45.4 GiB" free_swap="39.7 GiB" time=2025-04-29T16:17:59.698-03:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=31,31 memory.available="[13.2 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.1 GiB" memory.required.partial="20.5 GiB" memory.required.kv="784.0 MiB" memory.required.allocations="[11.1 GiB 9.4 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-04-29T16:17:59.721-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-29T16:17:59.724-03:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-29T16:17:59.728-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-29T16:17:59.736-03:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\mesqu\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Ollama\\blobs\\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 2048 --batch-size 512 --n-gpu-layers 62 --threads 6 --no-mmap --parallel 1 --tensor-split 31,31 --port 59575" time=2025-04-29T16:17:59.740-03:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-29T16:17:59.740-03:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-29T16:17:59.741-03:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-29T16:17:59.764-03:00 level=INFO source=runner.go:866 msg="starting ollama engine" time=2025-04-29T16:17:59.770-03:00 level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:59575" time=2025-04-29T16:17:59.794-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-29T16:17:59.796-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" time=2025-04-29T16:17:59.796-03:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" time=2025-04-29T16:17:59.796-03:00 level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\mesqu\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\mesqu\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-04-29T16:17:59.927-03:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-04-29T16:17:59.992-03:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="6.0 GiB" time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="6.7 GiB" time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.7 GiB" time=2025-04-29T16:18:02.261-03:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-04-29T16:18:02.275-03:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-04-29T16:18:02.322-03:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="153.5 MiB" time=2025-04-29T16:18:02.322-03:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="156.0 MiB" time=2025-04-29T16:18:02.322-03:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" time=2025-04-29T16:18:02.498-03:00 level=INFO source=server.go:619 msg="llama runner started in 2.76 seconds" [GIN] 2025/04/29 - 16:18:02 | 200 | 2.9320773s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/04/29 - 16:18:07 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/04/29 - 16:18:07 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-04-29T16:23:07.526-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0271288 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 time=2025-04-29T16:23:07.776-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2771421 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 time=2025-04-29T16:23:08.026-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5270175 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 [GIN] 2025/04/29 - 16:24:24 | 200 | 1.1665ms | 192.168.3.130 | GET "/api/tags" [GIN] 2025/04/29 - 16:24:24 | 200 | 0s | 192.168.3.130 | GET "/api/version" [GIN] 2025/04/29 - 16:24:36 | 200 | 998.4µs | 192.168.3.130 | GET "/api/tags" [GIN] 2025/04/29 - 16:24:40 | 200 | 536.7µs | 192.168.3.130 | GET "/api/tags" [GIN] 2025/04/29 - 16:24:51 | 200 | 999.1µs | 192.168.3.130 | GET "/api/tags" ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.6
GiteaMirror added the bug label 2026-04-12 18:46:28 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 29, 2025):

time=2025-04-29T16:17:59.698-03:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=63 layers.offload=62 layers.split=31,31 memory.available="[13.2 GiB 11.0 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="23.1 GiB" memory.required.partial="20.5 GiB" memory.required.kv="784.0 MiB"
 memory.required.allocations="[11.1 GiB 9.4 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB"
 memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB"
 projector.weights="806.2 MiB" projector.graph="1.0 GiB"

time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="6.7 GiB"
time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.7 GiB"

Logs indicate that ollama thinks it is using both GPUs. Note that OLLAMA_KEEP_ALIVE is unset or set to the default of 5 minutes, so 5 minutes after your last generation (16:18) the model was unloaded (16:23).

[GIN] 2025/04/29 - 16:18:02 | 200 |    2.9320773s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/04/29 - 16:18:07 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/29 - 16:18:07 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
time=2025-04-29T16:23:07.526-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0271288 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87
<!-- gh-comment-id:2840090331 --> @rick-github commented on GitHub (Apr 29, 2025): ``` time=2025-04-29T16:17:59.698-03:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=31,31 memory.available="[13.2 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.1 GiB" memory.required.partial="20.5 GiB" memory.required.kv="784.0 MiB" memory.required.allocations="[11.1 GiB 9.4 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="6.7 GiB" time=2025-04-29T16:18:00.011-03:00 level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.7 GiB" ``` Logs indicate that ollama thinks it is using both GPUs. Note that [`OLLAMA_KEEP_ALIVE`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately) is unset or set to the default of 5 minutes, so 5 minutes after your last generation (16:18) the model was unloaded (16:23). ``` [GIN] 2025/04/29 - 16:18:02 | 200 | 2.9320773s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/04/29 - 16:18:07 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/04/29 - 16:18:07 | 200 | 0s | 127.0.0.1 | GET "/api/ps" time=2025-04-29T16:23:07.526-03:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0271288 model=D:\Ollama\blobs\sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 ```
Author
Owner

@mesquitafmr commented on GitHub (Apr 29, 2025):

Yes, I just runned the model to test if it was going to the GPU memory. OLLAMA_KEEP_ALIVE is default. If you are talking about a fix I didn't understand.

PS C:\Users\mesqu> ollama run gemma3:27b-it-qat
>>>
PS C:\Users\mesqu> ollama ps
NAME                 ID              SIZE     PROCESSOR          UNTIL
gemma3:27b-it-qat    29eb0b9aeda3    24 GB    11%/89% CPU/GPU    4 minutes from now
<!-- gh-comment-id:2840443853 --> @mesquitafmr commented on GitHub (Apr 29, 2025): Yes, I just runned the model to test if it was going to the GPU memory. OLLAMA_KEEP_ALIVE is default. If you are talking about a fix I didn't understand. ``` PS C:\Users\mesqu> ollama run gemma3:27b-it-qat >>> PS C:\Users\mesqu> ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:27b-it-qat 29eb0b9aeda3 24 GB 11%/89% CPU/GPU 4 minutes from now ```
Author
Owner

@rick-github commented on GitHub (Apr 29, 2025):

This shows that the model is loaded in both VRAM and system RAM. This is because ollama estimated it could fit only 62 of the 63 layers of the model in VRAM, and so some was spilled to system RAM. So both GPUs and the CPU are all involved in inference. Because it's just one layer that is spilling, you can try forcing the runner to load all layers in VRAM as shown here by setting num_gpu=63.

<!-- gh-comment-id:2840457116 --> @rick-github commented on GitHub (Apr 29, 2025): This shows that the model is loaded in both VRAM and system RAM. This is because ollama estimated it could fit only 62 of the 63 layers of the model in VRAM, and so some was spilled to system RAM. So both GPUs and the CPU are all involved in inference. Because it's just one layer that is spilling, you can try forcing the runner to load all layers in VRAM as shown [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) by setting `num_gpu=63`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6897