[GH-ISSUE #10797] GPU is sufficient but it allocate cpu for working #7089

Closed
opened 2026-04-12 19:01:47 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @peddy-legoo on GitHub (May 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10797

2025/05/21 17:21:42 routes.go:1195: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\Administrator\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2025-05-21T17:21:42.042+08:00 level=INFO source=images.go:753 msg="total blobs: 17"
time=2025-05-21T17:21:42.042+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2025-05-21T17:21:42.043+08:00 level=INFO source=routes.go:1246 msg="Listening on 127.0.0.1:11434 (version 0.5.0)"
time=2025-05-21T17:21:42.043+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=2
time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu_windows.go:214 msg="" package=1 cores=8 efficiency=0 threads=8
time=2025-05-21T17:21:42.313+08:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-e3ae5033-8e39-4418-42df-16034761d789 library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA RTX A6000" total="48.0 GiB" available="46.5 GiB"
[GIN] 2025/05/21 - 17:22:14 | 200 | 2.0001ms | 127.0.0.1 | GET "/v1/models"
[GIN] 2025/05/21 - 17:22:14 | 200 | 1.0001ms | 127.0.0.1 | GET "/v1/models"
time=2025-05-21T17:22:27.293+08:00 level=INFO source=server.go:105 msg="system memory" total="127.7 GiB" free="121.2 GiB" free_swap="244.0 GiB"
time=2025-05-21T17:22:27.307+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[17.8 MiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="11.0 GiB" memory.weights.repeating="10.0 GiB" memory.weights.nonrepeating="1020.0 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2025-05-21T17:22:27.307+08:00 level=WARN source=server.go:219 msg="flash attention enabled but not supported by gpu"
time=2025-05-21T17:22:27.307+08:00 level=WARN source=server.go:242 msg="quantized kv cache requested but flash attention disabled" type=q8_0
time=2025-05-21T17:22:27.309+08:00 level=INFO source=server.go:397 msg="starting llama server" cmd="C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe --model C:\Users\Administrator\.ollama\models\blobs\sha256-728e7e4ac6e65cd68bf0d6c3ebf2e9944b19d3ad2da49ab53265457f6de1f02c --ctx-size 2048 --batch-size 512 --threads 16 --no-mmap --parallel 1 --port 50859"
time=2025-05-21T17:22:27.310+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-05-21T17:22:27.310+08:00 level=INFO source=server.go:576 msg="waiting for llama runner to start responding"
time=2025-05-21T17:22:27.319+08:00 level=INFO source=runner.go:941 msg="starting go runner"
time=2025-05-21T17:22:27.321+08:00 level=INFO source=runner.go:942 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=16
time=2025-05-21T17:22:27.321+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:50859"

Originally created by @peddy-legoo on GitHub (May 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10797 2025/05/21 17:21:42 routes.go:1195: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Administrator\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2025-05-21T17:21:42.042+08:00 level=INFO source=images.go:753 msg="total blobs: 17" time=2025-05-21T17:21:42.042+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2025-05-21T17:21:42.043+08:00 level=INFO source=routes.go:1246 msg="Listening on 127.0.0.1:11434 (version 0.5.0)" time=2025-05-21T17:21:42.043+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]" time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=2 time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-05-21T17:21:42.043+08:00 level=INFO source=gpu_windows.go:214 msg="" package=1 cores=8 efficiency=0 threads=8 time=2025-05-21T17:21:42.313+08:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-e3ae5033-8e39-4418-42df-16034761d789 library=cuda variant=v11 compute=8.6 driver=11.4 name="NVIDIA RTX A6000" total="48.0 GiB" **available="46.5 GiB"** [GIN] 2025/05/21 - 17:22:14 | 200 | 2.0001ms | 127.0.0.1 | GET "/v1/models" [GIN] 2025/05/21 - 17:22:14 | 200 | 1.0001ms | 127.0.0.1 | GET "/v1/models" **time=2025-05-21T17:22:27.293+08:00 level=INFO source=server.go:105 msg="system memory" total="127.7 GiB" free="121.2 GiB" free_swap="244.0 GiB"** time=2025-05-21T17:22:27.307+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[17.8 MiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="11.0 GiB" memory.weights.repeating="10.0 GiB" memory.weights.nonrepeating="1020.0 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB" time=2025-05-21T17:22:27.307+08:00 level=WARN source=server.go:219 msg="flash attention enabled but not supported by gpu" time=2025-05-21T17:22:27.307+08:00 level=WARN source=server.go:242 msg="quantized kv cache requested but flash attention disabled" type=q8_0 time=2025-05-21T17:22:27.309+08:00 level=INFO source=server.go:397 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-728e7e4ac6e65cd68bf0d6c3ebf2e9944b19d3ad2da49ab53265457f6de1f02c --ctx-size 2048 --batch-size 512 --threads 16 --no-mmap --parallel 1 --port 50859" time=2025-05-21T17:22:27.310+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-05-21T17:22:27.310+08:00 level=INFO source=server.go:576 msg="waiting for llama runner to start responding" time=2025-05-21T17:22:27.319+08:00 level=INFO source=runner.go:941 msg="starting go runner" time=2025-05-21T17:22:27.321+08:00 level=INFO source=runner.go:942 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=16 time=2025-05-21T17:22:27.321+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:50859"
Author
Owner

@rick-github commented on GitHub (May 21, 2025):

time=2025-05-21T17:22:27.307+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1
 layers.model=49 layers.offload=0 layers.split="" memory.available="[17.8 MiB]" memory.gpu_overhead="0 B"
 memory.required.full="12.0 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB"
 memory.required.allocations="[0 B]" memory.weights.total="11.0 GiB" memory.weights.repeating="10.0 GiB"
 memory.weights.nonrepeating="1020.0 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"

When ollama calculated the amount of memory required for GPU offloading, the GPU reported only 0.0178G available. Is there anything else using the GPU? What's the output of nvidia-smi? You are also using a relatively old version of ollama, does upgrading help? If you set OLLAMA_DEBUG=1 there will be more information logged that may help.

<!-- gh-comment-id:2897421589 --> @rick-github commented on GitHub (May 21, 2025): ``` time=2025-05-21T17:22:27.307+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[17.8 MiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="11.0 GiB" memory.weights.repeating="10.0 GiB" memory.weights.nonrepeating="1020.0 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB" ``` When ollama calculated the amount of memory required for GPU offloading, the GPU reported only 0.0178G available. Is there anything else using the GPU? What's the output of `nvidia-smi`? You are also using a relatively old version of ollama, does upgrading help? If you set `OLLAMA_DEBUG=1` there will be more information logged that may help.
Author
Owner

@peddy-legoo commented on GitHub (May 22, 2025):

time=2025-05-21T17:22:27.307+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1
 layers.model=49 layers.offload=0 layers.split="" memory.available="[17.8 MiB]" memory.gpu_overhead="0 B"
 memory.required.full="12.0 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB"
 memory.required.allocations="[0 B]" memory.weights.total="11.0 GiB" memory.weights.repeating="10.0 GiB"
 memory.weights.nonrepeating="1020.0 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"

When ollama calculated the amount of memory required for GPU offloading, the GPU reported only 0.0178G available. Is there anything else using the GPU? What's the output of nvidia-smi? You are also using a relatively old version of ollama, does upgrading help? If you set OLLAMA_DEBUG=1 there will be more information logged that may help.


I am using Windows 7 and forcefully installed Ollama. The nvidia-smi panel is not useful as it always shows that the VRAM is full. Do you know how Ollama reads and allocates VRAM? There are no other processes calling the GPU running in the background. Before running this Ollama model, the VRAM was fine, but once I ran the model, there was suddenly only 17MB of space left. It's perplexing.

<!-- gh-comment-id:2899690327 --> @peddy-legoo commented on GitHub (May 22, 2025): > ``` > time=2025-05-21T17:22:27.307+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 > layers.model=49 layers.offload=0 layers.split="" memory.available="[17.8 MiB]" memory.gpu_overhead="0 B" > memory.required.full="12.0 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" > memory.required.allocations="[0 B]" memory.weights.total="11.0 GiB" memory.weights.repeating="10.0 GiB" > memory.weights.nonrepeating="1020.0 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB" > ``` > > When ollama calculated the amount of memory required for GPU offloading, the GPU reported only 0.0178G available. Is there anything else using the GPU? What's the output of `nvidia-smi`? You are also using a relatively old version of ollama, does upgrading help? If you set `OLLAMA_DEBUG=1` there will be more information logged that may help. ------------------------------------------------------------------------------------- I am using Windows 7 and forcefully installed Ollama. The nvidia-smi panel is not useful as it always shows that the VRAM is full. Do you know how Ollama reads and allocates VRAM? There are no other processes calling the GPU running in the background. Before running this Ollama model, the VRAM was fine, but once I ran the model, there was suddenly only 17MB of space left. It's perplexing.
Author
Owner

@rick-github commented on GitHub (May 22, 2025):

Windows 7 is not officially supported, so it could be that the program/drivers/OS are interacting in weird ways. If nvidia-smi is reporting full VRAM then that aligns with what ollama is seeing. Is upgrading your OS an option?

<!-- gh-comment-id:2900961910 --> @rick-github commented on GitHub (May 22, 2025): Windows 7 is not officially supported, so it could be that the program/drivers/OS are interacting in weird ways. If `nvidia-smi` is reporting full VRAM then that aligns with what ollama is seeing. Is upgrading your OS an option?
Author
Owner

@donotsdubba commented on GitHub (Sep 16, 2025):

Is upgrading your OS an option?

@rick-github Just to clarify: Backdooring and bloating a system with spyware and other malware is not an upgrade. Please fix and upgrade Ollama to support more ethical OSes. (And no, sadly even GNU/Linux is compromised these days.)

Basic privacy and human rights is generally why Windows 7 users never moved past it, rather than any other reason. Not that it ever was a privacy heaven, but 8 and later crossed the final line that must never be accepted.

<!-- gh-comment-id:3298848784 --> @donotsdubba commented on GitHub (Sep 16, 2025): > Is upgrading your OS an option? @rick-github Just to clarify: Backdooring and bloating a system with spyware and other malware is _**not**_ an **_upgrade_**. Please fix and upgrade Ollama to support more ethical OSes. (And no, sadly even GNU/Linux is compromised these days.) Basic privacy and human rights is generally why Windows 7 users never moved past it, rather than any other reason. Not that it ever was a privacy heaven, but 8 and later crossed the final line that must never be accepted.
Author
Owner

@rick-github commented on GitHub (Sep 16, 2025):

I suggest you take this up with Nvidia then and ask them to provide device support for Windows 7.

<!-- gh-comment-id:3298887681 --> @rick-github commented on GitHub (Sep 16, 2025): I suggest you take this up with Nvidia then and ask them to provide device support for Windows 7.
Author
Owner

@donotsdubba commented on GitHub (Sep 20, 2025):

I suggest you take this up with Nvidia then and ask them to provide device support for Windows 7.

This is 100% unrelated: Windows 7 has working NVidia drivers and GPUs, as well as AMD, and Ollama v0.1.29 already did work with Windows 7 out-of-the-box with VxKex. Even v0.1.44 did with the tiniest bit of patching, as would much later versions.

This block is entirely artificial with no single technical merit to it. Please spare us the nonsensical excuses. There is no single OS-level true dependencies for Ollama Windows that is available to 10 or 11 that is not in 7. The VxKex API bridge should not be needed.

Either you are pro device ownership and basic privacy, or against it.

<!-- gh-comment-id:3314567085 --> @donotsdubba commented on GitHub (Sep 20, 2025): > I suggest you take this up with Nvidia then and ask them to provide device support for Windows 7. This is 100% unrelated: Windows 7 has working NVidia drivers and GPUs, as well as AMD, and Ollama v0.1.29 already did work with Windows 7 out-of-the-box with VxKex. [Even v0.1.44 did with the tiniest bit of patching](https://github.com/ollama/ollama/issues/3916#issuecomment-2254333276), as would much later versions. This block is entirely artificial with no single technical merit to it. Please spare us the nonsensical excuses. There is no single OS-level true dependencies for Ollama Windows that is available to 10 or 11 that is not in 7. The VxKex API bridge should **_not_** be needed. Either you are pro device ownership and basic privacy, or against it.
Author
Owner

@rick-github commented on GitHub (Sep 20, 2025):

Source is available, maximizing device ownership and basic privacy.

<!-- gh-comment-id:3314860261 --> @rick-github commented on GitHub (Sep 20, 2025): Source is available, maximizing device ownership and basic privacy.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7089