[GH-ISSUE #15237] New Gemma 4 models run on CPU, they say they are running on GPU (FA Enabled) #56258

Closed
opened 2026-04-29 10:30:09 -05:00 by GiteaMirror · 42 comments
Owner

Originally created by @sammyvoncheese on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15237

What is the issue?

Update 4/4: Issue related to the FA feature being turned on.

Models seems to load into GPU then jump to CPU. PS shows model running in GPU.

I tried 2b and 4b bf16, and the 26/31b 4q on 5090 with context set to 130k

Example output from ps.
gemma4:e2b-it-bf16 850bc7fea32f 12 GB 100% GPU 130000 57 minutes from now

From the log:
time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)

Relevant log output

time=2026-04-02T13:59:15.592-04:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:DEBUG OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:d:\\dev\\models\\llm OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"
time=2026-04-02T13:59:15.593-04:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-02T13:59:15.618-04:00 level=INFO source=images.go:499 msg="total blobs: 650"
time=2026-04-02T13:59:15.630-04:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-02T13:59:15.635-04:00 level=INFO source=routes.go:1802 msg="Listening on 127.0.0.1:11434 (version 0.20.0-rc0)"
time=2026-04-02T13:59:15.635-04:00 level=DEBUG source=sched.go:145 msg="starting llm scheduler"
time=2026-04-02T13:59:15.635-04:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-02T13:59:15.646-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52435"
time=2026-04-02T13:59:15.646-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12
time=2026-04-02T13:59:15.920-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=281.4062ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs=map[]
time=2026-04-02T13:59:15.921-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52440"
time=2026-04-02T13:59:15.921-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=268.9693ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[]
time=2026-04-02T13:59:16.189-04:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0
time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52445"
time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52446"
time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1
time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1
time=2026-04-02T13:59:16.354-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=164.5379ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]"
time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=275.8555ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]"
time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:401 msg="filtering device with overlapping libraries" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 delete_index=0 kept_library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=829.9088ms
time=2026-04-02T13:59:16.465-04:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5090" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:01:00.0 type=discrete total="31.8 GiB" available="30.3 GiB"
time=2026-04-02T13:59:16.465-04:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="31.8 GiB" default_num_ctx=32768
[GIN] 2026/04/02 - 14:00:08 | 200 |     26.9792ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     28.1036ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     26.0192ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |      23.369ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     24.2709ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     24.3459ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     22.4509ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     22.9703ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:18 | 200 |     25.8204ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:18 | 200 |     23.8954ms |       127.0.0.1 | GET      "/api/tags"
time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2026-04-02T14:00:19.016-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 58380"
time=2026-04-02T14:00:19.017-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=284.433ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[]
time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=284.433ms
time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=16 efficiency=0 threads=32
time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=sched.go:229 msg="loading first model" model=d:\dev\models\llm\blobs\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6
time=2026-04-02T14:00:19.361-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-02T14:00:19.399-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-02T14:00:19.402-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}"
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128
time=2026-04-02T14:00:19.402-04:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-04-02T14:00:19.403-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --model d:\\dev\\models\\llm\\blobs\\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6 --port 58385"
time=2026-04-02T14:00:19.403-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:484 msg="system memory" total="93.6 GiB" free="76.2 GiB" free_swap="76.3 GiB"
time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=CUDA available="29.8 GiB" free="30.3 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-02T14:00:19.406-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1
time=2026-04-02T14:00:19.435-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-02T14:00:19.436-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:58385"
time=2026-04-02T14:00:19.438-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:130000 KvCacheType: NumThreads:16 GPULayers:43[ID:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-02T14:00:19.471-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-02T14:00:19.473-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=F16 name="" description="" num_tensors=2131 num_key_values=55
time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama
load_backend: loaded CPU backend from D:\dev\ollama-server\ollama\lib\ollama\ggml-cpu-icelake.dll
time=2026-04-02T14:00:19.485-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-68a69638-eb9a-ef06-c025-5d8b66415f00
load_backend: loaded CUDA backend from D:\dev\ollama-server\ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-02T14:00:19.553-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}"
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128
time=2026-04-02T14:00:19.564-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.5334ms bounds=(0,0)-(2048,2048)
time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=63.4348ms size="[768 768]"
time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-02T14:00:19.628-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.8333ms shape="[2560 256]"
time=2026-04-02T14:00:19.731-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=684 splits=1
time=2026-04-02T14:00:19.984-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1831 splits=16
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1829 splits=16
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.9 GiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.3 GiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.2 GiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="628.0 MiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="651.0 MiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
... (64 lines left)

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.20.0-rc1

Originally created by @sammyvoncheese on GitHub (Apr 2, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15237 ### What is the issue? **Update 4/4**: Issue related to the FA feature being turned on. Models seems to load into GPU then jump to CPU. PS shows model running in GPU. I tried 2b and 4b bf16, and the 26/31b 4q on 5090 with context set to 130k Example output from ps. gemma4:e2b-it-bf16 850bc7fea32f 12 GB 100% GPU 130000 57 minutes from now From the log: time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) ### Relevant log output ```shell time=2026-04-02T13:59:15.592-04:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:DEBUG OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:d:\\dev\\models\\llm OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]" time=2026-04-02T13:59:15.593-04:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false" time=2026-04-02T13:59:15.618-04:00 level=INFO source=images.go:499 msg="total blobs: 650" time=2026-04-02T13:59:15.630-04:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-04-02T13:59:15.635-04:00 level=INFO source=routes.go:1802 msg="Listening on 127.0.0.1:11434 (version 0.20.0-rc0)" time=2026-04-02T13:59:15.635-04:00 level=DEBUG source=sched.go:145 msg="starting llm scheduler" time=2026-04-02T13:59:15.635-04:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-04-02T13:59:15.646-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52435" time=2026-04-02T13:59:15.646-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 time=2026-04-02T13:59:15.920-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=281.4062ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs=map[] time=2026-04-02T13:59:15.921-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52440" time=2026-04-02T13:59:15.921-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=268.9693ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] time=2026-04-02T13:59:16.189-04:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2 time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0 time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0 time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52445" time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52446" time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1 time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1 time=2026-04-02T13:59:16.354-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=164.5379ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]" time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=275.8555ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]" time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:401 msg="filtering device with overlapping libraries" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 delete_index=0 kept_library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=829.9088ms time=2026-04-02T13:59:16.465-04:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5090" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:01:00.0 type=discrete total="31.8 GiB" available="30.3 GiB" time=2026-04-02T13:59:16.465-04:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="31.8 GiB" default_num_ctx=32768 [GIN] 2026/04/02 - 14:00:08 | 200 | 26.9792ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 28.1036ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 26.0192ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 23.369ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 24.2709ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 24.3459ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 22.4509ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 22.9703ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:18 | 200 | 25.8204ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:18 | 200 | 23.8954ms | 127.0.0.1 | GET "/api/tags" time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2026-04-02T14:00:19.016-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 58380" time=2026-04-02T14:00:19.017-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=284.433ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=284.433ms time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=16 efficiency=0 threads=32 time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=sched.go:229 msg="loading first model" model=d:\dev\models\llm\blobs\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6 time=2026-04-02T14:00:19.361-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-02T14:00:19.399-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-02T14:00:19.402-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}" time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128 time=2026-04-02T14:00:19.402-04:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-04-02T14:00:19.403-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --model d:\\dev\\models\\llm\\blobs\\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6 --port 58385" time=2026-04-02T14:00:19.403-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:484 msg="system memory" total="93.6 GiB" free="76.2 GiB" free_swap="76.3 GiB" time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=CUDA available="29.8 GiB" free="30.3 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-02T14:00:19.406-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1 time=2026-04-02T14:00:19.435-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-02T14:00:19.436-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:58385" time=2026-04-02T14:00:19.438-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:130000 KvCacheType: NumThreads:16 GPULayers:43[ID:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-02T14:00:19.471-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-02T14:00:19.473-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=F16 name="" description="" num_tensors=2131 num_key_values=55 time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama load_backend: loaded CPU backend from D:\dev\ollama-server\ollama\lib\ollama\ggml-cpu-icelake.dll time=2026-04-02T14:00:19.485-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 load_backend: loaded CUDA backend from D:\dev\ollama-server\ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-02T14:00:19.553-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}" time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128 time=2026-04-02T14:00:19.564-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.5334ms bounds=(0,0)-(2048,2048) time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=63.4348ms size="[768 768]" time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-02T14:00:19.628-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.8333ms shape="[2560 256]" time=2026-04-02T14:00:19.731-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=684 splits=1 time=2026-04-02T14:00:19.984-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1831 splits=16 time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1829 splits=16 time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.9 GiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.3 GiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.2 GiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="628.0 MiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="651.0 MiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" ... (64 lines left) ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.20.0-rc1
GiteaMirror added the bug label 2026-04-29 10:30:09 -05:00
Author
Owner

@z0n1q commented on GitHub (Apr 2, 2026):

I can confirm that Gemma4 models make very little use of the GPU. It looks like Ollama is offloading some layers to the CPU. May only need some optimization.

OS:
Ubuntu 24.04
GPU:
3x RTX 6000 Pro Blackwell
CPU:
TR 9955WX

<!-- gh-comment-id:4180107400 --> @z0n1q commented on GitHub (Apr 2, 2026): I can confirm that Gemma4 models make very little use of the GPU. It looks like Ollama is offloading some layers to the CPU. May only need some optimization. OS: Ubuntu 24.04 GPU: 3x RTX 6000 Pro Blackwell CPU: TR 9955WX
Author
Owner

@sammyvoncheese commented on GitHub (Apr 2, 2026):

V0.20.0 same issue.

<!-- gh-comment-id:4180653620 --> @sammyvoncheese commented on GitHub (Apr 2, 2026): V0.20.0 same issue.
Author
Owner

@SingularityMan commented on GitHub (Apr 2, 2026):

Same

<!-- gh-comment-id:4180713920 --> @SingularityMan commented on GitHub (Apr 2, 2026): Same
Author
Owner

@ErikEngerd commented on GitHub (Apr 2, 2026):

seeing it as well. Getting to use all CPUs on the system.

<!-- gh-comment-id:4180719041 --> @ErikEngerd commented on GitHub (Apr 2, 2026): seeing it as well. Getting to use all CPUs on the system.
Author
Owner

@craftpip commented on GitHub (Apr 2, 2026):

i see the same thing when using gpt-oss:20b too, the GPU is not used.
I'm trying to run it on AMD 7900xtx

<!-- gh-comment-id:4180832030 --> @craftpip commented on GitHub (Apr 2, 2026): i see the same thing when using gpt-oss:20b too, the GPU is not used. I'm trying to run it on AMD 7900xtx
Author
Owner

@resc863 commented on GitHub (Apr 2, 2026):

also same on my RTX 4080 PC.
No GPU usage with E4B but 26B MoE works well on GPU

<!-- gh-comment-id:4180894707 --> @resc863 commented on GitHub (Apr 2, 2026): also same on my RTX 4080 PC. No GPU usage with E4B but 26B MoE works well on GPU
Author
Owner

@alerque commented on GitHub (Apr 3, 2026):

Cannot reproduce here, the graphics card takes the load.

I just ran gemma4:2b, 4b, and 26b models and all of them showed a small spike on both CPU and GPU at the beginning of processing, thereafter the CPU dropped out and just the GPU stayed loaded up until the request is complete. Ryzen AI 9 HX 370 w/ Radeon 890M.

<!-- gh-comment-id:4181095801 --> @alerque commented on GitHub (Apr 3, 2026): Cannot reproduce here, the graphics card takes the load. I just ran gemma4:2b, 4b, and 26b models and all of them showed a small spike on both CPU and GPU at the beginning of processing, thereafter the CPU dropped out and just the GPU stayed loaded up until the request is complete. Ryzen AI 9 HX 370 w/ Radeon 890M.
Author
Owner

@PythonLawrence commented on GitHub (Apr 3, 2026):

Sorta the same. Percentages displayed by ollama (below) are accurate though. Gemma4 (e2b q4) not using much of the 8.1GB available on the dedicated RTX 4070 Laptop GPU. ~2.4GB used with 16.4K context, then ~2.9GB used with 32.8K context, and then finally ~4GB used with 65.5K context; interestingly the a4b model had less such issues with 6.6GB on the GPU with a low context length!

(base) PS C:\Users\lpano> ollama ps
NAME                    ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:e2b-it-q4_K_M    7fbdbf8f5e45    8.2 GB    71%/29% CPU/GPU    16384      4 minutes from now

(base) PS C:\Users\lpano> ollama ps
NAME                    ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:e2b-it-q4_K_M    7fbdbf8f5e45    8.7 GB    66%/34% CPU/GPU    32768      4 minutes from now

(base) PS C:\Users\lpano> ollama ps
NAME                    ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:e2b-it-q4_K_M    7fbdbf8f5e45    9.8 GB    59%/41% CPU/GPU    65536      4 minutes from now

(base) PS C:\Users\lpano> ollama ps
NAME                        ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gemma4:26b-a4b-it-q4_K_M    5571076f3d70    20 GB    67%/33% CPU/GPU    16384      4 minutes from now
<!-- gh-comment-id:4181105896 --> @PythonLawrence commented on GitHub (Apr 3, 2026): Sorta the same. Percentages displayed by ollama (below) are accurate though. Gemma4 (e2b q4) not using much of the 8.1GB available on the dedicated RTX 4070 Laptop GPU. ~2.4GB used with 16.4K context, then ~2.9GB used with 32.8K context, and then finally ~4GB used with 65.5K context; interestingly the a4b model had less such issues with 6.6GB on the GPU with a low context length! ``` (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b-it-q4_K_M 7fbdbf8f5e45 8.2 GB 71%/29% CPU/GPU 16384 4 minutes from now (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b-it-q4_K_M 7fbdbf8f5e45 8.7 GB 66%/34% CPU/GPU 32768 4 minutes from now (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b-it-q4_K_M 7fbdbf8f5e45 9.8 GB 59%/41% CPU/GPU 65536 4 minutes from now (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b-a4b-it-q4_K_M 5571076f3d70 20 GB 67%/33% CPU/GPU 16384 4 minutes from now ```
Author
Owner

@mazphilip commented on GitHub (Apr 3, 2026):

Env: Ubuntu 24.04, NVIDIA driver 580.126.09, CUDA 13.0, dual 3090 + 5090 (54GB VRAM)

  1. Initially I had OOM issues - manually setting flash attention fixed this (I can now easily do 128k context window and probably >200k)

Fix: Flash attention (+ reduced context) in /etc/systemd/system/ollama.service.d/override.conf:

  [Service]                                                                                                                                                                                
  Environment="OLLAMA_FLASH_ATTENTION=1"                                                                                                                                                   
  # Environment="OLLAMA_CONTEXT_LENGTH=128000"                                                                                                                                     

Result: 912 tk/s prompt and 45 tk/s eval via ollama run --verbose, ollama ps reports 100% GPU, 61/61 layers offloaded.


  1. Remaining issue: Still significant CPU compute despite "100% GPU"
  • Logs show 1.2 GiB of model weights and 1.2 GiB compute graph remain on CPU despite all 61/61 layers reported as offloaded:
  model weights device=CUDA0 size="10.0 GiB"
  model weights device=CUDA1 size="8.4 GiB"                                                                                                                                                
  model weights device=CPU   size="1.2 GiB"
  compute graph  device=CPU   size="1.2 GiB"            
  • perf top during inference confirms real work on CPU, not just sampling:
    54.46% libggml-cpu-haswell.so ggml_compute_forward_flash_attn_ext
    39.58% libggml-cpu-haswell.so ggml_vec_dot_f16
  • Things that did NOT help with this issue:
    - OLLAMA_GPU_OVERHEAD=0 — no change in allocation; 1.2 GiB of weights remained on CPU
    - OLLAMA_KV_CACHE_TYPE=q8_0 — collapsed to single GPU, different issues

Unsolved: What are the 1.2 GiB of CPU-side weights and why do flash attention + dot product ops run on CPU despite full layer offload? If anyone has insight, would appreciate it.

Edit: It seems the 1.2GiB are the vision encoder weights that are not offloaded by Ollama/llama.cpp to the GPU. Might be related to #11422

<!-- gh-comment-id:4181172713 --> @mazphilip commented on GitHub (Apr 3, 2026): Env: Ubuntu 24.04, NVIDIA driver 580.126.09, CUDA 13.0, dual 3090 + 5090 (54GB VRAM) 1. Initially I had OOM issues - **manually setting flash attention fixed** this (I can now easily do 128k context window and probably >200k) Fix: Flash attention (+ reduced context) in /etc/systemd/system/ollama.service.d/override.conf: ``` [Service] Environment="OLLAMA_FLASH_ATTENTION=1" # Environment="OLLAMA_CONTEXT_LENGTH=128000" ``` Result: 912 tk/s prompt and 45 tk/s eval via ollama run --verbose, ollama ps reports 100% GPU, 61/61 layers offloaded. --- 2. Remaining issue: Still significant CPU compute despite "100% GPU" * Logs show 1.2 GiB of model weights and 1.2 GiB compute graph remain on CPU despite all 61/61 layers reported as offloaded: ``` model weights device=CUDA0 size="10.0 GiB" model weights device=CUDA1 size="8.4 GiB" model weights device=CPU size="1.2 GiB" compute graph device=CPU size="1.2 GiB" ``` * perf top during inference confirms real work on CPU, not just sampling: 54.46% libggml-cpu-haswell.so ggml_compute_forward_flash_attn_ext 39.58% libggml-cpu-haswell.so ggml_vec_dot_f16 * Things that did NOT help with this issue: - OLLAMA_GPU_OVERHEAD=0 — no change in allocation; 1.2 GiB of weights remained on CPU - OLLAMA_KV_CACHE_TYPE=q8_0 — collapsed to single GPU, different issues Unsolved: What are the 1.2 GiB of CPU-side weights and why do flash attention + dot product ops run on CPU despite full layer offload? If anyone has insight, would appreciate it. Edit: It seems the 1.2GiB are the vision encoder weights that are not offloaded by Ollama/llama.cpp to the GPU. Might be related to #11422
Author
Owner

@tjwebb commented on GitHub (Apr 3, 2026):

Same problem:

ollama ps reports 100% GPU, but logs show some stuff getting loaded onto CPU.

Eyeballing to top and nvtop, it looks like 3/4 of the work is being done by the CPU, and overall performance is much slower than expected. GPU is only running at ~20% capacity

ollama_think  | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ollama_think  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ollama_think  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama_think  | ggml_cuda_init: found 1 CUDA devices:
ollama_think  |   Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d
ollama_think  | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
ollama_think  | time=2026-04-03T02:39:04.225Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
ollama_think  | time=2026-04-03T02:39:04.232Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
ollama_think  | time=2026-04-03T02:39:04.259Z level=INFO source=model.go:138 msg="vision: decode" elapsed=1.847855ms bounds=(0,0)-(2048,2048)
ollama_think  | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=117.019105ms size="[768 768]"
ollama_think  | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
ollama_think  | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
ollama_think  | time=2026-04-03T02:39:04.377Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=119.770553ms shape="[5376 256]"
ollama_think  | time=2026-04-03T02:39:34.481Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_think  | time=2026-04-03T02:39:34.544Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
ollama_think  | time=2026-04-03T02:39:34.579Z level=INFO source=model.go:138 msg="vision: decode" elapsed=4.908968ms bounds=(0,0)-(2048,2048)
ollama_think  | time=2026-04-03T02:39:34.722Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=142.923203ms size="[768 768]"
ollama_think  | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
ollama_think  | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
ollama_think  | time=2026-04-03T02:39:34.726Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=151.570828ms shape="[5376 256]"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="18.4 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.2 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="23.5 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.0 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="2.3 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:272 msg="total memory" size="46.4 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=sched.go:561 msg="loaded runners" count=1
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"

CPU: Xeon 6 6747P
GPU: RTX 6000 Pro

<!-- gh-comment-id:4181499162 --> @tjwebb commented on GitHub (Apr 3, 2026): Same problem: `ollama ps` reports 100% GPU, but logs show some stuff getting loaded onto CPU. Eyeballing to `top` and `nvtop`, it looks like 3/4 of the work is being done by the CPU, and overall performance is much slower than expected. GPU is only running at ~20% capacity ``` ollama_think | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so ollama_think | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama_think | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama_think | ggml_cuda_init: found 1 CUDA devices: ollama_think | Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d ollama_think | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so ollama_think | time=2026-04-03T02:39:04.225Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) ollama_think | time=2026-04-03T02:39:04.232Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 ollama_think | time=2026-04-03T02:39:04.259Z level=INFO source=model.go:138 msg="vision: decode" elapsed=1.847855ms bounds=(0,0)-(2048,2048) ollama_think | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=117.019105ms size="[768 768]" ollama_think | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 ollama_think | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 ollama_think | time=2026-04-03T02:39:04.377Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=119.770553ms shape="[5376 256]" ollama_think | time=2026-04-03T02:39:34.481Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_think | time=2026-04-03T02:39:34.544Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 ollama_think | time=2026-04-03T02:39:34.579Z level=INFO source=model.go:138 msg="vision: decode" elapsed=4.908968ms bounds=(0,0)-(2048,2048) ollama_think | time=2026-04-03T02:39:34.722Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=142.923203ms size="[768 768]" ollama_think | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 ollama_think | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 ollama_think | time=2026-04-03T02:39:34.726Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=151.570828ms shape="[5376 256]" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="18.4 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.2 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="23.5 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.0 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="2.3 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:272 msg="total memory" size="46.4 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=sched.go:561 msg="loaded runners" count=1 ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU" ``` CPU: Xeon 6 6747P GPU: RTX 6000 Pro
Author
Owner

@nickkaltner commented on GitHub (Apr 3, 2026):

AMD RYZEN AI MAX+ 395 w/ Radeon 8060S here.

I see the same behaviour - as a prompt is evaluated, the gpu usage slowly goes down and the cpu usage up. i have tried with rocm, vulkan and it's the same thing.

it shows 100% gpu with both gemma4:26b and gemma4:31b but both of them are definitely using cpu!

<!-- gh-comment-id:4181810483 --> @nickkaltner commented on GitHub (Apr 3, 2026): AMD RYZEN AI MAX+ 395 w/ Radeon 8060S here. I see the same behaviour - as a prompt is evaluated, the gpu usage slowly goes down and the cpu usage up. i have tried with rocm, vulkan and it's the same thing. it shows 100% gpu with both gemma4:26b and gemma4:31b but both of them are definitely using cpu!
Author
Owner

@seawindcn commented on GitHub (Apr 3, 2026):

V0.20.0 same issue.

<!-- gh-comment-id:4182325526 --> @seawindcn commented on GitHub (Apr 3, 2026): V0.20.0 same issue.
Author
Owner

@somera commented on GitHub (Apr 3, 2026):

Same here ... 50% CPU usage

Image Image

Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s.
Ubuntu 24.04.x, Nvidia Driver 580.126.20

<!-- gh-comment-id:4183038765 --> @somera commented on GitHub (Apr 3, 2026): Same here ... 50% CPU usage <img width="556" height="340" alt="Image" src="https://github.com/user-attachments/assets/f9c66b8a-b141-4b9e-ae2d-9f0c781a6f86" /> <img width="771" height="93" alt="Image" src="https://github.com/user-attachments/assets/9c6517d0-816d-4671-97d8-25e43f5e6863" /> Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s. Ubuntu 24.04.x, Nvidia Driver 580.126.20
Author
Owner

@rabinnh commented on GitHub (Apr 3, 2026):

I have the same issue. I have 2 Nvidia RTX 3090s and I have conky loaded so I can see the memory of each GPU in real time.

The memory ping-pongs between the 2 GPUs until it finally starts executing on the CPU:

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now

All the other modesl run on the GPUs fine.

Another issue is that when I switch to another model, and it's running on the GPUs, Ollama never unloads gemma4:32b and my CPU load is maxed out, the temperatures and fans go way up, and I have to run "sudo systemctl restart ollama" to get everything back to normal.

NAME ID SIZE PROCESSOR CONTEXT UNTIL
richardyoung/kat-dev-72b:iq4_xs 14bbcc414a53 43 GB 100% GPU 8192 4 minutes from now
gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now

<!-- gh-comment-id:4184009183 --> @rabinnh commented on GitHub (Apr 3, 2026): I have the same issue. I have 2 Nvidia RTX 3090s and I have conky loaded so I can see the memory of each GPU in real time. The memory ping-pongs between the 2 GPUs until it finally starts executing on the CPU: NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now All the other modesl run on the GPUs fine. Another issue is that when I switch to another model, and it's running on the GPUs, Ollama never unloads gemma4:32b and my CPU load is maxed out, the temperatures and fans go way up, and I have to run "sudo systemctl restart ollama" to get everything back to normal. NAME ID SIZE PROCESSOR CONTEXT UNTIL richardyoung/kat-dev-72b:iq4_xs 14bbcc414a53 43 GB 100% GPU 8192 4 minutes from now gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now
Author
Owner

@PurpleBanana-ai commented on GitHub (Apr 3, 2026):

My apologies in advance if this is not the appropriate format for posting this info, I generally do not post on these. Running with Open WebUI is worse, especially with any type of tool call like web search, but for this case below, this is Ollama straight in the terminal. FYI, I am showing the same issue with a gguf model from unsloth for gemma4, not just the ones directly downloaded from Ollama. Was seeing my cpu package temp touch 60c+, which is not something I see with my cooling unless running intense benchmarks, never for inference or even diffusion.

Setup
Debian 13, Cuda 13.2, Driver 595.58.03
i9-14900k 790 chipset
94GB DDR5 6400
m.2 NVME (CPU Side PCIE Bus)
GPU 0: RTX 5090 32GB (CPU Side PCIE Bus - PCIE5 Slot x8)
GPU 1: RTX 3090 24GB (CPU Side PCIE Bus - PCIE 5 Slot x8)-yes the GPU is at PCIE4
GPU 2: RTX 5070ti 16GB (Chipset Side PCIE Bus - PCIE4 16x slot at x4)
GPU 3: RTX 5070ti 16GB (Chipset Side PCIE Bus - m.2 PCIE4 x4 to Occulink EGPU)
(no need to dog the frankenrig, she is fine, this is the only model I am having issues with, I will try it on llama.cpp and vLLM as well later.)

Same issues as above just a different config, but only showing issue with gemma4, any of the model versions, any quant, any ctx size. I am seeing the model weights offloaded to the CPU in the logs below. I have tried creating model files with static offloading gpu layers to 999 but no difference. gemma4 is also running very slow, even if I pin the 31B at 4K_M quant to my 5090 with 8192 ctx, it is no different than across multiple GPU's. ~13tps - 15tps for gemma 4. Load time is as expected across multiple cards with this config, not an issue.

This example and the logs is with the following Modelfile (FA on in env, no docker):

FROM gemma4:31b-it-q8_0
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 65
PARAMETER num_ctx 65536

qwen3.5 32B A3B Q8 is fine even at 245,760 ctx.
qwen3-next:80b-a3b-thinking-q4_K_M is fine at 204,800 ctx.

FYI Comparison Same Basic Prompt, Think mode enabled for gemma4 (doesn't change the issue if I turn it off):

gemma4
ollama run gemma4-31b-q8_0-custom:latest --verbose

<|think|> hi gemma, its nice to meet you! would you mind sharing with me a detailed explanation of new capabilities?

Performance:
total duration: 1m56.786501477s
load duration: 131.067798ms
prompt eval count: 39 token(s)
prompt eval duration: 145.114255ms
prompt eval rate: 268.75 tokens/s
eval count: 1612 token(s)
eval duration: 1m55.901022209s
eval rate: 13.91 tokens/s

qwen3-next:80b-a3b-thinking-q4_K_M
Same Prompt (minus the think tag token) for qwen3-next:80b-a3b-thinking-q4_K_M at 204,800 ctx:

total duration: 55.166186993s
load duration: 81.913138ms
prompt eval count: 33 token(s)
prompt eval duration: 125.400091ms
prompt eval rate: 263.16 tokens/s
eval count: 4901 token(s)
eval duration: 54.000393967s
eval rate: 90.76 tokens/s

Ollama Logs for gemma4:

Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:484 msg="system memory" total="94.1 GiB" free="90.3 GiB" free_swap=>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-864499ec-e762-8642-9601-9c125fe6fd64 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:44647"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.301-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 Batch>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="">
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-alderlake.so
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.340-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 |      18.469µs |       127.0.0.1 | HEAD     "/"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 |       7.005µs |       127.0.0.1 | GET      "/api/ps"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: found 4 CUDA devices:
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-864499ec-e762-8642-9601-9c125fe6fd64
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 2: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 3: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987
Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.009-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.028-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=783.895µs bounds=(0,0)-(2048,2048)
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=66.114575ms size="[768 768]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.095-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=67.401546ms shape="[5376 256]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.126-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.224-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 Bat>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.297-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id d>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=2560>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count defa>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count defa>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.320-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.045157ms bounds=(0,0)-(2048,2048)
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.388-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=67.402195ms size="[768 768]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=69.978037ms shape="[5376 256]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.391-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.562-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 Ba>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:272 msg="total memory" size="42.8 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm serv>

Not sure it helps, happy to provide more info.

<!-- gh-comment-id:4184023295 --> @PurpleBanana-ai commented on GitHub (Apr 3, 2026): My apologies in advance if this is not the appropriate format for posting this info, I generally do not post on these. Running with Open WebUI is worse, especially with any type of tool call like web search, but for this case below, this is Ollama straight in the terminal. FYI, I am showing the same issue with a gguf model from unsloth for gemma4, not just the ones directly downloaded from Ollama. Was seeing my cpu package temp touch 60c+, which is not something I see with my cooling unless running intense benchmarks, never for inference or even diffusion. Setup Debian 13, Cuda 13.2, Driver 595.58.03 i9-14900k 790 chipset 94GB DDR5 6400 m.2 NVME (CPU Side PCIE Bus) GPU 0: RTX 5090 32GB (CPU Side PCIE Bus - PCIE5 Slot x8) GPU 1: RTX 3090 24GB (CPU Side PCIE Bus - PCIE 5 Slot x8)-yes the GPU is at PCIE4 GPU 2: RTX 5070ti 16GB (Chipset Side PCIE Bus - PCIE4 16x slot at x4) GPU 3: RTX 5070ti 16GB (Chipset Side PCIE Bus - m.2 PCIE4 x4 to Occulink EGPU) (no need to dog the frankenrig, she is fine, this is the only model I am having issues with, I will try it on llama.cpp and vLLM as well later.) Same issues as above just a different config, but only showing issue with gemma4, any of the model versions, any quant, any ctx size. I am seeing the model weights offloaded to the CPU in the logs below. I have tried creating model files with static offloading gpu layers to 999 but no difference. gemma4 is also running very slow, even if I pin the 31B at 4K_M quant to my 5090 with 8192 ctx, it is no different than across multiple GPU's. ~13tps - 15tps for gemma 4. Load time is as expected across multiple cards with this config, not an issue. This example and the logs is with the following Modelfile (FA on in env, no docker): ```bash FROM gemma4:31b-it-q8_0 PARAMETER temperature 1.0 PARAMETER top_p 0.95 PARAMETER top_k 65 PARAMETER num_ctx 65536 ``` qwen3.5 32B A3B Q8 is fine even at 245,760 ctx. qwen3-next:80b-a3b-thinking-q4_K_M is fine at 204,800 ctx. FYI Comparison Same Basic Prompt, Think mode enabled for gemma4 (doesn't change the issue if I turn it off): **gemma4** ollama run gemma4-31b-q8_0-custom:latest --verbose >>> <|think|> hi gemma, its nice to meet you! would you mind sharing with me a detailed explanation of new capabilities? Performance: total duration: 1m56.786501477s load duration: 131.067798ms prompt eval count: 39 token(s) prompt eval duration: 145.114255ms prompt eval rate: 268.75 tokens/s eval count: 1612 token(s) eval duration: 1m55.901022209s eval rate: 13.91 tokens/s **qwen3-next:80b-a3b-thinking-q4_K_M** Same Prompt (minus the think tag token) for qwen3-next:80b-a3b-thinking-q4_K_M at 204,800 ctx: total duration: 55.166186993s load duration: 81.913138ms prompt eval count: 33 token(s) prompt eval duration: 125.400091ms prompt eval rate: 263.16 tokens/s eval count: 4901 token(s) eval duration: 54.000393967s eval rate: 90.76 tokens/s Ollama Logs for gemma4: ```bash Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:484 msg="system memory" total="94.1 GiB" free="90.3 GiB" free_swap=> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-864499ec-e762-8642-9601-9c125fe6fd64 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1 Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:44647" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.301-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 Batch> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description=""> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-alderlake.so Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.340-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama> Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 | 18.469µs | 127.0.0.1 | HEAD "/" Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 | 7.005µs | 127.0.0.1 | GET "/api/ps" Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: found 4 CUDA devices: Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-864499ec-e762-8642-9601-9c125fe6fd64 Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992 Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 2: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935 Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 3: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987 Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.009-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.028-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=783.895µs bounds=(0,0)-(2048,2048) Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=66.114575ms size="[768 768]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.095-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=67.401546ms shape="[5376 256]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.126-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.224-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 Bat> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.297-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id d> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=2560> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count defa> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count defa> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.320-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.045157ms bounds=(0,0)-(2048,2048) Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.388-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=67.402195ms size="[768 768]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=69.978037ms shape="[5376 256]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.391-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.562-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 Ba> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:272 msg="total memory" size="42.8 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm serv> ``` Not sure it helps, happy to provide more info.
Author
Owner

@zestysoft commented on GitHub (Apr 3, 2026):

fwiw, seeing the same behavior on a mac in ollama 0.20:

gemma4:31b
M3 MAX Processor with 128GB of ram.

ollama ps shows the model loaded with 100% GPU, but mactop shows 600+ CPU utilization with very little GPU.

<!-- gh-comment-id:4184554006 --> @zestysoft commented on GitHub (Apr 3, 2026): fwiw, seeing the same behavior on a mac in ollama 0.20: gemma4:31b M3 MAX Processor with 128GB of ram. ollama ps shows the model loaded with 100% GPU, but mactop shows 600+ CPU utilization with very little GPU.
Author
Owner

@Wladastic commented on GitHub (Apr 4, 2026):

Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick.
Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O

<!-- gh-comment-id:4186930506 --> @Wladastic commented on GitHub (Apr 4, 2026): Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick. Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O
Author
Owner

@homjay commented on GitHub (Apr 4, 2026):

Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick. Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O

Based on my observations, the glitch is triggered specifically when sending a second prompt. This behavior is highly unusual.

<!-- gh-comment-id:4186987651 --> @homjay commented on GitHub (Apr 4, 2026): > Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick. Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O Based on my observations, the glitch is triggered specifically when sending a second prompt. This behavior is highly unusual.
Author
Owner

@sergiosaurio commented on GitHub (Apr 4, 2026):

In my case using curl or python library produces the same results:
45% CPU usage and 5% GPU aprox per prompt.

Using the CLI or Ollama app works fine.

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 21 GB 100% GPU 16000 Forever

<!-- gh-comment-id:4187043229 --> @sergiosaurio commented on GitHub (Apr 4, 2026): In my case using curl or python library produces the same results: 45% CPU usage and 5% GPU aprox per prompt. Using the CLI or Ollama app works fine. NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b 5571076f3d70 21 GB 100% GPU 16000 Forever
Author
Owner

@alerque commented on GitHub (Apr 4, 2026):

In my case using curl or python library produces the same results: 45% CPU usage and 5% GPU aprox per prompt.

My use case involves using a separate Rust app that calls the API via the TCP port (via the rig crate). That works fine and the model runs under GPU when called via API calls from the socket as well as via the ollama CLI. I don't know what the your Python calls would be doing differently than that.

<!-- gh-comment-id:4187081184 --> @alerque commented on GitHub (Apr 4, 2026): > In my case using curl or python library produces the same results: 45% CPU usage and 5% GPU aprox per prompt. My use case involves using a separate Rust app that calls the API via the TCP port (via the `rig` crate). That works fine and the model runs under GPU when called via API calls from the socket as well as via the ollama CLI. I don't know what the your Python calls would be doing differently than that.
Author
Owner

@sammyvoncheese commented on GitHub (Apr 4, 2026):

20.2 CPU vs GPU. Calling a tool.

gemma4:e4b-it-bf16 d0d10a1b1ddb 21 GB 100% GPU 130000 57 minutes from now

Image

Same Model only generating text.

Image
<!-- gh-comment-id:4187108896 --> @sammyvoncheese commented on GitHub (Apr 4, 2026): 20.2 CPU vs GPU. Calling a tool. gemma4:e4b-it-bf16 d0d10a1b1ddb 21 GB 100% GPU 130000 57 minutes from now <img width="1158" height="722" alt="Image" src="https://github.com/user-attachments/assets/059c11de-1bcf-4578-bd59-d0e040953170" /> Same Model only generating text. <img width="1148" height="692" alt="Image" src="https://github.com/user-attachments/assets/3bfdf23d-ce50-4122-a4aa-bd355da9e143" />
Author
Owner

@somera commented on GitHub (Apr 4, 2026):

Same here ... 50% CPU usage

Image Image
Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s. Ubuntu 24.04.x, Nvidia Driver 580.126.20

Not usable at the momment. v0.20.2

AMD EPYC 9355 32-Core Processor + RTX PRO 6000 96 GB

MIx of CPU and GPU usage:

Image Image Image

And very low tokens/s.

$ ollama --verbose run gemma4:31b-it-q8_0
>>> hi
Thinking...
The user said "hi".
This is a standard greeting.

    *   Acknowledge the greeting.
    *   Offer assistance.
    *   Maintain a helpful and friendly tone.
"Hello! How can I help you today?" or "Hi there! What can I do for you?"
...done thinking.

Hello! How can I help you today?

total duration:       5.833939473s
load duration:        186.580984ms
prompt eval count:    16 token(s)
prompt eval duration: 2.0194886s
prompt eval rate:     7.92 tokens/s
eval count:           78 token(s)
eval duration:        3.590263048s
eval rate:            21.73 tokens/s

Restart ollama and than:

$ ollama --verbose run gemma4:26b-a4b-it-q8_0
>>> hi
Thinking...
The user said "hi".
This is a simple greeting.

    *   Acknowledge the greeting.
    *   Offer assistance.
    *   Maintain a polite and friendly tone.
"Hello! How can I help you today?" or "Hi there! What's on your mind?"
...done thinking.

Hello! How can I help you today?

total duration:       1.925485214s
load duration:        207.200064ms
prompt eval count:    16 token(s)
prompt eval duration: 77.994687ms
prompt eval rate:     205.14 tokens/s
eval count:           78 token(s)
eval duration:        1.504752337s
eval rate:            51.84 tokens/s

and now longer prompt:

>>> Show me a bash snippet
Thinking...
...
total duration:       1m1.961390093s
load duration:        204.835352ms
prompt eval count:    185 token(s)
prompt eval duration: 59.550431ms
prompt eval rate:     3106.61 tokens/s
eval count:           1325 token(s)
eval duration:        1m1.101323633s
eval rate:            21.69 tokens/s

and more longer prompt:

total duration:       10m8.197019746s
load duration:        172.859408ms
prompt eval count:    584 token(s)
prompt eval duration: 171.277392ms
prompt eval rate:     3409.67 tokens/s
eval count:           8695 token(s)
eval duration:        10m4.288086024s
eval rate:            14.39 tokens/s

For the last prompt:

Image Image Image
<!-- gh-comment-id:4187131443 --> @somera commented on GitHub (Apr 4, 2026): > Same here ... 50% CPU usage > > <img alt="Image" width="556" height="340" src="https://private-user-images.githubusercontent.com/8334250/573496790-f9c66b8a-b141-4b9e-ae2d-9f0c781a6f86.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUzMTAwNTEsIm5iZiI6MTc3NTMwOTc1MSwicGF0aCI6Ii84MzM0MjUwLzU3MzQ5Njc5MC1mOWM2NmI4YS1iMTQxLTRiOWUtYWUyZC05ZjBjNzgxYTZmODYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI2MDQwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNjA0MDRUMTMzNTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9OTFmODM5MzIyNzNjZWI0NzkzYmJkNWY3NWI2NDViNzQxNzlkN2RjYzg4ZWU3NzcxYjJiOWQ3NmFlNWUxNjFmNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.SJ5ASrmzlZaaN2CC4tF64FKMLCUt7NqsAxSV3GZP8po"> <img alt="Image" width="771" height="93" src="https://private-user-images.githubusercontent.com/8334250/573496835-9c6517d0-816d-4671-97d8-25e43f5e6863.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUzMTAwNTEsIm5iZiI6MTc3NTMwOTc1MSwicGF0aCI6Ii84MzM0MjUwLzU3MzQ5NjgzNS05YzY1MTdkMC04MTZkLTQ2NzEtOTdkOC0yNWU0M2Y1ZTY4NjMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI2MDQwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNjA0MDRUMTMzNTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlhOGRhMDkzODJkYTgzMDdlYWMxOGI2NDFjMDNjZTg5ZDVhNzYzZTIyOGIxZTg4NTI5ODI1Y2ZmNDg5ZWY2ZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.AEx0KbQPu6H9LJtOSV2eZxqNbAufqHTXwkYLn9xTs-8"> > Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s. Ubuntu 24.04.x, Nvidia Driver 580.126.20 Not usable at the momment. v0.20.2 AMD EPYC 9355 32-Core Processor + RTX PRO 6000 96 GB MIx of CPU and GPU usage: <img width="741" height="49" alt="Image" src="https://github.com/user-attachments/assets/0460b98d-2f5a-43dc-8fb1-ce2055cf6196" /> <img width="1081" height="44" alt="Image" src="https://github.com/user-attachments/assets/addf712d-5bb9-43c4-9473-96943f027e58" /> <img width="525" height="42" alt="Image" src="https://github.com/user-attachments/assets/78e4e069-b0c7-433f-87d4-52d3dae5bbe7" /> And very low tokens/s. ``` $ ollama --verbose run gemma4:31b-it-q8_0 >>> hi Thinking... The user said "hi". This is a standard greeting. * Acknowledge the greeting. * Offer assistance. * Maintain a helpful and friendly tone. "Hello! How can I help you today?" or "Hi there! What can I do for you?" ...done thinking. Hello! How can I help you today? total duration: 5.833939473s load duration: 186.580984ms prompt eval count: 16 token(s) prompt eval duration: 2.0194886s prompt eval rate: 7.92 tokens/s eval count: 78 token(s) eval duration: 3.590263048s eval rate: 21.73 tokens/s ``` Restart ollama and than: ``` $ ollama --verbose run gemma4:26b-a4b-it-q8_0 >>> hi Thinking... The user said "hi". This is a simple greeting. * Acknowledge the greeting. * Offer assistance. * Maintain a polite and friendly tone. "Hello! How can I help you today?" or "Hi there! What's on your mind?" ...done thinking. Hello! How can I help you today? total duration: 1.925485214s load duration: 207.200064ms prompt eval count: 16 token(s) prompt eval duration: 77.994687ms prompt eval rate: 205.14 tokens/s eval count: 78 token(s) eval duration: 1.504752337s eval rate: 51.84 tokens/s ``` and now longer prompt: ``` >>> Show me a bash snippet Thinking... ... total duration: 1m1.961390093s load duration: 204.835352ms prompt eval count: 185 token(s) prompt eval duration: 59.550431ms prompt eval rate: 3106.61 tokens/s eval count: 1325 token(s) eval duration: 1m1.101323633s eval rate: 21.69 tokens/s ``` and more longer prompt: ``` total duration: 10m8.197019746s load duration: 172.859408ms prompt eval count: 584 token(s) prompt eval duration: 171.277392ms prompt eval rate: 3409.67 tokens/s eval count: 8695 token(s) eval duration: 10m4.288086024s eval rate: 14.39 tokens/s ``` For the last prompt: <img width="764" height="44" alt="Image" src="https://github.com/user-attachments/assets/a3e0221e-838c-4737-8e0e-48fa22950405" /> <img width="537" height="33" alt="Image" src="https://github.com/user-attachments/assets/a26f8b00-4691-4b40-a616-2bd8e4b8a745" /> <img width="1095" height="37" alt="Image" src="https://github.com/user-attachments/assets/8380ff5f-4b33-4e82-9d5f-036694022e0e" />
Author
Owner

@chenav commented on GitHub (Apr 4, 2026):

+1 on wsl2 (last version) docker and 5090

<!-- gh-comment-id:4187145785 --> @chenav commented on GitHub (Apr 4, 2026): +1 on wsl2 (last version) docker and 5090
Author
Owner

@SingularityMan commented on GitHub (Apr 4, 2026):

Ubuntu 22.04 showing same issues.

<!-- gh-comment-id:4187166265 --> @SingularityMan commented on GitHub (Apr 4, 2026): Ubuntu 22.04 showing same issues.
Author
Owner

@alerque commented on GitHub (Apr 4, 2026):

@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

<!-- gh-comment-id:4187166307 --> @alerque commented on GitHub (Apr 4, 2026): @somera and others, try the 26b or smaller models instead of 31b. 31b seems to need *WILDLY* more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.
Author
Owner

@SingularityMan commented on GitHub (Apr 4, 2026):

@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

I'm already using the 26b model, and I have 48GB VRAM available. Doesn't matter which context length it is set to.

<!-- gh-comment-id:4187179250 --> @SingularityMan commented on GitHub (Apr 4, 2026): > [@somera](https://github.com/somera) and others, try the 26b or smaller models instead of 31b. 31b seems to need _WILDLY_ more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing. I'm *already* using the 26b model, and I have 48GB VRAM available. Doesn't matter which context length it is set to.
Author
Owner

@alerque commented on GitHub (Apr 4, 2026):

I'm already using the 26b model, and I have 48GB VRAM available.

As I just mentioned the 26b model eats through about 60GB VRAM when I run it. Try one of the even smaller models.

<!-- gh-comment-id:4187186591 --> @alerque commented on GitHub (Apr 4, 2026): > I'm _already_ using the 26b model, and I have 48GB VRAM available. As I just mentioned the 26b model eats through about 60GB VRAM when I run it. Try one of the even smaller models.
Author
Owner

@Wladastic commented on GitHub (Apr 4, 2026):

I cannot confirm it to be a ram issue.
31b Model with 22k token prompt just ran through in about 1-2 seconds.
Once a toolcall is mentioned it reverts to cpu

<!-- gh-comment-id:4187194866 --> @Wladastic commented on GitHub (Apr 4, 2026): I cannot confirm it to be a ram issue. 31b Model with 22k token prompt just ran through in about 1-2 seconds. Once a toolcall is mentioned it reverts to cpu
Author
Owner

@somera commented on GitHub (Apr 4, 2026):

@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

I don't see a VRAM issue in my system. I see a CPU+GPU usage.

<!-- gh-comment-id:4187207175 --> @somera commented on GitHub (Apr 4, 2026): > [@somera](https://github.com/somera) and others, try the 26b or smaller models instead of 31b. 31b seems to need _WILDLY_ more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing. I don't see a VRAM issue in my system. I see a CPU+GPU usage.
Author
Owner

@mazphilip commented on GitHub (Apr 4, 2026):

I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode)

  • It does seem true that the vision layers are always CPU offloaded, but the big issues seem FA related.

While you can force the FA usage, which makes ollama allocate the memory on the GPU(s), but once you run it (with longer context?), something happens and it moved all the calculation to the CPU.

I get very good performance (1500tk/s prompt, 50tk/s eval) when running (using /etc/systemd/system/ollama.service.d/override.conf)

  • no FA
  • Smaller than typica context: 96k (for 54GB VRAM) - to account for inefficient usage due to no FA

Resolution steps:

  1. Fix FA usage with llama.cpp (claude telling me it might be the global attention layers with key_length=512 not properly mapped in llama.cpp - looking into this rn
    1. EDIT: seems this is already fixed in llama.cpp - https://github.com/ggml-org/llama.cpp/releases/tag/b8609
  2. Move vision layers to GPU (but I can also see how this is personal use case specific)
<!-- gh-comment-id:4187455580 --> @mazphilip commented on GitHub (Apr 4, 2026): I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode) * It does seem true that the vision layers are always CPU offloaded, but the big issues seem FA related. While you can force the FA usage, which makes ollama allocate the memory on the GPU(s), but once you run it (with longer context?), something happens and it moved all the calculation to the CPU. I get very good performance (1500tk/s prompt, 50tk/s eval) when running (using `/etc/systemd/system/ollama.service.d/override.conf`) * no FA * Smaller than typica context: 96k (for 54GB VRAM) - to account for inefficient usage due to no FA Resolution steps: 1. Fix FA usage with llama.cpp (claude telling me it might be the global attention layers with key_length=512 not properly mapped in llama.cpp - looking into this rn 1. EDIT: seems this is already fixed in llama.cpp - https://github.com/ggml-org/llama.cpp/releases/tag/b8609 2. Move vision layers to GPU (but I can also see how this is personal use case specific)
Author
Owner

@slamj1 commented on GitHub (Apr 4, 2026):

I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine.

My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.

<!-- gh-comment-id:4187490708 --> @slamj1 commented on GitHub (Apr 4, 2026): I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine. My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.
Author
Owner

@somera commented on GitHub (Apr 4, 2026):

I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode)

I'm not using coding agents with ollama and I have the issues with ollama run <model> and from Open WebUI and small context (4096).

<!-- gh-comment-id:4187545549 --> @somera commented on GitHub (Apr 4, 2026): > I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode) I'm not using coding agents with ollama and I have the issues with `ollama run <model>` and from Open WebUI and small context (`4096`).
Author
Owner

@sammyvoncheese commented on GitHub (Apr 4, 2026):

I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine.

My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.

I was able to confirm that disabling FA caused the model layers to stay on the GPU now.

<!-- gh-comment-id:4187555594 --> @sammyvoncheese commented on GitHub (Apr 4, 2026): > I can confirm [@mazphilip](https://github.com/mazphilip) finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine. > > My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well. I was able to confirm that disabling FA caused the model layers to stay on the GPU now.
Author
Owner

@SingularityMan commented on GitHub (Apr 4, 2026):

Can confirm, disabling FA on Ollama seems to correctly offload everything to GPU now.

<!-- gh-comment-id:4187651881 --> @SingularityMan commented on GitHub (Apr 4, 2026): Can confirm, disabling FA on Ollama seems to correctly offload everything to GPU now.
Author
Owner

@viba1 commented on GitHub (Apr 4, 2026):

On my side, disabling FA works correctly for models running 100% GPU, but the issue remains for models that split their workload between the CPU and GPU.

For exemple:
Gemma4:26b: 21% CPU / 79% GPU ; ~ 1.2 token/s
Gemma3:27b: 19% CPU / 81% GPU ; ~3 token/s

<!-- gh-comment-id:4187844471 --> @viba1 commented on GitHub (Apr 4, 2026): On my side, disabling FA works correctly for models running 100% GPU, but the issue remains for models that split their workload between the CPU and GPU. For exemple: Gemma4:26b: 21% CPU / 79% GPU ; ~ 1.2 token/s Gemma3:27b: 19% CPU / 81% GPU ; ~3 token/s
Author
Owner

@mazphilip commented on GitHub (Apr 5, 2026):

I managed to make this work migrating this llama.cpp PR over: https://github.com/ggml-org/llama.cpp/pull/20998
Opening a PR

<!-- gh-comment-id:4187999846 --> @mazphilip commented on GitHub (Apr 5, 2026): I managed to make this work migrating this llama.cpp PR over: https://github.com/ggml-org/llama.cpp/pull/20998 Opening a PR
Author
Owner

@Cephei-OpenSource commented on GitHub (Apr 5, 2026):

I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.

<!-- gh-comment-id:4189278255 --> @Cephei-OpenSource commented on GitHub (Apr 5, 2026): I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.
Author
Owner

@Hello-World-Traveler commented on GitHub (Apr 6, 2026):

I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.

Turning OLLAMA_FLASH_ATTENTION to false makes little difference
gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now

OLLAMA_FLASH_ATTENTION to 0
gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now

gemma3:4b 5.4 GB 100% GPU 4096 4 minutes from now

<!-- gh-comment-id:4189763564 --> @Hello-World-Traveler commented on GitHub (Apr 6, 2026): > I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s. Turning OLLAMA_FLASH_ATTENTION to false makes little difference `gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now` OLLAMA_FLASH_ATTENTION to 0 `gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now ` gemma3:4b 5.4 GB 100% GPU 4096 4 minutes from now
Author
Owner

@tjwebb commented on GitHub (Apr 6, 2026):

yep disabling FA worked for me

<!-- gh-comment-id:4189769064 --> @tjwebb commented on GitHub (Apr 6, 2026): yep disabling FA worked for me
Author
Owner

@m0n5t3r commented on GitHub (Apr 6, 2026):

another data point that disabling FA works if you have enough VRAM (in my case Ryzen AI Maz 395, with 64 GB allocated to the GPU): before I was seeing between 25 and 75% GPU usage with gemma4:26b and 21 GB of VRAM used, now I see close to 100% GPU use and 38 GB of VRAM used (and it is much faster)

ollama ps said 100% GPU in both cases.

<!-- gh-comment-id:4192388211 --> @m0n5t3r commented on GitHub (Apr 6, 2026): another data point that disabling FA works if you have enough VRAM (in my case Ryzen AI Maz 395, with 64 GB allocated to the GPU): before I was seeing between 25 and 75% GPU usage with `gemma4:26b` and 21 GB of VRAM used, now I see close to 100% GPU use and 38 GB of VRAM used (and it is much faster) `ollama ps` said 100% GPU in both cases.
Author
Owner

@Hello-World-Traveler commented on GitHub (Apr 6, 2026):

Turing off thinking does make it faster with about 19 t/s
66%/34% CPU/GPU

Doesn't make much difference for me

OLLAMA_FLASH_ATTENTION	1
OLLAMA_MAX_LOADED_MODELS	1
OLLAMA_NUM_PARALLEL	1

I am using docker with gemma4:e4b and OLLAMA_NEW_ENGINE=true

<!-- gh-comment-id:4194943630 --> @Hello-World-Traveler commented on GitHub (Apr 6, 2026): Turing off thinking does make it faster with about 19 t/s `66%/34% CPU/GPU ` Doesn't make much difference for me ``` OLLAMA_FLASH_ATTENTION 1 OLLAMA_MAX_LOADED_MODELS 1 OLLAMA_NUM_PARALLEL 1 ``` I am using docker with gemma4:e4b and OLLAMA_NEW_ENGINE=true
Author
Owner

@roxlukas commented on GitHub (Apr 7, 2026):

Confirmed, with OLLAMA_FLASH_ATTENTION=1 on Gemma4:26B there is heavy CPU usage (50-60%) and eval token speed hovers around 30 token/s, even for the e4b variant!
with OLLAMA_FLASH_ATTENTION=0 token speed on Gemma4:26B jumps to 108 tokens/s on RTX 3090

In both cases Ollama reports full GPU inference:
ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 21 GB 100% GPU 32768 4 minutes from now

env:
Ollama 0.20.3
Windows 11
i5-11400F
64GB DDR4
RTX 3090

<!-- gh-comment-id:4198569953 --> @roxlukas commented on GitHub (Apr 7, 2026): **Confirmed**, with **OLLAMA_FLASH_ATTENTION=1** on Gemma4:26B there is heavy CPU usage (50-60%) and eval token speed hovers around 30 token/s, even for the e4b variant! with **OLLAMA_FLASH_ATTENTION=0** token speed on Gemma4:26B jumps to 108 tokens/s on RTX 3090 In both cases Ollama reports full GPU inference: ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b 5571076f3d70 21 GB 100% GPU 32768 4 minutes from now **env:** Ollama 0.20.3 Windows 11 i5-11400F 64GB DDR4 RTX 3090
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56258