[GH-ISSUE #15237] New Gemma 4 models run on CPU, they say they are running on GPU (FA Enabled) #56258

New Issue

GiteaMirror · 2026-04-29T10:30:09-05:00

GiteaMirror commented

2026-04-29 10:30:09 -05:00

Originally created by @sammyvoncheese on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15237

What is the issue?

Update 4/4: Issue related to the FA feature being turned on.

Models seems to load into GPU then jump to CPU. PS shows model running in GPU.

I tried 2b and 4b bf16, and the 26/31b 4q on 5090 with context set to 130k

Example output from ps.
gemma4:e2b-it-bf16 850bc7fea32f 12 GB 100% GPU 130000 57 minutes from now

From the log:
time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)

Relevant log output

time=2026-04-02T13:59:15.592-04:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:DEBUG OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:d:\\dev\\models\\llm OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"
time=2026-04-02T13:59:15.593-04:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-02T13:59:15.618-04:00 level=INFO source=images.go:499 msg="total blobs: 650"
time=2026-04-02T13:59:15.630-04:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-02T13:59:15.635-04:00 level=INFO source=routes.go:1802 msg="Listening on 127.0.0.1:11434 (version 0.20.0-rc0)"
time=2026-04-02T13:59:15.635-04:00 level=DEBUG source=sched.go:145 msg="starting llm scheduler"
time=2026-04-02T13:59:15.635-04:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-02T13:59:15.646-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52435"
time=2026-04-02T13:59:15.646-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12
time=2026-04-02T13:59:15.920-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=281.4062ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs=map[]
time=2026-04-02T13:59:15.921-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52440"
time=2026-04-02T13:59:15.921-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=268.9693ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[]
time=2026-04-02T13:59:16.189-04:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0
time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0
time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52445"
time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52446"
time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1
time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1
time=2026-04-02T13:59:16.354-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=164.5379ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]"
time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=275.8555ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]"
time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:401 msg="filtering device with overlapping libraries" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 delete_index=0 kept_library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=829.9088ms
time=2026-04-02T13:59:16.465-04:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5090" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:01:00.0 type=discrete total="31.8 GiB" available="30.3 GiB"
time=2026-04-02T13:59:16.465-04:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="31.8 GiB" default_num_ctx=32768
[GIN] 2026/04/02 - 14:00:08 | 200 |     26.9792ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     28.1036ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     26.0192ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |      23.369ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     24.2709ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     24.3459ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     22.4509ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:09 | 200 |     22.9703ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:18 | 200 |     25.8204ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/04/02 - 14:00:18 | 200 |     23.8954ms |       127.0.0.1 | GET      "/api/tags"
time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2026-04-02T14:00:19.016-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 58380"
time=2026-04-02T14:00:19.017-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=284.433ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[]
time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=284.433ms
time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=16 efficiency=0 threads=32
time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=sched.go:229 msg="loading first model" model=d:\dev\models\llm\blobs\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6
time=2026-04-02T14:00:19.361-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-02T14:00:19.399-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-02T14:00:19.402-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}"
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128
time=2026-04-02T14:00:19.402-04:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-04-02T14:00:19.403-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --model d:\\dev\\models\\llm\\blobs\\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6 --port 58385"
time=2026-04-02T14:00:19.403-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:484 msg="system memory" total="93.6 GiB" free="76.2 GiB" free_swap="76.3 GiB"
time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=CUDA available="29.8 GiB" free="30.3 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-02T14:00:19.406-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1
time=2026-04-02T14:00:19.435-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-02T14:00:19.436-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:58385"
time=2026-04-02T14:00:19.438-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:130000 KvCacheType: NumThreads:16 GPULayers:43[ID:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-02T14:00:19.471-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
time=2026-04-02T14:00:19.473-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=F16 name="" description="" num_tensors=2131 num_key_values=55
time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama
load_backend: loaded CPU backend from D:\dev\ollama-server\ollama\lib\ollama\ggml-cpu-icelake.dll
time=2026-04-02T14:00:19.485-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-68a69638-eb9a-ef06-c025-5d8b66415f00
load_backend: loaded CUDA backend from D:\dev\ollama-server\ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2026-04-02T14:00:19.553-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}"
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0
time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128
time=2026-04-02T14:00:19.564-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.5334ms bounds=(0,0)-(2048,2048)
time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=63.4348ms size="[768 768]"
time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-02T14:00:19.628-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.8333ms shape="[2560 256]"
time=2026-04-02T14:00:19.731-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=684 splits=1
time=2026-04-02T14:00:19.984-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1831 splits=16
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1829 splits=16
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.9 GiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.3 GiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.2 GiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="628.0 MiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="651.0 MiB"
time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB"
... (64 lines left)

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.20.0-rc1

Originally created by @sammyvoncheese on GitHub (Apr 2, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15237 ### What is the issue? **Update 4/4**: Issue related to the FA feature being turned on. Models seems to load into GPU then jump to CPU. PS shows model running in GPU. I tried 2b and 4b bf16, and the 26/31b 4q on 5090 with context set to 130k Example output from ps. gemma4:e2b-it-bf16 850bc7fea32f 12 GB 100% GPU 130000 57 minutes from now From the log: time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) ### Relevant log output ```shell time=2026-04-02T13:59:15.592-04:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:DEBUG OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:d:\\dev\\models\\llm OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]" time=2026-04-02T13:59:15.593-04:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false" time=2026-04-02T13:59:15.618-04:00 level=INFO source=images.go:499 msg="total blobs: 650" time=2026-04-02T13:59:15.630-04:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-04-02T13:59:15.635-04:00 level=INFO source=routes.go:1802 msg="Listening on 127.0.0.1:11434 (version 0.20.0-rc0)" time=2026-04-02T13:59:15.635-04:00 level=DEBUG source=sched.go:145 msg="starting llm scheduler" time=2026-04-02T13:59:15.635-04:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-04-02T13:59:15.646-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52435" time=2026-04-02T13:59:15.646-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 time=2026-04-02T13:59:15.920-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=281.4062ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs=map[] time=2026-04-02T13:59:15.921-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52440" time=2026-04-02T13:59:15.921-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=268.9693ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] time=2026-04-02T13:59:16.189-04:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2 time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0 time=2026-04-02T13:59:16.189-04:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 description="NVIDIA GeForce RTX 5090" compute=12.0 id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 pci_id=0000:01:00.0 time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52445" time=2026-04-02T13:59:16.190-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 52446" time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1 time=2026-04-02T13:59:16.190-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 CUDA_VISIBLE_DEVICES=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT=1 time=2026-04-02T13:59:16.354-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=164.5379ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]" time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=275.8555ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v12]" extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 GGML_CUDA_INIT:1]" time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:401 msg="filtering device with overlapping libraries" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v12 delete_index=0 kept_library=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T13:59:16.465-04:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=829.9088ms time=2026-04-02T13:59:16.465-04:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5090" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:01:00.0 type=discrete total="31.8 GiB" available="30.3 GiB" time=2026-04-02T13:59:16.465-04:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="31.8 GiB" default_num_ctx=32768 [GIN] 2026/04/02 - 14:00:08 | 200 | 26.9792ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 28.1036ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 26.0192ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 23.369ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 24.2709ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 24.3459ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 22.4509ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:09 | 200 | 22.9703ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:18 | 200 | 25.8204ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/04/02 - 14:00:18 | 200 | 23.8954ms | 127.0.0.1 | GET "/api/tags" time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2026-04-02T14:00:19.013-04:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2026-04-02T14:00:19.016-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --port 58380" time=2026-04-02T14:00:19.017-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=284.433ms OLLAMA_LIBRARY_PATH="[D:\\dev\\ollama-server\\ollama\\lib\\ollama D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=284.433ms time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-04-02T14:00:19.298-04:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=16 efficiency=0 threads=32 time=2026-04-02T14:00:19.298-04:00 level=DEBUG source=sched.go:229 msg="loading first model" model=d:\dev\models\llm\blobs\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6 time=2026-04-02T14:00:19.361-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-02T14:00:19.399-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-02T14:00:19.402-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}" time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-02T14:00:19.402-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128 time=2026-04-02T14:00:19.402-04:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-04-02T14:00:19.403-04:00 level=INFO source=server.go:432 msg="starting runner" cmd="D:\\dev\\ollama-server\\ollama\\ollama.exe runner --ollama-engine --model d:\\dev\\models\\llm\\blobs\\sha256-cbdeb708e2000122364bf1a63b8aa009504201863def6fb69da784681866a6c6 --port 58385" time=2026-04-02T14:00:19.403-04:00 level=DEBUG source=server.go:433 msg=subprocess CUDA_PATH="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" CUDA_PATH_V13_0="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0" CUDA_PATH_V13_1="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" OLLAMA_API_BASE=http://127.0.0.1:11434 OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MODELS=d:\dev\models\llm OLLAMA_NEW_ENGINE=1 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NUM_PARALLEL=1 PATH="D:\\dev\\ollama-server\\ollama\\lib\\ollama;D:\\dev\\ollama-server\\ollama\\lib\\ollama\\cuda_v13;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin\\x64;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.0\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA App\\NvDLISR;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\GitHub CLI\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.4.1\\;D:\\dev\\ollama-server\\ollama;D:\\dev\\Python\\Python314\\Scripts\\;D:\\dev\\Python\\Python314\\;C:\\Users\\willi\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\willi\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\willi\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\willi\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\willi\\.dotnet\\tools;" OLLAMA_LIBRARY_PATH=D:\dev\ollama-server\ollama\lib\ollama;D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:484 msg="system memory" total="93.6 GiB" free="76.2 GiB" free_swap="76.3 GiB" time=2026-04-02T14:00:19.405-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 library=CUDA available="29.8 GiB" free="30.3 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-02T14:00:19.406-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1 time=2026-04-02T14:00:19.435-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-02T14:00:19.436-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:58385" time=2026-04-02T14:00:19.438-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:130000 KvCacheType: NumThreads:16 GPULayers:43[ID:GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-02T14:00:19.471-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" time=2026-04-02T14:00:19.473-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=F16 name="" description="" num_tensors=2131 num_key_values=55 time=2026-04-02T14:00:19.473-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama load_backend: loaded CPU backend from D:\dev\ollama-server\ollama\lib\ollama\ggml-cpu-icelake.dll time=2026-04-02T14:00:19.485-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=D:\dev\ollama-server\ollama\lib\ollama\cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-68a69638-eb9a-ef06-c025-5d8b66415f00 load_backend: loaded CUDA backend from D:\dev\ollama-server\ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 time=2026-04-02T14:00:19.552-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2026-04-02T14:00:19.553-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_count_kv default=0 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.head_count_kv default="&{size:0 values:[]}" time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count default=0 time=2026-04-02T14:00:19.553-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.num_mel_bins default=128 time=2026-04-02T14:00:19.564-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.5334ms bounds=(0,0)-(2048,2048) time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=63.4348ms size="[768 768]" time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-02T14:00:19.627-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-02T14:00:19.628-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.8333ms shape="[2560 256]" time=2026-04-02T14:00:19.731-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=684 splits=1 time=2026-04-02T14:00:19.984-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1831 splits=16 time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1829 splits=16 time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.9 GiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.3 GiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.2 GiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="628.0 MiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="651.0 MiB" time=2026-04-02T14:00:19.996-04:00 level=DEBUG source=device.go:272 msg="total memory" size="19.6 GiB" ... (64 lines left) ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.20.0-rc1

GiteaMirror added the bug label 2026-04-29 10:30:09 -05:00

GiteaMirror closed this issue

2026-04-29 10:30:10 -05:00

GiteaMirror commented

2026-04-29 10:30:12 -05:00

@z0n1q commented on GitHub (Apr 2, 2026):

I can confirm that Gemma4 models make very little use of the GPU. It looks like Ollama is offloading some layers to the CPU. May only need some optimization.

OS:
Ubuntu 24.04
GPU:
3x RTX 6000 Pro Blackwell
CPU:
TR 9955WX

@z0n1q commented on GitHub (Apr 2, 2026): I can confirm that Gemma4 models make very little use of the GPU. It looks like Ollama is offloading some layers to the CPU. May only need some optimization. OS: Ubuntu 24.04 GPU: 3x RTX 6000 Pro Blackwell CPU: TR 9955WX

GiteaMirror commented

2026-04-29 10:30:12 -05:00

@sammyvoncheese commented on GitHub (Apr 2, 2026):

V0.20.0 same issue.

@sammyvoncheese commented on GitHub (Apr 2, 2026): V0.20.0 same issue.

GiteaMirror commented

2026-04-29 10:30:13 -05:00

@SingularityMan commented on GitHub (Apr 2, 2026):

Same

@SingularityMan commented on GitHub (Apr 2, 2026): Same

GiteaMirror commented

2026-04-29 10:30:15 -05:00

@ErikEngerd commented on GitHub (Apr 2, 2026):

seeing it as well. Getting to use all CPUs on the system.

@ErikEngerd commented on GitHub (Apr 2, 2026): seeing it as well. Getting to use all CPUs on the system.

GiteaMirror commented

2026-04-29 10:30:16 -05:00

@craftpip commented on GitHub (Apr 2, 2026):

i see the same thing when using gpt-oss:20b too, the GPU is not used.
I'm trying to run it on AMD 7900xtx

@craftpip commented on GitHub (Apr 2, 2026): i see the same thing when using gpt-oss:20b too, the GPU is not used. I'm trying to run it on AMD 7900xtx

GiteaMirror commented

2026-04-29 10:30:17 -05:00

@resc863 commented on GitHub (Apr 2, 2026):

also same on my RTX 4080 PC.
No GPU usage with E4B but 26B MoE works well on GPU

@resc863 commented on GitHub (Apr 2, 2026): also same on my RTX 4080 PC. No GPU usage with E4B but 26B MoE works well on GPU

GiteaMirror commented

2026-04-29 10:30:17 -05:00

@alerque commented on GitHub (Apr 3, 2026):

Cannot reproduce here, the graphics card takes the load.

I just ran gemma4:2b, 4b, and 26b models and all of them showed a small spike on both CPU and GPU at the beginning of processing, thereafter the CPU dropped out and just the GPU stayed loaded up until the request is complete. Ryzen AI 9 HX 370 w/ Radeon 890M.

@alerque commented on GitHub (Apr 3, 2026): Cannot reproduce here, the graphics card takes the load. I just ran gemma4:2b, 4b, and 26b models and all of them showed a small spike on both CPU and GPU at the beginning of processing, thereafter the CPU dropped out and just the GPU stayed loaded up until the request is complete. Ryzen AI 9 HX 370 w/ Radeon 890M.

GiteaMirror commented

2026-04-29 10:30:18 -05:00

@PythonLawrence commented on GitHub (Apr 3, 2026):

Sorta the same. Percentages displayed by ollama (below) are accurate though. Gemma4 (e2b q4) not using much of the 8.1GB available on the dedicated RTX 4070 Laptop GPU. ~2.4GB used with 16.4K context, then ~2.9GB used with 32.8K context, and then finally ~4GB used with 65.5K context; interestingly the a4b model had less such issues with 6.6GB on the GPU with a low context length!

(base) PS C:\Users\lpano> ollama ps
NAME                    ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:e2b-it-q4_K_M    7fbdbf8f5e45    8.2 GB    71%/29% CPU/GPU    16384      4 minutes from now

(base) PS C:\Users\lpano> ollama ps
NAME                    ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:e2b-it-q4_K_M    7fbdbf8f5e45    8.7 GB    66%/34% CPU/GPU    32768      4 minutes from now

(base) PS C:\Users\lpano> ollama ps
NAME                    ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:e2b-it-q4_K_M    7fbdbf8f5e45    9.8 GB    59%/41% CPU/GPU    65536      4 minutes from now

(base) PS C:\Users\lpano> ollama ps
NAME                        ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gemma4:26b-a4b-it-q4_K_M    5571076f3d70    20 GB    67%/33% CPU/GPU    16384      4 minutes from now

@PythonLawrence commented on GitHub (Apr 3, 2026): Sorta the same. Percentages displayed by ollama (below) are accurate though. Gemma4 (e2b q4) not using much of the 8.1GB available on the dedicated RTX 4070 Laptop GPU. ~2.4GB used with 16.4K context, then ~2.9GB used with 32.8K context, and then finally ~4GB used with 65.5K context; interestingly the a4b model had less such issues with 6.6GB on the GPU with a low context length! ``` (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b-it-q4_K_M 7fbdbf8f5e45 8.2 GB 71%/29% CPU/GPU 16384 4 minutes from now (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b-it-q4_K_M 7fbdbf8f5e45 8.7 GB 66%/34% CPU/GPU 32768 4 minutes from now (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b-it-q4_K_M 7fbdbf8f5e45 9.8 GB 59%/41% CPU/GPU 65536 4 minutes from now (base) PS C:\Users\lpano> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b-a4b-it-q4_K_M 5571076f3d70 20 GB 67%/33% CPU/GPU 16384 4 minutes from now ```

GiteaMirror commented

2026-04-29 10:30:19 -05:00

@mazphilip commented on GitHub (Apr 3, 2026):

Env: Ubuntu 24.04, NVIDIA driver 580.126.09, CUDA 13.0, dual 3090 + 5090 (54GB VRAM)

Initially I had OOM issues - manually setting flash attention fixed this (I can now easily do 128k context window and probably >200k)

Fix: Flash attention (+ reduced context) in /etc/systemd/system/ollama.service.d/override.conf:

  [Service]                                                                                                                                                                                
  Environment="OLLAMA_FLASH_ATTENTION=1"                                                                                                                                                   
  # Environment="OLLAMA_CONTEXT_LENGTH=128000"

Result: 912 tk/s prompt and 45 tk/s eval via ollama run --verbose, ollama ps reports 100% GPU, 61/61 layers offloaded.

Remaining issue: Still significant CPU compute despite "100% GPU"

Logs show 1.2 GiB of model weights and 1.2 GiB compute graph remain on CPU despite all 61/61 layers reported as offloaded:

  model weights device=CUDA0 size="10.0 GiB"
  model weights device=CUDA1 size="8.4 GiB"                                                                                                                                                
  model weights device=CPU   size="1.2 GiB"
  compute graph  device=CPU   size="1.2 GiB"

perf top during inference confirms real work on CPU, not just sampling:
54.46% libggml-cpu-haswell.so ggml_compute_forward_flash_attn_ext
39.58% libggml-cpu-haswell.so ggml_vec_dot_f16
Things that did NOT help with this issue:
- OLLAMA_GPU_OVERHEAD=0 — no change in allocation; 1.2 GiB of weights remained on CPU
- OLLAMA_KV_CACHE_TYPE=q8_0 — collapsed to single GPU, different issues

Unsolved: What are the 1.2 GiB of CPU-side weights and why do flash attention + dot product ops run on CPU despite full layer offload? If anyone has insight, would appreciate it.

Edit: It seems the 1.2GiB are the vision encoder weights that are not offloaded by Ollama/llama.cpp to the GPU. Might be related to #11422

@mazphilip commented on GitHub (Apr 3, 2026): Env: Ubuntu 24.04, NVIDIA driver 580.126.09, CUDA 13.0, dual 3090 + 5090 (54GB VRAM) 1. Initially I had OOM issues - **manually setting flash attention fixed** this (I can now easily do 128k context window and probably >200k) Fix: Flash attention (+ reduced context) in /etc/systemd/system/ollama.service.d/override.conf: ``` [Service] Environment="OLLAMA_FLASH_ATTENTION=1" # Environment="OLLAMA_CONTEXT_LENGTH=128000" ``` Result: 912 tk/s prompt and 45 tk/s eval via ollama run --verbose, ollama ps reports 100% GPU, 61/61 layers offloaded. --- 2. Remaining issue: Still significant CPU compute despite "100% GPU" * Logs show 1.2 GiB of model weights and 1.2 GiB compute graph remain on CPU despite all 61/61 layers reported as offloaded: ``` model weights device=CUDA0 size="10.0 GiB" model weights device=CUDA1 size="8.4 GiB" model weights device=CPU size="1.2 GiB" compute graph device=CPU size="1.2 GiB" ``` * perf top during inference confirms real work on CPU, not just sampling: 54.46% libggml-cpu-haswell.so ggml_compute_forward_flash_attn_ext 39.58% libggml-cpu-haswell.so ggml_vec_dot_f16 * Things that did NOT help with this issue: - OLLAMA_GPU_OVERHEAD=0 — no change in allocation; 1.2 GiB of weights remained on CPU - OLLAMA_KV_CACHE_TYPE=q8_0 — collapsed to single GPU, different issues Unsolved: What are the 1.2 GiB of CPU-side weights and why do flash attention + dot product ops run on CPU despite full layer offload? If anyone has insight, would appreciate it. Edit: It seems the 1.2GiB are the vision encoder weights that are not offloaded by Ollama/llama.cpp to the GPU. Might be related to #11422

GiteaMirror commented

2026-04-29 10:30:20 -05:00

@tjwebb commented on GitHub (Apr 3, 2026):

Same problem:

ollama ps reports 100% GPU, but logs show some stuff getting loaded onto CPU.

Eyeballing to top and nvtop, it looks like 3/4 of the work is being done by the CPU, and overall performance is much slower than expected. GPU is only running at ~20% capacity

ollama_think  | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ollama_think  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ollama_think  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama_think  | ggml_cuda_init: found 1 CUDA devices:
ollama_think  |   Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d
ollama_think  | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
ollama_think  | time=2026-04-03T02:39:04.225Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
ollama_think  | time=2026-04-03T02:39:04.232Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
ollama_think  | time=2026-04-03T02:39:04.259Z level=INFO source=model.go:138 msg="vision: decode" elapsed=1.847855ms bounds=(0,0)-(2048,2048)
ollama_think  | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=117.019105ms size="[768 768]"
ollama_think  | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
ollama_think  | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
ollama_think  | time=2026-04-03T02:39:04.377Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=119.770553ms shape="[5376 256]"
ollama_think  | time=2026-04-03T02:39:34.481Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_think  | time=2026-04-03T02:39:34.544Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
ollama_think  | time=2026-04-03T02:39:34.579Z level=INFO source=model.go:138 msg="vision: decode" elapsed=4.908968ms bounds=(0,0)-(2048,2048)
ollama_think  | time=2026-04-03T02:39:34.722Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=142.923203ms size="[768 768]"
ollama_think  | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
ollama_think  | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
ollama_think  | time=2026-04-03T02:39:34.726Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=151.570828ms shape="[5376 256]"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="18.4 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.2 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="23.5 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.0 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="2.3 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:272 msg="total memory" size="46.4 GiB"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=sched.go:561 msg="loaded runners" count=1
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
ollama_think  | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"

CPU: Xeon 6 6747P
GPU: RTX 6000 Pro

@tjwebb commented on GitHub (Apr 3, 2026): Same problem: `ollama ps` reports 100% GPU, but logs show some stuff getting loaded onto CPU. Eyeballing to `top` and `nvtop`, it looks like 3/4 of the work is being done by the CPU, and overall performance is much slower than expected. GPU is only running at ~20% capacity ``` ollama_think | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so ollama_think | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama_think | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama_think | ggml_cuda_init: found 1 CUDA devices: ollama_think | Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d ollama_think | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so ollama_think | time=2026-04-03T02:39:04.225Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) ollama_think | time=2026-04-03T02:39:04.232Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 ollama_think | time=2026-04-03T02:39:04.259Z level=INFO source=model.go:138 msg="vision: decode" elapsed=1.847855ms bounds=(0,0)-(2048,2048) ollama_think | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=117.019105ms size="[768 768]" ollama_think | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 ollama_think | time=2026-04-03T02:39:04.376Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 ollama_think | time=2026-04-03T02:39:04.377Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=119.770553ms shape="[5376 256]" ollama_think | time=2026-04-03T02:39:34.481Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_think | time=2026-04-03T02:39:34.544Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 ollama_think | time=2026-04-03T02:39:34.579Z level=INFO source=model.go:138 msg="vision: decode" elapsed=4.908968ms bounds=(0,0)-(2048,2048) ollama_think | time=2026-04-03T02:39:34.722Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=142.923203ms size="[768 768]" ollama_think | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 ollama_think | time=2026-04-03T02:39:34.725Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 ollama_think | time=2026-04-03T02:39:34.726Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=151.570828ms shape="[5376 256]" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:96 GPULayers:61[ID:GPU-13a1ab75-1e0f-0f52-f1a8-56f99675ff4d Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="18.4 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.2 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="23.5 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.0 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="2.3 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=device.go:272 msg="total memory" size="46.4 GiB" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=sched.go:561 msg="loaded runners" count=1 ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" ollama_think | time=2026-04-03T02:39:36.265Z level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU" ``` CPU: Xeon 6 6747P GPU: RTX 6000 Pro

GiteaMirror commented

2026-04-29 10:30:20 -05:00

@nickkaltner commented on GitHub (Apr 3, 2026):

AMD RYZEN AI MAX+ 395 w/ Radeon 8060S here.

I see the same behaviour - as a prompt is evaluated, the gpu usage slowly goes down and the cpu usage up. i have tried with rocm, vulkan and it's the same thing.

it shows 100% gpu with both gemma4:26b and gemma4:31b but both of them are definitely using cpu!

@nickkaltner commented on GitHub (Apr 3, 2026): AMD RYZEN AI MAX+ 395 w/ Radeon 8060S here. I see the same behaviour - as a prompt is evaluated, the gpu usage slowly goes down and the cpu usage up. i have tried with rocm, vulkan and it's the same thing. it shows 100% gpu with both gemma4:26b and gemma4:31b but both of them are definitely using cpu!

GiteaMirror commented

2026-04-29 10:30:20 -05:00

@seawindcn commented on GitHub (Apr 3, 2026):

V0.20.0 same issue.

@seawindcn commented on GitHub (Apr 3, 2026): V0.20.0 same issue.

GiteaMirror commented

2026-04-29 10:30:21 -05:00

@somera commented on GitHub (Apr 3, 2026):

Same here ... 50% CPU usage

Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s.
Ubuntu 24.04.x, Nvidia Driver 580.126.20

@somera commented on GitHub (Apr 3, 2026): Same here ... 50% CPU usage <img width="556" height="340" alt="Image" src="https://github.com/user-attachments/assets/f9c66b8a-b141-4b9e-ae2d-9f0c781a6f86" /> <img width="771" height="93" alt="Image" src="https://github.com/user-attachments/assets/9c6517d0-816d-4671-97d8-25e43f5e6863" /> Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s. Ubuntu 24.04.x, Nvidia Driver 580.126.20

GiteaMirror commented

2026-04-29 10:30:21 -05:00

@rabinnh commented on GitHub (Apr 3, 2026):

I have the same issue. I have 2 Nvidia RTX 3090s and I have conky loaded so I can see the memory of each GPU in real time.

The memory ping-pongs between the 2 GPUs until it finally starts executing on the CPU:

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now

All the other modesl run on the GPUs fine.

Another issue is that when I switch to another model, and it's running on the GPUs, Ollama never unloads gemma4:32b and my CPU load is maxed out, the temperatures and fans go way up, and I have to run "sudo systemctl restart ollama" to get everything back to normal.

NAME ID SIZE PROCESSOR CONTEXT UNTIL
richardyoung/kat-dev-72b:iq4_xs 14bbcc414a53 43 GB 100% GPU 8192 4 minutes from now
gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now

@rabinnh commented on GitHub (Apr 3, 2026): I have the same issue. I have 2 Nvidia RTX 3090s and I have conky loaded so I can see the memory of each GPU in real time. The memory ping-pongs between the 2 GPUs until it finally starts executing on the CPU: NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now All the other modesl run on the GPUs fine. Another issue is that when I switch to another model, and it's running on the GPUs, Ollama never unloads gemma4:32b and my CPU load is maxed out, the temperatures and fans go way up, and I have to run "sudo systemctl restart ollama" to get everything back to normal. NAME ID SIZE PROCESSOR CONTEXT UNTIL richardyoung/kat-dev-72b:iq4_xs 14bbcc414a53 43 GB 100% GPU 8192 4 minutes from now gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now

GiteaMirror commented

2026-04-29 10:30:22 -05:00

@PurpleBanana-ai commented on GitHub (Apr 3, 2026):

My apologies in advance if this is not the appropriate format for posting this info, I generally do not post on these. Running with Open WebUI is worse, especially with any type of tool call like web search, but for this case below, this is Ollama straight in the terminal. FYI, I am showing the same issue with a gguf model from unsloth for gemma4, not just the ones directly downloaded from Ollama. Was seeing my cpu package temp touch 60c+, which is not something I see with my cooling unless running intense benchmarks, never for inference or even diffusion.

Setup
Debian 13, Cuda 13.2, Driver 595.58.03
i9-14900k 790 chipset
94GB DDR5 6400
m.2 NVME (CPU Side PCIE Bus)
GPU 0: RTX 5090 32GB (CPU Side PCIE Bus - PCIE5 Slot x8)
GPU 1: RTX 3090 24GB (CPU Side PCIE Bus - PCIE 5 Slot x8)-yes the GPU is at PCIE4
GPU 2: RTX 5070ti 16GB (Chipset Side PCIE Bus - PCIE4 16x slot at x4)
GPU 3: RTX 5070ti 16GB (Chipset Side PCIE Bus - m.2 PCIE4 x4 to Occulink EGPU)
(no need to dog the frankenrig, she is fine, this is the only model I am having issues with, I will try it on llama.cpp and vLLM as well later.)

Same issues as above just a different config, but only showing issue with gemma4, any of the model versions, any quant, any ctx size. I am seeing the model weights offloaded to the CPU in the logs below. I have tried creating model files with static offloading gpu layers to 999 but no difference. gemma4 is also running very slow, even if I pin the 31B at 4K_M quant to my 5090 with 8192 ctx, it is no different than across multiple GPU's. ~13tps - 15tps for gemma 4. Load time is as expected across multiple cards with this config, not an issue.

This example and the logs is with the following Modelfile (FA on in env, no docker):

FROM gemma4:31b-it-q8_0
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 65
PARAMETER num_ctx 65536

qwen3.5 32B A3B Q8 is fine even at 245,760 ctx.
qwen3-next:80b-a3b-thinking-q4_K_M is fine at 204,800 ctx.

FYI Comparison Same Basic Prompt, Think mode enabled for gemma4 (doesn't change the issue if I turn it off):

gemma4
ollama run gemma4-31b-q8_0-custom:latest --verbose

<|think|> hi gemma, its nice to meet you! would you mind sharing with me a detailed explanation of new capabilities?

Performance:
total duration: 1m56.786501477s
load duration: 131.067798ms
prompt eval count: 39 token(s)
prompt eval duration: 145.114255ms
prompt eval rate: 268.75 tokens/s
eval count: 1612 token(s)
eval duration: 1m55.901022209s
eval rate: 13.91 tokens/s

qwen3-next:80b-a3b-thinking-q4_K_M
Same Prompt (minus the think tag token) for qwen3-next:80b-a3b-thinking-q4_K_M at 204,800 ctx:

total duration: 55.166186993s
load duration: 81.913138ms
prompt eval count: 33 token(s)
prompt eval duration: 125.400091ms
prompt eval rate: 263.16 tokens/s
eval count: 4901 token(s)
eval duration: 54.000393967s
eval rate: 90.76 tokens/s

Ollama Logs for gemma4:

Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:484 msg="system memory" total="94.1 GiB" free="90.3 GiB" free_swap=>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-864499ec-e762-8642-9601-9c125fe6fd64 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987 li>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:44647"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.301-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 Batch>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default=""
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default=""
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="">
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-alderlake.so
Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.340-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama>
Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 |      18.469µs |       127.0.0.1 | HEAD     "/"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 |       7.005µs |       127.0.0.1 | GET      "/api/ps"
Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: found 4 CUDA devices:
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-864499ec-e762-8642-9601-9c125fe6fd64
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 2: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935
Apr 03 10:46:31 purplebanana-ai ollama[168950]:   Device 3: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987
Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.009-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.028-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=783.895µs bounds=(0,0)-(2048,2048)
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=66.114575ms size="[768 768]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.095-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=67.401546ms shape="[5376 256]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.126-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.224-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 Bat>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.297-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id d>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=2560>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count defa>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count defa>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.320-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.045157ms bounds=(0,0)-(2048,2048)
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.388-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=67.402195ms size="[768 768]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=69.978037ms shape="[5376 256]"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.391-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.562-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 Ba>
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:272 msg="total memory" size="42.8 GiB"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm serv>

Not sure it helps, happy to provide more info.

@PurpleBanana-ai commented on GitHub (Apr 3, 2026): My apologies in advance if this is not the appropriate format for posting this info, I generally do not post on these. Running with Open WebUI is worse, especially with any type of tool call like web search, but for this case below, this is Ollama straight in the terminal. FYI, I am showing the same issue with a gguf model from unsloth for gemma4, not just the ones directly downloaded from Ollama. Was seeing my cpu package temp touch 60c+, which is not something I see with my cooling unless running intense benchmarks, never for inference or even diffusion. Setup Debian 13, Cuda 13.2, Driver 595.58.03 i9-14900k 790 chipset 94GB DDR5 6400 m.2 NVME (CPU Side PCIE Bus) GPU 0: RTX 5090 32GB (CPU Side PCIE Bus - PCIE5 Slot x8) GPU 1: RTX 3090 24GB (CPU Side PCIE Bus - PCIE 5 Slot x8)-yes the GPU is at PCIE4 GPU 2: RTX 5070ti 16GB (Chipset Side PCIE Bus - PCIE4 16x slot at x4) GPU 3: RTX 5070ti 16GB (Chipset Side PCIE Bus - m.2 PCIE4 x4 to Occulink EGPU) (no need to dog the frankenrig, she is fine, this is the only model I am having issues with, I will try it on llama.cpp and vLLM as well later.) Same issues as above just a different config, but only showing issue with gemma4, any of the model versions, any quant, any ctx size. I am seeing the model weights offloaded to the CPU in the logs below. I have tried creating model files with static offloading gpu layers to 999 but no difference. gemma4 is also running very slow, even if I pin the 31B at 4K_M quant to my 5090 with 8192 ctx, it is no different than across multiple GPU's. ~13tps - 15tps for gemma 4. Load time is as expected across multiple cards with this config, not an issue. This example and the logs is with the following Modelfile (FA on in env, no docker): ```bash FROM gemma4:31b-it-q8_0 PARAMETER temperature 1.0 PARAMETER top_p 0.95 PARAMETER top_k 65 PARAMETER num_ctx 65536 ``` qwen3.5 32B A3B Q8 is fine even at 245,760 ctx. qwen3-next:80b-a3b-thinking-q4_K_M is fine at 204,800 ctx. FYI Comparison Same Basic Prompt, Think mode enabled for gemma4 (doesn't change the issue if I turn it off): **gemma4** ollama run gemma4-31b-q8_0-custom:latest --verbose >>> <|think|> hi gemma, its nice to meet you! would you mind sharing with me a detailed explanation of new capabilities? Performance: total duration: 1m56.786501477s load duration: 131.067798ms prompt eval count: 39 token(s) prompt eval duration: 145.114255ms prompt eval rate: 268.75 tokens/s eval count: 1612 token(s) eval duration: 1m55.901022209s eval rate: 13.91 tokens/s **qwen3-next:80b-a3b-thinking-q4_K_M** Same Prompt (minus the think tag token) for qwen3-next:80b-a3b-thinking-q4_K_M at 204,800 ctx: total duration: 55.166186993s load duration: 81.913138ms prompt eval count: 33 token(s) prompt eval duration: 125.400091ms prompt eval rate: 263.16 tokens/s eval count: 4901 token(s) eval duration: 54.000393967s eval rate: 90.76 tokens/s Ollama Logs for gemma4: ```bash Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:484 msg="system memory" total="94.1 GiB" free="90.3 GiB" free_swap=> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-864499ec-e762-8642-9601-9c125fe6fd64 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987 li> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.288-04:00 level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1 Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1417 msg="starting ollama engine" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.297-04:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:44647" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.301-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 Batch> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.name default="" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.description default="" Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description=""> Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.337-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-alderlake.so Apr 03 10:46:31 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:31.340-04:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama> Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 | 18.469µs | 127.0.0.1 | HEAD "/" Apr 03 10:46:31 purplebanana-ai ollama[168950]: [GIN] 2026/04/03 - 10:46:31 | 200 | 7.005µs | 127.0.0.1 | GET "/api/ps" Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 03 10:46:31 purplebanana-ai ollama[168950]: ggml_cuda_init: found 4 CUDA devices: Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-864499ec-e762-8642-9601-9c125fe6fd64 Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-43e01948-3b0f-d96f-7efe-1dd1a630e992 Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 2: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-ca86a6a9-1744-35bf-5a7f-7d399406c935 Apr 03 10:46:31 purplebanana-ai ollama[168950]: Device 3: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd7987 Apr 03 10:46:31 purplebanana-ai ollama[168950]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.009-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.028-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=783.895µs bounds=(0,0)-(2048,2048) Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=66.114575ms size="[768 768]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.094-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.095-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=67.401546ms shape="[5376 256]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.126-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.224-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.268-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 Bat> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.297-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=general.alignment default=32 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.pooling_type default=0 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=tokenizer.ggml.eot_token_id d> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=2560> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.attention.global_head_> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_count default=0 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.expert_used_count defa> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.block_count defa> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.309-04:00 level=DEBUG source=ggml.go:325 msg="key with type not found" key=gemma4.audio.embedding_length> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.320-04:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=1.045157ms bounds=(0,0)-(2048,2048) Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.388-04:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=67.402195ms size="[768 768]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchS> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.389-04:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=69.978037ms shape="[5376 256]" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.391-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1137 splits=1 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.562-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2514 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=ggml.go:852 msg="compute graph" nodes=2512 splits=23 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=device.go:272 msg="total memory" size="42.8 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:784 msg=memory success=true required.InputWeights=1591738368 requ> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-43e01948-3b0f-d96f-7efe-1dd1a630e9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-864499ec-e762-8642-9601-9c125fe6fd> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-7c9fd88a-32f8-c5af-65ec-2ee436bd79> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:978 msg="available gpu" id=GPU-ca86a6a9-1744-35bf-5a7f-7d399406c9> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.564-04:00 level=DEBUG source=server.go:795 msg="new layout created" layers="61[ID:GPU-43e01948-3b0f-d96f> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 Ba> Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="14.0 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="17.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="3.4 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="5.1 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="281.5 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="489.8 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="640.0 MiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=device.go:272 msg="total memory" size="42.8 GiB" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" Apr 03 10:46:32 purplebanana-ai ollama[168950]: time=2026-04-03T10:46:32.565-04:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm serv> ``` Not sure it helps, happy to provide more info.

GiteaMirror commented

2026-04-29 10:30:23 -05:00

@zestysoft commented on GitHub (Apr 3, 2026):

fwiw, seeing the same behavior on a mac in ollama 0.20:

gemma4:31b
M3 MAX Processor with 128GB of ram.

ollama ps shows the model loaded with 100% GPU, but mactop shows 600+ CPU utilization with very little GPU.

@zestysoft commented on GitHub (Apr 3, 2026): fwiw, seeing the same behavior on a mac in ollama 0.20: gemma4:31b M3 MAX Processor with 128GB of ram. ollama ps shows the model loaded with 100% GPU, but mactop shows 600+ CPU utilization with very little GPU.

GiteaMirror commented

2026-04-29 10:30:24 -05:00

@Wladastic commented on GitHub (Apr 4, 2026):

Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick.
Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O

@Wladastic commented on GitHub (Apr 4, 2026): Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick. Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O

GiteaMirror commented

2026-04-29 10:30:24 -05:00

@homjay commented on GitHub (Apr 4, 2026):

Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick. Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O

Based on my observations, the glitch is triggered specifically when sending a second prompt. This behavior is highly unusual.

@homjay commented on GitHub (Apr 4, 2026): > Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick. Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O Based on my observations, the glitch is triggered specifically when sending a second prompt. This behavior is highly unusual.

GiteaMirror commented

2026-04-29 10:30:25 -05:00

@sergiosaurio commented on GitHub (Apr 4, 2026):

In my case using curl or python library produces the same results:
45% CPU usage and 5% GPU aprox per prompt.

Using the CLI or Ollama app works fine.

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 21 GB 100% GPU 16000 Forever

@sergiosaurio commented on GitHub (Apr 4, 2026): In my case using curl or python library produces the same results: 45% CPU usage and 5% GPU aprox per prompt. Using the CLI or Ollama app works fine. NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b 5571076f3d70 21 GB 100% GPU 16000 Forever

GiteaMirror commented

2026-04-29 10:30:27 -05:00

@alerque commented on GitHub (Apr 4, 2026):

In my case using curl or python library produces the same results: 45% CPU usage and 5% GPU aprox per prompt.

My use case involves using a separate Rust app that calls the API via the TCP port (via the rig crate). That works fine and the model runs under GPU when called via API calls from the socket as well as via the ollama CLI. I don't know what the your Python calls would be doing differently than that.

@alerque commented on GitHub (Apr 4, 2026): > In my case using curl or python library produces the same results: 45% CPU usage and 5% GPU aprox per prompt. My use case involves using a separate Rust app that calls the API via the TCP port (via the `rig` crate). That works fine and the model runs under GPU when called via API calls from the socket as well as via the ollama CLI. I don't know what the your Python calls would be doing differently than that.

GiteaMirror commented

2026-04-29 10:30:28 -05:00

@sammyvoncheese commented on GitHub (Apr 4, 2026):

20.2 CPU vs GPU. Calling a tool.

gemma4:e4b-it-bf16 d0d10a1b1ddb 21 GB 100% GPU 130000 57 minutes from now

Same Model only generating text.

@sammyvoncheese commented on GitHub (Apr 4, 2026): 20.2 CPU vs GPU. Calling a tool. gemma4:e4b-it-bf16 d0d10a1b1ddb 21 GB 100% GPU 130000 57 minutes from now <img width="1158" height="722" alt="Image" src="https://github.com/user-attachments/assets/059c11de-1bcf-4578-bd59-d0e040953170" /> Same Model only generating text. <img width="1148" height="692" alt="Image" src="https://github.com/user-attachments/assets/3bfdf23d-ce50-4122-a4aa-bd355da9e143" />

GiteaMirror commented

2026-04-29 10:30:29 -05:00

@somera commented on GitHub (Apr 4, 2026):

Same here ... 50% CPU usage

Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s. Ubuntu 24.04.x, Nvidia Driver 580.126.20

Not usable at the momment. v0.20.2

AMD EPYC 9355 32-Core Processor + RTX PRO 6000 96 GB

MIx of CPU and GPU usage:

And very low tokens/s.

$ ollama --verbose run gemma4:31b-it-q8_0
>>> hi
Thinking...
The user said "hi".
This is a standard greeting.

    *   Acknowledge the greeting.
    *   Offer assistance.
    *   Maintain a helpful and friendly tone.
"Hello! How can I help you today?" or "Hi there! What can I do for you?"
...done thinking.

Hello! How can I help you today?

total duration:       5.833939473s
load duration:        186.580984ms
prompt eval count:    16 token(s)
prompt eval duration: 2.0194886s
prompt eval rate:     7.92 tokens/s
eval count:           78 token(s)
eval duration:        3.590263048s
eval rate:            21.73 tokens/s

Restart ollama and than:

$ ollama --verbose run gemma4:26b-a4b-it-q8_0
>>> hi
Thinking...
The user said "hi".
This is a simple greeting.

    *   Acknowledge the greeting.
    *   Offer assistance.
    *   Maintain a polite and friendly tone.
"Hello! How can I help you today?" or "Hi there! What's on your mind?"
...done thinking.

Hello! How can I help you today?

total duration:       1.925485214s
load duration:        207.200064ms
prompt eval count:    16 token(s)
prompt eval duration: 77.994687ms
prompt eval rate:     205.14 tokens/s
eval count:           78 token(s)
eval duration:        1.504752337s
eval rate:            51.84 tokens/s

and now longer prompt:

>>> Show me a bash snippet
Thinking...
...
total duration:       1m1.961390093s
load duration:        204.835352ms
prompt eval count:    185 token(s)
prompt eval duration: 59.550431ms
prompt eval rate:     3106.61 tokens/s
eval count:           1325 token(s)
eval duration:        1m1.101323633s
eval rate:            21.69 tokens/s

and more longer prompt:

total duration:       10m8.197019746s
load duration:        172.859408ms
prompt eval count:    584 token(s)
prompt eval duration: 171.277392ms
prompt eval rate:     3409.67 tokens/s
eval count:           8695 token(s)
eval duration:        10m4.288086024s
eval rate:            14.39 tokens/s

For the last prompt:

@somera commented on GitHub (Apr 4, 2026): > Same here ... 50% CPU usage > > <img alt="Image" width="556" height="340" src="https://private-user-images.githubusercontent.com/8334250/573496790-f9c66b8a-b141-4b9e-ae2d-9f0c781a6f86.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUzMTAwNTEsIm5iZiI6MTc3NTMwOTc1MSwicGF0aCI6Ii84MzM0MjUwLzU3MzQ5Njc5MC1mOWM2NmI4YS1iMTQxLTRiOWUtYWUyZC05ZjBjNzgxYTZmODYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI2MDQwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNjA0MDRUMTMzNTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9OTFmODM5MzIyNzNjZWI0NzkzYmJkNWY3NWI2NDViNzQxNzlkN2RjYzg4ZWU3NzcxYjJiOWQ3NmFlNWUxNjFmNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.SJ5ASrmzlZaaN2CC4tF64FKMLCUt7NqsAxSV3GZP8po"> <img alt="Image" width="771" height="93" src="https://private-user-images.githubusercontent.com/8334250/573496835-9c6517d0-816d-4671-97d8-25e43f5e6863.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUzMTAwNTEsIm5iZiI6MTc3NTMwOTc1MSwicGF0aCI6Ii84MzM0MjUwLzU3MzQ5NjgzNS05YzY1MTdkMC04MTZkLTQ2NzEtOTdkOC0yNWU0M2Y1ZTY4NjMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI2MDQwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNjA0MDRUMTMzNTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlhOGRhMDkzODJkYTgzMDdlYWMxOGI2NDFjMDNjZTg5ZDVhNzYzZTIyOGIxZTg4NTI5ODI1Y2ZmNDg5ZWY2ZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.AEx0KbQPu6H9LJtOSV2eZxqNbAufqHTXwkYLn9xTs-8"> > Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s. Ubuntu 24.04.x, Nvidia Driver 580.126.20 Not usable at the momment. v0.20.2 AMD EPYC 9355 32-Core Processor + RTX PRO 6000 96 GB MIx of CPU and GPU usage: <img width="741" height="49" alt="Image" src="https://github.com/user-attachments/assets/0460b98d-2f5a-43dc-8fb1-ce2055cf6196" /> <img width="1081" height="44" alt="Image" src="https://github.com/user-attachments/assets/addf712d-5bb9-43c4-9473-96943f027e58" /> <img width="525" height="42" alt="Image" src="https://github.com/user-attachments/assets/78e4e069-b0c7-433f-87d4-52d3dae5bbe7" /> And very low tokens/s. ``` $ ollama --verbose run gemma4:31b-it-q8_0 >>> hi Thinking... The user said "hi". This is a standard greeting. * Acknowledge the greeting. * Offer assistance. * Maintain a helpful and friendly tone. "Hello! How can I help you today?" or "Hi there! What can I do for you?" ...done thinking. Hello! How can I help you today? total duration: 5.833939473s load duration: 186.580984ms prompt eval count: 16 token(s) prompt eval duration: 2.0194886s prompt eval rate: 7.92 tokens/s eval count: 78 token(s) eval duration: 3.590263048s eval rate: 21.73 tokens/s ``` Restart ollama and than: ``` $ ollama --verbose run gemma4:26b-a4b-it-q8_0 >>> hi Thinking... The user said "hi". This is a simple greeting. * Acknowledge the greeting. * Offer assistance. * Maintain a polite and friendly tone. "Hello! How can I help you today?" or "Hi there! What's on your mind?" ...done thinking. Hello! How can I help you today? total duration: 1.925485214s load duration: 207.200064ms prompt eval count: 16 token(s) prompt eval duration: 77.994687ms prompt eval rate: 205.14 tokens/s eval count: 78 token(s) eval duration: 1.504752337s eval rate: 51.84 tokens/s ``` and now longer prompt: ``` >>> Show me a bash snippet Thinking... ... total duration: 1m1.961390093s load duration: 204.835352ms prompt eval count: 185 token(s) prompt eval duration: 59.550431ms prompt eval rate: 3106.61 tokens/s eval count: 1325 token(s) eval duration: 1m1.101323633s eval rate: 21.69 tokens/s ``` and more longer prompt: ``` total duration: 10m8.197019746s load duration: 172.859408ms prompt eval count: 584 token(s) prompt eval duration: 171.277392ms prompt eval rate: 3409.67 tokens/s eval count: 8695 token(s) eval duration: 10m4.288086024s eval rate: 14.39 tokens/s ``` For the last prompt: <img width="764" height="44" alt="Image" src="https://github.com/user-attachments/assets/a3e0221e-838c-4737-8e0e-48fa22950405" /> <img width="537" height="33" alt="Image" src="https://github.com/user-attachments/assets/a26f8b00-4691-4b40-a616-2bd8e4b8a745" /> <img width="1095" height="37" alt="Image" src="https://github.com/user-attachments/assets/8380ff5f-4b33-4e82-9d5f-036694022e0e" />

GiteaMirror commented

2026-04-29 10:30:29 -05:00

@chenav commented on GitHub (Apr 4, 2026):

+1 on wsl2 (last version) docker and 5090

@chenav commented on GitHub (Apr 4, 2026): +1 on wsl2 (last version) docker and 5090

GiteaMirror commented

2026-04-29 10:30:29 -05:00

@SingularityMan commented on GitHub (Apr 4, 2026):

Ubuntu 22.04 showing same issues.

@SingularityMan commented on GitHub (Apr 4, 2026): Ubuntu 22.04 showing same issues.

GiteaMirror commented

2026-04-29 10:30:30 -05:00

@alerque commented on GitHub (Apr 4, 2026):

@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

@alerque commented on GitHub (Apr 4, 2026): @somera and others, try the 26b or smaller models instead of 31b. 31b seems to need *WILDLY* more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

GiteaMirror commented

2026-04-29 10:30:31 -05:00

@SingularityMan commented on GitHub (Apr 4, 2026):

@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

I'm already using the 26b model, and I have 48GB VRAM available. Doesn't matter which context length it is set to.

@SingularityMan commented on GitHub (Apr 4, 2026): > [@somera](https://github.com/somera) and others, try the 26b or smaller models instead of 31b. 31b seems to need _WILDLY_ more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing. I'm *already* using the 26b model, and I have 48GB VRAM available. Doesn't matter which context length it is set to.

GiteaMirror commented

2026-04-29 10:30:31 -05:00

@alerque commented on GitHub (Apr 4, 2026):

I'm already using the 26b model, and I have 48GB VRAM available.

As I just mentioned the 26b model eats through about 60GB VRAM when I run it. Try one of the even smaller models.

@alerque commented on GitHub (Apr 4, 2026): > I'm _already_ using the 26b model, and I have 48GB VRAM available. As I just mentioned the 26b model eats through about 60GB VRAM when I run it. Try one of the even smaller models.

GiteaMirror commented

2026-04-29 10:30:32 -05:00

@Wladastic commented on GitHub (Apr 4, 2026):

I cannot confirm it to be a ram issue.
31b Model with 22k token prompt just ran through in about 1-2 seconds.
Once a toolcall is mentioned it reverts to cpu

@Wladastic commented on GitHub (Apr 4, 2026): I cannot confirm it to be a ram issue. 31b Model with 22k token prompt just ran through in about 1-2 seconds. Once a toolcall is mentioned it reverts to cpu

GiteaMirror commented

2026-04-29 10:30:33 -05:00

@somera commented on GitHub (Apr 4, 2026):

@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.

I don't see a VRAM issue in my system. I see a CPU+GPU usage.

@somera commented on GitHub (Apr 4, 2026): > [@somera](https://github.com/somera) and others, try the 26b or smaller models instead of 31b. 31b seems to need _WILDLY_ more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing. I don't see a VRAM issue in my system. I see a CPU+GPU usage.

GiteaMirror commented

2026-04-29 10:30:35 -05:00

@mazphilip commented on GitHub (Apr 4, 2026):

I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode)

It does seem true that the vision layers are always CPU offloaded, but the big issues seem FA related.

While you can force the FA usage, which makes ollama allocate the memory on the GPU(s), but once you run it (with longer context?), something happens and it moved all the calculation to the CPU.

I get very good performance (1500tk/s prompt, 50tk/s eval) when running (using /etc/systemd/system/ollama.service.d/override.conf)

no FA
Smaller than typica context: 96k (for 54GB VRAM) - to account for inefficient usage due to no FA

Resolution steps:

Fix FA usage with llama.cpp (claude telling me it might be the global attention layers with key_length=512 not properly mapped in llama.cpp - looking into this rn
1. EDIT: seems this is already fixed in llama.cpp - https://github.com/ggml-org/llama.cpp/releases/tag/b8609
Move vision layers to GPU (but I can also see how this is personal use case specific)

@mazphilip commented on GitHub (Apr 4, 2026): I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode) * It does seem true that the vision layers are always CPU offloaded, but the big issues seem FA related. While you can force the FA usage, which makes ollama allocate the memory on the GPU(s), but once you run it (with longer context?), something happens and it moved all the calculation to the CPU. I get very good performance (1500tk/s prompt, 50tk/s eval) when running (using `/etc/systemd/system/ollama.service.d/override.conf`) * no FA * Smaller than typica context: 96k (for 54GB VRAM) - to account for inefficient usage due to no FA Resolution steps: 1. Fix FA usage with llama.cpp (claude telling me it might be the global attention layers with key_length=512 not properly mapped in llama.cpp - looking into this rn 1. EDIT: seems this is already fixed in llama.cpp - https://github.com/ggml-org/llama.cpp/releases/tag/b8609 2. Move vision layers to GPU (but I can also see how this is personal use case specific)

GiteaMirror commented

2026-04-29 10:30:36 -05:00

@slamj1 commented on GitHub (Apr 4, 2026):

I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine.

My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.

@slamj1 commented on GitHub (Apr 4, 2026): I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine. My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.

GiteaMirror commented

2026-04-29 10:30:37 -05:00

@somera commented on GitHub (Apr 4, 2026):

I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode)

I'm not using coding agents with ollama and I have the issues with ollama run <model> and from Open WebUI and small context (4096).

@somera commented on GitHub (Apr 4, 2026): > I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode) I'm not using coding agents with ollama and I have the issues with `ollama run <model>` and from Open WebUI and small context (`4096`).

GiteaMirror commented

2026-04-29 10:30:37 -05:00

@sammyvoncheese commented on GitHub (Apr 4, 2026):

I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine.

My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.

I was able to confirm that disabling FA caused the model layers to stay on the GPU now.

@sammyvoncheese commented on GitHub (Apr 4, 2026): > I can confirm [@mazphilip](https://github.com/mazphilip) finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine. > > My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well. I was able to confirm that disabling FA caused the model layers to stay on the GPU now.

GiteaMirror commented

2026-04-29 10:30:38 -05:00

@SingularityMan commented on GitHub (Apr 4, 2026):

Can confirm, disabling FA on Ollama seems to correctly offload everything to GPU now.

@SingularityMan commented on GitHub (Apr 4, 2026): Can confirm, disabling FA on Ollama seems to correctly offload everything to GPU now.

GiteaMirror commented

2026-04-29 10:30:39 -05:00

@viba1 commented on GitHub (Apr 4, 2026):

On my side, disabling FA works correctly for models running 100% GPU, but the issue remains for models that split their workload between the CPU and GPU.

For exemple:
Gemma4:26b: 21% CPU / 79% GPU ; ~ 1.2 token/s
Gemma3:27b: 19% CPU / 81% GPU ; ~3 token/s

@viba1 commented on GitHub (Apr 4, 2026): On my side, disabling FA works correctly for models running 100% GPU, but the issue remains for models that split their workload between the CPU and GPU. For exemple: Gemma4:26b: 21% CPU / 79% GPU ; ~ 1.2 token/s Gemma3:27b: 19% CPU / 81% GPU ; ~3 token/s

GiteaMirror commented

2026-04-29 10:30:40 -05:00

@mazphilip commented on GitHub (Apr 5, 2026):

I managed to make this work migrating this llama.cpp PR over: https://github.com/ggml-org/llama.cpp/pull/20998
Opening a PR

@mazphilip commented on GitHub (Apr 5, 2026): I managed to make this work migrating this llama.cpp PR over: https://github.com/ggml-org/llama.cpp/pull/20998 Opening a PR

GiteaMirror commented

2026-04-29 10:30:43 -05:00

@Cephei-OpenSource commented on GitHub (Apr 5, 2026):

I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.

@Cephei-OpenSource commented on GitHub (Apr 5, 2026): I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.

GiteaMirror commented

2026-04-29 10:30:43 -05:00

@Hello-World-Traveler commented on GitHub (Apr 6, 2026):

I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.

Turning OLLAMA_FLASH_ATTENTION to false makes little difference
gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now

OLLAMA_FLASH_ATTENTION to 0
gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now

gemma3:4b 5.4 GB 100% GPU 4096 4 minutes from now

@Hello-World-Traveler commented on GitHub (Apr 6, 2026): > I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s. Turning OLLAMA_FLASH_ATTENTION to false makes little difference `gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now` OLLAMA_FLASH_ATTENTION to 0 `gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from now ` gemma3:4b 5.4 GB 100% GPU 4096 4 minutes from now

GiteaMirror commented

2026-04-29 10:30:44 -05:00

@tjwebb commented on GitHub (Apr 6, 2026):

yep disabling FA worked for me

@tjwebb commented on GitHub (Apr 6, 2026): yep disabling FA worked for me

GiteaMirror commented

2026-04-29 10:30:45 -05:00

@m0n5t3r commented on GitHub (Apr 6, 2026):

another data point that disabling FA works if you have enough VRAM (in my case Ryzen AI Maz 395, with 64 GB allocated to the GPU): before I was seeing between 25 and 75% GPU usage with gemma4:26b and 21 GB of VRAM used, now I see close to 100% GPU use and 38 GB of VRAM used (and it is much faster)

ollama ps said 100% GPU in both cases.

@m0n5t3r commented on GitHub (Apr 6, 2026): another data point that disabling FA works if you have enough VRAM (in my case Ryzen AI Maz 395, with 64 GB allocated to the GPU): before I was seeing between 25 and 75% GPU usage with `gemma4:26b` and 21 GB of VRAM used, now I see close to 100% GPU use and 38 GB of VRAM used (and it is much faster) `ollama ps` said 100% GPU in both cases.

GiteaMirror commented

2026-04-29 10:30:46 -05:00

@Hello-World-Traveler commented on GitHub (Apr 6, 2026):

Turing off thinking does make it faster with about 19 t/s
66%/34% CPU/GPU

Doesn't make much difference for me

OLLAMA_FLASH_ATTENTION	1
OLLAMA_MAX_LOADED_MODELS	1
OLLAMA_NUM_PARALLEL	1

I am using docker with gemma4:e4b and OLLAMA_NEW_ENGINE=true

@Hello-World-Traveler commented on GitHub (Apr 6, 2026): Turing off thinking does make it faster with about 19 t/s `66%/34% CPU/GPU ` Doesn't make much difference for me ``` OLLAMA_FLASH_ATTENTION 1 OLLAMA_MAX_LOADED_MODELS 1 OLLAMA_NUM_PARALLEL 1 ``` I am using docker with gemma4:e4b and OLLAMA_NEW_ENGINE=true

GiteaMirror commented

2026-04-29 10:30:49 -05:00

@roxlukas commented on GitHub (Apr 7, 2026):

Confirmed, with OLLAMA_FLASH_ATTENTION=1 on Gemma4:26B there is heavy CPU usage (50-60%) and eval token speed hovers around 30 token/s, even for the e4b variant!
with OLLAMA_FLASH_ATTENTION=0 token speed on Gemma4:26B jumps to 108 tokens/s on RTX 3090

In both cases Ollama reports full GPU inference:
ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 21 GB 100% GPU 32768 4 minutes from now

env:
Ollama 0.20.3
Windows 11
i5-11400F
64GB DDR4
RTX 3090

@roxlukas commented on GitHub (Apr 7, 2026): **Confirmed**, with **OLLAMA_FLASH_ATTENTION=1** on Gemma4:26B there is heavy CPU usage (50-60%) and eval token speed hovers around 30 token/s, even for the e4b variant! with **OLLAMA_FLASH_ATTENTION=0** token speed on Gemma4:26B jumps to 108 tokens/s on RTX 3090 In both cases Ollama reports full GPU inference: ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b 5571076f3d70 21 GB 100% GPU 32768 4 minutes from now **env:** Ollama 0.20.3 Windows 11 i5-11400F 64GB DDR4 RTX 3090

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#56258