[GH-ISSUE #12428] Models loading slow since 0.12 version #54767

Closed
opened 2026-04-29 07:15:20 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @deep1305 on GitHub (Sep 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12428

I wanted to raise a issue that since 0.12 ollama update, the models take longer than expected to response even though other processes are not running on my device. To answer a query, it takes more than 1 minute to answer whether it is qwen3 model or deepseek-r1.

Originally created by @deep1305 on GitHub (Sep 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12428 I wanted to raise a issue that since 0.12 ollama update, the models take longer than expected to response even though other processes are not running on my device. To answer a query, it takes more than 1 minute to answer whether it is qwen3 model or deepseek-r1.
GiteaMirror added the bug label 2026-04-29 07:15:21 -05:00
Author
Owner

@jmorganca commented on GitHub (Sep 27, 2025):

Hi @deep1305 would it be possible to share what OS you are on, and also the logs of possible? Sorry about this.

<!-- gh-comment-id:3340957102 --> @jmorganca commented on GitHub (Sep 27, 2025): Hi @deep1305 would it be possible to share what OS you are on, and also the [logs](https://docs.ollama.com/troubleshooting) of possible? Sorry about this.
Author
Owner

@deep1305 commented on GitHub (Sep 27, 2025):

Hi I am running ollama on windows 11.

Below is the log:

time=2025-09-26T21:00:56.425-04:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:true OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\smart\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-09-26T21:00:56.535-04:00 level=INFO source=images.go:518 msg="total blobs: 74"
time=2025-09-26T21:00:56.538-04:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
time=2025-09-26T21:00:56.546-04:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 (version 0.12.2)"
time=2025-09-26T21:00:56.547-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20
time=2025-09-26T21:00:57.880-04:00 level=INFO source=gpu.go:311 msg="detected OS VRAM overhead" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" overhead="674.8 MiB"
time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda variant=v13 compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" total="4.0 GiB" available="3.2 GiB"
time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="\xc0" total="0 B" available="0 B"
time=2025-09-26T21:00:58.795-04:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="4.0 GiB" threshold="20.0 GiB"
[GIN] 2025/09/26 - 21:00:58 | 200 | 642.9µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/26 - 21:00:58 | 200 | 98.3295ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/09/26 - 21:07:07 | 200 | 2.7923ms | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/26 - 21:07:07 | 200 | 662.4702ms | 127.0.0.1 | POST "/api/show"
time=2025-09-26T21:07:09.875-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\Users\smart\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model C:\Users\smart\.ollama\models\blobs\sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --port 51082"
time=2025-09-26T21:07:09.907-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1
time=2025-09-26T21:07:09.987-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-26T21:07:09.989-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:51082"
time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="14.9 GiB" free_swap="27.0 GiB"
time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.7 GiB" free="3.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B"
time=2025-09-26T21:07:10.073-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:10.146-04:00 level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37
load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34
load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-09-26T21:07:10.298-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-26T21:07:10.650-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:10.909-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.160-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.414-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.669-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.907-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:12.167-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:12.425-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:15.220-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:19.513-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:24.438-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:28.963-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:34.114-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:39.762-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:46.714-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="8.3 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="4.0 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:342 msg="total memory" size="20.9 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-26T21:07:46.714-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-26T21:07:46.717-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-26T21:07:58.792-04:00 level=INFO source=server.go:1289 msg="llama runner started in 48.94 seconds"
[GIN] 2025/09/26 - 21:07:58 | 200 | 51.056361s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/09/26 - 21:09:16 | 200 | 21.0617641s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/09/26 - 21:09:22 | 200 | 5.0899348s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/09/26 - 21:19:19 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/26 - 21:19:19 | 200 | 124.9534ms | 127.0.0.1 | POST "/api/show"
time=2025-09-26T21:19:20.396-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\Users\smart\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf --port 52631"
time=2025-09-26T21:19:20.410-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1
time=2025-09-26T21:19:20.492-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="17.7 GiB" free_swap="25.5 GiB"
time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.5 GiB" free="3.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B"
time=2025-09-26T21:19:20.496-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:52631"
time=2025-09-26T21:19:20.498-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:20.533-04:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3moe file_type=Q4_K_M name="Qwen3 30B A3B Thinking 2507" description="" num_tensors=579 num_key_values=33
load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34
load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-09-26T21:19:21.514-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-26T21:19:21.651-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.702-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.757-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.812-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.861-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.920-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:27.515-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:33.756-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 8615100416
ggml_gallocr_reserve_n: failed to allocate CPU buffer of size 8615100416
time=2025-09-26T21:19:44.535-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:50.621-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:56.576-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="17.3 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="12.0 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="8.0 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:342 msg="total memory" size="37.3 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-26T21:19:56.577-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-26T21:19:56.576-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU"
time=2025-09-26T21:19:56.578-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-26T21:20:17.976-04:00 level=INFO source=server.go:1289 msg="llama runner started in 57.58 seconds"
[GIN] 2025/09/26 - 21:20:18 | 200 | 58.1164394s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/09/26 - 21:20:45 | 200 | 22.0714418s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/09/26 - 21:21:41 | 200 | 32.3475974s | 127.0.0.1 | POST "/api/chat"
time=2025-09-26T21:26:47.005-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0923822 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf
time=2025-09-26T21:26:47.253-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3420008 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf
time=2025-09-26T21:26:47.504-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5932162 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf

<!-- gh-comment-id:3341006994 --> @deep1305 commented on GitHub (Sep 27, 2025): Hi I am running ollama on windows 11. Below is the log: time=2025-09-26T21:00:56.425-04:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:true OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\smart\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-09-26T21:00:56.535-04:00 level=INFO source=images.go:518 msg="total blobs: 74" time=2025-09-26T21:00:56.538-04:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" time=2025-09-26T21:00:56.546-04:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 (version 0.12.2)" time=2025-09-26T21:00:56.547-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20 time=2025-09-26T21:00:57.880-04:00 level=INFO source=gpu.go:311 msg="detected OS VRAM overhead" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" overhead="674.8 MiB" time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda variant=v13 compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" total="4.0 GiB" available="3.2 GiB" time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="\xc0" total="0 B" available="0 B" time=2025-09-26T21:00:58.795-04:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="4.0 GiB" threshold="20.0 GiB" [GIN] 2025/09/26 - 21:00:58 | 200 | 642.9µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/26 - 21:00:58 | 200 | 98.3295ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/09/26 - 21:07:07 | 200 | 2.7923ms | 127.0.0.1 | HEAD "/" [GIN] 2025/09/26 - 21:07:07 | 200 | 662.4702ms | 127.0.0.1 | POST "/api/show" time=2025-09-26T21:07:09.875-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\\Users\\smart\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\smart\\.ollama\\models\\blobs\\sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --port 51082" time=2025-09-26T21:07:09.907-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1 time=2025-09-26T21:07:09.987-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-26T21:07:09.989-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:51082" time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="14.9 GiB" free_swap="27.0 GiB" time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.7 GiB" free="3.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B" time=2025-09-26T21:07:10.073-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:10.146-04:00 level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37 load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-09-26T21:07:10.298-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-26T21:07:10.650-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:10.909-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.160-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.414-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.669-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.907-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:12.167-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:12.425-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:15.220-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:19.513-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:24.438-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:28.963-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:34.114-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:39.762-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:46.714-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="8.3 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="4.0 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:342 msg="total memory" size="20.9 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-26T21:07:46.714-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU" time=2025-09-26T21:07:46.715-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-26T21:07:46.717-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-26T21:07:58.792-04:00 level=INFO source=server.go:1289 msg="llama runner started in 48.94 seconds" [GIN] 2025/09/26 - 21:07:58 | 200 | 51.056361s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/09/26 - 21:09:16 | 200 | 21.0617641s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/26 - 21:09:22 | 200 | 5.0899348s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/26 - 21:19:19 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/09/26 - 21:19:19 | 200 | 124.9534ms | 127.0.0.1 | POST "/api/show" time=2025-09-26T21:19:20.396-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\\Users\\smart\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\smart\\.ollama\\models\\blobs\\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf --port 52631" time=2025-09-26T21:19:20.410-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1 time=2025-09-26T21:19:20.492-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="17.7 GiB" free_swap="25.5 GiB" time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.5 GiB" free="3.0 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B" time=2025-09-26T21:19:20.496-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:52631" time=2025-09-26T21:19:20.498-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:20.533-04:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3moe file_type=Q4_K_M name="Qwen3 30B A3B Thinking 2507" description="" num_tensors=579 num_key_values=33 load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-09-26T21:19:21.514-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-26T21:19:21.651-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.702-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.757-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.812-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.861-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.920-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:27.515-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:33.756-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 8615100416 ggml_gallocr_reserve_n: failed to allocate CPU buffer of size 8615100416 time=2025-09-26T21:19:44.535-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:50.621-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:56.576-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="17.3 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="12.0 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="8.0 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:342 msg="total memory" size="37.3 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-26T21:19:56.577-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-26T21:19:56.576-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU" time=2025-09-26T21:19:56.578-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-26T21:20:17.976-04:00 level=INFO source=server.go:1289 msg="llama runner started in 57.58 seconds" [GIN] 2025/09/26 - 21:20:18 | 200 | 58.1164394s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/09/26 - 21:20:45 | 200 | 22.0714418s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/26 - 21:21:41 | 200 | 32.3475974s | 127.0.0.1 | POST "/api/chat" time=2025-09-26T21:26:47.005-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0923822 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf time=2025-09-26T21:26:47.253-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3420008 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf time=2025-09-26T21:26:47.504-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5932162 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf
Author
Owner

@asdnemasd commented on GitHub (Sep 27, 2025):

I'm experiencing the same issue. I think it has something to do with Ollama's new engine. With the Qwen3-Coder-30B-A3B model and Ollama v0.12.1, the model loads with around ~100 MB/s, but with version v0.12.2, that has switched the Qwen3 architecture to Ollama's new engine, the model only loads around ~30 MB/s. (In both cases, the model loads from a HDD, and the model was added through a custom GGUF file).

<!-- gh-comment-id:3341176964 --> @asdnemasd commented on GitHub (Sep 27, 2025): I'm experiencing the same issue. I think it has something to do with Ollama's new engine. With the Qwen3-Coder-30B-A3B model and Ollama v0.12.1, the model loads with around ~100 MB/s, but with version v0.12.2, that has switched the Qwen3 architecture to Ollama's new engine, the model only loads around ~30 MB/s. (In both cases, the model loads from a HDD, and the model was added through a custom GGUF file).
Author
Owner

@zxiaomzxm commented on GitHub (Sep 27, 2025):

same issue as here: https://github.com/ollama/ollama/issues/12407

<!-- gh-comment-id:3341190725 --> @zxiaomzxm commented on GitHub (Sep 27, 2025): same issue as here: https://github.com/ollama/ollama/issues/12407
Author
Owner

@rick-github commented on GitHub (Sep 27, 2025):

@deep1305 You are using a model with 8G of weights with a context length of 131072 and a GPU that has only 4GB of VRAM, so the model will not fit on the GPU and is going to run in CPU. Is your experience that CPU processing is slower in 0.12.* than previous versions? Can you run ollama run gemma3:12b --verbose hello and post the output from 0.12.2 and the previous version of ollama?

<!-- gh-comment-id:3341507340 --> @rick-github commented on GitHub (Sep 27, 2025): @deep1305 You are using a model with 8G of weights with a context length of 131072 and a GPU that has only 4GB of VRAM, so the model will not fit on the GPU and is going to run in CPU. Is your experience that CPU processing is slower in 0.12.* than previous versions? Can you run `ollama run gemma3:12b --verbose hello` and post the output from 0.12.2 and the previous version of ollama?
Author
Owner

@rick-github commented on GitHub (Sep 27, 2025):

@asdnemasd This seems like a different problem, you are seeing slower load times while the OP has slower execution. Can you open a new issue, set OLLAMA_DEBUG=1 and then post logs from 0.12.2 and whatever version of ollama you were running that loaded faster?

<!-- gh-comment-id:3341510200 --> @rick-github commented on GitHub (Sep 27, 2025): @asdnemasd This seems like a different problem, you are seeing slower load times while the OP has slower execution. Can you open a new issue, set `OLLAMA_DEBUG=1` and then post logs from 0.12.2 and whatever version of ollama you were running that loaded faster?
Author
Owner

@rick-github commented on GitHub (Sep 27, 2025):

@zxiaomzxm Your problem doesn't appear to be the same.

<!-- gh-comment-id:3341512266 --> @rick-github commented on GitHub (Sep 27, 2025): @zxiaomzxm Your problem doesn't appear to be the same.
Author
Owner

@asiyouil commented on GitHub (Sep 27, 2025):

I also met same problem. And I think the reason is that if ollama detects your VARM ( not include GPU shared memory ) is below then total model memory, it well enter low vram mode, and this mode only uses CPU to run model. ( example: your VRAM is 4 GiB, but total model memory is 20.9 GiB, ollama will enter low vram mode )

<!-- gh-comment-id:3341717931 --> @asiyouil commented on GitHub (Sep 27, 2025): I also met same problem. And I think the reason is that if ollama detects your VARM ( not include GPU shared memory ) is below then total model memory, it well enter low vram mode, and this mode only uses CPU to run model. ( example: your VRAM is 4 GiB, but total model memory is 20.9 GiB, ollama will enter low vram mode )
Author
Owner

@rick-github commented on GitHub (Sep 27, 2025):

No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

<!-- gh-comment-id:3341728234 --> @rick-github commented on GitHub (Sep 27, 2025): No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.
Author
Owner

@asiyouil commented on GitHub (Sep 27, 2025):

No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

But when I run local model that total memory is more then VRAM, ollama doesn't change the default context size and only enter low vram mode. You can also find a message at log.Only total model memory is below than VRAM, ollama can use GPU fully.

<!-- gh-comment-id:3341798441 --> @asiyouil commented on GitHub (Sep 27, 2025): > No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models. But when I run local model that total memory is more then VRAM, ollama doesn't change the default context size and only enter low vram mode. You can also find a message at log.Only total model memory is below than VRAM, ollama can use GPU fully.
Author
Owner

@rick-github commented on GitHub (Sep 27, 2025):

If ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

<!-- gh-comment-id:3341815575 --> @rick-github commented on GitHub (Sep 27, 2025): If ollama detects that you have low VRAM (less than 20GB), it changes the default context size **for gpt-oss models**.
Author
Owner

@deep1305 commented on GitHub (Sep 27, 2025):

@rick-github It was working perfectly fine with respect to faster inference prior to updating to the 0.12.* version.

<!-- gh-comment-id:3341886289 --> @deep1305 commented on GitHub (Sep 27, 2025): @rick-github It was working perfectly fine with respect to faster inference prior to updating to the 0.12.* version.
Author
Owner

@rick-github commented on GitHub (Sep 27, 2025):

@deep1305 Can you run ollama run gemma3:12b --verbose hello and post the output from 0.12.2 and the previous version of ollama?

<!-- gh-comment-id:3341899985 --> @rick-github commented on GitHub (Sep 27, 2025): @deep1305 Can you run ollama run gemma3:12b --verbose hello and post the output from 0.12.2 and the previous version of ollama?
Author
Owner

@tobing commented on GitHub (Sep 29, 2025):

Seem I have same issue with AMD GPU. I am using ollama 0.12.3 on cachyos. I have AMD 7800 XT.
I just run small model qwen3:0.6b, but quite slow. I remembered with ollama 0.11, I can run deepseek-r1-8b smoothly

[myuser@cachyos-x8664 ~]$ ollama serve
time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:518 msg="total blobs: 0"
time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 (version 0.12.3)"
time=2025-09-29T09:31:43.239+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-29T09:31:43.260+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2025-09-29T09:31:43.262+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101
time=2025-09-29T09:31:43.262+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.6 GiB"
time=2025-09-29T09:31:43.262+07:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB"
time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:217 msg="enabling flash attention"
time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 40549"
time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:672 msg="loading model" "model layers"=29 requested=-1
time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:678 msg="system memory" total="62.7 GiB" free="53.8 GiB" free_swap="62.7 GiB"
time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-f9ee9007b2049d8b available="14.0 GiB" free="14.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:40549"
time=2025-09-29T09:33:42.072+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.092+07:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3 file_type=Q4_K_M name="Qwen3 0.6B" description="" num_tensors=311 num_key_values=29
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-09-29T09:33:42.126+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.149+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:498 msg="offloaded 0/29 layers to GPU"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="492.8 MiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="966.9 MiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="24.0 MiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:342 msg="total memory" size="1.4 GiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-29T09:33:42.269+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-29T09:33:42.270+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-29T09:33:42.523+07:00 level=INFO source=server.go:1289 msg="llama runner started in 0.46 seconds"
[GIN] 2025/09/29 - 09:33:42 | 200 | 574.588692ms | 127.0.0.1 | POST "/api/generate"

No issue with ollama 0.11.11

[myuser@cachyos-x8664 ~]$ ollama serve
time=2025-09-29T10:32:33.925+07:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:477 msg="total blobs: 5"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=routes.go:1385 msg="Listening on 127.0.0.1:11434 (version 0.11.11)"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-29T10:32:33.949+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2025-09-29T10:32:33.950+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101
time=2025-09-29T10:32:33.950+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.4 GiB"
time=2025-09-29T10:32:33.950+07:00 level=INFO source=routes.go:1426 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB"
[GIN] 2025/09/29 - 10:32:48 | 200 | 32.14µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/29 - 10:32:48 | 200 | 285.89µs | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/09/29 - 10:32:58 | 200 | 20.53µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/29 - 10:32:58 | 200 | 36.896425ms | 127.0.0.1 | POST "/api/show"
llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 0.6B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 0.6B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: qwen3.block_count u32 = 28
llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024
llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072
llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 15
llama_model_loader: - type f32: 113 tensors
llama_model_loader: - type f16: 28 tensors
llama_model_loader: - type q4_K: 155 tensors
llama_model_loader: - type q6_K: 15 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 492.75 MiB (5.50 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 751.63 M
print_info: general.name = Qwen3 0.6B
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-09-29T10:32:58.869+07:00 level=INFO source=server.go:217 msg="enabling flash attention"
time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 37235"
time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:504 msg="system memory" total="62.7 GiB" free="54.0 GiB" free_swap="62.7 GiB"
time=2025-09-29T10:32:58.871+07:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa library=rocm parallel=1 required="2.1 GiB" gpus=1
time=2025-09-29T10:32:58.871+07:00 level=INFO source=server.go:544 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[14.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.1 GiB" memory.required.partial="2.1 GiB" memory.required.kv="896.4 MiB" memory.required.allocations="[2.1 GiB]" memory.weights.total="409.3 MiB" memory.weights.repeating="287.6 MiB" memory.weights.nonrepeating="121.7 MiB" memory.graph.full="298.8 MiB" memory.graph.partial="298.8 MiB"
time=2025-09-29T10:32:58.878+07:00 level=INFO source=runner.go:864 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T10:32:58.883+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-09-29T10:32:58.883+07:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:37235"
time=2025-09-29T10:32:58.893+07:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 0.6B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 0.6B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: qwen3.block_count u32 = 28
llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024
llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072
llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 15
llama_model_loader: - type f32: 113 tensors
llama_model_loader: - type f16: 28 tensors
llama_model_loader: - type q4_K: 155 tensors
llama_model_loader: - type q6_K: 15 tensors

<!-- gh-comment-id:3344664318 --> @tobing commented on GitHub (Sep 29, 2025): Seem I have same issue with AMD GPU. I am using ollama 0.12.3 on cachyos. I have AMD 7800 XT. I just run small model qwen3:0.6b, but quite slow. I remembered with ollama 0.11, I can run deepseek-r1-8b smoothly [myuser@cachyos-x8664 ~]$ ollama serve time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:518 msg="total blobs: 0" time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 **(version 0.12.3)**" time=2025-09-29T09:31:43.239+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-29T09:31:43.260+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2025-09-29T09:31:43.262+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101 time=2025-09-29T09:31:43.262+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.6 GiB" time=2025-09-29T09:31:43.262+07:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB" time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:217 msg="enabling flash attention" time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 40549" time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:672 msg="loading model" "model layers"=29 requested=-1 time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:678 msg="system memory" total="62.7 GiB" free="53.8 GiB" free_swap="62.7 GiB" time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-f9ee9007b2049d8b available="14.0 GiB" free="14.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:40549" time=2025-09-29T09:33:42.072+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.092+07:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3 file_type=Q4_K_M name="Qwen3 0.6B" description="" num_tensors=311 num_key_values=29 operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-09-29T09:33:42.126+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.149+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.269+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:498 msg="offloaded 0/29 layers to GPU" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="492.8 MiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="966.9 MiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="24.0 MiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:342 msg="total memory" size="1.4 GiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-29T09:33:42.269+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-29T09:33:42.270+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-29T09:33:42.523+07:00 level=INFO source=server.go:1289 msg="llama runner started in 0.46 seconds" [GIN] 2025/09/29 - 09:33:42 | 200 | 574.588692ms | 127.0.0.1 | POST "/api/generate" No issue with ollama 0.11.11 [myuser@cachyos-x8664 ~]$ ollama serve time=2025-09-29T10:32:33.925+07:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:477 msg="total blobs: 5" time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-09-29T10:32:33.926+07:00 level=INFO source=routes.go:1385 msg="Listening on 127.0.0.1:11434 **(version 0.11.11)**" time=2025-09-29T10:32:33.926+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-29T10:32:33.949+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2025-09-29T10:32:33.950+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101 time=2025-09-29T10:32:33.950+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.4 GiB" time=2025-09-29T10:32:33.950+07:00 level=INFO source=routes.go:1426 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB" [GIN] 2025/09/29 - 10:32:48 | 200 | 32.14µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/29 - 10:32:48 | 200 | 285.89µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/09/29 - 10:32:58 | 200 | 20.53µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/29 - 10:32:58 | 200 | 36.896425ms | 127.0.0.1 | POST "/api/show" llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 0.6B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 0.6B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3.block_count u32 = 28 llama_model_loader: - kv 7: qwen3.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type f16: 28 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q6_K: 15 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 492.75 MiB (5.50 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 751.63 M print_info: general.name = Qwen3 0.6B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-09-29T10:32:58.869+07:00 level=INFO source=server.go:217 msg="enabling flash attention" time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 37235" time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:504 msg="system memory" total="62.7 GiB" free="54.0 GiB" free_swap="62.7 GiB" time=2025-09-29T10:32:58.871+07:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa library=rocm parallel=1 required="2.1 GiB" gpus=1 time=2025-09-29T10:32:58.871+07:00 level=INFO source=server.go:544 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[14.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.1 GiB" memory.required.partial="2.1 GiB" memory.required.kv="896.4 MiB" memory.required.allocations="[2.1 GiB]" memory.weights.total="409.3 MiB" memory.weights.repeating="287.6 MiB" memory.weights.nonrepeating="121.7 MiB" memory.graph.full="298.8 MiB" memory.graph.partial="298.8 MiB" time=2025-09-29T10:32:58.878+07:00 level=INFO source=runner.go:864 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-09-29T10:32:58.883+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-09-29T10:32:58.883+07:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:37235" time=2025-09-29T10:32:58.893+07:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 0.6B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 0.6B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3.block_count u32 = 28 llama_model_loader: - kv 7: qwen3.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type f16: 28 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q6_K: 15 tensors
Author
Owner

@rick-github commented on GitHub (Sep 29, 2025):

operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

ollama didn't load the ROCm driver from /usr/lib/ollama/libggml-hip.so. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install.

<!-- gh-comment-id:3346533601 --> @rick-github commented on GitHub (Sep 29, 2025): ``` operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ``` ollama didn't load the ROCm driver from `/usr/lib/ollama/libggml-hip.so`. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install.
Author
Owner

@tobing commented on GitHub (Sep 29, 2025):

operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

ollama didn't load the ROCm driver from /usr/lib/ollama/libggml-hip.so. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install.

Please check the last part of my post. I have downgraded to ollama 0.11.11 without other changes.
ollama package automatically installed if i selected ollama-rocm

All working properly now. Checked by "ollama ps" command
With ollama 0.12.3 I saw Processor 100% CPU
With ollama 0.11.11 I saw Processor 100% GPU

<!-- gh-comment-id:3346784021 --> @tobing commented on GitHub (Sep 29, 2025): > ``` > operator() double registration of ggml_uncaught_exception > operator() double registration of ggml_uncaught_exception > operator() double registration of ggml_uncaught_exception > load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so > time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) > ``` > > ollama didn't load the ROCm driver from `/usr/lib/ollama/libggml-hip.so`. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install. Please check the last part of my post. I have downgraded to ollama 0.11.11 without other changes. ollama package automatically installed if i selected ollama-rocm All working properly now. Checked by "ollama ps" command With ollama 0.12.3 I saw Processor 100% CPU With ollama 0.11.11 I saw Processor 100% GPU
Author
Owner

@rick-github commented on GitHub (Sep 29, 2025):

With ollama 0.12.3 I saw Processor 100% CPU

Because the ROCm library wasn't loaded. When running 0.12.3, what's the output of ls -lR /usr/lib/ollama/?

<!-- gh-comment-id:3346799583 --> @rick-github commented on GitHub (Sep 29, 2025): > With ollama 0.12.3 I saw Processor 100% CPU Because the ROCm library wasn't loaded. When running 0.12.3, what's the output of `ls -lR /usr/lib/ollama/`?
Author
Owner

@tobing commented on GitHub (Sep 29, 2025):

This is the output ollama 0.11.11

[myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/
/usr/lib/ollama/:
total 709476
-rwxr-xr-x 1 root root 665840 Sep 16 02:15 libggml-base.so
-rwxr-xr-x 1 root root 780720 Sep 16 02:15 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root 784816 Sep 16 02:15 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 973232 Sep 16 02:15 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root 715192 Sep 16 02:15 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root 977328 Sep 16 02:15 libggml-cpu-skylakex.so
-rwxr-xr-x 1 root root 571824 Sep 16 02:15 libggml-cpu-sse42.so
-rwxr-xr-x 1 root root 551344 Sep 16 02:15 libggml-cpu-x64.so
-rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so
drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm

/usr/lib/ollama/rocm:
total 0
drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas

/usr/lib/ollama/rocm/rocblas:
total 0
[myuser@cachyos-x8664 ~]$

This is output of ollama 0.12.3

[myuser@cachyos-x8664 ~]$ ollama --version
ollama version is 0.12.3
[myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/
/usr/lib/ollama/:
total 709928
-rwxr-xr-x 1 root root 686320 Sep 26 19:25 libggml-base.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 944560 Sep 26 19:25 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root 780728 Sep 26 19:25 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root 948656 Sep 26 19:25 libggml-cpu-skylakex.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-sse42.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-x64.so
-rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so
drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm

/usr/lib/ollama/rocm:
total 0
drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas

/usr/lib/ollama/rocm/rocblas:
total 0
[myuser@cachyos-x8664 ~]$

<!-- gh-comment-id:3346805029 --> @tobing commented on GitHub (Sep 29, 2025): **This is the output ollama 0.11.11** [myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/ /usr/lib/ollama/: total 709476 -rwxr-xr-x 1 root root 665840 Sep 16 02:15 libggml-base.so -rwxr-xr-x 1 root root 780720 Sep 16 02:15 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 784816 Sep 16 02:15 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 973232 Sep 16 02:15 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 715192 Sep 16 02:15 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 977328 Sep 16 02:15 libggml-cpu-skylakex.so -rwxr-xr-x 1 root root 571824 Sep 16 02:15 libggml-cpu-sse42.so -rwxr-xr-x 1 root root 551344 Sep 16 02:15 libggml-cpu-x64.so -rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm /usr/lib/ollama/rocm: total 0 drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas /usr/lib/ollama/rocm/rocblas: total 0 [myuser@cachyos-x8664 ~]$ **This is output of ollama 0.12.3** [myuser@cachyos-x8664 ~]$ ollama --version ollama version is 0.12.3 [myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/ /usr/lib/ollama/: total 709928 -rwxr-xr-x 1 root root 686320 Sep 26 19:25 libggml-base.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 944560 Sep 26 19:25 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 780728 Sep 26 19:25 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 948656 Sep 26 19:25 libggml-cpu-skylakex.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-sse42.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-x64.so -rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm /usr/lib/ollama/rocm: total 0 drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas /usr/lib/ollama/rocm/rocblas: total 0 [myuser@cachyos-x8664 ~]$
Author
Owner

@rick-github commented on GitHub (Sep 29, 2025):

/usr/lib/ollama/rocm is usually not empty, I am guessing that Arch has them in a separate package which may lead to compatibility issues. I suggest using the official ollama installation.

$ ls -l /usr/local/lib/ollama/rocm
total 1920988
lrwxrwxrwx 1 root root         25 Sep 26 05:55 libamd_comgr.so.2 -> libamd_comgr.so.2.8.60303
-rwxr-xr-x 1 root root  144125696 Feb 10  2025 libamd_comgr.so.2.8.60303
lrwxrwxrwx 1 root root         24 Sep 26 05:55 libamdhip64.so.6 -> libamdhip64.so.6.3.60303
-rwxr-xr-x 1 root root   22294280 Feb 10  2025 libamdhip64.so.6.3.60303
lrwxrwxrwx 1 root root         24 Sep 26 05:55 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.123.0
-rwxr-xr-x 1 root root      58200 Feb  7  2025 libdrm_amdgpu.so.1.123.0
lrwxrwxrwx 1 root root         17 Sep 26 05:55 libdrm.so.2 -> libdrm.so.2.123.0
-rwxr-xr-x 1 root root     106888 Feb  7  2025 libdrm.so.2.123.0
-rwxr-xr-x 1 root root     109000 Apr  6  2024 libelf-0.190.so
lrwxrwxrwx 1 root root         15 Sep 26 05:55 libelf.so.1 -> libelf-0.190.so
lrwxrwxrwx 1 root root         26 Sep 26 05:55 libhipblaslt.so.0 -> libhipblaslt.so.0.10.60303
-rwxr-xr-x 1 root root    7450504 Feb 11  2025 libhipblaslt.so.0.10.60303
lrwxrwxrwx 1 root root         23 Sep 26 05:55 libhipblas.so.2 -> libhipblas.so.2.3.60303
-rwxr-xr-x 1 root root    1052288 Feb 11  2025 libhipblas.so.2.3.60303
lrwxrwxrwx 1 root root         30 Sep 26 05:55 libhsa-runtime64.so.1 -> libhsa-runtime64.so.1.14.60303
-rwxr-xr-x 1 root root    3259872 Feb 10  2025 libhsa-runtime64.so.1.14.60303
lrwxrwxrwx 1 root root         16 Sep 26 05:55 libnuma.so.1 -> libnuma.so.1.0.0
-rwxr-xr-x 1 root root      51400 Apr  6  2024 libnuma.so.1.0.0
lrwxrwxrwx 1 root root         23 Sep 26 05:55 librocblas.so.4 -> librocblas.so.4.3.60303
-rwxr-xr-x 1 root root   74646880 Feb 11  2025 librocblas.so.4.3.60303
lrwxrwxrwx 1 root root         32 Sep 26 05:55 librocprofiler-register.so.0 -> librocprofiler-register.so.0.4.0
-rwxr-xr-x 1 root root     872192 Feb 10  2025 librocprofiler-register.so.0.4.0
lrwxrwxrwx 1 root root         25 Sep 26 05:55 librocsolver.so.0 -> librocsolver.so.0.3.60303
-rwxr-xr-x 1 root root 1713040960 Feb 11  2025 librocsolver.so.0.3.60303
drwxr-xr-x 3 root root       4096 Sep 26 05:55 rocblas
<!-- gh-comment-id:3346888576 --> @rick-github commented on GitHub (Sep 29, 2025): `/usr/lib/ollama/rocm` is usually not empty, I am guessing that Arch has them in a separate package which may lead to compatibility issues. I suggest using the official ollama installation. ```console $ ls -l /usr/local/lib/ollama/rocm total 1920988 lrwxrwxrwx 1 root root 25 Sep 26 05:55 libamd_comgr.so.2 -> libamd_comgr.so.2.8.60303 -rwxr-xr-x 1 root root 144125696 Feb 10 2025 libamd_comgr.so.2.8.60303 lrwxrwxrwx 1 root root 24 Sep 26 05:55 libamdhip64.so.6 -> libamdhip64.so.6.3.60303 -rwxr-xr-x 1 root root 22294280 Feb 10 2025 libamdhip64.so.6.3.60303 lrwxrwxrwx 1 root root 24 Sep 26 05:55 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.123.0 -rwxr-xr-x 1 root root 58200 Feb 7 2025 libdrm_amdgpu.so.1.123.0 lrwxrwxrwx 1 root root 17 Sep 26 05:55 libdrm.so.2 -> libdrm.so.2.123.0 -rwxr-xr-x 1 root root 106888 Feb 7 2025 libdrm.so.2.123.0 -rwxr-xr-x 1 root root 109000 Apr 6 2024 libelf-0.190.so lrwxrwxrwx 1 root root 15 Sep 26 05:55 libelf.so.1 -> libelf-0.190.so lrwxrwxrwx 1 root root 26 Sep 26 05:55 libhipblaslt.so.0 -> libhipblaslt.so.0.10.60303 -rwxr-xr-x 1 root root 7450504 Feb 11 2025 libhipblaslt.so.0.10.60303 lrwxrwxrwx 1 root root 23 Sep 26 05:55 libhipblas.so.2 -> libhipblas.so.2.3.60303 -rwxr-xr-x 1 root root 1052288 Feb 11 2025 libhipblas.so.2.3.60303 lrwxrwxrwx 1 root root 30 Sep 26 05:55 libhsa-runtime64.so.1 -> libhsa-runtime64.so.1.14.60303 -rwxr-xr-x 1 root root 3259872 Feb 10 2025 libhsa-runtime64.so.1.14.60303 lrwxrwxrwx 1 root root 16 Sep 26 05:55 libnuma.so.1 -> libnuma.so.1.0.0 -rwxr-xr-x 1 root root 51400 Apr 6 2024 libnuma.so.1.0.0 lrwxrwxrwx 1 root root 23 Sep 26 05:55 librocblas.so.4 -> librocblas.so.4.3.60303 -rwxr-xr-x 1 root root 74646880 Feb 11 2025 librocblas.so.4.3.60303 lrwxrwxrwx 1 root root 32 Sep 26 05:55 librocprofiler-register.so.0 -> librocprofiler-register.so.0.4.0 -rwxr-xr-x 1 root root 872192 Feb 10 2025 librocprofiler-register.so.0.4.0 lrwxrwxrwx 1 root root 25 Sep 26 05:55 librocsolver.so.0 -> librocsolver.so.0.3.60303 -rwxr-xr-x 1 root root 1713040960 Feb 11 2025 librocsolver.so.0.3.60303 drwxr-xr-x 3 root root 4096 Sep 26 05:55 rocblas ```
Author
Owner

@tobing commented on GitHub (Sep 29, 2025):

Reinstalled from aur or repo the result still same. So I need to install official ollama manually

[root@cachyos-x8664 myuser]# pacman -Qi ollama
Installed From  : cachyos-extra-v3
Name            : ollama
Version         : 0.12.3-1.1
Description     : Create, run and share large language models (LLMs)
Architecture    : x86_64_v3
URL             : https://github.com/ollama/ollama
Licenses        : MIT
Groups          : None
Provides        : None
Depends On      : gcc-libs  glibc
Optional Deps   : None
Required By     : ollama-rocm
Optional For    : None
Conflicts With  : None
Replaces        : None
Installed Size  : 37.19 MiB
Packager        : CachyOS <admin@cachyos.org>
Build Date      : Fri 26 Sep 2025 07:25:32 PM WIB
Install Date    : Mon 29 Sep 2025 08:04:41 PM WIB
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

[root@cachyos-x8664 myuser]# ls -lR /usr/lib/ollama/
/usr/lib/ollama/:
total 709928
-rwxr-xr-x 1 root root    686320 Sep 26 19:25 libggml-base.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root    944560 Sep 26 19:25 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root    780728 Sep 26 19:25 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root    948656 Sep 26 19:25 libggml-cpu-skylakex.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-sse42.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-x64.so
-rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so
drwxr-xr-x 1 root root        14 Sep 29 20:24 rocm

/usr/lib/ollama/rocm:
total 0
drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas

/usr/lib/ollama/rocm/rocblas:
total 0
[root@cachyos-x8664 myuser]# ls -l /usr/local/lib/ollama/rocm
ls: cannot access '/usr/local/lib/ollama/rocm': No such file or directory
[root@cachyos-x8664 myuser]# 

<!-- gh-comment-id:3347016106 --> @tobing commented on GitHub (Sep 29, 2025): Reinstalled from aur or repo the result still same. So I need to install official ollama manually ``` [root@cachyos-x8664 myuser]# pacman -Qi ollama Installed From : cachyos-extra-v3 Name : ollama Version : 0.12.3-1.1 Description : Create, run and share large language models (LLMs) Architecture : x86_64_v3 URL : https://github.com/ollama/ollama Licenses : MIT Groups : None Provides : None Depends On : gcc-libs glibc Optional Deps : None Required By : ollama-rocm Optional For : None Conflicts With : None Replaces : None Installed Size : 37.19 MiB Packager : CachyOS <admin@cachyos.org> Build Date : Fri 26 Sep 2025 07:25:32 PM WIB Install Date : Mon 29 Sep 2025 08:04:41 PM WIB Install Reason : Explicitly installed Install Script : No Validated By : Signature [root@cachyos-x8664 myuser]# ls -lR /usr/lib/ollama/ /usr/lib/ollama/: total 709928 -rwxr-xr-x 1 root root 686320 Sep 26 19:25 libggml-base.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 944560 Sep 26 19:25 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 780728 Sep 26 19:25 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 948656 Sep 26 19:25 libggml-cpu-skylakex.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-sse42.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-x64.so -rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so drwxr-xr-x 1 root root 14 Sep 29 20:24 rocm /usr/lib/ollama/rocm: total 0 drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas /usr/lib/ollama/rocm/rocblas: total 0 [root@cachyos-x8664 myuser]# ls -l /usr/local/lib/ollama/rocm ls: cannot access '/usr/local/lib/ollama/rocm': No such file or directory [root@cachyos-x8664 myuser]# ```
Author
Owner

@rick-github commented on GitHub (Sep 29, 2025):

So I must install official ollama manually right

That's my recommendation.

<!-- gh-comment-id:3347049131 --> @rick-github commented on GitHub (Sep 29, 2025): > So I must install official ollama manually right That's my [recommendation](https://ollama.com/download).
Author
Owner

@tobing commented on GitHub (Sep 29, 2025):

I just installed official ollama. its working, processor 100% GPU.
Thanks

<!-- gh-comment-id:3347101130 --> @tobing commented on GitHub (Sep 29, 2025): I just installed official ollama. its working, processor 100% GPU. Thanks
Author
Owner

@cyrozap commented on GitHub (Oct 2, 2025):

Installation of the ollama-rocm package requires the installation of over 10 GB of files including the ROCm libraries. This is a needless waste of disk space on a system with only Nvidia GPUs. Is there any chance the ROCm dependency can be made optional again?

Apologies for the noise, I misdiagnosed the problem I was experiencing.

<!-- gh-comment-id:3358972092 --> @cyrozap commented on GitHub (Oct 2, 2025): ~Installation of the `ollama-rocm` package requires the installation of over 10 GB of files including the ROCm libraries. This is a needless waste of disk space on a system with only Nvidia GPUs. Is there any chance the ROCm dependency can be made optional again?~ Apologies for the noise, I misdiagnosed the problem I was experiencing.
Author
Owner

@rick-github commented on GitHub (Oct 2, 2025):

If Arch has a ROCm dependency, that's an Arch packaging issue. Ollama does not require installing ROCm libraries on a non-ROCm system.

<!-- gh-comment-id:3359294647 --> @rick-github commented on GitHub (Oct 2, 2025): If Arch has a ROCm dependency, that's an Arch packaging issue. Ollama does not require installing ROCm libraries on a non-ROCm system.
Author
Owner

@cyrozap commented on GitHub (Oct 2, 2025):

I'm very sorry, I misread some of the thread and misunderstood the issue being described here. And to clarify, on Arch the ollama and ollama-cuda packages do not depend on ollama-rocm, and installing ollama-rocm did not fix the issue I was seeing.

I did some more extensive testing and discovered that the issue I was experiencing was a packaging issue, but completely unrelated to ROCm. In case anyone is curious, the issue I was having was that CMAKE_CUDA_ARCHITECTURES was being set to a value that didn't include my GPU's architecture. Sorry for the noise!

<!-- gh-comment-id:3359835542 --> @cyrozap commented on GitHub (Oct 2, 2025): I'm very sorry, I misread some of the thread and misunderstood the issue being described here. And to clarify, on Arch the `ollama` and `ollama-cuda` packages do not depend on `ollama-rocm`, and installing `ollama-rocm` did not fix the issue I was seeing. I did some more extensive testing and discovered that the issue I was experiencing _was_ a packaging issue, but completely unrelated to ROCm. In case anyone is curious, the issue I was having was that `CMAKE_CUDA_ARCHITECTURES` was being set to a value that didn't include my GPU's architecture. Sorry for the noise!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54767