[GH-ISSUE #12428] Models loading slow since 0.12 version #54767

New Issue

GiteaMirror · 2026-04-29T07:15:20-05:00

GiteaMirror commented

2026-04-29 07:15:20 -05:00

Originally created by @deep1305 on GitHub (Sep 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12428

I wanted to raise a issue that since 0.12 ollama update, the models take longer than expected to response even though other processes are not running on my device. To answer a query, it takes more than 1 minute to answer whether it is qwen3 model or deepseek-r1.

Originally created by @deep1305 on GitHub (Sep 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12428 I wanted to raise a issue that since 0.12 ollama update, the models take longer than expected to response even though other processes are not running on my device. To answer a query, it takes more than 1 minute to answer whether it is qwen3 model or deepseek-r1.

GiteaMirror added the bug label 2026-04-29 07:15:21 -05:00

GiteaMirror closed this issue

2026-04-29 07:15:25 -05:00

GiteaMirror commented

2026-04-29 07:15:29 -05:00

@jmorganca commented on GitHub (Sep 27, 2025):

Hi @deep1305 would it be possible to share what OS you are on, and also the logs of possible? Sorry about this.

@jmorganca commented on GitHub (Sep 27, 2025): Hi @deep1305 would it be possible to share what OS you are on, and also the [logs](https://docs.ollama.com/troubleshooting) of possible? Sorry about this.

GiteaMirror commented

2026-04-29 07:15:31 -05:00

@deep1305 commented on GitHub (Sep 27, 2025):

Hi I am running ollama on windows 11.

Below is the log:

time=2025-09-26T21:00:56.425-04:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:true OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\smart\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-09-26T21:00:56.535-04:00 level=INFO source=images.go:518 msg="total blobs: 74"
time=2025-09-26T21:00:56.538-04:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
time=2025-09-26T21:00:56.546-04:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 (version 0.12.2)"
time=2025-09-26T21:00:56.547-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20
time=2025-09-26T21:00:57.880-04:00 level=INFO source=gpu.go:311 msg="detected OS VRAM overhead" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" overhead="674.8 MiB"
time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda variant=v13 compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" total="4.0 GiB" available="3.2 GiB"
time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="\xc0" total="0 B" available="0 B"
time=2025-09-26T21:00:58.795-04:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="4.0 GiB" threshold="20.0 GiB"
[GIN] 2025/09/26 - 21:00:58 | 200 | 642.9µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/26 - 21:00:58 | 200 | 98.3295ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/09/26 - 21:07:07 | 200 | 2.7923ms | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/26 - 21:07:07 | 200 | 662.4702ms | 127.0.0.1 | POST "/api/show"
time=2025-09-26T21:07:09.875-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\Users\smart\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model C:\Users\smart\.ollama\models\blobs\sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --port 51082"
time=2025-09-26T21:07:09.907-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1
time=2025-09-26T21:07:09.987-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-26T21:07:09.989-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:51082"
time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="14.9 GiB" free_swap="27.0 GiB"
time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.7 GiB" free="3.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B"
time=2025-09-26T21:07:10.073-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:10.146-04:00 level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37
load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34
load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-09-26T21:07:10.298-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-26T21:07:10.650-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:10.909-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.160-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.414-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.669-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:11.907-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:12.167-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:12.425-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:15.220-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:19.513-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:24.438-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:28.963-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:34.114-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:39.762-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:46.714-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="8.3 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="4.0 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:342 msg="total memory" size="20.9 GiB"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-26T21:07:46.714-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU"
time=2025-09-26T21:07:46.715-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-26T21:07:46.717-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-26T21:07:58.792-04:00 level=INFO source=server.go:1289 msg="llama runner started in 48.94 seconds"
[GIN] 2025/09/26 - 21:07:58 | 200 | 51.056361s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/09/26 - 21:09:16 | 200 | 21.0617641s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/09/26 - 21:09:22 | 200 | 5.0899348s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/09/26 - 21:19:19 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/26 - 21:19:19 | 200 | 124.9534ms | 127.0.0.1 | POST "/api/show"
time=2025-09-26T21:19:20.396-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\Users\smart\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf --port 52631"
time=2025-09-26T21:19:20.410-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1
time=2025-09-26T21:19:20.492-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="17.7 GiB" free_swap="25.5 GiB"
time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.5 GiB" free="3.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B"
time=2025-09-26T21:19:20.496-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:52631"
time=2025-09-26T21:19:20.498-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:20.533-04:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3moe file_type=Q4_K_M name="Qwen3 30B A3B Thinking 2507" description="" num_tensors=579 num_key_values=33
load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34
load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-09-26T21:19:21.514-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-26T21:19:21.651-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.702-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.757-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.812-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.861-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:21.920-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:27.515-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:33.756-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 8615100416
ggml_gallocr_reserve_n: failed to allocate CPU buffer of size 8615100416
time=2025-09-26T21:19:44.535-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:50.621-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:56.576-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="17.3 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="12.0 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="8.0 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:342 msg="total memory" size="37.3 GiB"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-26T21:19:56.577-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-26T21:19:56.576-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU"
time=2025-09-26T21:19:56.578-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-26T21:20:17.976-04:00 level=INFO source=server.go:1289 msg="llama runner started in 57.58 seconds"
[GIN] 2025/09/26 - 21:20:18 | 200 | 58.1164394s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/09/26 - 21:20:45 | 200 | 22.0714418s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/09/26 - 21:21:41 | 200 | 32.3475974s | 127.0.0.1 | POST "/api/chat"
time=2025-09-26T21:26:47.005-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0923822 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf
time=2025-09-26T21:26:47.253-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3420008 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf
time=2025-09-26T21:26:47.504-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5932162 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf

@deep1305 commented on GitHub (Sep 27, 2025): Hi I am running ollama on windows 11. Below is the log: time=2025-09-26T21:00:56.425-04:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:true OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\smart\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-09-26T21:00:56.535-04:00 level=INFO source=images.go:518 msg="total blobs: 74" time=2025-09-26T21:00:56.538-04:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" time=2025-09-26T21:00:56.546-04:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 (version 0.12.2)" time=2025-09-26T21:00:56.547-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-09-26T21:00:56.548-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20 time=2025-09-26T21:00:57.880-04:00 level=INFO source=gpu.go:311 msg="detected OS VRAM overhead" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" overhead="674.8 MiB" time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 library=cuda variant=v13 compute=8.6 driver=13.0 name="NVIDIA GeForce RTX 3050 Ti Laptop GPU" total="4.0 GiB" available="3.2 GiB" time=2025-09-26T21:00:58.795-04:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="\xc0" total="0 B" available="0 B" time=2025-09-26T21:00:58.795-04:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="4.0 GiB" threshold="20.0 GiB" [GIN] 2025/09/26 - 21:00:58 | 200 | 642.9µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/26 - 21:00:58 | 200 | 98.3295ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/09/26 - 21:07:07 | 200 | 2.7923ms | 127.0.0.1 | HEAD "/" [GIN] 2025/09/26 - 21:07:07 | 200 | 662.4702ms | 127.0.0.1 | POST "/api/show" time=2025-09-26T21:07:09.875-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\\Users\\smart\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\smart\\.ollama\\models\\blobs\\sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --port 51082" time=2025-09-26T21:07:09.907-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1 time=2025-09-26T21:07:09.987-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-26T21:07:09.989-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:51082" time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="14.9 GiB" free_swap="27.0 GiB" time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.7 GiB" free="3.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-09-26T21:07:10.059-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B" time=2025-09-26T21:07:10.073-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:10.146-04:00 level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37 load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-09-26T21:07:10.298-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-26T21:07:10.650-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:10.909-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.160-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.414-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.669-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:11.907-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:12.167-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:12.425-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:15.220-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:5[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:5(43..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:19.513-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:4[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:4(44..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:24.438-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:28.963-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:34.114-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:39.762-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:46.714-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="8.3 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="4.0 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=backend.go:342 msg="total memory" size="20.9 GiB" time=2025-09-26T21:07:46.715-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-26T21:07:46.714-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-26T21:07:46.715-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU" time=2025-09-26T21:07:46.715-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-26T21:07:46.717-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-26T21:07:58.792-04:00 level=INFO source=server.go:1289 msg="llama runner started in 48.94 seconds" [GIN] 2025/09/26 - 21:07:58 | 200 | 51.056361s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/09/26 - 21:09:16 | 200 | 21.0617641s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/26 - 21:09:22 | 200 | 5.0899348s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/26 - 21:19:19 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/09/26 - 21:19:19 | 200 | 124.9534ms | 127.0.0.1 | POST "/api/show" time=2025-09-26T21:19:20.396-04:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\\Users\\smart\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\smart\\.ollama\\models\\blobs\\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf --port 52631" time=2025-09-26T21:19:20.410-04:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1 time=2025-09-26T21:19:20.492-04:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:678 msg="system memory" total="31.7 GiB" free="17.7 GiB" free_swap="25.5 GiB" time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 available="2.5 GiB" free="3.0 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-09-26T21:19:20.495-04:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="0 B" free="0 B" minimum="0 B" overhead="0 B" time=2025-09-26T21:19:20.496-04:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:52631" time=2025-09-26T21:19:20.498-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:20.533-04:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3moe file_type=Q4_K_M name="Qwen3 30B A3B Thinking 2507" description="" num_tensors=579 num_key_values=33 load_backend: loaded CPU backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes, ID: GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 load_backend: loaded CUDA backend from C:\Users\smart\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-09-26T21:19:21.514-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-26T21:19:21.651-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.702-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.757-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.812-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.861-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:21.920-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:27.515-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:3[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:3(45..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:33.756-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:2[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:2(46..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 8615100416 ggml_gallocr_reserve_n: failed to allocate CPU buffer of size 8615100416 time=2025-09-26T21:19:44.535-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:1[ID:GPU-57a2f29f-474d-2d57-7dcf-2edda631bd34 Layers:1(47..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:50.621-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:56.576-04:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="17.3 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="12.0 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="8.0 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=backend.go:342 msg="total memory" size="37.3 GiB" time=2025-09-26T21:19:56.577-04:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-26T21:19:56.577-04:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-26T21:19:56.576-04:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-26T21:19:56.577-04:00 level=INFO source=ggml.go:498 msg="offloaded 0/49 layers to GPU" time=2025-09-26T21:19:56.578-04:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-26T21:20:17.976-04:00 level=INFO source=server.go:1289 msg="llama runner started in 57.58 seconds" [GIN] 2025/09/26 - 21:20:18 | 200 | 58.1164394s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/09/26 - 21:20:45 | 200 | 22.0714418s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/26 - 21:21:41 | 200 | 32.3475974s | 127.0.0.1 | POST "/api/chat" time=2025-09-26T21:26:47.005-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0923822 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf time=2025-09-26T21:26:47.253-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3420008 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf time=2025-09-26T21:26:47.504-04:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5932162 runner.size="37.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=19228 runner.model=C:\Users\smart\.ollama\models\blobs\sha256-58574f2e94b99fb9e4391408b57e5aeaaaec10f6384e9a699fc2cb43a5c8eabf

GiteaMirror commented

2026-04-29 07:15:32 -05:00

@asdnemasd commented on GitHub (Sep 27, 2025):

I'm experiencing the same issue. I think it has something to do with Ollama's new engine. With the Qwen3-Coder-30B-A3B model and Ollama v0.12.1, the model loads with around ~100 MB/s, but with version v0.12.2, that has switched the Qwen3 architecture to Ollama's new engine, the model only loads around ~30 MB/s. (In both cases, the model loads from a HDD, and the model was added through a custom GGUF file).

@asdnemasd commented on GitHub (Sep 27, 2025): I'm experiencing the same issue. I think it has something to do with Ollama's new engine. With the Qwen3-Coder-30B-A3B model and Ollama v0.12.1, the model loads with around ~100 MB/s, but with version v0.12.2, that has switched the Qwen3 architecture to Ollama's new engine, the model only loads around ~30 MB/s. (In both cases, the model loads from a HDD, and the model was added through a custom GGUF file).

GiteaMirror commented

2026-04-29 07:15:34 -05:00

@zxiaomzxm commented on GitHub (Sep 27, 2025):

same issue as here: https://github.com/ollama/ollama/issues/12407

@zxiaomzxm commented on GitHub (Sep 27, 2025): same issue as here: https://github.com/ollama/ollama/issues/12407

GiteaMirror commented

2026-04-29 07:15:36 -05:00

@rick-github commented on GitHub (Sep 27, 2025):

@deep1305 You are using a model with 8G of weights with a context length of 131072 and a GPU that has only 4GB of VRAM, so the model will not fit on the GPU and is going to run in CPU. Is your experience that CPU processing is slower in 0.12.* than previous versions? Can you run ollama run gemma3:12b --verbose hello and post the output from 0.12.2 and the previous version of ollama?

@rick-github commented on GitHub (Sep 27, 2025): @deep1305 You are using a model with 8G of weights with a context length of 131072 and a GPU that has only 4GB of VRAM, so the model will not fit on the GPU and is going to run in CPU. Is your experience that CPU processing is slower in 0.12.* than previous versions? Can you run `ollama run gemma3:12b --verbose hello` and post the output from 0.12.2 and the previous version of ollama?

GiteaMirror commented

2026-04-29 07:15:38 -05:00

@rick-github commented on GitHub (Sep 27, 2025):

@asdnemasd This seems like a different problem, you are seeing slower load times while the OP has slower execution. Can you open a new issue, set OLLAMA_DEBUG=1 and then post logs from 0.12.2 and whatever version of ollama you were running that loaded faster?

@rick-github commented on GitHub (Sep 27, 2025): @asdnemasd This seems like a different problem, you are seeing slower load times while the OP has slower execution. Can you open a new issue, set `OLLAMA_DEBUG=1` and then post logs from 0.12.2 and whatever version of ollama you were running that loaded faster?

GiteaMirror commented

2026-04-29 07:15:43 -05:00

@rick-github commented on GitHub (Sep 27, 2025):

@zxiaomzxm Your problem doesn't appear to be the same.

@rick-github commented on GitHub (Sep 27, 2025): @zxiaomzxm Your problem doesn't appear to be the same.

GiteaMirror commented

2026-04-29 07:15:46 -05:00

@asiyouil commented on GitHub (Sep 27, 2025):

I also met same problem. And I think the reason is that if ollama detects your VARM ( not include GPU shared memory ) is below then total model memory, it well enter low vram mode, and this mode only uses CPU to run model. ( example: your VRAM is 4 GiB, but total model memory is 20.9 GiB, ollama will enter low vram mode )

@asiyouil commented on GitHub (Sep 27, 2025): I also met same problem. And I think the reason is that if ollama detects your VARM ( not include GPU shared memory ) is below then total model memory, it well enter low vram mode, and this mode only uses CPU to run model. ( example: your VRAM is 4 GiB, but total model memory is 20.9 GiB, ollama will enter low vram mode )

GiteaMirror commented

2026-04-29 07:15:47 -05:00

@rick-github commented on GitHub (Sep 27, 2025):

No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

@rick-github commented on GitHub (Sep 27, 2025): No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

GiteaMirror commented

2026-04-29 07:15:48 -05:00

@asiyouil commented on GitHub (Sep 27, 2025):

No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

But when I run local model that total memory is more then VRAM, ollama doesn't change the default context size and only enter low vram mode. You can also find a message at log.Only total model memory is below than VRAM, ollama can use GPU fully.

@asiyouil commented on GitHub (Sep 27, 2025): > No, if ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models. But when I run local model that total memory is more then VRAM, ollama doesn't change the default context size and only enter low vram mode. You can also find a message at log.Only total model memory is below than VRAM, ollama can use GPU fully.

GiteaMirror commented

2026-04-29 07:15:52 -05:00

@rick-github commented on GitHub (Sep 27, 2025):

If ollama detects that you have low VRAM (less than 20GB), it changes the default context size for gpt-oss models.

@rick-github commented on GitHub (Sep 27, 2025): If ollama detects that you have low VRAM (less than 20GB), it changes the default context size **for gpt-oss models**.

GiteaMirror commented

2026-04-29 07:15:58 -05:00

@deep1305 commented on GitHub (Sep 27, 2025):

@rick-github It was working perfectly fine with respect to faster inference prior to updating to the 0.12.* version.

@deep1305 commented on GitHub (Sep 27, 2025): @rick-github It was working perfectly fine with respect to faster inference prior to updating to the 0.12.* version.

GiteaMirror commented

2026-04-29 07:16:00 -05:00

@rick-github commented on GitHub (Sep 27, 2025):

@deep1305 Can you run ollama run gemma3:12b --verbose hello and post the output from 0.12.2 and the previous version of ollama?

@rick-github commented on GitHub (Sep 27, 2025): @deep1305 Can you run ollama run gemma3:12b --verbose hello and post the output from 0.12.2 and the previous version of ollama?

GiteaMirror commented

2026-04-29 07:16:01 -05:00

@tobing commented on GitHub (Sep 29, 2025):

Seem I have same issue with AMD GPU. I am using ollama 0.12.3 on cachyos. I have AMD 7800 XT.
I just run small model qwen3:0.6b, but quite slow. I remembered with ollama 0.11, I can run deepseek-r1-8b smoothly

[myuser@cachyos-x8664 ~]$ ollama serve
time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:518 msg="total blobs: 0"
time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 (version 0.12.3)"
time=2025-09-29T09:31:43.239+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-29T09:31:43.260+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2025-09-29T09:31:43.262+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101
time=2025-09-29T09:31:43.262+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.6 GiB"
time=2025-09-29T09:31:43.262+07:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB"
time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:217 msg="enabling flash attention"
time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 40549"
time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:672 msg="loading model" "model layers"=29 requested=-1
time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:678 msg="system memory" total="62.7 GiB" free="53.8 GiB" free_swap="62.7 GiB"
time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-f9ee9007b2049d8b available="14.0 GiB" free="14.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:40549"
time=2025-09-29T09:33:42.072+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.092+07:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3 file_type=Q4_K_M name="Qwen3 0.6B" description="" num_tensors=311 num_key_values=29
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-09-29T09:33:42.126+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.149+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:498 msg="offloaded 0/29 layers to GPU"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="492.8 MiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="966.9 MiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="24.0 MiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:342 msg="total memory" size="1.4 GiB"
time=2025-09-29T09:33:42.269+07:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-29T09:33:42.269+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-29T09:33:42.270+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-29T09:33:42.523+07:00 level=INFO source=server.go:1289 msg="llama runner started in 0.46 seconds"
[GIN] 2025/09/29 - 09:33:42 | 200 | 574.588692ms | 127.0.0.1 | POST "/api/generate"

No issue with ollama 0.11.11

[myuser@cachyos-x8664 ~]$ ollama serve
time=2025-09-29T10:32:33.925+07:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:477 msg="total blobs: 5"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=routes.go:1385 msg="Listening on 127.0.0.1:11434 (version 0.11.11)"
time=2025-09-29T10:32:33.926+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-29T10:32:33.949+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2025-09-29T10:32:33.950+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101
time=2025-09-29T10:32:33.950+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.4 GiB"
time=2025-09-29T10:32:33.950+07:00 level=INFO source=routes.go:1426 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB"
[GIN] 2025/09/29 - 10:32:48 | 200 | 32.14µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/29 - 10:32:48 | 200 | 285.89µs | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/09/29 - 10:32:58 | 200 | 20.53µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/09/29 - 10:32:58 | 200 | 36.896425ms | 127.0.0.1 | POST "/api/show"
llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 0.6B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 0.6B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: qwen3.block_count u32 = 28
llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024
llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072
llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 15
llama_model_loader: - type f32: 113 tensors
llama_model_loader: - type f16: 28 tensors
llama_model_loader: - type q4_K: 155 tensors
llama_model_loader: - type q6_K: 15 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 492.75 MiB (5.50 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 751.63 M
print_info: general.name = Qwen3 0.6B
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-09-29T10:32:58.869+07:00 level=INFO source=server.go:217 msg="enabling flash attention"
time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 37235"
time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:504 msg="system memory" total="62.7 GiB" free="54.0 GiB" free_swap="62.7 GiB"
time=2025-09-29T10:32:58.871+07:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa library=rocm parallel=1 required="2.1 GiB" gpus=1
time=2025-09-29T10:32:58.871+07:00 level=INFO source=server.go:544 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[14.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.1 GiB" memory.required.partial="2.1 GiB" memory.required.kv="896.4 MiB" memory.required.allocations="[2.1 GiB]" memory.weights.total="409.3 MiB" memory.weights.repeating="287.6 MiB" memory.weights.nonrepeating="121.7 MiB" memory.graph.full="298.8 MiB" memory.graph.partial="298.8 MiB"
time=2025-09-29T10:32:58.878+07:00 level=INFO source=runner.go:864 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T10:32:58.883+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-09-29T10:32:58.883+07:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:37235"
time=2025-09-29T10:32:58.893+07:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 0.6B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 0.6B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: qwen3.block_count u32 = 28
llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024
llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072
llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 15
llama_model_loader: - type f32: 113 tensors
llama_model_loader: - type f16: 28 tensors
llama_model_loader: - type q4_K: 155 tensors
llama_model_loader: - type q6_K: 15 tensors

@tobing commented on GitHub (Sep 29, 2025): Seem I have same issue with AMD GPU. I am using ollama 0.12.3 on cachyos. I have AMD 7800 XT. I just run small model qwen3:0.6b, but quite slow. I remembered with ollama 0.11, I can run deepseek-r1-8b smoothly [myuser@cachyos-x8664 ~]$ ollama serve time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:518 msg="total blobs: 0" time=2025-09-29T09:31:43.238+07:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" time=2025-09-29T09:31:43.238+07:00 level=INFO source=routes.go:1528 msg="Listening on 127.0.0.1:11434 **(version 0.12.3)**" time=2025-09-29T09:31:43.239+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-29T09:31:43.260+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2025-09-29T09:31:43.262+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101 time=2025-09-29T09:31:43.262+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.6 GiB" time=2025-09-29T09:31:43.262+07:00 level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB" time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:217 msg="enabling flash attention" time=2025-09-29T09:33:42.060+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 40549" time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:672 msg="loading model" "model layers"=29 requested=-1 time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:678 msg="system memory" total="62.7 GiB" free="53.8 GiB" free_swap="62.7 GiB" time=2025-09-29T09:33:42.061+07:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-f9ee9007b2049d8b available="14.0 GiB" free="14.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-29T09:33:42.068+07:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:40549" time=2025-09-29T09:33:42.072+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.092+07:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3 file_type=Q4_K_M name="Qwen3 0.6B" description="" num_tensors=311 num_key_values=29 operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-09-29T09:33:42.126+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.149+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.269+07:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-29T09:33:42.269+07:00 level=INFO source=ggml.go:498 msg="offloaded 0/29 layers to GPU" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="492.8 MiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="966.9 MiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="24.0 MiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=backend.go:342 msg="total memory" size="1.4 GiB" time=2025-09-29T09:33:42.269+07:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-29T09:33:42.269+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-29T09:33:42.270+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-29T09:33:42.523+07:00 level=INFO source=server.go:1289 msg="llama runner started in 0.46 seconds" [GIN] 2025/09/29 - 09:33:42 | 200 | 574.588692ms | 127.0.0.1 | POST "/api/generate" No issue with ollama 0.11.11 [myuser@cachyos-x8664 ~]$ ollama serve time=2025-09-29T10:32:33.925+07:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16392 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/myuser/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:477 msg="total blobs: 5" time=2025-09-29T10:32:33.926+07:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-09-29T10:32:33.926+07:00 level=INFO source=routes.go:1385 msg="Listening on 127.0.0.1:11434 **(version 0.11.11)**" time=2025-09-29T10:32:33.926+07:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-29T10:32:33.949+07:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2025-09-29T10:32:33.950+07:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=GPU-f9ee9007b2049d8b gpu_type=gfx1101 time=2025-09-29T10:32:33.950+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-f9ee9007b2049d8b library=rocm variant="" compute=gfx1101 driver=0.0 name=1002:747e total="16.0 GiB" available="14.4 GiB" time=2025-09-29T10:32:33.950+07:00 level=INFO source=routes.go:1426 msg="entering low vram mode" "total vram"="16.0 GiB" threshold="20.0 GiB" [GIN] 2025/09/29 - 10:32:48 | 200 | 32.14µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/29 - 10:32:48 | 200 | 285.89µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/09/29 - 10:32:58 | 200 | 20.53µs | 127.0.0.1 | HEAD "/" [GIN] 2025/09/29 - 10:32:58 | 200 | 36.896425ms | 127.0.0.1 | POST "/api/show" llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 0.6B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 0.6B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3.block_count u32 = 28 llama_model_loader: - kv 7: qwen3.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type f16: 28 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q6_K: 15 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 492.75 MiB (5.50 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 751.63 M print_info: general.name = Qwen3 0.6B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-09-29T10:32:58.869+07:00 level=INFO source=server.go:217 msg="enabling flash attention" time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --model /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa --port 37235" time=2025-09-29T10:32:58.870+07:00 level=INFO source=server.go:504 msg="system memory" total="62.7 GiB" free="54.0 GiB" free_swap="62.7 GiB" time=2025-09-29T10:32:58.871+07:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa library=rocm parallel=1 required="2.1 GiB" gpus=1 time=2025-09-29T10:32:58.871+07:00 level=INFO source=server.go:544 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[14.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.1 GiB" memory.required.partial="2.1 GiB" memory.required.kv="896.4 MiB" memory.required.allocations="[2.1 GiB]" memory.weights.total="409.3 MiB" memory.weights.repeating="287.6 MiB" memory.weights.nonrepeating="121.7 MiB" memory.graph.full="298.8 MiB" memory.graph.partial="298.8 MiB" time=2025-09-29T10:32:58.878+07:00 level=INFO source=runner.go:864 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-09-29T10:32:58.883+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-09-29T10:32:58.883+07:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:37235" time=2025-09-29T10:32:58.893+07:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16392 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-f9ee9007b2049d8b Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-29T10:32:58.893+07:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 28 key-value pairs and 311 tensors from /home/myuser/.ollama/models/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 0.6B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 0.6B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: qwen3.block_count u32 = 28 llama_model_loader: - kv 7: qwen3.context_length u32 = 40960 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - kv 27: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type f16: 28 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q6_K: 15 tensors

GiteaMirror commented

2026-04-29 07:16:06 -05:00

@rick-github commented on GitHub (Sep 29, 2025):

operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

ollama didn't load the ROCm driver from /usr/lib/ollama/libggml-hip.so. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install.

@rick-github commented on GitHub (Sep 29, 2025): ``` operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception operator() double registration of ggml_uncaught_exception load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ``` ollama didn't load the ROCm driver from `/usr/lib/ollama/libggml-hip.so`. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install.

GiteaMirror commented

2026-04-29 07:16:07 -05:00

@tobing commented on GitHub (Sep 29, 2025):

operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
operator() double registration of ggml_uncaught_exception
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
ollama didn't load the ROCm driver from /usr/lib/ollama/libggml-hip.so. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install.

Please check the last part of my post. I have downgraded to ollama 0.11.11 without other changes.
ollama package automatically installed if i selected ollama-rocm

All working properly now. Checked by "ollama ps" command
With ollama 0.12.3 I saw Processor 100% CPU
With ollama 0.11.11 I saw Processor 100% GPU

@tobing commented on GitHub (Sep 29, 2025): > ``` > operator() double registration of ggml_uncaught_exception > operator() double registration of ggml_uncaught_exception > operator() double registration of ggml_uncaught_exception > load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so > time=2025-09-29T09:33:42.123+07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) > ``` > > ollama didn't load the ROCm driver from `/usr/lib/ollama/libggml-hip.so`. If you installed from AUR, you also need to install/upgrade the ollama-rocm package. The double registration may indicate a compatibility issue with a previous version of ollama, it might be best to remove the ollama and ollama-rocm packages and re-install. Please check the last part of my post. I have downgraded to ollama 0.11.11 without other changes. ollama package automatically installed if i selected ollama-rocm All working properly now. Checked by "ollama ps" command With ollama 0.12.3 I saw Processor 100% CPU With ollama 0.11.11 I saw Processor 100% GPU

GiteaMirror commented

2026-04-29 07:16:07 -05:00

@rick-github commented on GitHub (Sep 29, 2025):

With ollama 0.12.3 I saw Processor 100% CPU

Because the ROCm library wasn't loaded. When running 0.12.3, what's the output of ls -lR /usr/lib/ollama/?

@rick-github commented on GitHub (Sep 29, 2025): > With ollama 0.12.3 I saw Processor 100% CPU Because the ROCm library wasn't loaded. When running 0.12.3, what's the output of `ls -lR /usr/lib/ollama/`?

GiteaMirror commented

2026-04-29 07:16:08 -05:00

@tobing commented on GitHub (Sep 29, 2025):

This is the output ollama 0.11.11

[myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/
/usr/lib/ollama/:
total 709476
-rwxr-xr-x 1 root root 665840 Sep 16 02:15 libggml-base.so
-rwxr-xr-x 1 root root 780720 Sep 16 02:15 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root 784816 Sep 16 02:15 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 973232 Sep 16 02:15 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root 715192 Sep 16 02:15 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root 977328 Sep 16 02:15 libggml-cpu-skylakex.so
-rwxr-xr-x 1 root root 571824 Sep 16 02:15 libggml-cpu-sse42.so
-rwxr-xr-x 1 root root 551344 Sep 16 02:15 libggml-cpu-x64.so
-rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so
drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm

/usr/lib/ollama/rocm:
total 0
drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas

/usr/lib/ollama/rocm/rocblas:
total 0
[myuser@cachyos-x8664 ~]$

This is output of ollama 0.12.3

[myuser@cachyos-x8664 ~]$ ollama --version
ollama version is 0.12.3
[myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/
/usr/lib/ollama/:
total 709928
-rwxr-xr-x 1 root root 686320 Sep 26 19:25 libggml-base.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 944560 Sep 26 19:25 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root 780728 Sep 26 19:25 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root 948656 Sep 26 19:25 libggml-cpu-skylakex.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-sse42.so
-rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-x64.so
-rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so
drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm

/usr/lib/ollama/rocm:
total 0
drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas

/usr/lib/ollama/rocm/rocblas:
total 0
[myuser@cachyos-x8664 ~]$

@tobing commented on GitHub (Sep 29, 2025): **This is the output ollama 0.11.11** [myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/ /usr/lib/ollama/: total 709476 -rwxr-xr-x 1 root root 665840 Sep 16 02:15 libggml-base.so -rwxr-xr-x 1 root root 780720 Sep 16 02:15 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 784816 Sep 16 02:15 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 973232 Sep 16 02:15 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 715192 Sep 16 02:15 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 977328 Sep 16 02:15 libggml-cpu-skylakex.so -rwxr-xr-x 1 root root 571824 Sep 16 02:15 libggml-cpu-sse42.so -rwxr-xr-x 1 root root 551344 Sep 16 02:15 libggml-cpu-x64.so -rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm /usr/lib/ollama/rocm: total 0 drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas /usr/lib/ollama/rocm/rocblas: total 0 [myuser@cachyos-x8664 ~]$ **This is output of ollama 0.12.3** [myuser@cachyos-x8664 ~]$ ollama --version ollama version is 0.12.3 [myuser@cachyos-x8664 ~]$ ls -lR /usr/lib/ollama/ /usr/lib/ollama/: total 709928 -rwxr-xr-x 1 root root 686320 Sep 26 19:25 libggml-base.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 944560 Sep 26 19:25 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 780728 Sep 26 19:25 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 948656 Sep 26 19:25 libggml-cpu-skylakex.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-sse42.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-x64.so -rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so drwxr-xr-x 1 root root 14 Sep 29 07:40 rocm /usr/lib/ollama/rocm: total 0 drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas /usr/lib/ollama/rocm/rocblas: total 0 [myuser@cachyos-x8664 ~]$

GiteaMirror commented

2026-04-29 07:16:08 -05:00

@rick-github commented on GitHub (Sep 29, 2025):

/usr/lib/ollama/rocm is usually not empty, I am guessing that Arch has them in a separate package which may lead to compatibility issues. I suggest using the official ollama installation.

$ ls -l /usr/local/lib/ollama/rocm
total 1920988
lrwxrwxrwx 1 root root         25 Sep 26 05:55 libamd_comgr.so.2 -> libamd_comgr.so.2.8.60303
-rwxr-xr-x 1 root root  144125696 Feb 10  2025 libamd_comgr.so.2.8.60303
lrwxrwxrwx 1 root root         24 Sep 26 05:55 libamdhip64.so.6 -> libamdhip64.so.6.3.60303
-rwxr-xr-x 1 root root   22294280 Feb 10  2025 libamdhip64.so.6.3.60303
lrwxrwxrwx 1 root root         24 Sep 26 05:55 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.123.0
-rwxr-xr-x 1 root root      58200 Feb  7  2025 libdrm_amdgpu.so.1.123.0
lrwxrwxrwx 1 root root         17 Sep 26 05:55 libdrm.so.2 -> libdrm.so.2.123.0
-rwxr-xr-x 1 root root     106888 Feb  7  2025 libdrm.so.2.123.0
-rwxr-xr-x 1 root root     109000 Apr  6  2024 libelf-0.190.so
lrwxrwxrwx 1 root root         15 Sep 26 05:55 libelf.so.1 -> libelf-0.190.so
lrwxrwxrwx 1 root root         26 Sep 26 05:55 libhipblaslt.so.0 -> libhipblaslt.so.0.10.60303
-rwxr-xr-x 1 root root    7450504 Feb 11  2025 libhipblaslt.so.0.10.60303
lrwxrwxrwx 1 root root         23 Sep 26 05:55 libhipblas.so.2 -> libhipblas.so.2.3.60303
-rwxr-xr-x 1 root root    1052288 Feb 11  2025 libhipblas.so.2.3.60303
lrwxrwxrwx 1 root root         30 Sep 26 05:55 libhsa-runtime64.so.1 -> libhsa-runtime64.so.1.14.60303
-rwxr-xr-x 1 root root    3259872 Feb 10  2025 libhsa-runtime64.so.1.14.60303
lrwxrwxrwx 1 root root         16 Sep 26 05:55 libnuma.so.1 -> libnuma.so.1.0.0
-rwxr-xr-x 1 root root      51400 Apr  6  2024 libnuma.so.1.0.0
lrwxrwxrwx 1 root root         23 Sep 26 05:55 librocblas.so.4 -> librocblas.so.4.3.60303
-rwxr-xr-x 1 root root   74646880 Feb 11  2025 librocblas.so.4.3.60303
lrwxrwxrwx 1 root root         32 Sep 26 05:55 librocprofiler-register.so.0 -> librocprofiler-register.so.0.4.0
-rwxr-xr-x 1 root root     872192 Feb 10  2025 librocprofiler-register.so.0.4.0
lrwxrwxrwx 1 root root         25 Sep 26 05:55 librocsolver.so.0 -> librocsolver.so.0.3.60303
-rwxr-xr-x 1 root root 1713040960 Feb 11  2025 librocsolver.so.0.3.60303
drwxr-xr-x 3 root root       4096 Sep 26 05:55 rocblas

@rick-github commented on GitHub (Sep 29, 2025): `/usr/lib/ollama/rocm` is usually not empty, I am guessing that Arch has them in a separate package which may lead to compatibility issues. I suggest using the official ollama installation. ```console $ ls -l /usr/local/lib/ollama/rocm total 1920988 lrwxrwxrwx 1 root root 25 Sep 26 05:55 libamd_comgr.so.2 -> libamd_comgr.so.2.8.60303 -rwxr-xr-x 1 root root 144125696 Feb 10 2025 libamd_comgr.so.2.8.60303 lrwxrwxrwx 1 root root 24 Sep 26 05:55 libamdhip64.so.6 -> libamdhip64.so.6.3.60303 -rwxr-xr-x 1 root root 22294280 Feb 10 2025 libamdhip64.so.6.3.60303 lrwxrwxrwx 1 root root 24 Sep 26 05:55 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.123.0 -rwxr-xr-x 1 root root 58200 Feb 7 2025 libdrm_amdgpu.so.1.123.0 lrwxrwxrwx 1 root root 17 Sep 26 05:55 libdrm.so.2 -> libdrm.so.2.123.0 -rwxr-xr-x 1 root root 106888 Feb 7 2025 libdrm.so.2.123.0 -rwxr-xr-x 1 root root 109000 Apr 6 2024 libelf-0.190.so lrwxrwxrwx 1 root root 15 Sep 26 05:55 libelf.so.1 -> libelf-0.190.so lrwxrwxrwx 1 root root 26 Sep 26 05:55 libhipblaslt.so.0 -> libhipblaslt.so.0.10.60303 -rwxr-xr-x 1 root root 7450504 Feb 11 2025 libhipblaslt.so.0.10.60303 lrwxrwxrwx 1 root root 23 Sep 26 05:55 libhipblas.so.2 -> libhipblas.so.2.3.60303 -rwxr-xr-x 1 root root 1052288 Feb 11 2025 libhipblas.so.2.3.60303 lrwxrwxrwx 1 root root 30 Sep 26 05:55 libhsa-runtime64.so.1 -> libhsa-runtime64.so.1.14.60303 -rwxr-xr-x 1 root root 3259872 Feb 10 2025 libhsa-runtime64.so.1.14.60303 lrwxrwxrwx 1 root root 16 Sep 26 05:55 libnuma.so.1 -> libnuma.so.1.0.0 -rwxr-xr-x 1 root root 51400 Apr 6 2024 libnuma.so.1.0.0 lrwxrwxrwx 1 root root 23 Sep 26 05:55 librocblas.so.4 -> librocblas.so.4.3.60303 -rwxr-xr-x 1 root root 74646880 Feb 11 2025 librocblas.so.4.3.60303 lrwxrwxrwx 1 root root 32 Sep 26 05:55 librocprofiler-register.so.0 -> librocprofiler-register.so.0.4.0 -rwxr-xr-x 1 root root 872192 Feb 10 2025 librocprofiler-register.so.0.4.0 lrwxrwxrwx 1 root root 25 Sep 26 05:55 librocsolver.so.0 -> librocsolver.so.0.3.60303 -rwxr-xr-x 1 root root 1713040960 Feb 11 2025 librocsolver.so.0.3.60303 drwxr-xr-x 3 root root 4096 Sep 26 05:55 rocblas ```

GiteaMirror commented

2026-04-29 07:16:09 -05:00

@tobing commented on GitHub (Sep 29, 2025):

Reinstalled from aur or repo the result still same. So I need to install official ollama manually

[root@cachyos-x8664 myuser]# pacman -Qi ollama
Installed From  : cachyos-extra-v3
Name            : ollama
Version         : 0.12.3-1.1
Description     : Create, run and share large language models (LLMs)
Architecture    : x86_64_v3
URL             : https://github.com/ollama/ollama
Licenses        : MIT
Groups          : None
Provides        : None
Depends On      : gcc-libs  glibc
Optional Deps   : None
Required By     : ollama-rocm
Optional For    : None
Conflicts With  : None
Replaces        : None
Installed Size  : 37.19 MiB
Packager        : CachyOS <admin@cachyos.org>
Build Date      : Fri 26 Sep 2025 07:25:32 PM WIB
Install Date    : Mon 29 Sep 2025 08:04:41 PM WIB
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

[root@cachyos-x8664 myuser]# ls -lR /usr/lib/ollama/
/usr/lib/ollama/:
total 709928
-rwxr-xr-x 1 root root    686320 Sep 26 19:25 libggml-base.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root    944560 Sep 26 19:25 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root    780728 Sep 26 19:25 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root    948656 Sep 26 19:25 libggml-cpu-skylakex.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-sse42.so
-rwxr-xr-x 1 root root    780720 Sep 26 19:25 libggml-cpu-x64.so
-rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so
drwxr-xr-x 1 root root        14 Sep 29 20:24 rocm

/usr/lib/ollama/rocm:
total 0
drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas

/usr/lib/ollama/rocm/rocblas:
total 0
[root@cachyos-x8664 myuser]# ls -l /usr/local/lib/ollama/rocm
ls: cannot access '/usr/local/lib/ollama/rocm': No such file or directory
[root@cachyos-x8664 myuser]#

@tobing commented on GitHub (Sep 29, 2025): Reinstalled from aur or repo the result still same. So I need to install official ollama manually ``` [root@cachyos-x8664 myuser]# pacman -Qi ollama Installed From : cachyos-extra-v3 Name : ollama Version : 0.12.3-1.1 Description : Create, run and share large language models (LLMs) Architecture : x86_64_v3 URL : https://github.com/ollama/ollama Licenses : MIT Groups : None Provides : None Depends On : gcc-libs glibc Optional Deps : None Required By : ollama-rocm Optional For : None Conflicts With : None Replaces : None Installed Size : 37.19 MiB Packager : CachyOS <admin@cachyos.org> Build Date : Fri 26 Sep 2025 07:25:32 PM WIB Install Date : Mon 29 Sep 2025 08:04:41 PM WIB Install Reason : Explicitly installed Install Script : No Validated By : Signature [root@cachyos-x8664 myuser]# ls -lR /usr/lib/ollama/ /usr/lib/ollama/: total 709928 -rwxr-xr-x 1 root root 686320 Sep 26 19:25 libggml-base.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 944560 Sep 26 19:25 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 780728 Sep 26 19:25 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 948656 Sep 26 19:25 libggml-cpu-skylakex.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-sse42.so -rwxr-xr-x 1 root root 780720 Sep 26 19:25 libggml-cpu-x64.so -rwxr-xr-x 1 root root 720468464 Sep 26 19:25 libggml-hip.so drwxr-xr-x 1 root root 14 Sep 29 20:24 rocm /usr/lib/ollama/rocm: total 0 drwxr-xr-x 1 root root 0 Sep 26 19:25 rocblas /usr/lib/ollama/rocm/rocblas: total 0 [root@cachyos-x8664 myuser]# ls -l /usr/local/lib/ollama/rocm ls: cannot access '/usr/local/lib/ollama/rocm': No such file or directory [root@cachyos-x8664 myuser]# ```

GiteaMirror commented

2026-04-29 07:16:09 -05:00

@rick-github commented on GitHub (Sep 29, 2025):

So I must install official ollama manually right

That's my recommendation.

@rick-github commented on GitHub (Sep 29, 2025): > So I must install official ollama manually right That's my [recommendation](https://ollama.com/download).

GiteaMirror commented

2026-04-29 07:16:09 -05:00

@tobing commented on GitHub (Sep 29, 2025):

I just installed official ollama. its working, processor 100% GPU.
Thanks

@tobing commented on GitHub (Sep 29, 2025): I just installed official ollama. its working, processor 100% GPU. Thanks

GiteaMirror commented

2026-04-29 07:16:10 -05:00

@cyrozap commented on GitHub (Oct 2, 2025):

Installation of the ollama-rocm package requires the installation of over 10 GB of files including the ROCm libraries. This is a needless waste of disk space on a system with only Nvidia GPUs. Is there any chance the ROCm dependency can be made optional again?

Apologies for the noise, I misdiagnosed the problem I was experiencing.

@cyrozap commented on GitHub (Oct 2, 2025): ~Installation of the `ollama-rocm` package requires the installation of over 10 GB of files including the ROCm libraries. This is a needless waste of disk space on a system with only Nvidia GPUs. Is there any chance the ROCm dependency can be made optional again?~ Apologies for the noise, I misdiagnosed the problem I was experiencing.

GiteaMirror commented

2026-04-29 07:16:11 -05:00

@rick-github commented on GitHub (Oct 2, 2025):

If Arch has a ROCm dependency, that's an Arch packaging issue. Ollama does not require installing ROCm libraries on a non-ROCm system.

@rick-github commented on GitHub (Oct 2, 2025): If Arch has a ROCm dependency, that's an Arch packaging issue. Ollama does not require installing ROCm libraries on a non-ROCm system.

GiteaMirror commented

2026-04-29 07:16:11 -05:00

@cyrozap commented on GitHub (Oct 2, 2025):

I'm very sorry, I misread some of the thread and misunderstood the issue being described here. And to clarify, on Arch the ollama and ollama-cuda packages do not depend on ollama-rocm, and installing ollama-rocm did not fix the issue I was seeing.

I did some more extensive testing and discovered that the issue I was experiencing was a packaging issue, but completely unrelated to ROCm. In case anyone is curious, the issue I was having was that CMAKE_CUDA_ARCHITECTURES was being set to a value that didn't include my GPU's architecture. Sorry for the noise!

@cyrozap commented on GitHub (Oct 2, 2025): I'm very sorry, I misread some of the thread and misunderstood the issue being described here. And to clarify, on Arch the `ollama` and `ollama-cuda` packages do not depend on `ollama-rocm`, and installing `ollama-rocm` did not fix the issue I was seeing. I did some more extensive testing and discovered that the issue I was experiencing _was_ a packaging issue, but completely unrelated to ROCm. In case anyone is curious, the issue I was having was that `CMAKE_CUDA_ARCHITECTURES` was being set to a value that didn't include my GPU's architecture. Sorry for the noise!

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#54767