[GH-ISSUE #13541] Model load is very slow on Vulkan #70979

Open
opened 2026-05-04 23:37:27 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @0x7CFE on GitHub (Dec 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13541

What is the issue?

I am using TB4 eGPU enclosure and AMD Radeon RX 5700 XT with 8GB VRAM. The setup works fine with ROCm, but when I switch to Vulkan, the models take forever to load (like, ~30x slower).

Maybe that's a fluke but changing iGPU VRAM in BIOS from 512MB to 16GB speeds up the process a bit. Maybe not, but that's definitely helps with system UI lag. When it's 512 everything is glitchy, even cursor moves slow and terminal is redrawing at, like, 2 FPS.

CPU: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Kernel: 6.18.1-061801-generic

ollama-slow-load.log.txt

Relevant log output

дек 21 23:17:26 fw13 ollama[7179]: time=2025-12-21T23:17:26.787+05:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.038+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.02"
дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.289+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.04"
дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.540+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.06"
дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.791+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.08"
дек 21 23:17:28 fw13 ollama[7179]: time=2025-12-21T23:17:28.042+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.99"
...
дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.257+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.99"
дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.508+05:00 level=DEBUG source=server.go:1382 msg="model load progress 1.00"
дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.759+05:00 level=DEBUG source=server.go:1382 msg="model load progress 1.00"
дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.884+05:00 level=DEBUG source=ggml.go:282 msg="key with type not found" key=qwen3moe.pooling_type default=0
дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.884+05:00 level=TRACE source=runner.go:479 msg="forwardBatch no pending batch detected" batchID=0
дек 21 23:18:04 fw13 ollama[7179]: time=2025-12-21T23:18:04.010+05:00 level=INFO source=server.go:1376 msg="llama runner started in 37.54 seconds"

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.13.5

Originally created by @0x7CFE on GitHub (Dec 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13541 ### What is the issue? I am using TB4 eGPU enclosure and AMD Radeon RX 5700 XT with 8GB VRAM. The setup **works fine with ROCm**, but when I switch to Vulkan, the models take forever to load (like, **~30x** slower). Maybe that's a fluke but changing **iGPU** VRAM in BIOS from 512MB to 16GB speeds up the process a bit. Maybe not, but that's definitely helps with system UI lag. When it's 512 everything is glitchy, even cursor moves slow and terminal is redrawing at, like, 2 FPS. CPU: AMD Ryzen AI 9 HX 370 w/ Radeon 890M Kernel: 6.18.1-061801-generic [ollama-slow-load.log.txt](https://github.com/user-attachments/files/24280458/ollama-slow-load.log.txt) ### Relevant log output ```shell дек 21 23:17:26 fw13 ollama[7179]: time=2025-12-21T23:17:26.787+05:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.038+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.02" дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.289+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.04" дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.540+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.06" дек 21 23:17:27 fw13 ollama[7179]: time=2025-12-21T23:17:27.791+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.08" дек 21 23:17:28 fw13 ollama[7179]: time=2025-12-21T23:17:28.042+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.99" ... дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.257+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.99" дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.508+05:00 level=DEBUG source=server.go:1382 msg="model load progress 1.00" дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.759+05:00 level=DEBUG source=server.go:1382 msg="model load progress 1.00" дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.884+05:00 level=DEBUG source=ggml.go:282 msg="key with type not found" key=qwen3moe.pooling_type default=0 дек 21 23:18:03 fw13 ollama[7179]: time=2025-12-21T23:18:03.884+05:00 level=TRACE source=runner.go:479 msg="forwardBatch no pending batch detected" batchID=0 дек 21 23:18:04 fw13 ollama[7179]: time=2025-12-21T23:18:04.010+05:00 level=INFO source=server.go:1376 msg="llama runner started in 37.54 seconds" ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.13.5
GiteaMirror added the bug label 2026-05-04 23:37:27 -05:00
Author
Owner

@0x7CFE commented on GitHub (Dec 22, 2025):

I've just performed the same test on iGPU with eGPU disconnected and the results were basically same (slow load). So it probably has nothing to do with eGPU at all.

дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.259+05:00 level=DEBUG source=server.go:965 msg="available gpu" id=00000000-c100-0000-0000-000000000000 library=Vulkan "available layer vram"="36.9 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="88.0 MiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=DEBUG source=server.go:782 msg="new layout created" layers="49[ID:00000000-c100-0000-0000-000000000000 Layers:49(0..48)]"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:49[ID:00000000-c100-0000-0000-000000000000 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=ggml.go:482 msg="offloading 48 repeating layers to GPU"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=ggml.go:494 msg="offloaded 49/49 layers to GPU"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="17.1 GiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="166.9 MiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="384.0 MiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="88.0 MiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:272 msg="total memory" size="17.7 GiB"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.261+05:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.512+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.01"
дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.763+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.02"
дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.015+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.03"
дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.266+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.04"
дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.517+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.05"
дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.768+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.06"
дек 22 16:35:10 fw13 ollama[114248]: time=2025-12-22T16:35:10.019+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.07"
<!-- gh-comment-id:3681698582 --> @0x7CFE commented on GitHub (Dec 22, 2025): I've just performed the same test on iGPU **with eGPU disconnected** and the results were basically same (slow load). So it probably has nothing to do with eGPU at all. ```shell дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.259+05:00 level=DEBUG source=server.go:965 msg="available gpu" id=00000000-c100-0000-0000-000000000000 library=Vulkan "available layer vram"="36.9 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="88.0 MiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=DEBUG source=server.go:782 msg="new layout created" layers="49[ID:00000000-c100-0000-0000-000000000000 Layers:49(0..48)]" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:49[ID:00000000-c100-0000-0000-000000000000 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=ggml.go:482 msg="offloading 48 repeating layers to GPU" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=ggml.go:494 msg="offloaded 49/49 layers to GPU" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="17.1 GiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="166.9 MiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="384.0 MiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="88.0 MiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=device.go:272 msg="total memory" size="17.7 GiB" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.260+05:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.261+05:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.512+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.01" дек 22 16:35:08 fw13 ollama[114248]: time=2025-12-22T16:35:08.763+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.02" дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.015+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.03" дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.266+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.04" дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.517+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.05" дек 22 16:35:09 fw13 ollama[114248]: time=2025-12-22T16:35:09.768+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.06" дек 22 16:35:10 fw13 ollama[114248]: time=2025-12-22T16:35:10.019+05:00 level=DEBUG source=server.go:1382 msg="model load progress 0.07" ```
Author
Owner

@0x7CFE commented on GitHub (Dec 26, 2025):

Another data point. Even when running qwen 8B that definitely fits in 8GB eGPU VRAM I still have slow load. LACT shows that eGPU RAM is occupied instantly, whereas full load takes around a minute.

ollama-slow-load-qwen8b.log.txt

<!-- gh-comment-id:3693280824 --> @0x7CFE commented on GitHub (Dec 26, 2025): Another data point. Even when running qwen 8B that definitely fits in 8GB eGPU VRAM I still have slow load. LACT shows that eGPU RAM is occupied instantly, whereas full load takes around a minute. [ollama-slow-load-qwen8b.log.txt](https://github.com/user-attachments/files/24350392/ollama-slow-load-qwen8b.log.txt)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70979