[GH-ISSUE #13573] Vulkan use causes context size to be ignored and models to fail. #70997

Open
opened 2026-05-04 23:42:03 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @D337z on GitHub (Dec 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13573

What is the issue?

When I enable the use of any GPU layers for AI use, the model fails to output. It also seems to ignore the context size set for the embedding model entirely and claims to use 100% GPU regardless of the GPU layer setting. For regular Qwen3 models, it just fails to output anything, but the CPU and GPU usage show correctly and the context is set properly. And, before you say anything about it, I have it set to only use GPU 1. I have also tried with GPU 0 and not setting a GPU. I manually added GPU 1 myself as the initial GPU 0 wasn't showing proper resource availability. And also, the num_experts_used warning isn't a part of this, it works fine and sets the experts properly when only CPU is in use.

Ollama.log

Relevant log output

[GIN] 2025/12/26 - 16:23:32 | 400 |   53.793898ms |       127.0.0.1 | POST     "/api/chat"
...
time=2025-12-26T16:24:07.112-06:00 level=INFO source=sched.go:450 msg="gpu memory" id=8680c59b-0500-0000-0002-000000000000 library=Vulkan available="45.4 GiB" free="45.9 GiB"
time=2025-12-26T16:24:07.112-06:00 level=INFO source=sched.go:450 msg="gpu memory" id=00000000-0000-0000-0000-000000000000 library=Vulkan available="0 B" free="0 B"
...
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) UHD Graphics 630 (CML GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (CML GT2) () | uma: 1 | fp16: 1 | bf16: 0 | warp size: 0 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /usr/local/lib/ollama/vulkan/libggml-vulkan.so
...
ggml_backend_vk_get_device_memory called: uuid 8680c59b-0500-0000-0002-000000000000
ggml_backend_vk_get_device_memory called: uuid 00000000-0000-0000-0000-000000000000
...
time=2025-12-26T16:24:12.247-06:00 level=INFO source=ggml.go:494 msg="offloaded 32/68 layers to GPU"
time=2025-12-26T16:24:12.247-06:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="19.8 GiB"
time=2025-12-26T16:24:12.247-06:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="2.0 GiB"

OS

Linux

GPU

Intel

CPU

Intel

Ollama version

0.13.5

Originally created by @D337z on GitHub (Dec 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13573 ### What is the issue? When I enable the use of any GPU layers for AI use, the model fails to output. It also seems to ignore the context size set for the embedding model entirely and claims to use 100% GPU regardless of the GPU layer setting. For regular Qwen3 models, it just fails to output anything, but the CPU and GPU usage show correctly and the context is set properly. And, before you say anything about it, I have it set to only use GPU 1. I have also tried with GPU 0 and not setting a GPU. I manually added GPU 1 myself as the initial GPU 0 wasn't showing proper resource availability. And also, the num_experts_used warning isn't a part of this, it works fine and sets the experts properly when only CPU is in use. [Ollama.log](https://github.com/user-attachments/files/24351567/Ollama.log) ### Relevant log output ```shell [GIN] 2025/12/26 - 16:23:32 | 400 | 53.793898ms | 127.0.0.1 | POST "/api/chat" ... time=2025-12-26T16:24:07.112-06:00 level=INFO source=sched.go:450 msg="gpu memory" id=8680c59b-0500-0000-0002-000000000000 library=Vulkan available="45.4 GiB" free="45.9 GiB" time=2025-12-26T16:24:07.112-06:00 level=INFO source=sched.go:450 msg="gpu memory" id=00000000-0000-0000-0000-000000000000 library=Vulkan available="0 B" free="0 B" ... ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = Intel(R) UHD Graphics 630 (CML GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (CML GT2) () | uma: 1 | fp16: 1 | bf16: 0 | warp size: 0 | shared memory: 65536 | int dot: 0 | matrix cores: none load_backend: loaded Vulkan backend from /usr/local/lib/ollama/vulkan/libggml-vulkan.so ... ggml_backend_vk_get_device_memory called: uuid 8680c59b-0500-0000-0002-000000000000 ggml_backend_vk_get_device_memory called: uuid 00000000-0000-0000-0000-000000000000 ... time=2025-12-26T16:24:12.247-06:00 level=INFO source=ggml.go:494 msg="offloaded 32/68 layers to GPU" time=2025-12-26T16:24:12.247-06:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="19.8 GiB" time=2025-12-26T16:24:12.247-06:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="2.0 GiB" ``` ### OS Linux ### GPU Intel ### CPU Intel ### Ollama version 0.13.5
GiteaMirror added the bug label 2026-05-04 23:42:03 -05:00
Author
Owner

@D337z commented on GitHub (Dec 29, 2025):

I should probably add to look at the log file I uploaded since it actually had the error in it. But I'm sure you would know to do that.

<!-- gh-comment-id:3695349538 --> @D337z commented on GitHub (Dec 29, 2025): I should probably add to look at the log file I uploaded since it actually had the error in it. But I'm sure you would know to do that.
Author
Owner

@marco-hofmann commented on GitHub (Dec 30, 2025):

I’m having similar issues. Do you have any update on this?

<!-- gh-comment-id:3698519181 --> @marco-hofmann commented on GitHub (Dec 30, 2025): I’m having similar issues. Do you have any update on this?
Author
Owner

@D337z commented on GitHub (Dec 30, 2025):

panic: failed to sample token
goroutine 1025 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc000234f00, {0x1, {0x572eeb48d250, 0xc005952000}, {0x572eeb497b20, 0xc0001573f8}, {0xc000242008, 0xea, 0x11f}, {{0x572eeb497b20, ...}, ...}, ...})
github.com/ollama/ollama/runner/ollamarunner/runner.go:763 +0x1a85
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 38
github.com/ollama/ollama/runner/ollamarunner/ runner.go:458 +0x2cd

I managed to track it down to this. But not sure why it's happening just yet.

<!-- gh-comment-id:3699766687 --> @D337z commented on GitHub (Dec 30, 2025): panic: failed to sample token goroutine 1025 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc000234f00, {0x1, {0x572eeb48d250, 0xc005952000}, {0x572eeb497b20, 0xc0001573f8}, {0xc000242008, 0xea, 0x11f}, {{0x572eeb497b20, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:763 +0x1a85 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 38 github.com/ollama/ollama/runner/ollamarunner/ runner.go:458 +0x2cd I managed to track it down to this. But not sure why it's happening just yet.
Author
Owner

@ribbles commented on GitHub (Mar 8, 2026):

Same issue:

OS: Windows 11
GPU: Intel Arc 130V GPU (16GB)
CPU: Intel Core Ultra 5 238V
Ollama Version: 0.17.4

time=2026-03-08T15:47:44.142-06:00 level=INFO source=sched.go:498 msg="gpu memory" id=8680a064-0400-0000-0002-000000000000 library=Vulkan available="17.0 GiB" free="17.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-08T15:47:44.142-06:00 level=INFO source=server.go:498 msg="loading model" "model layers"=29 requested=-1
time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="4.1 GiB"
time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="1.8 GiB"
time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="1.8 GiB"
...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 130V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\luser\AppData\Local\Programs\Ollama\lib\ollama\vulkan\ggml-vulkan.dll
...
ggml_backend_vk_get_device_memory called: uuid 8680a064-0400-0000-0002-000000000000
ggml_backend_vk_get_device_memory called: luid 0x000000000000feb6
ggml_dxgi_pdh_init called
DXGI + PDH Initialized. Getting GPU free memory info
[DXGI] Adapter Description: Intel(R) Arc(TM) 130V GPU (16GB), LUID: 0x000000000000FEB6, Dedicated: 0.12 GB, Shared: 17.99 GB
[DXGI] Adapter Description: Microsoft Basic Render Driver, LUID: 0x00000000000102BE, Dedicated: 0.00 GB, Shared: 17.99 GB
Integrated GPU (Intel(R) Arc(TM) 130V GPU (16GB)) with LUID 0x000000000000feb6 detected. Shared Total: 19320695193.00 bytes (17.99 GB), Shared Usage: 765382656.00 bytes (0.71 GB), Dedicated Total: 134217728.00 bytes (0.12 GB), Dedicated Usage: 0.00 bytes (0.00 GB)
ggml_backend_vk_get_device_memory utilizing DXGI + PDH memory reporting free: 18689530265 total: 19454912921
---
time=2026-03-08T15:48:20.580-06:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:62163/completion\": read tcp 127.0.0.1:62173->127.0.0.1:62163: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/03/08 - 15:48:20 | 500 |   37.2008654s |       127.0.0.1 | POST     "/v1/chat/completions"

server.log

Same issue as #13585

<!-- gh-comment-id:4020096576 --> @ribbles commented on GitHub (Mar 8, 2026): Same issue: OS: Windows 11 GPU: Intel Arc 130V GPU (16GB) CPU: Intel Core Ultra 5 238V Ollama Version: 0.17.4 ``` time=2026-03-08T15:47:44.142-06:00 level=INFO source=sched.go:498 msg="gpu memory" id=8680a064-0400-0000-0002-000000000000 library=Vulkan available="17.0 GiB" free="17.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-08T15:47:44.142-06:00 level=INFO source=server.go:498 msg="loading model" "model layers"=29 requested=-1 time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="4.1 GiB" time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="1.8 GiB" time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="1.8 GiB" ... ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Intel(R) Arc(TM) 130V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from C:\Users\luser\AppData\Local\Programs\Ollama\lib\ollama\vulkan\ggml-vulkan.dll ... ggml_backend_vk_get_device_memory called: uuid 8680a064-0400-0000-0002-000000000000 ggml_backend_vk_get_device_memory called: luid 0x000000000000feb6 ggml_dxgi_pdh_init called DXGI + PDH Initialized. Getting GPU free memory info [DXGI] Adapter Description: Intel(R) Arc(TM) 130V GPU (16GB), LUID: 0x000000000000FEB6, Dedicated: 0.12 GB, Shared: 17.99 GB [DXGI] Adapter Description: Microsoft Basic Render Driver, LUID: 0x00000000000102BE, Dedicated: 0.00 GB, Shared: 17.99 GB Integrated GPU (Intel(R) Arc(TM) 130V GPU (16GB)) with LUID 0x000000000000feb6 detected. Shared Total: 19320695193.00 bytes (17.99 GB), Shared Usage: 765382656.00 bytes (0.71 GB), Dedicated Total: 134217728.00 bytes (0.12 GB), Dedicated Usage: 0.00 bytes (0.00 GB) ggml_backend_vk_get_device_memory utilizing DXGI + PDH memory reporting free: 18689530265 total: 19454912921 --- time=2026-03-08T15:48:20.580-06:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:62163/completion\": read tcp 127.0.0.1:62173->127.0.0.1:62163: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2026/03/08 - 15:48:20 | 500 | 37.2008654s | 127.0.0.1 | POST "/v1/chat/completions" ``` [server.log](https://github.com/user-attachments/files/25828237/server.log) Same issue as #13585
Author
Owner

@D337z commented on GitHub (Mar 8, 2026):

It's a known existing problem that mostly affects only Intel GPUs. Might
merge this with the original report.

On Sun, Mar 8, 2026, 16:57 Wolfpack @.***> wrote:

ribbles left a comment (ollama/ollama#13573)
https://github.com/ollama/ollama/issues/13573#issuecomment-4020096576

Same issue:

OS: Windows 11
GPU: Intel Arc 130V GPU (16GB)
CPU: Intel Core Ultra 5 238V
Ollama Version: 0.17.4

time=2026-03-08T15:47:44.142-06:00 level=INFO source=sched.go:498 msg="gpu memory" id=8680a064-0400-0000-0002-000000000000 library=Vulkan available="17.0 GiB" free="17.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-08T15:47:44.142-06:00 level=INFO source=server.go:498 msg="loading model" "model layers"=29 requested=-1
time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="4.1 GiB"
time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="1.8 GiB"
time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="1.8 GiB"
...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 130V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\luser\AppData\Local\Programs\Ollama\lib\ollama\vulkan\ggml-vulkan.dll
...
ggml_backend_vk_get_device_memory called: uuid 8680a064-0400-0000-0002-000000000000
ggml_backend_vk_get_device_memory called: luid 0x000000000000feb6
ggml_dxgi_pdh_init called
DXGI + PDH Initialized. Getting GPU free memory info
[DXGI] Adapter Description: Intel(R) Arc(TM) 130V GPU (16GB), LUID: 0x000000000000FEB6, Dedicated: 0.12 GB, Shared: 17.99 GB
[DXGI] Adapter Description: Microsoft Basic Render Driver, LUID: 0x00000000000102BE, Dedicated: 0.00 GB, Shared: 17.99 GB
Integrated GPU (Intel(R) Arc(TM) 130V GPU (16GB)) with LUID 0x000000000000feb6 detected. Shared Total: 19320695193.00 bytes (17.99 GB), Shared Usage: 765382656.00 bytes (0.71 GB), Dedicated Total: 134217728.00 bytes (0.12 GB), Dedicated Usage: 0.00 bytes (0.00 GB)
ggml_backend_vk_get_device_memory utilizing DXGI + PDH memory reporting free: 18689530265 total: 19454912921

time=2026-03-08T15:48:20.580-06:00 level=ERROR source=server.go:1610 msg="post predict" error="Post "http://127.0.0.1:62163/completion": read tcp 127.0.0.1:62173->127.0.0.1:62163: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/03/08 - 15:48:20 | 500 | 37.2008654s | 127.0.0.1 | POST "/v1/chat/completions"

server.log https://github.com/user-attachments/files/25828237/server.log

Same issue as #13585 https://github.com/ollama/ollama/issues/13585


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/13573#issuecomment-4020096576,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ATOTXI6SZ5GWZES5TXOMYQ34PX3FHAVCNFSM6AAAAACQC4ZA5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DAMRQGA4TMNJXGY
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:4020103036 --> @D337z commented on GitHub (Mar 8, 2026): It's a known existing problem that mostly affects only Intel GPUs. Might merge this with the original report. On Sun, Mar 8, 2026, 16:57 Wolfpack ***@***.***> wrote: > *ribbles* left a comment (ollama/ollama#13573) > <https://github.com/ollama/ollama/issues/13573#issuecomment-4020096576> > > Same issue: > > OS: Windows 11 > GPU: Intel Arc 130V GPU (16GB) > CPU: Intel Core Ultra 5 238V > Ollama Version: 0.17.4 > > time=2026-03-08T15:47:44.142-06:00 level=INFO source=sched.go:498 msg="gpu memory" id=8680a064-0400-0000-0002-000000000000 library=Vulkan available="17.0 GiB" free="17.4 GiB" minimum="457.0 MiB" overhead="0 B" > time=2026-03-08T15:47:44.142-06:00 level=INFO source=server.go:498 msg="loading model" "model layers"=29 requested=-1 > time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="4.1 GiB" > time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="1.8 GiB" > time=2026-03-08T15:47:44.142-06:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="1.8 GiB" > ... > ggml_vulkan: Found 1 Vulkan devices: > ggml_vulkan: 0 = Intel(R) Arc(TM) 130V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat > load_backend: loaded Vulkan backend from C:\Users\luser\AppData\Local\Programs\Ollama\lib\ollama\vulkan\ggml-vulkan.dll > ... > ggml_backend_vk_get_device_memory called: uuid 8680a064-0400-0000-0002-000000000000 > ggml_backend_vk_get_device_memory called: luid 0x000000000000feb6 > ggml_dxgi_pdh_init called > DXGI + PDH Initialized. Getting GPU free memory info > [DXGI] Adapter Description: Intel(R) Arc(TM) 130V GPU (16GB), LUID: 0x000000000000FEB6, Dedicated: 0.12 GB, Shared: 17.99 GB > [DXGI] Adapter Description: Microsoft Basic Render Driver, LUID: 0x00000000000102BE, Dedicated: 0.00 GB, Shared: 17.99 GB > Integrated GPU (Intel(R) Arc(TM) 130V GPU (16GB)) with LUID 0x000000000000feb6 detected. Shared Total: 19320695193.00 bytes (17.99 GB), Shared Usage: 765382656.00 bytes (0.71 GB), Dedicated Total: 134217728.00 bytes (0.12 GB), Dedicated Usage: 0.00 bytes (0.00 GB) > ggml_backend_vk_get_device_memory utilizing DXGI + PDH memory reporting free: 18689530265 total: 19454912921 > --- > time=2026-03-08T15:48:20.580-06:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:62163/completion\": read tcp 127.0.0.1:62173->127.0.0.1:62163: wsarecv: An existing connection was forcibly closed by the remote host." > [GIN] 2026/03/08 - 15:48:20 | 500 | 37.2008654s | 127.0.0.1 | POST "/v1/chat/completions" > > server.log <https://github.com/user-attachments/files/25828237/server.log> > > Same issue as #13585 <https://github.com/ollama/ollama/issues/13585> > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/13573#issuecomment-4020096576>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ATOTXI6SZ5GWZES5TXOMYQ34PX3FHAVCNFSM6AAAAACQC4ZA5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DAMRQGA4TMNJXGY> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70997