[GH-ISSUE #14854] WSL2 + Intel Arc 140T: Vulkan runner hangs after completion request on /dev/dxg while llama.cpp works #56094

Open
opened 2026-04-29 10:15:23 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @oldeucryptoboi on GitHub (Mar 14, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14854

What is the issue?

On WSL2 (Ubuntu 24.04.4, kernel 6.6.87.2-microsoft-standard-WSL2) with an Intel Arc 140T iGPU exposed through /dev/dxg, Ollama 0.18.0 detects the GPU, fully loads Qwen3.5-9B-Q8_0.gguf on Vulkan, and then hangs after completion request before generating the first token.

This does not look like a general machine/configuration issue because:

  • vulkaninfo sees Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB)) via Dozen
  • clinfo sees the GPU
  • the same model works on the same machine with official llama.cpp b8348 over Vulkan (--device Vulkan0 --gpu-layers all)
  • while Ollama is hung, the WSL kernel logs /dev/dxg sync-object failures

Possibly related to #13097 and #13621, but this repro is specific to WSL2 + Dozen + /dev/dxg, and with the new engine it hangs rather than immediately panicking.

How to reproduce

  1. Create a model from the public GGUF unsloth/Qwen3.5-9B-GGUF / Qwen3.5-9B-Q8_0.gguf
  2. Start Ollama with Vulkan forced:
export OLLAMA_HOST=127.0.0.1:11443
export OLLAMA_LLM_LIBRARY=vulkan
export OLLAMA_VULKAN=1
export OLLAMA_NEW_ENGINE=true
export OLLAMA_DEBUG=1
/usr/local/bin/ollama serve
  1. Run:
OLLAMA_HOST=127.0.0.1:11443 ollama run qwen35-9b-q8-local "Reply with exactly OK."
  1. The model loads on Vulkan0, then hangs before the first token.

I also reproduced with /api/generate and with OLLAMA_FLASH_ATTENTION=false, so this does not look specific to the CLI or to flash attention.

Relevant log output

time=2026-03-14T18:35:46.347-04:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit ... FlashAttention:Disabled ... GPULayers:33[...] }"
ggml_vulkan: 0 = Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB)) (Dozen)
time=2026-03-14T18:35:53.243-04:00 level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU"
time=2026-03-14T18:36:40.132-04:00 level=INFO source=server.go:1388 msg="llama runner started in 55.60 seconds"
time=2026-03-14T18:36:40.169-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format=""
time=2026-03-14T18:36:40.198-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5

At that point the request stalls and no first token is produced.

The WSL kernel logs at the same time show the /dev/dxg sync path failing:

misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00
misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512
misc dxg: dxgk: process_completion_packet: did not find packet to complete

For completeness: with the old engine on the same machine, I also hit the older crash path instead of the hang:

llama_sampler_dist_apply(...): Assertion 'found' failed.
SIGABRT: abort

Control case

llama.cpp b8348 on the same WSL2 machine, same GPU, same GGUF, works over Vulkan and generates text successfully. So the GPU/Vulkan path is functional enough for inference outside Ollama; the failure seems specific to Ollama's Vulkan runner on this stack.

OS

Linux (WSL2 Ubuntu 24.04.4)

GPU

Intel Arc 140T iGPU via Dozen (Mesa 25.3.6) / /dev/dxg

CPU

Intel Core Ultra 7 255H

Ollama version

0.18.0

Originally created by @oldeucryptoboi on GitHub (Mar 14, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14854 ### What is the issue? On WSL2 (`Ubuntu 24.04.4`, kernel `6.6.87.2-microsoft-standard-WSL2`) with an Intel Arc 140T iGPU exposed through `/dev/dxg`, Ollama `0.18.0` detects the GPU, fully loads `Qwen3.5-9B-Q8_0.gguf` on Vulkan, and then hangs after `completion request` before generating the first token. This does not look like a general machine/configuration issue because: - `vulkaninfo` sees `Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB))` via Dozen - `clinfo` sees the GPU - the same model works on the same machine with official `llama.cpp` `b8348` over Vulkan (`--device Vulkan0 --gpu-layers all`) - while Ollama is hung, the WSL kernel logs `/dev/dxg` sync-object failures Possibly related to #13097 and #13621, but this repro is specific to `WSL2 + Dozen + /dev/dxg`, and with the new engine it hangs rather than immediately panicking. ### How to reproduce 1. Create a model from the public GGUF `unsloth/Qwen3.5-9B-GGUF` / `Qwen3.5-9B-Q8_0.gguf` 2. Start Ollama with Vulkan forced: ```bash export OLLAMA_HOST=127.0.0.1:11443 export OLLAMA_LLM_LIBRARY=vulkan export OLLAMA_VULKAN=1 export OLLAMA_NEW_ENGINE=true export OLLAMA_DEBUG=1 /usr/local/bin/ollama serve ``` 3. Run: ```bash OLLAMA_HOST=127.0.0.1:11443 ollama run qwen35-9b-q8-local "Reply with exactly OK." ``` 4. The model loads on `Vulkan0`, then hangs before the first token. I also reproduced with `/api/generate` and with `OLLAMA_FLASH_ATTENTION=false`, so this does not look specific to the CLI or to flash attention. ### Relevant log output ```text time=2026-03-14T18:35:46.347-04:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit ... FlashAttention:Disabled ... GPULayers:33[...] }" ggml_vulkan: 0 = Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB)) (Dozen) time=2026-03-14T18:35:53.243-04:00 level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU" time=2026-03-14T18:36:40.132-04:00 level=INFO source=server.go:1388 msg="llama runner started in 55.60 seconds" time=2026-03-14T18:36:40.169-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format="" time=2026-03-14T18:36:40.198-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 ``` At that point the request stalls and no first token is produced. The WSL kernel logs at the same time show the `/dev/dxg` sync path failing: ```text misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00 misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512 misc dxg: dxgk: process_completion_packet: did not find packet to complete ``` For completeness: with the old engine on the same machine, I also hit the older crash path instead of the hang: ```text llama_sampler_dist_apply(...): Assertion 'found' failed. SIGABRT: abort ``` ### Control case `llama.cpp` `b8348` on the same WSL2 machine, same GPU, same GGUF, works over Vulkan and generates text successfully. So the GPU/Vulkan path is functional enough for inference outside Ollama; the failure seems specific to Ollama's Vulkan runner on this stack. ### OS Linux (WSL2 Ubuntu 24.04.4) ### GPU Intel Arc 140T iGPU via Dozen (`Mesa 25.3.6`) / `/dev/dxg` ### CPU Intel Core Ultra 7 255H ### Ollama version 0.18.0
GiteaMirror added the intelvulkan labels 2026-04-29 10:15:23 -05:00
Author
Owner

@oldeucryptoboi commented on GitHub (Mar 14, 2026):

I dug into the v0.18.0 source locally and found two code-level problems that look like plausible root causes for this WSL2/Vulkan hang.

  1. ComputeWithNotify violates its own contract

ml/backend.go says:

ComputeWithNotify(func(), ...Tensor) // notify callback once compute has begun

But ml/backend/ggml/ggml.go currently does:

func (c *Context) ComputeWithNotify(cb func(), tensors ...ml.Tensor) {
    c.b.schedMu.Lock()
    defer c.b.schedMu.Unlock()
    if cb != nil {
        go cb()
    }

    if status := C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph); status != C.GGML_STATUS_SUCCESS {
        panic(...)
    }
    C.ggml_backend_sched_reset(c.b.sched)
    ...
}

So the callback fires before ggml_backend_sched_graph_compute_async() is even called.

That matters because runner/ollamarunner/runner.go uses that callback to let forwardBatch() start setting up the next batch:

// If we have a pending batch still processing, wait until Compute has started
if pendingBatch.ctx != nil {
    <-pendingBatch.computeStartedCh
    nextBatch.inputsReadyCh = pendingBatch.outputsReadyCh
}

In other words, the new engine thinks the previous compute has started when in fact it may not even have been submitted yet.

  1. The new runner enables async pipelining without checking backend async capability

In runner/ollamarunner/runner.go:

supportsAsync := pooling.Type(s.model.Backend().Config().Uint("pooling_type")) == pooling.TypeNone

So async execution is enabled for all non-pooling models, regardless of backend capability.

But Ollama's vendored Vulkan backend currently advertises the opposite in ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:

props->caps = {
    /* .async                 = */ false,
    /* .host_buffer           = */ true,
    /* .buffer_from_host_ptr  = */ false,
    /* .events                = */ false,
};

So the runner is pipelining batches as if the backend were async-safe, while the backend explicitly says it is not.

Why I think this is relevant to this issue specifically:

  • the hang happens after completion request, before first token
  • llama.cpp on the same machine/model works
  • the WSL kernel logs show /dev/dxg fence/sync failures while Ollama is stuck
  • the old engine hits a different failure family (llama_sampler_dist_apply(...): Assertion 'found' failed), which suggests the new engine's batching/scheduling path is part of the problem

My best current hypothesis is:

  • the new engine is overlapping or advancing work too early on the Vulkan path
  • WSL2/Dozen is more sensitive to this than the plain llama.cpp CLI path
  • the immediate thing to try is:
    • move the cb() call in ComputeWithNotify to after successful ggml_backend_sched_graph_compute_async()
    • disable runner-side async pipelining when the selected backend reports async = false

I patched the first change locally in a source clone, but I did not get far enough to run a full patched Vulkan repro because my local build tree was not packaging the runtime libraries the same way as the installed binary.

<!-- gh-comment-id:4061622517 --> @oldeucryptoboi commented on GitHub (Mar 14, 2026): I dug into the `v0.18.0` source locally and found two code-level problems that look like plausible root causes for this WSL2/Vulkan hang. 1. `ComputeWithNotify` violates its own contract `ml/backend.go` says: ```go ComputeWithNotify(func(), ...Tensor) // notify callback once compute has begun ``` But `ml/backend/ggml/ggml.go` currently does: ```go func (c *Context) ComputeWithNotify(cb func(), tensors ...ml.Tensor) { c.b.schedMu.Lock() defer c.b.schedMu.Unlock() if cb != nil { go cb() } if status := C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph); status != C.GGML_STATUS_SUCCESS { panic(...) } C.ggml_backend_sched_reset(c.b.sched) ... } ``` So the callback fires *before* `ggml_backend_sched_graph_compute_async()` is even called. That matters because `runner/ollamarunner/runner.go` uses that callback to let `forwardBatch()` start setting up the next batch: ```go // If we have a pending batch still processing, wait until Compute has started if pendingBatch.ctx != nil { <-pendingBatch.computeStartedCh nextBatch.inputsReadyCh = pendingBatch.outputsReadyCh } ``` In other words, the new engine thinks the previous compute has started when in fact it may not even have been submitted yet. 2. The new runner enables async pipelining without checking backend async capability In `runner/ollamarunner/runner.go`: ```go supportsAsync := pooling.Type(s.model.Backend().Config().Uint("pooling_type")) == pooling.TypeNone ``` So async execution is enabled for all non-pooling models, regardless of backend capability. But Ollama's vendored Vulkan backend currently advertises the opposite in `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp`: ```cpp props->caps = { /* .async = */ false, /* .host_buffer = */ true, /* .buffer_from_host_ptr = */ false, /* .events = */ false, }; ``` So the runner is pipelining batches as if the backend were async-safe, while the backend explicitly says it is not. Why I think this is relevant to this issue specifically: - the hang happens after `completion request`, before first token - `llama.cpp` on the same machine/model works - the WSL kernel logs show `/dev/dxg` fence/sync failures while Ollama is stuck - the old engine hits a different failure family (`llama_sampler_dist_apply(...): Assertion 'found' failed`), which suggests the new engine's batching/scheduling path is part of the problem My best current hypothesis is: - the new engine is overlapping or advancing work too early on the Vulkan path - WSL2/Dozen is more sensitive to this than the plain `llama.cpp` CLI path - the immediate thing to try is: - move the `cb()` call in `ComputeWithNotify` to *after* successful `ggml_backend_sched_graph_compute_async()` - disable runner-side async pipelining when the selected backend reports `async = false` I patched the first change locally in a source clone, but I did not get far enough to run a full patched Vulkan repro because my local build tree was not packaging the runtime libraries the same way as the installed binary.
Author
Owner

@oldeucryptoboi commented on GitHub (Mar 14, 2026):

Follow-up after controlled A/B testing on the same machine.

The earlier ComputeWithNotify callback-order theory is not isolated as the cause. I re-ran the tests against the unmodified stock 0.18.0 binary and found a narrower reproducer:

  • Same machine: WSL2 + Intel Arc 140T + Dozen Vulkan + /dev/dxg
  • Same model: qwen35-9b-q8-local
  • Same binary: /usr/local/bin/ollama 0.18.0
  • Same env except for flash-attention handling:
    • OLLAMA_LLM_LIBRARY=vulkan
    • OLLAMA_VULKAN=1
    • OLLAMA_NEW_ENGINE=true

Results:

  1. OLLAMA_FLASH_ATTENTION=false exported:

    • model loads on 100% GPU
    • request completes successfully
    • ollama run qwen35-9b-q8-local "Reply with exactly OK." returns OK
  2. OLLAMA_FLASH_ATTENTION unset:

    • Ollama logs enabling flash attention
    • load request shows FlashAttention:Enabled
    • model still fully loads on GPU
    • runner reaches completion request and loading cache slot
    • then it stalls until client timeout, with no token produced

Relevant log evidence from the failing run:

INFO source=server.go:246 msg="enabling flash attention"
INFO source=runner.go:1284 msg=load request="{... FlashAttention:Enabled KvSize:4096 ... GPULayers:33 ...}"
INFO source=server.go:1388 msg="llama runner started in 95.81 seconds"
DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format=""
DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5

At the same time, dmesg shows the same WSL dxg sync-object failure pattern:

misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00
misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512
misc dxg: dxgk: process_completion_packet: did not find packet to complete

This lines up with the source too:

  • qwen35 defaults flash attention on when not explicitly overridden:
    • fs/ggml/ggml.go FlashAttention() returns true for qwen35
  • the engine enables it automatically when the env var is unset:
    • llm/server.go

So the narrowed conclusion is:

  • General Ollama Vulkan on this machine is not broken
  • qwen35 on the WSL2/Dozen flash-attention path is broken
  • forcing OLLAMA_FLASH_ATTENTION=false is a working local workaround here

If useful, I can test whether this is specific to qwen35 or any Vulkan run that uses FA on this WSL/Dozen stack.

<!-- gh-comment-id:4061659827 --> @oldeucryptoboi commented on GitHub (Mar 14, 2026): Follow-up after controlled A/B testing on the same machine. The earlier `ComputeWithNotify` callback-order theory is **not** isolated as the cause. I re-ran the tests against the unmodified stock `0.18.0` binary and found a narrower reproducer: - Same machine: `WSL2 + Intel Arc 140T + Dozen Vulkan + /dev/dxg` - Same model: `qwen35-9b-q8-local` - Same binary: `/usr/local/bin/ollama` `0.18.0` - Same env except for flash-attention handling: - `OLLAMA_LLM_LIBRARY=vulkan` - `OLLAMA_VULKAN=1` - `OLLAMA_NEW_ENGINE=true` Results: 1. `OLLAMA_FLASH_ATTENTION=false` exported: - model loads on `100% GPU` - request completes successfully - `ollama run qwen35-9b-q8-local "Reply with exactly OK."` returns `OK` 2. `OLLAMA_FLASH_ATTENTION` unset: - Ollama logs `enabling flash attention` - load request shows `FlashAttention:Enabled` - model still fully loads on GPU - runner reaches `completion request` and `loading cache slot` - then it stalls until client timeout, with no token produced Relevant log evidence from the failing run: ```text INFO source=server.go:246 msg="enabling flash attention" INFO source=runner.go:1284 msg=load request="{... FlashAttention:Enabled KvSize:4096 ... GPULayers:33 ...}" INFO source=server.go:1388 msg="llama runner started in 95.81 seconds" DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format="" DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 ``` At the same time, `dmesg` shows the same WSL `dxg` sync-object failure pattern: ```text misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00 misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512 misc dxg: dxgk: process_completion_packet: did not find packet to complete ``` This lines up with the source too: - `qwen35` defaults flash attention on when not explicitly overridden: - `fs/ggml/ggml.go` `FlashAttention()` returns true for `qwen35` - the engine enables it automatically when the env var is unset: - `llm/server.go` So the narrowed conclusion is: - **General Ollama Vulkan on this machine is not broken** - **`qwen35` on the WSL2/Dozen flash-attention path is broken** - forcing `OLLAMA_FLASH_ATTENTION=false` is a working local workaround here If useful, I can test whether this is specific to `qwen35` or any Vulkan run that uses FA on this WSL/Dozen stack.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56094