[GH-ISSUE #13964] Intel Arrow Lake (ARL) GPU produces garbage output with Vulkan backend on larger models (3B+) #55647

New Issue

GiteaMirror · 2026-04-29T09:31:48-05:00

GiteaMirror commented

2026-04-29 09:31:48 -05:00

Originally created by @chefboyrdave21 on GitHub (Jan 29, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13964

Description

When using Ollama with an Intel Arrow Lake GPU via the Vulkan backend, smaller models (1B) work correctly with OLLAMA_FLASH_ATTENTION=0, but larger models (3B and above) produce garbage/gibberish output instead of coherent text.

Environment

OS: Ubuntu 24.04 (kernel 6.14.0-37-generic)
GPU: Intel Arrow Lake-P [Intel Graphics] (device ID 0x7d51)
GPU Driver: xe (Intel Xe KMD)
Vulkan: Mesa 25.0.7-0ubuntu0.24.04.2 (Intel open-source Mesa driver, Vulkan 1.4.305)
Ollama Version: 0.15.2
Docker: Tested both in Docker (ollama/ollama:latest with --device /dev/dri) and native install - same results

Steps to Reproduce

Start Ollama with Vulkan enabled:

docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama \
  --device /dev/dri -e OLLAMA_VULKAN=1 -e OLLAMA_FLASH_ATTENTION=0 \
  ollama/ollama:latest

Test with 1B model (works):

ollama run llama3.2:1b "Say hello"
# Output: Hello. (correct)

Test with 3B model (fails):

ollama run llama3.2:3b "Say hello"
# Output: binder Binder htags Mig laus ragen 旋 kne laus iras ... (garbage)

Expected Behavior

All models should produce coherent output when using the Vulkan backend with Intel GPUs.

Actual Behavior

llama3.2:1b: Works correctly with OLLAMA_FLASH_ATTENTION=0
llama3.2:3b: Produces garbage output
Larger models (8B, 30B): Also produce garbage output

The garbage output includes random words, Unicode characters, and fragments like:

binder Binder htags Mig laus ragen 旋 kne laus iras thức emean Crime nels Fields mium...

Diagnostic Information

GPU Detection (working)

msg="inference compute" id=8680517d-0300-0000-0100-000000000000 library=Vulkan name=Vulkan0 
description="Intel(R) Graphics (ARL)" type=iGPU total="18.1 GiB" available="16.2 GiB"

Model Loading (working)

load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloading output layer to GPU  
load_tensors: offloaded 29/29 layers to GPU
load_tensors: Vulkan0 model buffer size = 1918.35 MiB

Vulkan Info

GPU0:
  apiVersion         = 1.4.305
  driverVersion      = 25.0.7
  vendorID           = 0x8086
  deviceID           = 0x7d51
  deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
  deviceName         = Intel(R) Graphics (ARL)
  driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
  driverName         = Intel open-source Mesa driver

dmesg (GPU driver loaded correctly after loading MEI modules)

xe 0000:01:00.0: [drm] Using GuC firmware from i915/mtl_guc_70.bin version 70.36.0
xe 0000:01:00.0: [drm] Using HuC firmware from i915/mtl_huc_gsc.bin version 8.5.4
xe 0000:01:00.0: [drm] Using GSC firmware from i915/mtl_gsc_1.bin version 102.1.15.1926
[drm] Initialized xe 1.1.0 for 0000:01:00.0 on minor 0

Workarounds Tested

Configuration	1B Model	3B Model
`OLLAMA_NUM_GPU=0` (CPU only)	✅ Works	✅ Works
`OLLAMA_VULKAN=1` (GPU)	❌ Garbage	❌ Garbage
`OLLAMA_VULKAN=1 OLLAMA_FLASH_ATTENTION=0`	✅ Works	❌ Garbage
Native install (non-Docker)	Same results	Same results

Current Workaround

Using CPU-only mode (OLLAMA_NUM_GPU=0) works for all models but loses GPU acceleration.

Additional Notes

Initially the GPU wasn't detected due to missing MEI kernel modules (mei_me, mei_gsc, mei_pxp). After loading these and reloading the xe driver, the GPU was detected.
The issue appears to be in the Vulkan compute shaders for larger models, possibly related to memory handling or shader compilation for models with more layers.
Intel Arrow Lake is a relatively new GPU architecture, which may not be fully supported yet.

Originally created by @chefboyrdave21 on GitHub (Jan 29, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13964 ## Description When using Ollama with an Intel Arrow Lake GPU via the Vulkan backend, smaller models (1B) work correctly with `OLLAMA_FLASH_ATTENTION=0`, but larger models (3B and above) produce garbage/gibberish output instead of coherent text. ## Environment - **OS:** Ubuntu 24.04 (kernel 6.14.0-37-generic) - **GPU:** Intel Arrow Lake-P [Intel Graphics] (device ID 0x7d51) - **GPU Driver:** xe (Intel Xe KMD) - **Vulkan:** Mesa 25.0.7-0ubuntu0.24.04.2 (Intel open-source Mesa driver, Vulkan 1.4.305) - **Ollama Version:** 0.15.2 - **Docker:** Tested both in Docker (`ollama/ollama:latest` with `--device /dev/dri`) and native install - same results ## Steps to Reproduce 1. Start Ollama with Vulkan enabled: ```bash docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama \ --device /dev/dri -e OLLAMA_VULKAN=1 -e OLLAMA_FLASH_ATTENTION=0 \ ollama/ollama:latest ``` 2. Test with 1B model (works): ```bash ollama run llama3.2:1b "Say hello" # Output: Hello. (correct) ``` 3. Test with 3B model (fails): ```bash ollama run llama3.2:3b "Say hello" # Output: binder Binder htags Mig laus ragen 旋 kne laus iras ... (garbage) ``` ## Expected Behavior All models should produce coherent output when using the Vulkan backend with Intel GPUs. ## Actual Behavior - **llama3.2:1b:** Works correctly with `OLLAMA_FLASH_ATTENTION=0` - **llama3.2:3b:** Produces garbage output - **Larger models (8B, 30B):** Also produce garbage output The garbage output includes random words, Unicode characters, and fragments like: ``` binder Binder htags Mig laus ragen 旋 kne laus iras thức emean Crime nels Fields mium... ``` ## Diagnostic Information ### GPU Detection (working) ``` msg="inference compute" id=8680517d-0300-0000-0100-000000000000 library=Vulkan name=Vulkan0 description="Intel(R) Graphics (ARL)" type=iGPU total="18.1 GiB" available="16.2 GiB" ``` ### Model Loading (working) ``` load_tensors: offloading 29 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: Vulkan0 model buffer size = 1918.35 MiB ``` ### Vulkan Info ``` GPU0: apiVersion = 1.4.305 driverVersion = 25.0.7 vendorID = 0x8086 deviceID = 0x7d51 deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU deviceName = Intel(R) Graphics (ARL) driverID = DRIVER_ID_INTEL_OPEN_SOURCE_MESA driverName = Intel open-source Mesa driver ``` ### dmesg (GPU driver loaded correctly after loading MEI modules) ``` xe 0000:01:00.0: [drm] Using GuC firmware from i915/mtl_guc_70.bin version 70.36.0 xe 0000:01:00.0: [drm] Using HuC firmware from i915/mtl_huc_gsc.bin version 8.5.4 xe 0000:01:00.0: [drm] Using GSC firmware from i915/mtl_gsc_1.bin version 102.1.15.1926 [drm] Initialized xe 1.1.0 for 0000:01:00.0 on minor 0 ``` ## Workarounds Tested | Configuration | 1B Model | 3B Model | |--------------|----------|----------| | `OLLAMA_NUM_GPU=0` (CPU only) | ✅ Works | ✅ Works | | `OLLAMA_VULKAN=1` (GPU) | ❌ Garbage | ❌ Garbage | | `OLLAMA_VULKAN=1 OLLAMA_FLASH_ATTENTION=0` | ✅ Works | ❌ Garbage | | Native install (non-Docker) | Same results | Same results | ## Current Workaround Using CPU-only mode (`OLLAMA_NUM_GPU=0`) works for all models but loses GPU acceleration. ## Additional Notes - Initially the GPU wasn't detected due to missing MEI kernel modules (`mei_me`, `mei_gsc`, `mei_pxp`). After loading these and reloading the `xe` driver, the GPU was detected. - The issue appears to be in the Vulkan compute shaders for larger models, possibly related to memory handling or shader compilation for models with more layers. - Intel Arrow Lake is a relatively new GPU architecture, which may not be fully supported yet.

GiteaMirror added the vulkan linux labels 2026-04-29 09:31:48 -05:00

GiteaMirror commented

2026-04-29 09:31:50 -05:00

@rick-github commented on GitHub (Jan 29, 2026):

OLLAMA_NUM_GPU is not an ollama configuration variable, it's unusual that setting it affects the outcome.

@rick-github commented on GitHub (Jan 29, 2026): `OLLAMA_NUM_GPU` is not an ollama configuration variable, it's unusual that setting it affects the outcome.

GiteaMirror commented

2026-04-29 09:31:50 -05:00

@scrapes commented on GitHub (Jan 31, 2026):

Can you try the q8 versions of the Model (llama3.2:3b-instruct-q8_0) ? In my first tests with the pretty much exact problem this fixed it.

I suspect the problem lies somewhere in the handling of the lower quantization(s) via Vulkan on Intel.

@scrapes commented on GitHub (Jan 31, 2026): Can you try the q8 versions of the Model (llama3.2:3b-instruct-q8_0) ? In my first tests with the pretty much exact problem this fixed it. I suspect the problem lies somewhere in the handling of the lower quantization(s) via Vulkan on Intel.

GiteaMirror commented

2026-04-29 09:31:51 -05:00

@dadi72 commented on GitHub (Feb 3, 2026):

I can confirm that, as I have the same problem (on Intel Arc 750), here is my test of llama3.1:

ollama run llama3.1 "say hello" 
everediaansen_unregisterstrarebra MindDDS dersstrarheimerzsche ringingillonledonansenborumium

ollama run llama3.2:3b-instruct-q8_0 "say hello"  
Hello!

and a bigger one:

ollama run llama3:8b-instruct-q8_0 "say hello" 
Hello! How are you today?

@dadi72 commented on GitHub (Feb 3, 2026): I can confirm that, as I have the same problem (on Intel Arc 750), here is my test of llama3.1: ```shell ollama run llama3.1 "say hello" everediaansen_unregisterstrarebra MindDDS dersstrarheimerzsche ringingillonledonansenborumium ``` ```shell ollama run llama3.2:3b-instruct-q8_0 "say hello" Hello! ``` and a bigger one: ```shell ollama run llama3:8b-instruct-q8_0 "say hello" Hello! How are you today? ```

GiteaMirror commented

2026-04-29 09:31:52 -05:00

@DrazorV commented on GitHub (Feb 4, 2026):

Same issue here, Arc A770 with qwen3:8b

@DrazorV commented on GitHub (Feb 4, 2026): Same issue here, Arc A770 with qwen3:8b

GiteaMirror commented

2026-04-29 09:31:52 -05:00

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Update: Model runner crashes with larger MoE models (qwen3-coder-next)

I've discovered that the issue escalates from "garbage output" to full crashes when running larger Mixture-of-Experts models like qwen3-coder-next:latest (80B params, 3B activated).

New Crash Behavior

With Ollama v0.15.5-rc2, running qwen3-coder-next:latest causes the model runner to crash with:

Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error

System Info

Kernel: 6.14.0-37-generic (Ubuntu 24.04.3 LTS)
GPU: Intel Arrow Lake-P [Intel Graphics] [8086:7d51]
Driver: xe 1.1.0
Mesa: 25.0.7-0ubuntu0.24.04.2
Vulkan: 1.4.305
Ollama: 0.15.5-rc2
VM: 64GB RAM allocated (Proxmox VM with GPU passthrough)

Server Logs (crash)

goroutine 1176 gp=0xc000103dc0 m=nil [chan receive]:
runtime.gopark(0x30?, 0x5d6ec34bbd00?, 0x1?, 0x12?, 0xc000086b20?)
...
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...)
...
time=2026-02-04T12:39:25.959Z level=ERROR source=server.go:1609 msg="post predict" error="Post \"http://127.0.0.1:45141/completion\": EOF"

dmesg GPU errors

xe 0000:01:00.0: [drm] *ERROR* GT1: GSC proxy component not bound!
workqueue: output_poll_execute hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND

Working Model (for comparison)

qwen3-coder:30b works correctly on the same hardware, suggesting the issue may be related to:

Model size/memory handling
MoE-specific tensor operations in Vulkan shaders
The new Gated DeltaNet architecture in qwen3-coder-next

llama.cpp #17389 - Vulkan crashes on Intel iGPU with certain models
llama.cpp #10528 - Inconsistent Vulkan segfaults
intel/ipex-llm #12318 - Arc Xe2 iGPU crashes with k-quant models

Potential Root Causes

Intel xe driver bug - The GSC proxy error and workqueue hogging suggest possible driver-level issues with the new xe driver on Arrow Lake
Mesa Vulkan shader bugs - The issue seems to worsen with more complex/larger models, possibly due to shader compilation issues
ggml-vulkan memory handling - Could be related to how VRAM is allocated for larger models on Intel iGPU with shared memory

Questions

Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner?
Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan?
Would it help to test with an older Mesa version to isolate if this is a Mesa regression?

@chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Update: Model runner crashes with larger MoE models (qwen3-coder-next) I've discovered that the issue escalates from "garbage output" to **full crashes** when running larger Mixture-of-Experts models like `qwen3-coder-next:latest` (80B params, 3B activated). ### New Crash Behavior With Ollama v0.15.5-rc2, running `qwen3-coder-next:latest` causes the model runner to crash with: ``` Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error ``` ### System Info - **Kernel:** 6.14.0-37-generic (Ubuntu 24.04.3 LTS) - **GPU:** Intel Arrow Lake-P [Intel Graphics] [8086:7d51] - **Driver:** xe 1.1.0 - **Mesa:** 25.0.7-0ubuntu0.24.04.2 - **Vulkan:** 1.4.305 - **Ollama:** 0.15.5-rc2 - **VM:** 64GB RAM allocated (Proxmox VM with GPU passthrough) ### Server Logs (crash) ``` goroutine 1176 gp=0xc000103dc0 m=nil [chan receive]: runtime.gopark(0x30?, 0x5d6ec34bbd00?, 0x1?, 0x12?, 0xc000086b20?) ... github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...) ... time=2026-02-04T12:39:25.959Z level=ERROR source=server.go:1609 msg="post predict" error="Post \"http://127.0.0.1:45141/completion\": EOF" ``` ### dmesg GPU errors ``` xe 0000:01:00.0: [drm] *ERROR* GT1: GSC proxy component not bound! workqueue: output_poll_execute hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND ``` ### Working Model (for comparison) `qwen3-coder:30b` works correctly on the same hardware, suggesting the issue may be related to: 1. Model size/memory handling 2. MoE-specific tensor operations in Vulkan shaders 3. The new Gated DeltaNet architecture in qwen3-coder-next ### Related Issues - llama.cpp #17389 - Vulkan crashes on Intel iGPU with certain models - llama.cpp #10528 - Inconsistent Vulkan segfaults - intel/ipex-llm #12318 - Arc Xe2 iGPU crashes with k-quant models ### Potential Root Causes 1. **Intel xe driver bug** - The GSC proxy error and workqueue hogging suggest possible driver-level issues with the new xe driver on Arrow Lake 2. **Mesa Vulkan shader bugs** - The issue seems to worsen with more complex/larger models, possibly due to shader compilation issues 3. **ggml-vulkan memory handling** - Could be related to how VRAM is allocated for larger models on Intel iGPU with shared memory ### Questions 1. Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner? 2. Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan? 3. Would it help to test with an older Mesa version to isolate if this is a Mesa regression?

GiteaMirror commented

2026-04-29 09:31:54 -05:00

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Cross-posted to ggml-org/llama.cpp#19327 for tracking in the upstream Vulkan backend.

@chefboyrdave21 commented on GitHub (Feb 4, 2026): Cross-posted to ggml-org/llama.cpp#19327 for tracking in the upstream Vulkan backend.

GiteaMirror commented

2026-04-29 09:31:54 -05:00

@DrazorV commented on GitHub (Feb 4, 2026):

Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner?

Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan?

Using vulkan on llama.cpp with my Arc A770 does not produce the same issue.

@DrazorV commented on GitHub (Feb 4, 2026): > 1. Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner? > 2. Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan? Using vulkan on llama.cpp with my Arc A770 does not produce the same issue.

GiteaMirror commented

2026-04-29 09:31:55 -05:00

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Major Update: Root Cause Identified and Fixes In Progress

After extensive debugging on my Intel Arrow Lake system, I've identified the root cause and am working on fixes.

Root Cause: xe Kernel Driver Job Timeout

The Intel xe kernel driver has a hardcoded maximum job timeout of 10 seconds (CONFIG_DRM_XE_JOB_TIMEOUT_MAX = 10000ms). When running MoE (Mixture of Experts) models with 128 experts, the Vulkan shader operations exceed this timeout, causing:

GPU job timeout → GPU reset
Vulkan device lost error
Model runner crash

What Works vs What Doesn't

Model Type	Size	Works?	Speed
TinyLlama	1.1B	✅ Yes	25 t/s
Mistral	7B	✅ Yes	12 t/s
Llama3	8B	✅ Yes	10 t/s
Qwen3-Coder (MoE)	30B	❌ No	Timeout
Qwen3-Coder-Next (MoE)	80B	❌ No	Timeout

Key finding: Standard dense models up to 8B work fine on Vulkan! The issue is specifically with MoE models that have many experts (128 in qwen3-coder's case).

Evidence from dmesg

xe 0000:01:00.0: [drm] GT0: Timedout job: seqno=4294967185, guc_id=2, flags=0x0 in llama-cli
xe 0000:01:00.0: [drm] Xe device coredump has been created
xe 0000:01:00.0: [drm] GT0: Engine reset: engine_class=ccs

Fixes I'm Working On

Kernel module rebuild - Building xe driver with CONFIG_DRM_XE_JOB_TIMEOUT_MAX=60000 (60 seconds)
llama.cpp patch - Submitted details to ggml-org/llama.cpp#19327 to split MUL_MAT_ID operations for Intel GPUs
Testing - Will report back with results

Current Workaround

For now, use CPU-only mode for MoE models:

OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b

Or use non-MoE models which work fine on Vulkan:

ollama run llama3.2:8b  # Works at 10 t/s on Vulkan

Full technical details in ggml-org/llama.cpp#19327
This likely affects all Intel iGPUs with xe driver running MoE models

@chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Major Update: Root Cause Identified and Fixes In Progress After extensive debugging on my Intel Arrow Lake system, I've identified the **root cause** and am working on fixes. ### Root Cause: xe Kernel Driver Job Timeout The Intel `xe` kernel driver has a hardcoded **maximum job timeout of 10 seconds** (`CONFIG_DRM_XE_JOB_TIMEOUT_MAX = 10000ms`). When running MoE (Mixture of Experts) models with 128 experts, the Vulkan shader operations exceed this timeout, causing: 1. GPU job timeout → GPU reset 2. Vulkan device lost error 3. Model runner crash ### What Works vs What Doesn't | Model Type | Size | Works? | Speed | |------------|------|--------|-------| | TinyLlama | 1.1B | ✅ Yes | 25 t/s | | Mistral | 7B | ✅ Yes | 12 t/s | | Llama3 | 8B | ✅ Yes | 10 t/s | | Qwen3-Coder (MoE) | 30B | ❌ No | Timeout | | Qwen3-Coder-Next (MoE) | 80B | ❌ No | Timeout | **Key finding:** Standard dense models up to 8B work fine on Vulkan! The issue is specifically with **MoE models that have many experts** (128 in qwen3-coder's case). ### Evidence from dmesg ``` xe 0000:01:00.0: [drm] GT0: Timedout job: seqno=4294967185, guc_id=2, flags=0x0 in llama-cli xe 0000:01:00.0: [drm] Xe device coredump has been created xe 0000:01:00.0: [drm] GT0: Engine reset: engine_class=ccs ``` ### Fixes I'm Working On 1. **Kernel module rebuild** - Building xe driver with `CONFIG_DRM_XE_JOB_TIMEOUT_MAX=60000` (60 seconds) 2. **llama.cpp patch** - Submitted details to ggml-org/llama.cpp#19327 to split MUL_MAT_ID operations for Intel GPUs 3. **Testing** - Will report back with results ### Current Workaround For now, use CPU-only mode for MoE models: ```bash OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b ``` Or use non-MoE models which work fine on Vulkan: ```bash ollama run llama3.2:8b # Works at 10 t/s on Vulkan ``` ### Related - Full technical details in ggml-org/llama.cpp#19327 - This likely affects all Intel iGPUs with xe driver running MoE models

GiteaMirror commented

2026-04-29 09:31:55 -05:00

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Update: Working Fix Found!

I've identified a working fix for Intel Arrow Lake + MoE models. The issue is in llama.cpp's Vulkan backend.

Root Cause Confirmed

The Intel xe kernel driver has a hardcoded 10-second job timeout. MoE models with 128 experts (like qwen3-coder) exceed this timeout during MUL_MAT_ID operations.

The Fix

I've posted a patch to ggml-org/llama.cpp#19327 that forces CPU fallback for MUL_MAT_ID operations on Intel GPUs when there are many experts.

Results with Patched llama.cpp

> Hi
Hello! How can I help you today?

[ Prompt: 12.8 t/s | Generation: 6.4 t/s ]

Before: GPU timeout crash
After: Stable at 6.4 t/s (vs 4.6 t/s pure CPU)

For Ollama Users

Until this is fixed upstream, workarounds:

CPU only (works but slower):

OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b

Use non-MoE models (work fine on Vulkan):

ollama run llama3.2:8b  # Works at 10 t/s
ollama run mistral:7b   # Works at 12 t/s

Next Steps

I've proposed the fix to llama.cpp maintainers. Once merged, Ollama should pick it up in a future release.

@chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Update: Working Fix Found! I've identified a working fix for Intel Arrow Lake + MoE models. The issue is in llama.cpp's Vulkan backend. ### Root Cause Confirmed The Intel xe kernel driver has a hardcoded 10-second job timeout. MoE models with 128 experts (like qwen3-coder) exceed this timeout during MUL_MAT_ID operations. ### The Fix I've posted a patch to ggml-org/llama.cpp#19327 that forces CPU fallback for MUL_MAT_ID operations on Intel GPUs when there are many experts. ### Results with Patched llama.cpp ``` > Hi Hello! How can I help you today? [ Prompt: 12.8 t/s | Generation: 6.4 t/s ] ``` **Before:** GPU timeout crash **After:** Stable at 6.4 t/s (vs 4.6 t/s pure CPU) ### For Ollama Users Until this is fixed upstream, workarounds: 1. **CPU only** (works but slower): ```bash OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b ``` 2. **Use non-MoE models** (work fine on Vulkan): ```bash ollama run llama3.2:8b # Works at 10 t/s ollama run mistral:7b # Works at 12 t/s ``` ### Next Steps I've proposed the fix to llama.cpp maintainers. Once merged, Ollama should pick it up in a future release.

GiteaMirror commented

2026-04-29 09:31:56 -05:00

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Update: Proxmox PCI Passthrough Environment Details

For completeness, here are the virtualization details. This issue occurs with the Intel iGPU passed through to a VM via PCI passthrough.

Host (Proxmox)

Hypervisor: Proxmox VE 9.1.0
Host Kernel: 6.17.2-1-pve
CPU: Intel Core Ultra 9 285H (Arrow Lake)
GPU on Host: Intel Arrow Lake-P [8086:7d51] bound to vfio-pci for passthrough

VM Configuration (102 - ollama)

cpu: host
hostpci0: 00:02.0,pcie=1,x-vga=0,rombar=1
machine: q35
memory: 65536
vga: none

Guest VM

OS: Ubuntu 24.04.3 LTS
Kernel: 6.14.0-37-generic (with xe.force_probe=7d51 iommu=pt)
GPU Driver: xe (Intel Xe KMD)
GPU visible as: 01:00.0 inside VM

Does Proxmox/Passthrough Affect This?

Likely not. The xe driver's CONFIG_DRM_XE_JOB_TIMEOUT_MAX=10000ms is a kernel-level constant that would apply equally on bare metal. The PCI passthrough provides near-native GPU access - the timeout mechanism is enforced by the guest kernel's xe driver, not the hypervisor.

The same issue would occur on bare metal with the same kernel and driver versions. I'm documenting this for anyone searching for similar issues in virtualized environments.

@chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Update: Proxmox PCI Passthrough Environment Details For completeness, here are the virtualization details. This issue occurs with the Intel iGPU passed through to a VM via PCI passthrough. ### Host (Proxmox) - **Hypervisor:** Proxmox VE 9.1.0 - **Host Kernel:** 6.17.2-1-pve - **CPU:** Intel Core Ultra 9 285H (Arrow Lake) - **GPU on Host:** Intel Arrow Lake-P [8086:7d51] bound to `vfio-pci` for passthrough ### VM Configuration (102 - ollama) ``` cpu: host hostpci0: 00:02.0,pcie=1,x-vga=0,rombar=1 machine: q35 memory: 65536 vga: none ``` ### Guest VM - **OS:** Ubuntu 24.04.3 LTS - **Kernel:** 6.14.0-37-generic (with `xe.force_probe=7d51 iommu=pt`) - **GPU Driver:** xe (Intel Xe KMD) - **GPU visible as:** `01:00.0` inside VM ### Does Proxmox/Passthrough Affect This? **Likely not.** The xe driver's `CONFIG_DRM_XE_JOB_TIMEOUT_MAX=10000ms` is a kernel-level constant that would apply equally on bare metal. The PCI passthrough provides near-native GPU access - the timeout mechanism is enforced by the guest kernel's xe driver, not the hypervisor. The same issue would occur on bare metal with the same kernel and driver versions. I'm documenting this for anyone searching for similar issues in virtualized environments.

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#55647