[GH-ISSUE #13964] Intel Arrow Lake (ARL) GPU produces garbage output with Vulkan backend on larger models (3B+) #55647

Open
opened 2026-04-29 09:31:48 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @chefboyrdave21 on GitHub (Jan 29, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13964

Description

When using Ollama with an Intel Arrow Lake GPU via the Vulkan backend, smaller models (1B) work correctly with OLLAMA_FLASH_ATTENTION=0, but larger models (3B and above) produce garbage/gibberish output instead of coherent text.

Environment

  • OS: Ubuntu 24.04 (kernel 6.14.0-37-generic)
  • GPU: Intel Arrow Lake-P [Intel Graphics] (device ID 0x7d51)
  • GPU Driver: xe (Intel Xe KMD)
  • Vulkan: Mesa 25.0.7-0ubuntu0.24.04.2 (Intel open-source Mesa driver, Vulkan 1.4.305)
  • Ollama Version: 0.15.2
  • Docker: Tested both in Docker (ollama/ollama:latest with --device /dev/dri) and native install - same results

Steps to Reproduce

  1. Start Ollama with Vulkan enabled:

    docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama \
      --device /dev/dri -e OLLAMA_VULKAN=1 -e OLLAMA_FLASH_ATTENTION=0 \
      ollama/ollama:latest
    
  2. Test with 1B model (works):

    ollama run llama3.2:1b "Say hello"
    # Output: Hello. (correct)
    
  3. Test with 3B model (fails):

    ollama run llama3.2:3b "Say hello"
    # Output: binder Binder htags Mig laus ragen 旋 kne laus iras ... (garbage)
    

Expected Behavior

All models should produce coherent output when using the Vulkan backend with Intel GPUs.

Actual Behavior

  • llama3.2:1b: Works correctly with OLLAMA_FLASH_ATTENTION=0
  • llama3.2:3b: Produces garbage output
  • Larger models (8B, 30B): Also produce garbage output

The garbage output includes random words, Unicode characters, and fragments like:

binder Binder htags Mig laus ragen 旋 kne laus iras thức emean Crime nels Fields mium...

Diagnostic Information

GPU Detection (working)

msg="inference compute" id=8680517d-0300-0000-0100-000000000000 library=Vulkan name=Vulkan0 
description="Intel(R) Graphics (ARL)" type=iGPU total="18.1 GiB" available="16.2 GiB"

Model Loading (working)

load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloading output layer to GPU  
load_tensors: offloaded 29/29 layers to GPU
load_tensors: Vulkan0 model buffer size = 1918.35 MiB

Vulkan Info

GPU0:
  apiVersion         = 1.4.305
  driverVersion      = 25.0.7
  vendorID           = 0x8086
  deviceID           = 0x7d51
  deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
  deviceName         = Intel(R) Graphics (ARL)
  driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
  driverName         = Intel open-source Mesa driver

dmesg (GPU driver loaded correctly after loading MEI modules)

xe 0000:01:00.0: [drm] Using GuC firmware from i915/mtl_guc_70.bin version 70.36.0
xe 0000:01:00.0: [drm] Using HuC firmware from i915/mtl_huc_gsc.bin version 8.5.4
xe 0000:01:00.0: [drm] Using GSC firmware from i915/mtl_gsc_1.bin version 102.1.15.1926
[drm] Initialized xe 1.1.0 for 0000:01:00.0 on minor 0

Workarounds Tested

Configuration 1B Model 3B Model
OLLAMA_NUM_GPU=0 (CPU only) Works Works
OLLAMA_VULKAN=1 (GPU) Garbage Garbage
OLLAMA_VULKAN=1 OLLAMA_FLASH_ATTENTION=0 Works Garbage
Native install (non-Docker) Same results Same results

Current Workaround

Using CPU-only mode (OLLAMA_NUM_GPU=0) works for all models but loses GPU acceleration.

Additional Notes

  • Initially the GPU wasn't detected due to missing MEI kernel modules (mei_me, mei_gsc, mei_pxp). After loading these and reloading the xe driver, the GPU was detected.
  • The issue appears to be in the Vulkan compute shaders for larger models, possibly related to memory handling or shader compilation for models with more layers.
  • Intel Arrow Lake is a relatively new GPU architecture, which may not be fully supported yet.
Originally created by @chefboyrdave21 on GitHub (Jan 29, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13964 ## Description When using Ollama with an Intel Arrow Lake GPU via the Vulkan backend, smaller models (1B) work correctly with `OLLAMA_FLASH_ATTENTION=0`, but larger models (3B and above) produce garbage/gibberish output instead of coherent text. ## Environment - **OS:** Ubuntu 24.04 (kernel 6.14.0-37-generic) - **GPU:** Intel Arrow Lake-P [Intel Graphics] (device ID 0x7d51) - **GPU Driver:** xe (Intel Xe KMD) - **Vulkan:** Mesa 25.0.7-0ubuntu0.24.04.2 (Intel open-source Mesa driver, Vulkan 1.4.305) - **Ollama Version:** 0.15.2 - **Docker:** Tested both in Docker (`ollama/ollama:latest` with `--device /dev/dri`) and native install - same results ## Steps to Reproduce 1. Start Ollama with Vulkan enabled: ```bash docker run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama \ --device /dev/dri -e OLLAMA_VULKAN=1 -e OLLAMA_FLASH_ATTENTION=0 \ ollama/ollama:latest ``` 2. Test with 1B model (works): ```bash ollama run llama3.2:1b "Say hello" # Output: Hello. (correct) ``` 3. Test with 3B model (fails): ```bash ollama run llama3.2:3b "Say hello" # Output: binder Binder htags Mig laus ragen 旋 kne laus iras ... (garbage) ``` ## Expected Behavior All models should produce coherent output when using the Vulkan backend with Intel GPUs. ## Actual Behavior - **llama3.2:1b:** Works correctly with `OLLAMA_FLASH_ATTENTION=0` - **llama3.2:3b:** Produces garbage output - **Larger models (8B, 30B):** Also produce garbage output The garbage output includes random words, Unicode characters, and fragments like: ``` binder Binder htags Mig laus ragen 旋 kne laus iras thức emean Crime nels Fields mium... ``` ## Diagnostic Information ### GPU Detection (working) ``` msg="inference compute" id=8680517d-0300-0000-0100-000000000000 library=Vulkan name=Vulkan0 description="Intel(R) Graphics (ARL)" type=iGPU total="18.1 GiB" available="16.2 GiB" ``` ### Model Loading (working) ``` load_tensors: offloading 29 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: Vulkan0 model buffer size = 1918.35 MiB ``` ### Vulkan Info ``` GPU0: apiVersion = 1.4.305 driverVersion = 25.0.7 vendorID = 0x8086 deviceID = 0x7d51 deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU deviceName = Intel(R) Graphics (ARL) driverID = DRIVER_ID_INTEL_OPEN_SOURCE_MESA driverName = Intel open-source Mesa driver ``` ### dmesg (GPU driver loaded correctly after loading MEI modules) ``` xe 0000:01:00.0: [drm] Using GuC firmware from i915/mtl_guc_70.bin version 70.36.0 xe 0000:01:00.0: [drm] Using HuC firmware from i915/mtl_huc_gsc.bin version 8.5.4 xe 0000:01:00.0: [drm] Using GSC firmware from i915/mtl_gsc_1.bin version 102.1.15.1926 [drm] Initialized xe 1.1.0 for 0000:01:00.0 on minor 0 ``` ## Workarounds Tested | Configuration | 1B Model | 3B Model | |--------------|----------|----------| | `OLLAMA_NUM_GPU=0` (CPU only) | ✅ Works | ✅ Works | | `OLLAMA_VULKAN=1` (GPU) | ❌ Garbage | ❌ Garbage | | `OLLAMA_VULKAN=1 OLLAMA_FLASH_ATTENTION=0` | ✅ Works | ❌ Garbage | | Native install (non-Docker) | Same results | Same results | ## Current Workaround Using CPU-only mode (`OLLAMA_NUM_GPU=0`) works for all models but loses GPU acceleration. ## Additional Notes - Initially the GPU wasn't detected due to missing MEI kernel modules (`mei_me`, `mei_gsc`, `mei_pxp`). After loading these and reloading the `xe` driver, the GPU was detected. - The issue appears to be in the Vulkan compute shaders for larger models, possibly related to memory handling or shader compilation for models with more layers. - Intel Arrow Lake is a relatively new GPU architecture, which may not be fully supported yet.
GiteaMirror added the vulkanlinux labels 2026-04-29 09:31:48 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 29, 2026):

OLLAMA_NUM_GPU is not an ollama configuration variable, it's unusual that setting it affects the outcome.

<!-- gh-comment-id:3818045195 --> @rick-github commented on GitHub (Jan 29, 2026): `OLLAMA_NUM_GPU` is not an ollama configuration variable, it's unusual that setting it affects the outcome.
Author
Owner

@scrapes commented on GitHub (Jan 31, 2026):

Can you try the q8 versions of the Model (llama3.2:3b-instruct-q8_0) ? In my first tests with the pretty much exact problem this fixed it.

I suspect the problem lies somewhere in the handling of the lower quantization(s) via Vulkan on Intel.

<!-- gh-comment-id:3829363457 --> @scrapes commented on GitHub (Jan 31, 2026): Can you try the q8 versions of the Model (llama3.2:3b-instruct-q8_0) ? In my first tests with the pretty much exact problem this fixed it. I suspect the problem lies somewhere in the handling of the lower quantization(s) via Vulkan on Intel.
Author
Owner

@dadi72 commented on GitHub (Feb 3, 2026):

I can confirm that, as I have the same problem (on Intel Arc 750), here is my test of llama3.1:

ollama run llama3.1 "say hello" 
everediaansen_unregisterstrarebra MindDDS dersstrarheimerzsche ringingillonledonansenborumium
ollama run llama3.2:3b-instruct-q8_0 "say hello"  
Hello!

and a bigger one:

ollama run llama3:8b-instruct-q8_0 "say hello" 
Hello! How are you today?
<!-- gh-comment-id:3843418140 --> @dadi72 commented on GitHub (Feb 3, 2026): I can confirm that, as I have the same problem (on Intel Arc 750), here is my test of llama3.1: ```shell ollama run llama3.1 "say hello" everediaansen_unregisterstrarebra MindDDS dersstrarheimerzsche ringingillonledonansenborumium ``` ```shell ollama run llama3.2:3b-instruct-q8_0 "say hello" Hello! ``` and a bigger one: ```shell ollama run llama3:8b-instruct-q8_0 "say hello" Hello! How are you today? ```
Author
Owner

@DrazorV commented on GitHub (Feb 4, 2026):

Same issue here, Arc A770 with qwen3:8b

<!-- gh-comment-id:3845951288 --> @DrazorV commented on GitHub (Feb 4, 2026): Same issue here, Arc A770 with qwen3:8b
Author
Owner

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Update: Model runner crashes with larger MoE models (qwen3-coder-next)

I've discovered that the issue escalates from "garbage output" to full crashes when running larger Mixture-of-Experts models like qwen3-coder-next:latest (80B params, 3B activated).

New Crash Behavior

With Ollama v0.15.5-rc2, running qwen3-coder-next:latest causes the model runner to crash with:

Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error

System Info

  • Kernel: 6.14.0-37-generic (Ubuntu 24.04.3 LTS)
  • GPU: Intel Arrow Lake-P [Intel Graphics] [8086:7d51]
  • Driver: xe 1.1.0
  • Mesa: 25.0.7-0ubuntu0.24.04.2
  • Vulkan: 1.4.305
  • Ollama: 0.15.5-rc2
  • VM: 64GB RAM allocated (Proxmox VM with GPU passthrough)

Server Logs (crash)

goroutine 1176 gp=0xc000103dc0 m=nil [chan receive]:
runtime.gopark(0x30?, 0x5d6ec34bbd00?, 0x1?, 0x12?, 0xc000086b20?)
...
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...)
...
time=2026-02-04T12:39:25.959Z level=ERROR source=server.go:1609 msg="post predict" error="Post \"http://127.0.0.1:45141/completion\": EOF"

dmesg GPU errors

xe 0000:01:00.0: [drm] *ERROR* GT1: GSC proxy component not bound!
workqueue: output_poll_execute hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND

Working Model (for comparison)

qwen3-coder:30b works correctly on the same hardware, suggesting the issue may be related to:

  1. Model size/memory handling
  2. MoE-specific tensor operations in Vulkan shaders
  3. The new Gated DeltaNet architecture in qwen3-coder-next
  • llama.cpp #17389 - Vulkan crashes on Intel iGPU with certain models
  • llama.cpp #10528 - Inconsistent Vulkan segfaults
  • intel/ipex-llm #12318 - Arc Xe2 iGPU crashes with k-quant models

Potential Root Causes

  1. Intel xe driver bug - The GSC proxy error and workqueue hogging suggest possible driver-level issues with the new xe driver on Arrow Lake
  2. Mesa Vulkan shader bugs - The issue seems to worsen with more complex/larger models, possibly due to shader compilation issues
  3. ggml-vulkan memory handling - Could be related to how VRAM is allocated for larger models on Intel iGPU with shared memory

Questions

  1. Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner?
  2. Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan?
  3. Would it help to test with an older Mesa version to isolate if this is a Mesa regression?
<!-- gh-comment-id:3847272362 --> @chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Update: Model runner crashes with larger MoE models (qwen3-coder-next) I've discovered that the issue escalates from "garbage output" to **full crashes** when running larger Mixture-of-Experts models like `qwen3-coder-next:latest` (80B params, 3B activated). ### New Crash Behavior With Ollama v0.15.5-rc2, running `qwen3-coder-next:latest` causes the model runner to crash with: ``` Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error ``` ### System Info - **Kernel:** 6.14.0-37-generic (Ubuntu 24.04.3 LTS) - **GPU:** Intel Arrow Lake-P [Intel Graphics] [8086:7d51] - **Driver:** xe 1.1.0 - **Mesa:** 25.0.7-0ubuntu0.24.04.2 - **Vulkan:** 1.4.305 - **Ollama:** 0.15.5-rc2 - **VM:** 64GB RAM allocated (Proxmox VM with GPU passthrough) ### Server Logs (crash) ``` goroutine 1176 gp=0xc000103dc0 m=nil [chan receive]: runtime.gopark(0x30?, 0x5d6ec34bbd00?, 0x1?, 0x12?, 0xc000086b20?) ... github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...) ... time=2026-02-04T12:39:25.959Z level=ERROR source=server.go:1609 msg="post predict" error="Post \"http://127.0.0.1:45141/completion\": EOF" ``` ### dmesg GPU errors ``` xe 0000:01:00.0: [drm] *ERROR* GT1: GSC proxy component not bound! workqueue: output_poll_execute hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND ``` ### Working Model (for comparison) `qwen3-coder:30b` works correctly on the same hardware, suggesting the issue may be related to: 1. Model size/memory handling 2. MoE-specific tensor operations in Vulkan shaders 3. The new Gated DeltaNet architecture in qwen3-coder-next ### Related Issues - llama.cpp #17389 - Vulkan crashes on Intel iGPU with certain models - llama.cpp #10528 - Inconsistent Vulkan segfaults - intel/ipex-llm #12318 - Arc Xe2 iGPU crashes with k-quant models ### Potential Root Causes 1. **Intel xe driver bug** - The GSC proxy error and workqueue hogging suggest possible driver-level issues with the new xe driver on Arrow Lake 2. **Mesa Vulkan shader bugs** - The issue seems to worsen with more complex/larger models, possibly due to shader compilation issues 3. **ggml-vulkan memory handling** - Could be related to how VRAM is allocated for larger models on Intel iGPU with shared memory ### Questions 1. Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner? 2. Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan? 3. Would it help to test with an older Mesa version to isolate if this is a Mesa regression?
Author
Owner

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Cross-posted to ggml-org/llama.cpp#19327 for tracking in the upstream Vulkan backend.

<!-- gh-comment-id:3847275474 --> @chefboyrdave21 commented on GitHub (Feb 4, 2026): Cross-posted to ggml-org/llama.cpp#19327 for tracking in the upstream Vulkan backend.
Author
Owner

@DrazorV commented on GitHub (Feb 4, 2026):

  1. Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner?
  2. Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan?

Using vulkan on llama.cpp with my Arc A770 does not produce the same issue.

<!-- gh-comment-id:3847447418 --> @DrazorV commented on GitHub (Feb 4, 2026): > 1. Should this be filed separately in ggml-org/llama.cpp since the crash appears to be in the runner? > 2. Has anyone tested Arrow Lake with the SYCL/oneAPI backend as an alternative to Vulkan? Using vulkan on llama.cpp with my Arc A770 does not produce the same issue.
Author
Owner

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Major Update: Root Cause Identified and Fixes In Progress

After extensive debugging on my Intel Arrow Lake system, I've identified the root cause and am working on fixes.

Root Cause: xe Kernel Driver Job Timeout

The Intel xe kernel driver has a hardcoded maximum job timeout of 10 seconds (CONFIG_DRM_XE_JOB_TIMEOUT_MAX = 10000ms). When running MoE (Mixture of Experts) models with 128 experts, the Vulkan shader operations exceed this timeout, causing:

  1. GPU job timeout → GPU reset
  2. Vulkan device lost error
  3. Model runner crash

What Works vs What Doesn't

Model Type Size Works? Speed
TinyLlama 1.1B Yes 25 t/s
Mistral 7B Yes 12 t/s
Llama3 8B Yes 10 t/s
Qwen3-Coder (MoE) 30B No Timeout
Qwen3-Coder-Next (MoE) 80B No Timeout

Key finding: Standard dense models up to 8B work fine on Vulkan! The issue is specifically with MoE models that have many experts (128 in qwen3-coder's case).

Evidence from dmesg

xe 0000:01:00.0: [drm] GT0: Timedout job: seqno=4294967185, guc_id=2, flags=0x0 in llama-cli
xe 0000:01:00.0: [drm] Xe device coredump has been created
xe 0000:01:00.0: [drm] GT0: Engine reset: engine_class=ccs

Fixes I'm Working On

  1. Kernel module rebuild - Building xe driver with CONFIG_DRM_XE_JOB_TIMEOUT_MAX=60000 (60 seconds)

  2. llama.cpp patch - Submitted details to ggml-org/llama.cpp#19327 to split MUL_MAT_ID operations for Intel GPUs

  3. Testing - Will report back with results

Current Workaround

For now, use CPU-only mode for MoE models:

OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b

Or use non-MoE models which work fine on Vulkan:

ollama run llama3.2:8b  # Works at 10 t/s on Vulkan
  • Full technical details in ggml-org/llama.cpp#19327
  • This likely affects all Intel iGPUs with xe driver running MoE models
<!-- gh-comment-id:3848714504 --> @chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Major Update: Root Cause Identified and Fixes In Progress After extensive debugging on my Intel Arrow Lake system, I've identified the **root cause** and am working on fixes. ### Root Cause: xe Kernel Driver Job Timeout The Intel `xe` kernel driver has a hardcoded **maximum job timeout of 10 seconds** (`CONFIG_DRM_XE_JOB_TIMEOUT_MAX = 10000ms`). When running MoE (Mixture of Experts) models with 128 experts, the Vulkan shader operations exceed this timeout, causing: 1. GPU job timeout → GPU reset 2. Vulkan device lost error 3. Model runner crash ### What Works vs What Doesn't | Model Type | Size | Works? | Speed | |------------|------|--------|-------| | TinyLlama | 1.1B | ✅ Yes | 25 t/s | | Mistral | 7B | ✅ Yes | 12 t/s | | Llama3 | 8B | ✅ Yes | 10 t/s | | Qwen3-Coder (MoE) | 30B | ❌ No | Timeout | | Qwen3-Coder-Next (MoE) | 80B | ❌ No | Timeout | **Key finding:** Standard dense models up to 8B work fine on Vulkan! The issue is specifically with **MoE models that have many experts** (128 in qwen3-coder's case). ### Evidence from dmesg ``` xe 0000:01:00.0: [drm] GT0: Timedout job: seqno=4294967185, guc_id=2, flags=0x0 in llama-cli xe 0000:01:00.0: [drm] Xe device coredump has been created xe 0000:01:00.0: [drm] GT0: Engine reset: engine_class=ccs ``` ### Fixes I'm Working On 1. **Kernel module rebuild** - Building xe driver with `CONFIG_DRM_XE_JOB_TIMEOUT_MAX=60000` (60 seconds) 2. **llama.cpp patch** - Submitted details to ggml-org/llama.cpp#19327 to split MUL_MAT_ID operations for Intel GPUs 3. **Testing** - Will report back with results ### Current Workaround For now, use CPU-only mode for MoE models: ```bash OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b ``` Or use non-MoE models which work fine on Vulkan: ```bash ollama run llama3.2:8b # Works at 10 t/s on Vulkan ``` ### Related - Full technical details in ggml-org/llama.cpp#19327 - This likely affects all Intel iGPUs with xe driver running MoE models
Author
Owner

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Update: Working Fix Found!

I've identified a working fix for Intel Arrow Lake + MoE models. The issue is in llama.cpp's Vulkan backend.

Root Cause Confirmed

The Intel xe kernel driver has a hardcoded 10-second job timeout. MoE models with 128 experts (like qwen3-coder) exceed this timeout during MUL_MAT_ID operations.

The Fix

I've posted a patch to ggml-org/llama.cpp#19327 that forces CPU fallback for MUL_MAT_ID operations on Intel GPUs when there are many experts.

Results with Patched llama.cpp

> Hi
Hello! How can I help you today?

[ Prompt: 12.8 t/s | Generation: 6.4 t/s ]

Before: GPU timeout crash
After: Stable at 6.4 t/s (vs 4.6 t/s pure CPU)

For Ollama Users

Until this is fixed upstream, workarounds:

  1. CPU only (works but slower):

    OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b
    
  2. Use non-MoE models (work fine on Vulkan):

    ollama run llama3.2:8b  # Works at 10 t/s
    ollama run mistral:7b   # Works at 12 t/s
    

Next Steps

I've proposed the fix to llama.cpp maintainers. Once merged, Ollama should pick it up in a future release.

<!-- gh-comment-id:3848823346 --> @chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Update: Working Fix Found! I've identified a working fix for Intel Arrow Lake + MoE models. The issue is in llama.cpp's Vulkan backend. ### Root Cause Confirmed The Intel xe kernel driver has a hardcoded 10-second job timeout. MoE models with 128 experts (like qwen3-coder) exceed this timeout during MUL_MAT_ID operations. ### The Fix I've posted a patch to ggml-org/llama.cpp#19327 that forces CPU fallback for MUL_MAT_ID operations on Intel GPUs when there are many experts. ### Results with Patched llama.cpp ``` > Hi Hello! How can I help you today? [ Prompt: 12.8 t/s | Generation: 6.4 t/s ] ``` **Before:** GPU timeout crash **After:** Stable at 6.4 t/s (vs 4.6 t/s pure CPU) ### For Ollama Users Until this is fixed upstream, workarounds: 1. **CPU only** (works but slower): ```bash OLLAMA_NUM_GPU=0 ollama run qwen3-coder:30b ``` 2. **Use non-MoE models** (work fine on Vulkan): ```bash ollama run llama3.2:8b # Works at 10 t/s ollama run mistral:7b # Works at 12 t/s ``` ### Next Steps I've proposed the fix to llama.cpp maintainers. Once merged, Ollama should pick it up in a future release.
Author
Owner

@chefboyrdave21 commented on GitHub (Feb 4, 2026):

Update: Proxmox PCI Passthrough Environment Details

For completeness, here are the virtualization details. This issue occurs with the Intel iGPU passed through to a VM via PCI passthrough.

Host (Proxmox)

  • Hypervisor: Proxmox VE 9.1.0
  • Host Kernel: 6.17.2-1-pve
  • CPU: Intel Core Ultra 9 285H (Arrow Lake)
  • GPU on Host: Intel Arrow Lake-P [8086:7d51] bound to vfio-pci for passthrough

VM Configuration (102 - ollama)

cpu: host
hostpci0: 00:02.0,pcie=1,x-vga=0,rombar=1
machine: q35
memory: 65536
vga: none

Guest VM

  • OS: Ubuntu 24.04.3 LTS
  • Kernel: 6.14.0-37-generic (with xe.force_probe=7d51 iommu=pt)
  • GPU Driver: xe (Intel Xe KMD)
  • GPU visible as: 01:00.0 inside VM

Does Proxmox/Passthrough Affect This?

Likely not. The xe driver's CONFIG_DRM_XE_JOB_TIMEOUT_MAX=10000ms is a kernel-level constant that would apply equally on bare metal. The PCI passthrough provides near-native GPU access - the timeout mechanism is enforced by the guest kernel's xe driver, not the hypervisor.

The same issue would occur on bare metal with the same kernel and driver versions. I'm documenting this for anyone searching for similar issues in virtualized environments.

<!-- gh-comment-id:3849505239 --> @chefboyrdave21 commented on GitHub (Feb 4, 2026): ## Update: Proxmox PCI Passthrough Environment Details For completeness, here are the virtualization details. This issue occurs with the Intel iGPU passed through to a VM via PCI passthrough. ### Host (Proxmox) - **Hypervisor:** Proxmox VE 9.1.0 - **Host Kernel:** 6.17.2-1-pve - **CPU:** Intel Core Ultra 9 285H (Arrow Lake) - **GPU on Host:** Intel Arrow Lake-P [8086:7d51] bound to `vfio-pci` for passthrough ### VM Configuration (102 - ollama) ``` cpu: host hostpci0: 00:02.0,pcie=1,x-vga=0,rombar=1 machine: q35 memory: 65536 vga: none ``` ### Guest VM - **OS:** Ubuntu 24.04.3 LTS - **Kernel:** 6.14.0-37-generic (with `xe.force_probe=7d51 iommu=pt`) - **GPU Driver:** xe (Intel Xe KMD) - **GPU visible as:** `01:00.0` inside VM ### Does Proxmox/Passthrough Affect This? **Likely not.** The xe driver's `CONFIG_DRM_XE_JOB_TIMEOUT_MAX=10000ms` is a kernel-level constant that would apply equally on bare metal. The PCI passthrough provides near-native GPU access - the timeout mechanism is enforced by the guest kernel's xe driver, not the hypervisor. The same issue would occur on bare metal with the same kernel and driver versions. I'm documenting this for anyone searching for similar issues in virtualized environments.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55647