[GH-ISSUE #15877] Vulkan backend ignores maxComputeWorkGroupInvocations and creates 512-invocation pipelines (breaks Raspberry Pi 5 / V3D, cap = 256) #72175

Open
opened 2026-05-05 03:35:39 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @txenoo on GitHub (Apr 29, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15877

What is the issue?

Summary

Ollama's Vulkan backend hardcodes {512, 1, 1} workgroup sizes in dozens of ggml_vk_create_pipeline(...) calls and in the corresponding .comp shaders (layout(local_size_x = 512, ...)). On Raspberry Pi 5 (Broadcom V3D, V3DV driver) the hardware advertises VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations = 256, so every one of those pipelines exceeds the device limit and is rejected by the V3D shader compiler:

nir_to_vir.c:  assert(c->local_invocation_index_bits <= 8);

(8 bits = 256 invocations.)

This is the same root cause behind the "Vulkan doesn't work on Raspberry Pi 5" reports against llama.cpp:

This issue identifies the specific bug and points to the V3DV-side workaround pending to land in Mesa.

Environment

  • Hardware: Raspberry Pi 5 (Broadcom BCM2712, V3D 7.1.7)
  • OS: Debian Trixie (aarch64)
  • Driver: Mesa V3DV (Vulkan 1.3)
  • Ollama: built with -DGGML_VULKAN=ON, server launched with OLLAMA_VULKAN=1
  • Device limits (relevant ones):
    • maxComputeWorkGroupInvocations = 256
    • maxComputeWorkGroupSize = [256, 256, 256]
    • subgroupSize = 16

What's happening

  1. ggml-vulkan.cpp:4478 correctly reads the device limit:

    device->max_workgroup_size_log2 =
        uint32_t(log2f(float(device->properties.limits.maxComputeWorkGroupInvocations)));
    

    On V3D this yields log2(256) = 8.

  2. A handful of code paths honor that cap but most of the pipelines hardcode {512, 1, 1} and ignore max_workgroup_size_log2. The matching .comp shaders declare the same hardcoded size, e.g. abs.comp, concat.comp, copy.comp, … all local_size_x = 512.

Steps to reproduce

On a Raspberry Pi 5 with stock Mesa V3DV (no workaround applied):

# 1. Start the server with Vulkan enabled (it's off by default).
OLLAMA_VULKAN=1 ollama serve

# 2. In another shell, run any model.
ollama run gemma3:1b

The runner crashes during shader pipeline creation on the first compute dispatch. The affected ops (cpy, get_rows, add/mul, gelu, soft_max, rope, …) are on the hot path of every transformer forward pass, so any model fails immediately.

Expected behavior

Ollama's Vulkan backend should respect VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations for every pipeline it creates, the same way it already does for argsort.

Driver-side workaround (V3DV)

So we implemented at the driver side a workaround for applications that don't honour the maxComputeWorkGroupInvocations lowering oversized workgroups itself so it doesn't assert. Mesa MRs:

With 3 previous Mesa MR RPi5 can use run ollama, altough there are many opportunities of performance improvements both in Mesa and ggml size making adjustments for Broadcom arquitecture like it is dome for other vendors.

These let RPi5 users run Ollama today, but the wrapping loop the driver inserts is pure overhead — fixing it in the backend would let V3D run dispatches at native size.

Relevant log output


OS

Linux

GPU

Other

CPU

Other

Ollama version

0.21.2

Originally created by @txenoo on GitHub (Apr 29, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15877 ### What is the issue? ## Summary Ollama's Vulkan backend hardcodes `{512, 1, 1}` workgroup sizes in dozens of `ggml_vk_create_pipeline(...)` calls and in the corresponding `.comp` shaders (`layout(local_size_x = 512, ...)`). On Raspberry Pi 5 (Broadcom V3D, V3DV driver) the hardware advertises `VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations = 256`, so every one of those pipelines exceeds the device limit and is rejected by the V3D shader compiler: ``` nir_to_vir.c: assert(c->local_invocation_index_bits <= 8); ``` (8 bits = 256 invocations.) This is the same root cause behind the "Vulkan doesn't work on Raspberry Pi 5" reports against llama.cpp: - [`ggml-org/llama.cpp#9801`](https://github.com/ggml-org/llama.cpp/issues/9801) — "Bug: [vulkan] llama.cpp not work on Raspberry Pi 5" - [`ggml-org/llama.cpp#5237`](https://github.com/ggml-org/llama.cpp/issues/5237) — original Jan 2024 report This issue identifies the specific bug and points to the [V3DV-side workaround pending to land in Mesa](https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41257). ## Environment - Hardware: Raspberry Pi 5 (Broadcom BCM2712, V3D 7.1.7) - OS: Debian Trixie (aarch64) - Driver: Mesa V3DV (Vulkan 1.3) - Ollama: built with `-DGGML_VULKAN=ON`, server launched with `OLLAMA_VULKAN=1` - Device limits (relevant ones): - `maxComputeWorkGroupInvocations = 256` - `maxComputeWorkGroupSize = [256, 256, 256]` - `subgroupSize = 16` ## What's happening 1. `ggml-vulkan.cpp:4478` correctly reads the device limit: ```cpp device->max_workgroup_size_log2 = uint32_t(log2f(float(device->properties.limits.maxComputeWorkGroupInvocations))); ``` On V3D this yields `log2(256) = 8`. 2. A handful of code paths honor that cap but most of the pipelines hardcode `{512, 1, 1}` and ignore `max_workgroup_size_log2`. The matching `.comp` shaders declare the same hardcoded size, e.g. `abs.comp`, `concat.comp`, `copy.comp`, … all `local_size_x = 512`. ## Steps to reproduce On a Raspberry Pi 5 with stock Mesa V3DV (no workaround applied): ```bash # 1. Start the server with Vulkan enabled (it's off by default). OLLAMA_VULKAN=1 ollama serve # 2. In another shell, run any model. ollama run gemma3:1b ``` The runner crashes during shader pipeline creation on the first compute dispatch. The affected ops (cpy, get_rows, add/mul, gelu, soft_max, rope, …) are on the hot path of every transformer forward pass, so any model fails immediately. ## Expected behavior Ollama's Vulkan backend should respect `VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations` for every pipeline it creates, the same way it already does for argsort. ## Driver-side workaround (V3DV) So we implemented at the driver side a workaround for applications that don't honour the maxComputeWorkGroupInvocations lowering oversized workgroups itself so it doesn't assert. Mesa MRs: - https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41257 — likely sufficient on its own; lowers oversized compute workgroups to 256 invocations so ggml-vulkan pipelines compile on V3DV. - https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41256 - Fixes a shader compilation bug. - https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255 - Needed for performance enabling v8dot HW. With 3 previous Mesa MR RPi5 can use run ollama, altough there are many opportunities of performance improvements both in Mesa and ggml size making adjustments for Broadcom arquitecture like it is dome for other vendors. These let RPi5 users run Ollama today, but the wrapping loop the driver inserts is pure overhead — fixing it in the backend would let V3D run dispatches at native size. ### Relevant log output ```shell ``` ### OS Linux ### GPU Other ### CPU Other ### Ollama version 0.21.2
GiteaMirror added the bug label 2026-05-05 03:35:39 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72175