[GH-ISSUE #15285] AMD APU bad allocation of gemma4:eXb models #71838

Open
opened 2026-05-05 02:40:37 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @ByCzech on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15285

What is the issue?

On AMD APU Ryzen CPUs with iGPU (i.e. Ryzen 7 5700G) loading gemma4:e4b and gemma4:e2b models looks like that:

NAME          ID              SIZE     PROCESSOR          CONTEXT    UNTIL   
gemma4:e4b    c6eb396dbd59    11 GB    62%/38% CPU/GPU    32768      Forever
NAME          ID              SIZE      PROCESSOR          CONTEXT    UNTIL   
gemma4:e2b    7fbdbf8f5e45    8.4 GB    68%/32% CPU/GPU    32768      Forever

62% on CPU / system RAM.

Different models (even bigger) are loaded on 100% GPU well, even same family, for example:

NAME          ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gemma4:26b    5571076f3d70    21 GB    100% GPU     32768      Forever
NAME          ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gemma4:31b    6316f0629137    30 GB    100% GPU     32768      Forever
NAME                    ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
mistral-small3.2:24b    5a408ab55df5    21 GB    100% GPU     32768      Forever

On dedicated GPU are eXb models loaded fully to GPU (100%):

NAME          ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gemma4:e4b    c6eb396dbd59    11 GB    100% GPU     32768      Forever

Debian 13
Kernel 6.18.15
Vulkan 1.4.309

Relevant log output


OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.20.0

Originally created by @ByCzech on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15285 ### What is the issue? On AMD APU Ryzen CPUs with iGPU (i.e. Ryzen 7 5700G) loading gemma4:e4b and gemma4:e2b models looks like that: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e4b c6eb396dbd59 11 GB 62%/38% CPU/GPU 32768 Forever ``` ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e2b 7fbdbf8f5e45 8.4 GB 68%/32% CPU/GPU 32768 Forever ``` 62% on CPU / system RAM. Different models (even bigger) are loaded on 100% GPU well, even same family, for example: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b 5571076f3d70 21 GB 100% GPU 32768 Forever ``` ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:31b 6316f0629137 30 GB 100% GPU 32768 Forever ``` ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL mistral-small3.2:24b 5a408ab55df5 21 GB 100% GPU 32768 Forever ``` On dedicated GPU are eXb models loaded fully to GPU (100%): ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:e4b c6eb396dbd59 11 GB 100% GPU 32768 Forever ``` Debian 13 Kernel 6.18.15 Vulkan 1.4.309 ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-05-05 02:40:37 -05:00
Author
Owner

@chejh-amd commented on GitHub (Apr 8, 2026):

This might be the same Vulkan buffer allocation issue that's showing up in #15261 and #15248. the MoE variants (e4b/e2b) have expert weight tensors that can be pretty large individually, and on iGPUs the Vulkan max buffer allocation size is usually smaller than on discrete cards. when a single tensor exceeds that limit, it falls back to CPU for that layer.

That would explain why the dense models (26b/31b) load fine at 100% GPU even though they're bigger overall, since their per-tensor sizes are more evenly distributed. and why e4b works at 100% on a discrete GPU where the Vulkan heap limit is larger.

Could you grab the ollama debug logs (OLLAMA_DEBUG=1 ollama serve) while loading gemma4:e4b? looking for any alloc_tensor_range: failed to allocate Vulkan0 buffer messages, that would confirm it.

<!-- gh-comment-id:4203424733 --> @chejh-amd commented on GitHub (Apr 8, 2026): This might be the same Vulkan buffer allocation issue that's showing up in #15261 and #15248. the MoE variants (e4b/e2b) have expert weight tensors that can be pretty large individually, and on iGPUs the Vulkan max buffer allocation size is usually smaller than on discrete cards. when a single tensor exceeds that limit, it falls back to CPU for that layer. That would explain why the dense models (26b/31b) load fine at 100% GPU even though they're bigger overall, since their per-tensor sizes are more evenly distributed. and why e4b works at 100% on a discrete GPU where the Vulkan heap limit is larger. Could you grab the ollama debug logs (`OLLAMA_DEBUG=1 ollama serve`) while loading gemma4:e4b? looking for any `alloc_tensor_range: failed` to allocate `Vulkan0` buffer messages, that would confirm it.
Author
Owner

@ByCzech commented on GitHub (Apr 8, 2026):

You're right, here is relevant piece of log:

dub 08 05:09:44 speedy ollama[946703]: alloc_tensor_range: failed to allocate Vulkan0 buffer of size 469762048

<!-- gh-comment-id:4203547016 --> @ByCzech commented on GitHub (Apr 8, 2026): You're right, here is relevant piece of log: `dub 08 05:09:44 speedy ollama[946703]: alloc_tensor_range: failed to allocate Vulkan0 buffer of size 469762048`
Author
Owner

@chejh-amd commented on GitHub (Apr 8, 2026):

Yep that confirms it. 469 MiB single buffer allocation failing on the iGPU's Vulkan heap. The MoE expert weights need one big contiguous allocation per layer, and the iGPU's max buffer size is smaller than what's needed.
Not much you can do on your end right now unfortunately. this would need either the Vulkan backend to split large tensors across multiple buffers, or ollama to handle the fallback more gracefully. For now the eXb models will run with that CPU/GPU split on iGPU. The output should still be correct as long as you're on a version with the MoE precision fix (once it lands in ollama from upstream llama.cpp #21566).

<!-- gh-comment-id:4203575459 --> @chejh-amd commented on GitHub (Apr 8, 2026): Yep that confirms it. 469 MiB single buffer allocation failing on the iGPU's Vulkan heap. The MoE expert weights need one big contiguous allocation per layer, and the iGPU's max buffer size is smaller than what's needed. Not much you can do on your end right now unfortunately. this would need either the Vulkan backend to split large tensors across multiple buffers, or ollama to handle the fallback more gracefully. For now the eXb models will run with that CPU/GPU split on iGPU. The output should still be correct as long as you're on a version with the MoE precision fix (once it lands in ollama from upstream llama.cpp #21566).
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15285
Analyzed: 2026-04-18T18:22:43.213636

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274310500 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15285 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15285 **Analyzed**: 2026-04-18T18:22:43.213636 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71838