[GH-ISSUE #15749] gemma4:e4b only offloads 2.8 GiB to ROCm GPU despite 7.5 GiB available #56552

Open
opened 2026-04-29 11:00:25 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @Itstommy10 on GitHub (Apr 22, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15749

What is the issue?

GPU: AMD Radeon RX 6600 XT (gfx1032, 8GB VRAM)
OS: Linux
Ollama: 0.21.0
Backend: ROCm

Issue:
Ollama reports 7.5 GiB available on ROCm0 but only offloads 2.8 GiB of
model weights to GPU, putting 6.6 GiB on CPU.

ollama ps shows: 68% CPU / 32% GPU

Logs:
gpu memory id=0 library=ROCm available="7.5 GiB" free="8.0 GiB" minimum="457.0 MiB" overhead="0 B"
model weights device=ROCm0 size="2.8 GiB"
model weights device=CPU size="6.6 GiB"
offloaded 42/43 layers to GPU

Tried:

  • OLLAMA_NUM_GPU=999
  • OLLAMA_GPU_OVERHEAD=0
  • OLLAMA_FLASH_ATTENTION=0
  • HSA_OVERRIDE_GFX_VERSION=10.3.0
    None changed the weights distribution.

Note: Other models (qwen3.5:9b, qwen3-vl) work fine on this GPU.

Relevant log output

gpu memory id=0 library=ROCm available="7.5 GiB" free="8.0 GiB" minimum="457.0 MiB" overhead="0 B"
model weights device=ROCm0 size="2.8 GiB"
model weights device=CPU   size="6.6 GiB"
offloaded 42/43 layers to GPU

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.21.0

Originally created by @Itstommy10 on GitHub (Apr 22, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15749 ### What is the issue? **GPU:** AMD Radeon RX 6600 XT (gfx1032, 8GB VRAM) **OS:** Linux **Ollama:** 0.21.0 **Backend:** ROCm **Issue:** Ollama reports 7.5 GiB available on ROCm0 but only offloads 2.8 GiB of model weights to GPU, putting 6.6 GiB on CPU. ollama ps shows: 68% CPU / 32% GPU **Logs:** gpu memory id=0 library=ROCm available="7.5 GiB" free="8.0 GiB" minimum="457.0 MiB" overhead="0 B" model weights device=ROCm0 size="2.8 GiB" model weights device=CPU size="6.6 GiB" offloaded 42/43 layers to GPU **Tried:** - OLLAMA_NUM_GPU=999 - OLLAMA_GPU_OVERHEAD=0 - OLLAMA_FLASH_ATTENTION=0 - HSA_OVERRIDE_GFX_VERSION=10.3.0 None changed the weights distribution. **Note:** Other models (qwen3.5:9b, qwen3-vl) work fine on this GPU. ### Relevant log output ```shell gpu memory id=0 library=ROCm available="7.5 GiB" free="8.0 GiB" minimum="457.0 MiB" overhead="0 B" model weights device=ROCm0 size="2.8 GiB" model weights device=CPU size="6.6 GiB" offloaded 42/43 layers to GPU ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.21.0
GiteaMirror added the bug label 2026-04-29 11:00:25 -05:00
Author
Owner

@NBAB42Bq commented on GitHub (Apr 22, 2026):

Have the same issue with an AMD Radeon RX 6700 XT

ollama ps shows: 63% CPU / 37% GPU

Tried the same ENV VARs as you did. Also tried token context 32k (/set parameter num_ctx 32768) and saved it as a new model

<!-- gh-comment-id:4298875620 --> @NBAB42Bq commented on GitHub (Apr 22, 2026): Have the same issue with an AMD Radeon RX 6700 XT ollama ps shows: 63% CPU / 37% GPU Tried the same ENV VARs as you did. Also tried token context 32k (/set parameter num_ctx 32768) and saved it as a new model
Author
Owner

@Itstommy10 commented on GitHub (Apr 22, 2026):

Hi @NBAB42Bq , did you manage to get something?

<!-- gh-comment-id:4299775897 --> @Itstommy10 commented on GitHub (Apr 22, 2026): Hi @NBAB42Bq , did you manage to get something?
Author
Owner

@NBAB42Bq commented on GitHub (Apr 23, 2026):

No not yet. Yesterday there was an ollama update on arch linux, but the problem stays the same

<!-- gh-comment-id:4301850357 --> @NBAB42Bq commented on GitHub (Apr 23, 2026): No not yet. Yesterday there was an ollama update on arch linux, but the problem stays the same
Author
Owner

@Sasha-BabyBird commented on GitHub (Apr 23, 2026):

Experiencing the same issue with a NVIDIA GeForce GTX 4050 Laptop and gemma4:e2b model on v. 0.21.1. The GPU has 6 GB VRAM but less than 2 GB gets offloaded there.

<!-- gh-comment-id:4306185582 --> @Sasha-BabyBird commented on GitHub (Apr 23, 2026): Experiencing the same issue with a NVIDIA GeForce GTX 4050 Laptop and gemma4:e2b model on v. 0.21.1. The GPU has 6 GB VRAM but less than 2 GB gets offloaded there.
Author
Owner

@galuszkak commented on GitHub (Apr 25, 2026):

I didn't had this problem - until I upgraded to Ubuntu 26.04 - not sure if this upgrade is the reason - but I have same problem.

<!-- gh-comment-id:4317834521 --> @galuszkak commented on GitHub (Apr 25, 2026): I didn't had this problem - until I upgraded to Ubuntu 26.04 - not sure if this upgrade is the reason - but I have same problem.
Author
Owner

@jeberger commented on GitHub (Apr 26, 2026):

Same issue here with a Radeon 7600 and gemma4:e2b and e4b, running Ollama from the official 0.21.0-rocm Docker image. I'm attaching the OLLAMA_DEBUG=2 log.

ollama-gemma4-debug.log

<!-- gh-comment-id:4321546096 --> @jeberger commented on GitHub (Apr 26, 2026): Same issue here with a Radeon 7600 and gemma4:e2b and e4b, running Ollama from the official 0.21.0-rocm Docker image. I'm attaching the `OLLAMA_DEBUG=2` log. [ollama-gemma4-debug.log](https://github.com/user-attachments/files/27096465/ollama-gemma4-debug.log)
Author
Owner

@deezid commented on GitHub (Apr 28, 2026):

Gemma4:e2b uses like 7.8gb of vram easily.
Won't even fit onto an 8GB GPU due to overhead.

<!-- gh-comment-id:4337424619 --> @deezid commented on GitHub (Apr 28, 2026): Gemma4:e2b uses like 7.8gb of vram easily. Won't even fit onto an 8GB GPU due to overhead.
Author
Owner

@jeberger commented on GitHub (Apr 28, 2026):

Gemma4:e2b uses like 7.8gb of vram easily.

So it should easily be able to put 75%-80% of the layers on the GPU (including overhead). The point isn't that it's spilling over on the CPU, the point is that it's only putting 1/3 of what it should be able on the GPU before spilling over, leaving half the VRAM unused.

<!-- gh-comment-id:4337516065 --> @jeberger commented on GitHub (Apr 28, 2026): > Gemma4:e2b uses like 7.8gb of vram easily. So it should easily be able to put 75%-80% of the layers on the GPU (including overhead). The point isn't that it's spilling over on the CPU, the point is that it's only putting 1/3 of what it should be able on the GPU before spilling over, leaving half the VRAM unused.
Author
Owner

@Itstommy10 commented on GitHub (Apr 28, 2026):

Gemma4:e2b

Hi, I actually load Gemma4:e2b at 100% with an 8gb Radeon 6000xt and through debugging I see that it weighs 7.4gb so it fits without problems and works very well

<!-- gh-comment-id:4337546209 --> @Itstommy10 commented on GitHub (Apr 28, 2026): > Gemma4:e2b Hi, I actually load Gemma4:e2b at 100% with an 8gb Radeon 6000xt and through debugging I see that it weighs 7.4gb so it fits without problems and works very well
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56552