[GH-ISSUE #15274] Gemma 4 31B - Very low GPU power utilization [RTX 5090] Linux = max 50% [Power] #56283

Closed
opened 2026-04-29 10:33:51 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @eXt73 on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15274

What is the issue?

Under Linux, Gemma 4 31B exhibits very low GPU resource utilization... oscillating between 25% and 50%. Most frequently, it is around 40%. [Power]

Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

Originally created by @eXt73 on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15274 ### What is the issue? Under Linux, Gemma 4 31B exhibits very low GPU resource utilization... oscillating between 25% and 50%. Most frequently, it is around 40%. [Power] <img width="2192" height="987" alt="Image" src="https://github.com/user-attachments/assets/0fdaca7f-a9b9-462c-b7e6-1bd84377a3d2" /> ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-04-29 10:33:51 -05:00
Author
Owner

@tidely commented on GitHub (Apr 3, 2026):

Your GPU doesn't have enough VRAM to run Gemma4 31B with a 65536 context window, which means the CPU is doing a lot of work while offloading some parts to the GPU. For reference, I can only run Gemma 4 31B at a 16k context window with 32GB of VRAM without it turning to the CPU for help.

<!-- gh-comment-id:4183380515 --> @tidely commented on GitHub (Apr 3, 2026): Your GPU doesn't have enough VRAM to run Gemma4 31B with a 65536 context window, which means the CPU is doing a lot of work while offloading some parts to the GPU. For reference, I can only run Gemma 4 31B at a 16k context window with 32GB of VRAM without it turning to the CPU for help.
Author
Owner

@eXt73 commented on GitHub (Apr 3, 2026):

With window quantization [cache] on Q4, it has this capability. See the screenshot of ollama. Qwen 27B Q4_K_M runs smoothly with a 62K window. So something's clearly messed up here...

Image Image
<!-- gh-comment-id:4183535269 --> @eXt73 commented on GitHub (Apr 3, 2026): With window quantization [cache] on Q4, it has this capability. See the screenshot of ollama. Qwen 27B Q4_K_M runs smoothly with a 62K window. So something's clearly messed up here... <img width="2560" height="1440" alt="Image" src="https://github.com/user-attachments/assets/d5a73a90-ae9b-4d37-bd2c-9fc4240fde4d" /> <img width="2560" height="1440" alt="Image" src="https://github.com/user-attachments/assets/ebc4dc72-5472-4010-8918-ed635e61a7d5" />
Author
Owner

@Wladastic commented on GitHub (Apr 4, 2026):

Same here, 24gb/32gb vram utilization at 32k context.
Kv cache quantization set to q8 and flash attention enabled.
Still same result.
Version 20 took 7 mins for 32k and version 20.2 took 20 minutes

<!-- gh-comment-id:4186955709 --> @Wladastic commented on GitHub (Apr 4, 2026): Same here, 24gb/32gb vram utilization at 32k context. Kv cache quantization set to q8 and flash attention enabled. Still same result. Version 20 took 7 mins for 32k and version 20.2 took 20 minutes
Author
Owner

@homjay commented on GitHub (Apr 4, 2026):

Your GPU doesn't have enough VRAM to run Gemma4 31B with a 65536 context window, which means the CPU is doing a lot of work while offloading some parts to the GPU. For reference, I can only run Gemma 4 31B at a 16k context window with 32GB of VRAM without it turning to the CPU for help.

with flash attention and kv cache, the memory usage is not the problem.

Gemma 4 31B stalls on CPU with zero GPU utilization (not a memory issue) with 4*4090

and I can't stop it:

NAME          ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gemma4:31b    6316f0629137    37 GB    100% GPU     262144     Stopping...
<!-- gh-comment-id:4186974463 --> @homjay commented on GitHub (Apr 4, 2026): > Your GPU doesn't have enough VRAM to run Gemma4 31B with a 65536 context window, which means the CPU is doing a lot of work while offloading some parts to the GPU. For reference, I can only run Gemma 4 31B at a 16k context window with 32GB of VRAM without it turning to the CPU for help. with flash attention and kv cache, the memory usage is not the problem. Gemma 4 31B stalls on CPU with zero GPU utilization (not a memory issue) with 4*4090 and I can't stop it: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:31b 6316f0629137 37 GB 100% GPU 262144 Stopping... ```
Author
Owner

@Wladastic commented on GitHub (Apr 4, 2026):

I just compared to lm-studio, it's working flawlessly there...

<!-- gh-comment-id:4187233877 --> @Wladastic commented on GitHub (Apr 4, 2026): I just compared to lm-studio, it's working flawlessly there...
Author
Owner

@royolsen commented on GitHub (Apr 4, 2026):

I'm seeing the same on H100 80GB (gamma4:31b-it-q8.0 with 128k context window Q8 KV cache). Running about 1600% CPU and low GPU usage (18-35%). I'll add that I'm seeing 12 tps at 10k context.

<!-- gh-comment-id:4187714720 --> @royolsen commented on GitHub (Apr 4, 2026): I'm seeing the same on H100 80GB (gamma4:31b-it-q8.0 with 128k context window Q8 KV cache). Running about 1600% CPU and low GPU usage (18-35%). I'll add that I'm seeing 12 tps at 10k context.
Author
Owner

@SingKS8 commented on GitHub (Apr 5, 2026):

Same variables settings like above on 4*3090, the PROCESSOR 100% GPU is fake, almost 1600% CPU is using.
I have tried 'gemma4:31b' on llama-cpp, it works quite well.
Now, running gemma 4 on my phone is faster than ollama on 4*3090. 😅

<!-- gh-comment-id:4188813316 --> @SingKS8 commented on GitHub (Apr 5, 2026): Same variables settings like above on 4\*3090, the `PROCESSOR 100% GPU` is fake, almost `1600%` CPU is using. I have tried 'gemma4:31b' on llama-cpp, it works quite well. Now, running gemma 4 on my phone is faster than ollama on 4\*3090. 😅
Author
Owner

@mjolley9 commented on GitHub (Apr 5, 2026):

Same issue here with both gemma4:26b-a4b-it-q4_K_M and gemma4:31b-it-q4_K_M on a RTX 6000 Pro....at best it hovers around 35-40% GPU (vs same task on Qwen3.5 typically holds at 80% GPU).

<!-- gh-comment-id:4189595831 --> @mjolley9 commented on GitHub (Apr 5, 2026): Same issue here with both gemma4:26b-a4b-it-q4_K_M and gemma4:31b-it-q4_K_M on a RTX 6000 Pro....at best it hovers around 35-40% GPU (vs same task on Qwen3.5 typically holds at 80% GPU).
Author
Owner

@pulsar85 commented on GitHub (Apr 7, 2026):

I have the same problem.

<!-- gh-comment-id:4201809001 --> @pulsar85 commented on GitHub (Apr 7, 2026): I have the same problem.
Author
Owner

@eXt73 commented on GitHub (Apr 8, 2026):

Version https://github.com/ollama/ollama/releases/tag/v0.20.4-rc2 solves this problem

<!-- gh-comment-id:4206391210 --> @eXt73 commented on GitHub (Apr 8, 2026): Version https://github.com/ollama/ollama/releases/tag/v0.20.4-rc2 solves this problem
Author
Owner

@pulsar85 commented on GitHub (Apr 8, 2026):

Version v0.20.4-rc2 solves this problem

Yes, this version works well. Thank you!

<!-- gh-comment-id:4207975406 --> @pulsar85 commented on GitHub (Apr 8, 2026): > Version v0.20.4-rc2 solves this problem Yes, this version works well. Thank you!
Author
Owner

@josuemaisonet-source commented on GitHub (Apr 15, 2026):

I had a similar issue with an rtx 5060 and Gemma4 4B

<!-- gh-comment-id:4254506468 --> @josuemaisonet-source commented on GitHub (Apr 15, 2026): I had a similar issue with an rtx 5060 and Gemma4 4B
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56283