[GH-ISSUE #15295] Gemma4 very slow on GB10 #9786

Closed
opened 2026-04-12 22:40:03 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @phr0gz on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15295

What is the issue?

On GB10 platforms (aka DGX Spark), the Gemma 31B models (Q4, Q8, BF16 ) are very slow:

See other model values:
qwen3.5:35b-a3b-q4_K_M, response token/s: 59.68
qwen3.5:122b-a10b-q4_K_M, response token/s: 24
gemma4:26b-a4b-it-q4_K_M, response token/s: 58.67

And see gemma4 31B value:
gemma4:31b-it-q4_K_M, response token/s: 10.34

EDIT: I also tested with llama.cpp, and it's as slow as ollama.

Relevant log output


OS

Docker

GPU

Nvidia

CPU

Other

Ollama version

0.20.0

Originally created by @phr0gz on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15295 ### What is the issue? On GB10 platforms (aka DGX Spark), the Gemma 31B models (Q4, Q8, BF16 ) are very slow: See other model values: qwen3.5:35b-a3b-q4_K_M, response token/s: 59.68 qwen3.5:122b-a10b-q4_K_M, response token/s: 24 gemma4:26b-a4b-it-q4_K_M, response token/s: 58.67 And see gemma4 31B value: gemma4:31b-it-q4_K_M, response token/s: 10.34 EDIT: I also tested with llama.cpp, and it's as slow as ollama. ### Relevant log output ```shell ``` ### OS Docker ### GPU Nvidia ### CPU Other ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-04-12 22:40:03 -05:00
Author
Owner

@3DPJamie commented on GitHub (Apr 3, 2026):

are you aware of difference between dense models and activated parameters models? your examples are:
qwen3.5:35b-a3b-q4_K_M, response token/s: 59.68
qwen3.5:122b-a10b-q4_K_M, response token/s: 24
gemma4:26b-a4b-it-q4_K_M, response token/s: 58.67

which all have 3-10 active parameters, while gemma4:31b is a dense model with 31B active parameters. try your test with qwen3.5:27b model and do the comparison again

<!-- gh-comment-id:4184131573 --> @3DPJamie commented on GitHub (Apr 3, 2026): are you aware of difference between dense models and activated parameters models? your examples are: qwen3.5:35b-a3b-q4_K_M, response token/s: 59.68 qwen3.5:122b-a10b-q4_K_M, response token/s: 24 gemma4:26b-a4b-it-q4_K_M, response token/s: 58.67 which all have 3-10 active parameters, while gemma4:31b is a dense model with 31B active parameters. try your test with qwen3.5:27b model and do the comparison again
Author
Owner

@phr0gz commented on GitHub (Apr 3, 2026):

You are completely right:
qwen3.5:27b-q4_K_M, response token/s: 11.51

<!-- gh-comment-id:4184241171 --> @phr0gz commented on GitHub (Apr 3, 2026): You are completely right: qwen3.5:27b-q4_K_M, response token/s: 11.51
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9786