[GH-ISSUE #15286] gemma4:31b Performance issues and high resource usage #71839

New Issue

GiteaMirror · 2026-05-05T02:40:41-05:00

GiteaMirror commented

2026-05-05 02:40:41 -05:00

Originally created by @Yevhen-Myroshnychenko on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15286

What is the issue?

I have M1 Max with 64 Gb. From the models description it looks like the model should work smoothly, but in reality it's almost unusable, responses are toooooo slow. I tried to changed the context length but that didn't make speed better.

There is also a strong heating of the Macbook on which the model is running, which has not been observed before with models that are 2-1.5 times larger in size.

gemma4:31b Model size 19 GB

Relevant log output

OS

Mac OS Tahoe 26.3.1 (a)

GPU

64 GB

CPU

10-core CPU (8 performance, 2 efficiency)

Ollama version

0.20.0

Originally created by @Yevhen-Myroshnychenko on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15286 ### What is the issue? I have M1 Max with 64 Gb. From the models description it looks like the model should work smoothly, but in reality it's almost unusable, responses are toooooo slow. I tried to changed the context length but that didn't make speed better. There is also a strong heating of the Macbook on which the model is running, which has not been observed before with models that are 2-1.5 times larger in size. gemma4:31b Model size 19 GB ### Relevant log output ```shell ``` ### OS Mac OS Tahoe 26.3.1 (a) ### GPU 64 GB ### CPU 10-core CPU (8 performance, 2 efficiency) ### Ollama version 0.20.0

GiteaMirror added the bug performance labels 2026-05-05 02:40:41 -05:00

GiteaMirror commented

2026-05-05 02:40:43 -05:00

@Cephei-OpenSource commented on GitHub (Apr 3, 2026):

I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards

@Cephei-OpenSource commented on GitHub (Apr 3, 2026): I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards

GiteaMirror commented

2026-05-05 02:40:43 -05:00

@Yevhen-Myroshnychenko commented on GitHub (Apr 3, 2026):

@Cephei-OpenSource In my case with M1 Max, CPU load is equal to +-10%.

@Yevhen-Myroshnychenko commented on GitHub (Apr 3, 2026): @Cephei-OpenSource In my case with M1 Max, CPU load is equal to +-10%.

GiteaMirror commented

2026-05-05 02:40:44 -05:00

@Cephei-OpenSource commented on GitHub (Apr 3, 2026):

Maybe a CUDA Problem with Linux or Nvidia driver for Linux? To further detail: I'm running Ubuntu 24.04 with newest version of CUDA and Nvidia driver (newest as available in the standard repository, to be specific).

@Cephei-OpenSource commented on GitHub (Apr 3, 2026): Maybe a CUDA Problem with Linux or Nvidia driver for Linux? To further detail: I'm running Ubuntu 24.04 with newest version of CUDA and Nvidia driver (newest as available in the standard repository, to be specific).

GiteaMirror commented

2026-05-05 02:40:44 -05:00

@seamon67 commented on GitHub (Apr 3, 2026):

I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards

Can confirm this is happening with RTX 6000 Pro on Ubuntu 24.04 + CUDA 13.0

@seamon67 commented on GitHub (Apr 3, 2026): > I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards Can confirm this is happening with RTX 6000 Pro on Ubuntu 24.04 + CUDA 13.0

GiteaMirror commented

2026-05-05 02:40:45 -05:00

@fcorneli commented on GitHub (Apr 3, 2026):

I notice the same on AlmaLinux 10, CUDA 13.2, RTX 6000 PRO. Simple test:

ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose

I only get 29 tokens/sec. Plenty of headroom:

ollama ps
NAME                  ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
gemma4:31b-it-q8_0    53dd8459790f    51 GB    100% GPU     262144     6 minutes from now

GPU only uses 400 W, while the CPU load hits 250%.

@fcorneli commented on GitHub (Apr 3, 2026): I notice the same on AlmaLinux 10, CUDA 13.2, RTX 6000 PRO. Simple test: ``` ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose ``` I only get 29 tokens/sec. Plenty of headroom: ``` ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:31b-it-q8_0 53dd8459790f 51 GB 100% GPU 262144 6 minutes from now ``` GPU only uses 400 W, while the CPU load hits 250%.

GiteaMirror commented

2026-05-05 02:40:45 -05:00

@fcorneli commented on GitHub (Apr 3, 2026):

Seems like direct usage of llama.cpp is not burning CPU cycles. The latest git pull gives me for a similar simple test:

./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0

already 42 tokens/sec.

@fcorneli commented on GitHub (Apr 3, 2026): Seems like direct usage of `llama.cpp` is not burning CPU cycles. The latest git pull gives me for a similar simple test: ``` ./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0 ``` already 42 tokens/sec.

GiteaMirror commented

2026-05-05 02:40:46 -05:00

@gordan-bobic commented on GitHub (Apr 3, 2026):

I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle.

@gordan-bobic commented on GitHub (Apr 3, 2026): I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle.

GiteaMirror commented

2026-05-05 02:40:46 -05:00

@greg-hydrogen commented on GitHub (Apr 4, 2026):

I am experiencing the same issue, CPU is 1000% and everything says it offloaded to the GPU, maybe the original author of this issue can change the title to be more generic to the issue we all are having?

@greg-hydrogen commented on GitHub (Apr 4, 2026): I am experiencing the same issue, CPU is 1000% and everything says it offloaded to the GPU, maybe the original author of this issue can change the title to be more generic to the issue we all are having?

GiteaMirror commented

2026-05-05 02:40:47 -05:00

@Cephei-OpenSource commented on GitHub (Apr 4, 2026):

I just installed the new version 0.20.2. This has improved the problem a bit. Now GPU Usage is at 50% with still heavy CPU Usage, but the latter is less now. Yet, token speed still is only around 20 t/s which is quite disappointing on an NVDIA RTX 6000 Pro. Kind regards

@Cephei-OpenSource commented on GitHub (Apr 4, 2026): I just installed the new version 0.20.2. This has improved the problem a bit. Now GPU Usage is at 50% with still heavy CPU Usage, but the latter is less now. Yet, token speed still is only around 20 t/s which is quite disappointing on an NVDIA RTX 6000 Pro. Kind regards

GiteaMirror commented

2026-05-05 02:40:47 -05:00

@Yevhen-Myroshnychenko commented on GitHub (Apr 4, 2026):

@greg-hydrogen I hope, new title is better. If you would like, you can give to me more appropriate text, and I will change it. Thank you.

@Yevhen-Myroshnychenko commented on GitHub (Apr 4, 2026): @greg-hydrogen I hope, new title is better. If you would like, you can give to me more appropriate text, and I will change it. Thank you.

GiteaMirror commented

2026-05-05 02:40:47 -05:00

@homjay commented on GitHub (Apr 4, 2026):

I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle.

not alone, same problem with multi 4090 gpu, / cuda13.

@homjay commented on GitHub (Apr 4, 2026): > I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle. not alone, same problem with multi 4090 gpu, / cuda13.

GiteaMirror commented

2026-05-05 02:40:48 -05:00

@Wladastic commented on GitHub (Apr 4, 2026):

I just tried the same models in lmstudio, works flawless there.
So must be something up with the ollama implementation

@Wladastic commented on GitHub (Apr 4, 2026): I just tried the same models in lmstudio, works flawless there. So must be something up with the ollama implementation

GiteaMirror commented

2026-05-05 02:40:49 -05:00

@gordan-bobic commented on GitHub (Apr 4, 2026):

I can also confirm it works fine in llama.cpp without exhibiting the anomaly. What I have noticed with using llama.cpp directly is that the problem doesn't manifest with KV quantization q4_0/q4_1, but it does manifest with iq4_nl KV quantization. Not sure if it is related, but the symptoms are the same (memory allocated on GPU, but all compute happening on the CPU).

@gordan-bobic commented on GitHub (Apr 4, 2026): I can also confirm it works fine in llama.cpp without exhibiting the anomaly. What I have noticed with using llama.cpp directly is that the problem doesn't manifest with KV quantization q4_0/q4_1, but it does manifest with iq4_nl KV quantization. Not sure if it is related, but the symptoms are the same (memory allocated on GPU, but all compute happening on the CPU).

GiteaMirror commented

2026-05-05 02:40:49 -05:00

@somera commented on GitHub (Apr 4, 2026):

Same here ... https://github.com/ollama/ollama/issues/15237#issuecomment-4187131443

@somera commented on GitHub (Apr 4, 2026): Same here ... https://github.com/ollama/ollama/issues/15237#issuecomment-4187131443

GiteaMirror commented

2026-05-05 02:40:50 -05:00

@hoondy commented on GitHub (Apr 5, 2026):

Same here with M1 Max 64GB. Painfully slow.

@hoondy commented on GitHub (Apr 5, 2026): Same here with M1 Max 64GB. Painfully slow.

GiteaMirror commented

2026-05-05 02:40:51 -05:00

@AMcPherran commented on GitHub (Apr 6, 2026):

I'm experiencing the same issue with Ollama running in Docker on Fedora with an RTX3090. nvidia-smi shows that the model is fully loaded in GPU memory with headroom remaining, and yet all CPU cores are ripping and responses are too slow for it to be usable

@AMcPherran commented on GitHub (Apr 6, 2026): I'm experiencing the same issue with Ollama running in Docker on Fedora with an RTX3090. `nvidia-smi` shows that the model is fully loaded in GPU memory with headroom remaining, and yet all CPU cores are ripping and responses are too slow for it to be usable

GiteaMirror commented

2026-05-05 02:40:53 -05:00

@ivleth commented on GitHub (Apr 7, 2026):

Same story, Nvidia a5000 24GB, latets Nvidia driver and latest CUDA 13.2, Gemma3 27b blasts away with 30tokens/sec, Gemma4, only with 3 tokens/sec. The big difference is indeed cpu usage which is much higher on gemma4 compared to gemma3. I only adjusted these parameters, temperature=1.0
top_p=0.95
top_k=64

Leaving it to default does not make any difference.

@ivleth commented on GitHub (Apr 7, 2026): Same story, Nvidia a5000 24GB, latets Nvidia driver and latest CUDA 13.2, Gemma3 27b blasts away with 30tokens/sec, Gemma4, only with 3 tokens/sec. The big difference is indeed cpu usage which is much higher on gemma4 compared to gemma3. I only adjusted these parameters, temperature=1.0 top_p=0.95 top_k=64 Leaving it to default does not make any difference.

GiteaMirror commented

2026-05-05 02:40:54 -05:00

@akniy commented on GitHub (Apr 7, 2026):

I am experiencing similar low performance with Gemma4:e4b on Tesla V100 with NVidia Linux driver 580.126.09 and CUDA 13.0 generating 15 response_token/s showing high cpu usage of 8 cores while Gemma3:4b generates 81 response_token/s with little to no cpu usage on same setup.

@akniy commented on GitHub (Apr 7, 2026): I am experiencing similar low performance with Gemma4:e4b on Tesla V100 with NVidia Linux driver 580.126.09 and CUDA 13.0 generating 15 response_token/s showing high cpu usage of 8 cores while Gemma3:4b generates 81 response_token/s with little to no cpu usage on same setup.

GiteaMirror commented

2026-05-05 02:40:55 -05:00

@jagmarques commented on GitHub (Apr 7, 2026):

The performance profile you're seeing with gemma4:31b on M1 Max (slow responses + strong heating) is consistent with KV cache pressure at longer context lengths — the model's attention pattern is forcing full cache traversal each step, and at 19GB model size the M1's unified memory bandwidth gets saturated quickly.

A few things to try:

Explicitly cap context: — gemma4 defaults to a large context window, capping it reduces both memory use and per-token compute.
Monitor memory pressure: while running — if you're in red zone, the system is compressing/swapping, which explains the heat.
Check if it's an MoE model: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size.

For context: KV cache compression can help here — the cache grows with every generated token and on long conversations it becomes the bottleneck. NexusQuant compresses KV caches 7–10x training-free, though it's currently PyTorch-only and not yet integrated into ollama/llama.cpp. For now, limiting is the most practical lever.

@jagmarques commented on GitHub (Apr 7, 2026): The performance profile you're seeing with gemma4:31b on M1 Max (slow responses + strong heating) is consistent with KV cache pressure at longer context lengths — the model's attention pattern is forcing full cache traversal each step, and at 19GB model size the M1's unified memory bandwidth gets saturated quickly. A few things to try: 1. **Explicitly cap context**: — gemma4 defaults to a large context window, capping it reduces both memory use and per-token compute. 2. **Monitor memory pressure**: while running — if you're in red zone, the system is compressing/swapping, which explains the heat. 3. **Check if it's an MoE model**: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size. For context: KV cache compression can help here — the cache grows with every generated token and on long conversations it becomes the bottleneck. [NexusQuant](https://github.com/jagmarques/nexusquant) compresses KV caches 7–10x training-free, though it's currently PyTorch-only and not yet integrated into ollama/llama.cpp. For now, limiting is the most practical lever.

GiteaMirror commented

2026-05-05 02:40:56 -05:00

@hoondy commented on GitHub (Apr 7, 2026):

3. Check if it's an MoE model: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size.

gemma4:31b is dense model. MoE is gemma4:26b (26B A4B MoE).

@hoondy commented on GitHub (Apr 7, 2026): > 3\. **Check if it's an MoE model**: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size. gemma4:31b is dense model. MoE is gemma4:26b (26B A4B MoE).

GiteaMirror commented

2026-05-05 02:40:57 -05:00

@rick-github commented on GitHub (Apr 7, 2026):

If you have flash attention enabled, disable it until upgrading to 0.20.4.

@rick-github commented on GitHub (Apr 7, 2026): If you have flash attention enabled, disable it until upgrading to 0.20.4.

GiteaMirror commented

2026-05-05 02:40:57 -05:00

@ForsakenHarmony commented on GitHub (Apr 7, 2026):

I've disabled flash attention, same thing for me

ollama is at 6 t/s, unsloth studio 120 t/s

@ForsakenHarmony commented on GitHub (Apr 7, 2026): I've disabled flash attention, same thing for me ollama is at 6 t/s, unsloth studio 120 t/s

GiteaMirror commented

2026-05-05 02:40:58 -05:00

@rick-github commented on GitHub (Apr 7, 2026):

Server logs will aid in debugging.

@rick-github commented on GitHub (Apr 7, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.

GiteaMirror commented

2026-05-05 02:40:59 -05:00

@ForsakenHarmony commented on GitHub (Apr 7, 2026):

using unsloth gemma 4 e4b UD-Q4_K_XL
server.log

@ForsakenHarmony commented on GitHub (Apr 7, 2026): using unsloth gemma 4 e4b UD-Q4_K_XL [server.log](https://github.com/user-attachments/files/26554192/server.log)

GiteaMirror commented

2026-05-05 02:41:00 -05:00

@rick-github commented on GitHub (Apr 7, 2026):

The model is not optimized for the ollama engine.

$ for i in gemma-4-e4b-no-vision gemma4:e4b ; do echo $i ; ollama run  $i why is the sky blue --verbose 2>&1 | grep eval.rate ; done
gemma-4-e4b-no-vision
prompt eval rate:     42.02 tokens/s
eval rate:            13.64 tokens/s
gemma4:e4b
prompt eval rate:     165.22 tokens/s
eval rate:            184.09 tokens/s

There are a number of tensors that are higher precision in the unsloth model compared to the ollama model, so the computational requirements differ.

    blk.0.inp_gate.weight                Q4_K    [2560 256]   |	    blk.0.inp_gate.weight                F32     [2560 256]  
    blk.0.proj.weight                    Q4_K    [256 2560]   |	    blk.0.proj.weight                    F32     [256 2560]  
...
    blk.39.ffn_up.weight                 Q4_K    [2560 10240] |	    blk.39.ffn_up.weight                 unknown [2560 10240]
    blk.39.inp_gate.weight               Q4_K    [2560 256]   |	    blk.39.inp_gate.weight               F32     [2560 256]  
    blk.39.proj.weight                   Q4_K    [256 2560]   |	    blk.39.proj.weight                   F32     [256 2560]  
    per_layer_model_proj.weight          Q4_K    [2560 10752] |	    per_layer_model_proj.weight          BF16    [2560 10752]
    per_layer_token_embd.weight          BF16    [10752 26214 |	    per_layer_token_embd.weight          Q5_K    [10752 26214
    token_embd.weight                    Q6_K    [2560 262144 |	    token_embd.weight                    Q5_K    [2560 262144

@rick-github commented on GitHub (Apr 7, 2026): The model is not optimized for the ollama engine. ``` $ for i in gemma-4-e4b-no-vision gemma4:e4b ; do echo $i ; ollama run $i why is the sky blue --verbose 2>&1 | grep eval.rate ; done gemma-4-e4b-no-vision prompt eval rate: 42.02 tokens/s eval rate: 13.64 tokens/s gemma4:e4b prompt eval rate: 165.22 tokens/s eval rate: 184.09 tokens/s ``` There are a number of tensors that are higher precision in the unsloth model compared to the ollama model, so the computational requirements differ. ``` blk.0.inp_gate.weight Q4_K [2560 256] | blk.0.inp_gate.weight F32 [2560 256] blk.0.proj.weight Q4_K [256 2560] | blk.0.proj.weight F32 [256 2560] ... blk.39.ffn_up.weight Q4_K [2560 10240] | blk.39.ffn_up.weight unknown [2560 10240] blk.39.inp_gate.weight Q4_K [2560 256] | blk.39.inp_gate.weight F32 [2560 256] blk.39.proj.weight Q4_K [256 2560] | blk.39.proj.weight F32 [256 2560] per_layer_model_proj.weight Q4_K [2560 10752] | per_layer_model_proj.weight BF16 [2560 10752] per_layer_token_embd.weight BF16 [10752 26214 | per_layer_token_embd.weight Q5_K [10752 26214 token_embd.weight Q6_K [2560 262144 | token_embd.weight Q5_K [2560 262144 ```

GiteaMirror commented

2026-05-05 02:41:01 -05:00

@ivleth commented on GitHub (Apr 10, 2026):

I just updated to the latest version of ollama (Windows) v0.20.5, and the performance issues for me are solved now. I am getting a steady 26tokens/sec on a Nvidia a5000 24GB, latetst Nvidia driver and latest CUDA 13.2, with these parameters, temperature=1.0
top_p=0.95
top_k=64
Model used: Gemma4:31b

@ivleth commented on GitHub (Apr 10, 2026): I just updated to the latest version of ollama (Windows) v0.20.5, and the performance issues for me are solved now. I am getting a steady 26tokens/sec on a Nvidia a5000 24GB, latetst Nvidia driver and latest CUDA 13.2, with these parameters, temperature=1.0 top_p=0.95 top_k=64 Model used: Gemma4:31b

GiteaMirror commented

2026-05-05 02:41:02 -05:00

@Cephei-OpenSource commented on GitHub (Apr 10, 2026):

I thought this thread was closed, that's why I didn't answer before. Yes, I can also confirm since version 0.20.4 the problem is solved and FLASH_ATTENTION can again be enabled. Many thanks and kind regards

@Cephei-OpenSource commented on GitHub (Apr 10, 2026): I thought this thread was closed, that's why I didn't answer before. Yes, I can also confirm since version 0.20.4 the problem is solved and FLASH_ATTENTION can again be enabled. Many thanks and kind regards

GiteaMirror commented

2026-05-05 02:41:02 -05:00

@fcorneli commented on GitHub (Apr 11, 2026):

New benchmark results for ollama version 0.20.5 on an RTX 6000 PRO.

ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose
...
eval rate:            41.35 tokens/s

Compared with llama.cpp latest pull,

./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0
...
[ Prompt: 344,7 t/s | Generation: 42,4 t/s ]

they are on par now.

@fcorneli commented on GitHub (Apr 11, 2026): New benchmark results for ollama version 0.20.5 on an RTX 6000 PRO. ``` ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose ... eval rate: 41.35 tokens/s ``` Compared with `llama.cpp` latest pull, ``` ./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0 ... [ Prompt: 344,7 t/s | Generation: 42,4 t/s ] ``` they are on par now.

GiteaMirror commented

2026-05-05 02:41:03 -05:00

@lundybernard commented on GitHub (Apr 15, 2026):

Same issue on AMD Ryzen AI 9 HX PRO 375 w/ Radeon 890M × 24
32/64G allocated to vram.
Ubuntu 24.04
Linux Kernel v6.19.10
Ollama v0.20.7
gemma4:31b

Running the model almost maxes out vram: 32000M / 32594M
System ram jumps from 8.86 to 25.1G / 31.1G

> ollama run gemma4:31b "Explain E=mc^2" --verbose
...
total duration:       7m15.138388203s
load duration:        19.690769893s
prompt eval count:    22 token(s)
prompt eval duration: 1.044065201s
prompt eval rate:     21.07 tokens/s
eval count:           1325 token(s)
eval duration:        6m53.792264103s
eval rate:            3.20 tokens/s

Update:
I lowered the context size to 32k, fixed paging to system ram, now using 23.4G vram.

total duration:       4m59.828915446s
prompt eval rate:     90.88 tokens/s
eval rate:            4.31 tokens/s

If you're having this issue try lowering the context size

@lundybernard commented on GitHub (Apr 15, 2026): Same issue on AMD Ryzen AI 9 HX PRO 375 w/ Radeon 890M × 24 32/64G allocated to vram. Ubuntu 24.04 Linux Kernel v6.19.10 Ollama v0.20.7 gemma4:31b Running the model almost maxes out vram: 32000M / 32594M System ram jumps from 8.86 to 25.1G / 31.1G ``` > ollama run gemma4:31b "Explain E=mc^2" --verbose ... total duration: 7m15.138388203s load duration: 19.690769893s prompt eval count: 22 token(s) prompt eval duration: 1.044065201s prompt eval rate: 21.07 tokens/s eval count: 1325 token(s) eval duration: 6m53.792264103s eval rate: 3.20 tokens/s ``` --- Update: I lowered the context size to 32k, fixed paging to system ram, now using 23.4G vram. ``` total duration: 4m59.828915446s prompt eval rate: 90.88 tokens/s eval rate: 4.31 tokens/s ``` If you're having this issue try lowering the context size

GiteaMirror commented

2026-05-05 02:41:05 -05:00

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15286
Analyzed: 2026-04-18T18:22:42.325377

Analysis

Type: unknown
Severity: medium
Components: unknown

Implementation Plan

Effort: medium
Steps:

This issue has been triaged and marked for implementation.

@PureBlissAK commented on GitHub (Apr 18, 2026):  ## 🤖 Automated Triage & Analysis Report **Issue**: #15286 **Analyzed**: 2026-04-18T18:22:42.325377 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#71839