[GH-ISSUE #15286] gemma4:31b Performance issues and high resource usage #71839

Open
opened 2026-05-05 02:40:41 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @Yevhen-Myroshnychenko on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15286

What is the issue?

I have M1 Max with 64 Gb. From the models description it looks like the model should work smoothly, but in reality it's almost unusable, responses are toooooo slow. I tried to changed the context length but that didn't make speed better.

There is also a strong heating of the Macbook on which the model is running, which has not been observed before with models that are 2-1.5 times larger in size.

gemma4:31b Model size 19 GB

Relevant log output


OS

Mac OS Tahoe 26.3.1 (a)

GPU

64 GB

CPU

10-core CPU (8 performance, 2 efficiency)

Ollama version

0.20.0

Originally created by @Yevhen-Myroshnychenko on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15286 ### What is the issue? I have M1 Max with 64 Gb. From the models description it looks like the model should work smoothly, but in reality it's almost unusable, responses are toooooo slow. I tried to changed the context length but that didn't make speed better. There is also a strong heating of the Macbook on which the model is running, which has not been observed before with models that are 2-1.5 times larger in size. gemma4:31b Model size 19 GB ### Relevant log output ```shell ``` ### OS Mac OS Tahoe 26.3.1 (a) ### GPU 64 GB ### CPU 10-core CPU (8 performance, 2 efficiency) ### Ollama version 0.20.0
GiteaMirror added the bugperformance labels 2026-05-05 02:40:41 -05:00
Author
Owner

@Cephei-OpenSource commented on GitHub (Apr 3, 2026):

I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards

<!-- gh-comment-id:4183581576 --> @Cephei-OpenSource commented on GitHub (Apr 3, 2026): I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards
Author
Owner

@Yevhen-Myroshnychenko commented on GitHub (Apr 3, 2026):

@Cephei-OpenSource In my case with M1 Max, CPU load is equal to +-10%.

<!-- gh-comment-id:4183605854 --> @Yevhen-Myroshnychenko commented on GitHub (Apr 3, 2026): @Cephei-OpenSource In my case with M1 Max, CPU load is equal to +-10%.
Author
Owner

@Cephei-OpenSource commented on GitHub (Apr 3, 2026):

Maybe a CUDA Problem with Linux or Nvidia driver for Linux? To further detail: I'm running Ubuntu 24.04 with newest version of CUDA and Nvidia driver (newest as available in the standard repository, to be specific).

<!-- gh-comment-id:4183630479 --> @Cephei-OpenSource commented on GitHub (Apr 3, 2026): Maybe a CUDA Problem with Linux or Nvidia driver for Linux? To further detail: I'm running Ubuntu 24.04 with newest version of CUDA and Nvidia driver (newest as available in the standard repository, to be specific).
Author
Owner

@seamon67 commented on GitHub (Apr 3, 2026):

I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards

Can confirm this is happening with RTX 6000 Pro on Ubuntu 24.04 + CUDA 13.0

<!-- gh-comment-id:4183965106 --> @seamon67 commented on GitHub (Apr 3, 2026): > I can confirm this. Although the model fits fully in a card (in my case Nvidia RTX 6000 Pro with 96GB VRAM) what happens is, the GPU runs with only about 30% load, while CPU usage is extremely high on all 48 cores of my CPU. And that's why this model is so slow, I assume. Although the model is completely loaded into VRAM, it still uses the GPU just a little bit and yet does most of the work with the CPU. Note: Other models of comparable size operate fully in the GPU and don't show this behaviour of high CPU usage on my system. Kind regards Can confirm this is happening with RTX 6000 Pro on Ubuntu 24.04 + CUDA 13.0
Author
Owner

@fcorneli commented on GitHub (Apr 3, 2026):

I notice the same on AlmaLinux 10, CUDA 13.2, RTX 6000 PRO. Simple test:

ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose

I only get 29 tokens/sec. Plenty of headroom:

ollama ps
NAME                  ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
gemma4:31b-it-q8_0    53dd8459790f    51 GB    100% GPU     262144     6 minutes from now

GPU only uses 400 W, while the CPU load hits 250%.

<!-- gh-comment-id:4184084355 --> @fcorneli commented on GitHub (Apr 3, 2026): I notice the same on AlmaLinux 10, CUDA 13.2, RTX 6000 PRO. Simple test: ``` ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose ``` I only get 29 tokens/sec. Plenty of headroom: ``` ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:31b-it-q8_0 53dd8459790f 51 GB 100% GPU 262144 6 minutes from now ``` GPU only uses 400 W, while the CPU load hits 250%.
Author
Owner

@fcorneli commented on GitHub (Apr 3, 2026):

Seems like direct usage of llama.cpp is not burning CPU cycles. The latest git pull gives me for a similar simple test:

./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0

already 42 tokens/sec.

<!-- gh-comment-id:4184160045 --> @fcorneli commented on GitHub (Apr 3, 2026): Seems like direct usage of `llama.cpp` is not burning CPU cycles. The latest git pull gives me for a similar simple test: ``` ./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0 ``` already 42 tokens/sec.
Author
Owner

@gordan-bobic commented on GitHub (Apr 3, 2026):

I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle.

<!-- gh-comment-id:4184851524 --> @gordan-bobic commented on GitHub (Apr 3, 2026): I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle.
Author
Owner

@greg-hydrogen commented on GitHub (Apr 4, 2026):

I am experiencing the same issue, CPU is 1000% and everything says it offloaded to the GPU, maybe the original author of this issue can change the title to be more generic to the issue we all are having?

<!-- gh-comment-id:4186180984 --> @greg-hydrogen commented on GitHub (Apr 4, 2026): I am experiencing the same issue, CPU is 1000% and everything says it offloaded to the GPU, maybe the original author of this issue can change the title to be more generic to the issue we all are having?
Author
Owner

@Cephei-OpenSource commented on GitHub (Apr 4, 2026):

I just installed the new version 0.20.2. This has improved the problem a bit. Now GPU Usage is at 50% with still heavy CPU Usage, but the latter is less now. Yet, token speed still is only around 20 t/s which is quite disappointing on an NVDIA RTX 6000 Pro. Kind regards

<!-- gh-comment-id:4186672810 --> @Cephei-OpenSource commented on GitHub (Apr 4, 2026): I just installed the new version 0.20.2. This has improved the problem a bit. Now GPU Usage is at 50% with still heavy CPU Usage, but the latter is less now. Yet, token speed still is only around 20 t/s which is quite disappointing on an NVDIA RTX 6000 Pro. Kind regards
Author
Owner

@Yevhen-Myroshnychenko commented on GitHub (Apr 4, 2026):

@greg-hydrogen I hope, new title is better. If you would like, you can give to me more appropriate text, and I will change it. Thank you.

<!-- gh-comment-id:4186874794 --> @Yevhen-Myroshnychenko commented on GitHub (Apr 4, 2026): @greg-hydrogen I hope, new title is better. If you would like, you can give to me more appropriate text, and I will change it. Thank you.
Author
Owner

@homjay commented on GitHub (Apr 4, 2026):

I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle.

not alone, same problem with multi 4090 gpu, / cuda13.

<!-- gh-comment-id:4186984108 --> @homjay commented on GitHub (Apr 4, 2026): > I can confirm that the problem seems to be the lack of GPU offload. Log shows all 61 layers are offloaded to GPU, but most of the processing seems to be happening on the CPU. I am on Linux/CUDA, and 4 GPUs have memory allocated on them, with CPU memory allocation only about 2.4GB, but most of the compute seems to be happening on CPU while GPUs mostly sit idle. not alone, same problem with multi 4090 gpu, / cuda13.
Author
Owner

@Wladastic commented on GitHub (Apr 4, 2026):

I just tried the same models in lmstudio, works flawless there.
So must be something up with the ollama implementation

<!-- gh-comment-id:4187233263 --> @Wladastic commented on GitHub (Apr 4, 2026): I just tried the same models in lmstudio, works flawless there. So must be something up with the ollama implementation
Author
Owner

@gordan-bobic commented on GitHub (Apr 4, 2026):

I can also confirm it works fine in llama.cpp without exhibiting the anomaly. What I have noticed with using llama.cpp directly is that the problem doesn't manifest with KV quantization q4_0/q4_1, but it does manifest with iq4_nl KV quantization. Not sure if it is related, but the symptoms are the same (memory allocated on GPU, but all compute happening on the CPU).

<!-- gh-comment-id:4187326228 --> @gordan-bobic commented on GitHub (Apr 4, 2026): I can also confirm it works fine in llama.cpp without exhibiting the anomaly. What I have noticed with using llama.cpp directly is that the problem doesn't manifest with KV quantization q4_0/q4_1, but it does manifest with iq4_nl KV quantization. Not sure if it is related, but the symptoms are the same (memory allocated on GPU, but all compute happening on the CPU).
Author
Owner

@somera commented on GitHub (Apr 4, 2026):

Same here ... https://github.com/ollama/ollama/issues/15237#issuecomment-4187131443

<!-- gh-comment-id:4187326429 --> @somera commented on GitHub (Apr 4, 2026): Same here ... https://github.com/ollama/ollama/issues/15237#issuecomment-4187131443
Author
Owner

@hoondy commented on GitHub (Apr 5, 2026):

Same here with M1 Max 64GB. Painfully slow.

<!-- gh-comment-id:4188250770 --> @hoondy commented on GitHub (Apr 5, 2026): Same here with M1 Max 64GB. Painfully slow.
Author
Owner

@AMcPherran commented on GitHub (Apr 6, 2026):

I'm experiencing the same issue with Ollama running in Docker on Fedora with an RTX3090. nvidia-smi shows that the model is fully loaded in GPU memory with headroom remaining, and yet all CPU cores are ripping and responses are too slow for it to be usable

<!-- gh-comment-id:4194979102 --> @AMcPherran commented on GitHub (Apr 6, 2026): I'm experiencing the same issue with Ollama running in Docker on Fedora with an RTX3090. `nvidia-smi` shows that the model is fully loaded in GPU memory with headroom remaining, and yet all CPU cores are ripping and responses are too slow for it to be usable
Author
Owner

@ivleth commented on GitHub (Apr 7, 2026):

Same story, Nvidia a5000 24GB, latets Nvidia driver and latest CUDA 13.2, Gemma3 27b blasts away with 30tokens/sec, Gemma4, only with 3 tokens/sec. The big difference is indeed cpu usage which is much higher on gemma4 compared to gemma3. I only adjusted these parameters, temperature=1.0
top_p=0.95
top_k=64

Leaving it to default does not make any difference.

<!-- gh-comment-id:4198705100 --> @ivleth commented on GitHub (Apr 7, 2026): Same story, Nvidia a5000 24GB, latets Nvidia driver and latest CUDA 13.2, Gemma3 27b blasts away with 30tokens/sec, Gemma4, only with 3 tokens/sec. The big difference is indeed cpu usage which is much higher on gemma4 compared to gemma3. I only adjusted these parameters, temperature=1.0 top_p=0.95 top_k=64 Leaving it to default does not make any difference.
Author
Owner

@akniy commented on GitHub (Apr 7, 2026):

I am experiencing similar low performance with Gemma4:e4b on Tesla V100 with NVidia Linux driver 580.126.09 and CUDA 13.0 generating 15 response_token/s showing high cpu usage of 8 cores while Gemma3:4b generates 81 response_token/s with little to no cpu usage on same setup.

<!-- gh-comment-id:4200463505 --> @akniy commented on GitHub (Apr 7, 2026): I am experiencing similar low performance with Gemma4:e4b on Tesla V100 with NVidia Linux driver 580.126.09 and CUDA 13.0 generating 15 response_token/s showing high cpu usage of 8 cores while Gemma3:4b generates 81 response_token/s with little to no cpu usage on same setup.
Author
Owner

@jagmarques commented on GitHub (Apr 7, 2026):

The performance profile you're seeing with gemma4:31b on M1 Max (slow responses + strong heating) is consistent with KV cache pressure at longer context lengths — the model's attention pattern is forcing full cache traversal each step, and at 19GB model size the M1's unified memory bandwidth gets saturated quickly.

A few things to try:

  1. Explicitly cap context: — gemma4 defaults to a large context window, capping it reduces both memory use and per-token compute.
  2. Monitor memory pressure: while running — if you're in red zone, the system is compressing/swapping, which explains the heat.
  3. Check if it's an MoE model: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size.

For context: KV cache compression can help here — the cache grows with every generated token and on long conversations it becomes the bottleneck. NexusQuant compresses KV caches 7–10x training-free, though it's currently PyTorch-only and not yet integrated into ollama/llama.cpp. For now, limiting is the most practical lever.

<!-- gh-comment-id:4201232650 --> @jagmarques commented on GitHub (Apr 7, 2026): The performance profile you're seeing with gemma4:31b on M1 Max (slow responses + strong heating) is consistent with KV cache pressure at longer context lengths — the model's attention pattern is forcing full cache traversal each step, and at 19GB model size the M1's unified memory bandwidth gets saturated quickly. A few things to try: 1. **Explicitly cap context**: — gemma4 defaults to a large context window, capping it reduces both memory use and per-token compute. 2. **Monitor memory pressure**: while running — if you're in red zone, the system is compressing/swapping, which explains the heat. 3. **Check if it's an MoE model**: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size. For context: KV cache compression can help here — the cache grows with every generated token and on long conversations it becomes the bottleneck. [NexusQuant](https://github.com/jagmarques/nexusquant) compresses KV caches 7–10x training-free, though it's currently PyTorch-only and not yet integrated into ollama/llama.cpp. For now, limiting is the most practical lever.
Author
Owner

@hoondy commented on GitHub (Apr 7, 2026):

3. Check if it's an MoE model: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size.

gemma4:31b is dense model. MoE is gemma4:26b (26B A4B MoE).

<!-- gh-comment-id:4201455101 --> @hoondy commented on GitHub (Apr 7, 2026): > 3\. **Check if it's an MoE model**: gemma4:31b is an MoE architecture — the routing overhead at inference can be disproportionate on Apple Silicon compared to dense models of similar size. gemma4:31b is dense model. MoE is gemma4:26b (26B A4B MoE).
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

If you have flash attention enabled, disable it until upgrading to 0.20.4.

<!-- gh-comment-id:4202677032 --> @rick-github commented on GitHub (Apr 7, 2026): If you have flash attention enabled, disable it until upgrading to 0.20.4.
Author
Owner

@ForsakenHarmony commented on GitHub (Apr 7, 2026):

I've disabled flash attention, same thing for me

ollama is at 6 t/s, unsloth studio 120 t/s

<!-- gh-comment-id:4202685530 --> @ForsakenHarmony commented on GitHub (Apr 7, 2026): I've disabled flash attention, same thing for me ollama is at 6 t/s, unsloth studio 120 t/s
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:4202696289 --> @rick-github commented on GitHub (Apr 7, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@ForsakenHarmony commented on GitHub (Apr 7, 2026):

using unsloth gemma 4 e4b UD-Q4_K_XL
server.log

<!-- gh-comment-id:4202758672 --> @ForsakenHarmony commented on GitHub (Apr 7, 2026): using unsloth gemma 4 e4b UD-Q4_K_XL [server.log](https://github.com/user-attachments/files/26554192/server.log)
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

The model is not optimized for the ollama engine.

$ for i in gemma-4-e4b-no-vision gemma4:e4b ; do echo $i ; ollama run  $i why is the sky blue --verbose 2>&1 | grep eval.rate ; done
gemma-4-e4b-no-vision
prompt eval rate:     42.02 tokens/s
eval rate:            13.64 tokens/s
gemma4:e4b
prompt eval rate:     165.22 tokens/s
eval rate:            184.09 tokens/s

There are a number of tensors that are higher precision in the unsloth model compared to the ollama model, so the computational requirements differ.

    blk.0.inp_gate.weight                Q4_K    [2560 256]   |	    blk.0.inp_gate.weight                F32     [2560 256]  
    blk.0.proj.weight                    Q4_K    [256 2560]   |	    blk.0.proj.weight                    F32     [256 2560]  
...
    blk.39.ffn_up.weight                 Q4_K    [2560 10240] |	    blk.39.ffn_up.weight                 unknown [2560 10240]
    blk.39.inp_gate.weight               Q4_K    [2560 256]   |	    blk.39.inp_gate.weight               F32     [2560 256]  
    blk.39.proj.weight                   Q4_K    [256 2560]   |	    blk.39.proj.weight                   F32     [256 2560]  
    per_layer_model_proj.weight          Q4_K    [2560 10752] |	    per_layer_model_proj.weight          BF16    [2560 10752]
    per_layer_token_embd.weight          BF16    [10752 26214 |	    per_layer_token_embd.weight          Q5_K    [10752 26214
    token_embd.weight                    Q6_K    [2560 262144 |	    token_embd.weight                    Q5_K    [2560 262144
<!-- gh-comment-id:4202897648 --> @rick-github commented on GitHub (Apr 7, 2026): The model is not optimized for the ollama engine. ``` $ for i in gemma-4-e4b-no-vision gemma4:e4b ; do echo $i ; ollama run $i why is the sky blue --verbose 2>&1 | grep eval.rate ; done gemma-4-e4b-no-vision prompt eval rate: 42.02 tokens/s eval rate: 13.64 tokens/s gemma4:e4b prompt eval rate: 165.22 tokens/s eval rate: 184.09 tokens/s ``` There are a number of tensors that are higher precision in the unsloth model compared to the ollama model, so the computational requirements differ. ``` blk.0.inp_gate.weight Q4_K [2560 256] | blk.0.inp_gate.weight F32 [2560 256] blk.0.proj.weight Q4_K [256 2560] | blk.0.proj.weight F32 [256 2560] ... blk.39.ffn_up.weight Q4_K [2560 10240] | blk.39.ffn_up.weight unknown [2560 10240] blk.39.inp_gate.weight Q4_K [2560 256] | blk.39.inp_gate.weight F32 [2560 256] blk.39.proj.weight Q4_K [256 2560] | blk.39.proj.weight F32 [256 2560] per_layer_model_proj.weight Q4_K [2560 10752] | per_layer_model_proj.weight BF16 [2560 10752] per_layer_token_embd.weight BF16 [10752 26214 | per_layer_token_embd.weight Q5_K [10752 26214 token_embd.weight Q6_K [2560 262144 | token_embd.weight Q5_K [2560 262144 ```
Author
Owner

@ivleth commented on GitHub (Apr 10, 2026):

I just updated to the latest version of ollama (Windows) v0.20.5, and the performance issues for me are solved now. I am getting a steady 26tokens/sec on a Nvidia a5000 24GB, latetst Nvidia driver and latest CUDA 13.2, with these parameters, temperature=1.0
top_p=0.95
top_k=64
Model used: Gemma4:31b

<!-- gh-comment-id:4224736931 --> @ivleth commented on GitHub (Apr 10, 2026): I just updated to the latest version of ollama (Windows) v0.20.5, and the performance issues for me are solved now. I am getting a steady 26tokens/sec on a Nvidia a5000 24GB, latetst Nvidia driver and latest CUDA 13.2, with these parameters, temperature=1.0 top_p=0.95 top_k=64 Model used: Gemma4:31b
Author
Owner

@Cephei-OpenSource commented on GitHub (Apr 10, 2026):

I thought this thread was closed, that's why I didn't answer before. Yes, I can also confirm since version 0.20.4 the problem is solved and FLASH_ATTENTION can again be enabled. Many thanks and kind regards

<!-- gh-comment-id:4224849293 --> @Cephei-OpenSource commented on GitHub (Apr 10, 2026): I thought this thread was closed, that's why I didn't answer before. Yes, I can also confirm since version 0.20.4 the problem is solved and FLASH_ATTENTION can again be enabled. Many thanks and kind regards
Author
Owner

@fcorneli commented on GitHub (Apr 11, 2026):

New benchmark results for ollama version 0.20.5 on an RTX 6000 PRO.

ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose
...
eval rate:            41.35 tokens/s

Compared with llama.cpp latest pull,

./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0
...
[ Prompt: 344,7 t/s | Generation: 42,4 t/s ]

they are on par now.

<!-- gh-comment-id:4230079927 --> @fcorneli commented on GitHub (Apr 11, 2026): New benchmark results for ollama version 0.20.5 on an RTX 6000 PRO. ``` ollama run gemma4:31b-it-q8_0 "Explain E=mc^2" --verbose ... eval rate: 41.35 tokens/s ``` Compared with `llama.cpp` latest pull, ``` ./build/bin/llama-cli --hf-repo unsloth/gemma-4-31B-it-GGUF:Q8_0 --prompt "Explain E=mc^2" --device CUDA0 --ctx-size 0 ... [ Prompt: 344,7 t/s | Generation: 42,4 t/s ] ``` they are on par now.
Author
Owner

@lundybernard commented on GitHub (Apr 15, 2026):

Same issue on AMD Ryzen AI 9 HX PRO 375 w/ Radeon 890M × 24
32/64G allocated to vram.
Ubuntu 24.04
Linux Kernel v6.19.10
Ollama v0.20.7
gemma4:31b

Running the model almost maxes out vram: 32000M / 32594M
System ram jumps from 8.86 to 25.1G / 31.1G

> ollama run gemma4:31b "Explain E=mc^2" --verbose
...
total duration:       7m15.138388203s
load duration:        19.690769893s
prompt eval count:    22 token(s)
prompt eval duration: 1.044065201s
prompt eval rate:     21.07 tokens/s
eval count:           1325 token(s)
eval duration:        6m53.792264103s
eval rate:            3.20 tokens/s

Update:
I lowered the context size to 32k, fixed paging to system ram, now using 23.4G vram.

total duration:       4m59.828915446s
prompt eval rate:     90.88 tokens/s
eval rate:            4.31 tokens/s

If you're having this issue try lowering the context size

<!-- gh-comment-id:4254186349 --> @lundybernard commented on GitHub (Apr 15, 2026): Same issue on AMD Ryzen AI 9 HX PRO 375 w/ Radeon 890M × 24 32/64G allocated to vram. Ubuntu 24.04 Linux Kernel v6.19.10 Ollama v0.20.7 gemma4:31b Running the model almost maxes out vram: 32000M / 32594M System ram jumps from 8.86 to 25.1G / 31.1G ``` > ollama run gemma4:31b "Explain E=mc^2" --verbose ... total duration: 7m15.138388203s load duration: 19.690769893s prompt eval count: 22 token(s) prompt eval duration: 1.044065201s prompt eval rate: 21.07 tokens/s eval count: 1325 token(s) eval duration: 6m53.792264103s eval rate: 3.20 tokens/s ``` --- Update: I lowered the context size to 32k, fixed paging to system ram, now using 23.4G vram. ``` total duration: 4m59.828915446s prompt eval rate: 90.88 tokens/s eval rate: 4.31 tokens/s ``` If you're having this issue try lowering the context size
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15286
Analyzed: 2026-04-18T18:22:42.325377

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274310477 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15286 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15286 **Analyzed**: 2026-04-18T18:22:42.325377 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71839