[GH-ISSUE #15353] Why is Gemma4:26b performance significantly slow on Ollama? #35582

Closed
opened 2026-04-22 20:10:34 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @MMaturax on GitHub (Apr 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15353

What is the issue?

I am experiencing a significant performance gap when running the Gemma4:26b model on Ollama. In my tests, Ollama performs substantially worse than other inference engines on the exact same hardware.

Update:
After some investigation, I noticed that llama.cpp achieves faster inference because it defaults to a "text-only" mode for multimodal models when no image is provided. I am not certain if Unsloth Studio employs a similar mechanism, but the performance difference is undeniable.

Rather than a bug, please consider this a Feature Request:
It would be a significant improvement if Ollama could implement an optional "text-only" loading mode for multimodal models. If no image processing is required, matching the 2x speed boost seen in llama.cpp by bypassing the vision encoders/modules would be a massive optimization for users with high-end hardware like the RTX 5090.

Comparison

llama.cpp: 2x faster than Ollama.
Unsloth Studio: 2x faster than Ollama.

Environment

I have performed benchmarks on two separate Ubuntu machines with nearly identical high-end specifications:

GPU: NVIDIA RTX 5090
CPU: AMD Ryzen 9 7950X3D
RAM: 64GB
OS: Ubuntu 24.04.3 LTS

Ollama

test@ubuntumainserver:~$ nvidia-smi
Sun Apr  5 22:18:01 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8              9W /  450W |   19032MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1643537      C   /usr/local/bin/ollama                 19022MiB |
+-----------------------------------------------------------------------------------------+
test@ubuntumainserver:~$ ollama run gemma4:26b "5+8/16=?" --verbose
Thinking...
The user is asking for the result of the arithmetic expression $5 + \frac{8}{16}$.

    *   Order of operations (PEMDAS/BODMAS) states that Division/Multiplication comes before Addition/Subtraction.
    *   Expression: $5 + (8 \div 16)$

    *   $8 / 16 = \frac{8}{16}$
    *   Simplify the fraction: $\frac{8}{16} = \frac{1}{2}$
    *   Decimal form: $0.5$

    *   $5 + 0.5 = 5.5$
    *   Fraction form: $5 \frac{1}{2}$

    *   Show the simplified fraction version.
    *   Show the decimal version.
...done thinking.

To solve **5 + 8/16**, follow the order of operations (division before addition):

1.  **Simplify the fraction:**
    $8/16$ can be simplified by dividing both the numerator and denominator by 8.
    $8 \div 8 = 1$
    $16 \div 8 = 2$
    So, $8/16 = 1/2$ (or **0.5** in decimal form).

2.  **Add to 5:**
    $5 + 0.5 = 5.5$

**Final Answer:**
**5.5** (or $5 \frac{1}{2}$)

total duration:       3.438829184s
load duration:        212.989801ms
prompt eval count:    22 token(s)
prompt eval duration: 18.179174ms
prompt eval rate:     1210.18 tokens/s
eval count:           328 token(s)
eval duration:        3.10435076s
eval rate:            105.66 tokens/s

llama.cpp

Sun Apr  5 22:18:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P3             37W /  450W |       2MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
test@xserver:~/models$ ../llama.cpp/build/bin/llama-cli -m gemma-4-26B-A4B-it-UD-Q4_K_M.gguf   -p "5+8/16=?"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32109 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8669-761797ffd
model      : gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 5+8/16=?

[Start thinking]
The user wants to solve the mathematical expression $5 + 8/16$.

    *   Addition ($+$)
    *   Division ($/$)

    *   Order of operations (PEMDAS/BODMAS) states that division should be performed before addition.

    *   Expression: $8 / 16$
    *   Calculation: $\frac{8}{16}$
    *   Simplification: Both 8 and 16 are divisible by 8.
    *   $\frac{8 \div 8}{16 \div 8} = \frac{1}{2}$
    *   Decimal form: $0.5$

    *   Expression: $5 + 0.5$
    *   Calculation: $5.5$

    *   Fraction form: $5 \frac{1}{2}$ or $\frac{11}{2}$
    *   Decimal form: $5.5$
[End thinking]

To solve this, follow the order of operations (PEMDAS/BODMAS), which dictates that you perform division before addition.

1.  **Divide 8 by 16:**
    $8 / 16 = 0.5$ (or $\frac{1}{2}$)

2.  **Add 5 to the result:**
    $5 + 0.5 = 5.5$

**Answer:**
**5.5** (or $5\frac{1}{2}$)

[ Prompt: 181.9 t/s | Generation: 212.5 t/s ]

> 

Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 9209 + (22311 = 16071 +    5420 +     819) +         587 |
llama_memory_breakdown_print: |   - Host               |                  1274 =   748 +       0 +     526                |

Relevant log output

Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |       20.29µs |       127.0.0.1 | HEAD     "/"
Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |  226.505134ms |       127.0.0.1 | POST     "/api/show"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.117Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38853"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:247 msg="enabling flash attention"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:484 msg="system memory" total="61.9 GiB" free="57.5 GiB" free_swap="7.1 GiB"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 library=CUDA available="30.9 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=-1
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.467Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.519Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: found 1 CUDA devices:
Apr 05 22:31:10 ubuntumainserver ollama[4010283]:   Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.671Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.675Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.688Z level=INFO source=model.go:138 msg="vision: decode" elapsed=3.098772ms bounds=(0,0)-(2048,2048)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=75.059281ms size="[768 768]"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=78.862504ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.118Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.161Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.168Z level=INFO source=model.go:138 msg="vision: decode" elapsed=308.777µs bounds=(0,0)-(2048,2048)
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.231Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=62.685692ms size="[768 768]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.234Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.965872ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.0 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="318.7 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="72.0 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:272 msg="total memory" size="18.7 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
Apr 05 22:31:12 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:12.794Z level=INFO source=server.go:1390 msg="llama runner started in 2.35 seconds"
Apr 05 22:31:15 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:15 | 200 |   5.93703499s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.20.2

Originally created by @MMaturax on GitHub (Apr 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15353 ### What is the issue? I am experiencing a significant performance gap when running the Gemma4:26b model on Ollama. In my tests, Ollama performs substantially worse than other inference engines on the exact same hardware. **Update:** After some investigation, I noticed that llama.cpp achieves faster inference because it defaults to a "text-only" mode for multimodal models when no image is provided. I am not certain if Unsloth Studio employs a similar mechanism, but the performance difference is undeniable. Rather than a bug, please consider this a Feature Request: It would be a significant improvement if Ollama could implement an optional "text-only" loading mode for multimodal models. If no image processing is required, matching the 2x speed boost seen in llama.cpp by bypassing the vision encoders/modules would be a massive optimization for users with high-end hardware like the RTX 5090. ### Comparison llama.cpp: 2x faster than Ollama. Unsloth Studio: 2x faster than Ollama. ### Environment I have performed benchmarks on two separate Ubuntu machines with nearly identical high-end specifications: GPU: NVIDIA RTX 5090 CPU: AMD Ryzen 9 7950X3D RAM: 64GB OS: Ubuntu 24.04.3 LTS ### Ollama ``` test@ubuntumainserver:~$ nvidia-smi Sun Apr 5 22:18:01 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5090 On | 00000000:01:00.0 Off | N/A | | 0% 40C P8 9W / 450W | 19032MiB / 32607MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1643537 C /usr/local/bin/ollama 19022MiB | +-----------------------------------------------------------------------------------------+ test@ubuntumainserver:~$ ollama run gemma4:26b "5+8/16=?" --verbose Thinking... The user is asking for the result of the arithmetic expression $5 + \frac{8}{16}$. * Order of operations (PEMDAS/BODMAS) states that Division/Multiplication comes before Addition/Subtraction. * Expression: $5 + (8 \div 16)$ * $8 / 16 = \frac{8}{16}$ * Simplify the fraction: $\frac{8}{16} = \frac{1}{2}$ * Decimal form: $0.5$ * $5 + 0.5 = 5.5$ * Fraction form: $5 \frac{1}{2}$ * Show the simplified fraction version. * Show the decimal version. ...done thinking. To solve **5 + 8/16**, follow the order of operations (division before addition): 1. **Simplify the fraction:** $8/16$ can be simplified by dividing both the numerator and denominator by 8. $8 \div 8 = 1$ $16 \div 8 = 2$ So, $8/16 = 1/2$ (or **0.5** in decimal form). 2. **Add to 5:** $5 + 0.5 = 5.5$ **Final Answer:** **5.5** (or $5 \frac{1}{2}$) total duration: 3.438829184s load duration: 212.989801ms prompt eval count: 22 token(s) prompt eval duration: 18.179174ms prompt eval rate: 1210.18 tokens/s eval count: 328 token(s) eval duration: 3.10435076s eval rate: 105.66 tokens/s ``` ### llama.cpp ``` Sun Apr 5 22:18:10 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5090 On | 00000000:01:00.0 Off | N/A | | 0% 40C P3 37W / 450W | 2MiB / 32607MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ test@xserver:~/models$ ../llama.cpp/build/bin/llama-cli -m gemma-4-26B-A4B-it-UD-Q4_K_M.gguf -p "5+8/16=?" ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32109 MiB): Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b8669-761797ffd model : gemma-4-26B-A4B-it-UD-Q4_K_M.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file> add a text file /glob <pattern> add text files using globbing pattern > 5+8/16=? [Start thinking] The user wants to solve the mathematical expression $5 + 8/16$. * Addition ($+$) * Division ($/$) * Order of operations (PEMDAS/BODMAS) states that division should be performed before addition. * Expression: $8 / 16$ * Calculation: $\frac{8}{16}$ * Simplification: Both 8 and 16 are divisible by 8. * $\frac{8 \div 8}{16 \div 8} = \frac{1}{2}$ * Decimal form: $0.5$ * Expression: $5 + 0.5$ * Calculation: $5.5$ * Fraction form: $5 \frac{1}{2}$ or $\frac{11}{2}$ * Decimal form: $5.5$ [End thinking] To solve this, follow the order of operations (PEMDAS/BODMAS), which dictates that you perform division before addition. 1. **Divide 8 by 16:** $8 / 16 = 0.5$ (or $\frac{1}{2}$) 2. **Add 5 to the result:** $5 + 0.5 = 5.5$ **Answer:** **5.5** (or $5\frac{1}{2}$) [ Prompt: 181.9 t/s | Generation: 212.5 t/s ] > Exiting... llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - CUDA0 (RTX 5090) | 32109 = 9209 + (22311 = 16071 + 5420 + 819) + 587 | llama_memory_breakdown_print: | - Host | 1274 = 748 + 0 + 526 | ``` ### Relevant log output ```shell Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 | 20.29µs | 127.0.0.1 | HEAD "/" Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 | 226.505134ms | 127.0.0.1 | POST "/api/show" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.117Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38853" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:247 msg="enabling flash attention" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 42019" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:484 msg="system memory" total="61.9 GiB" free="57.5 GiB" free_swap="7.1 GiB" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 library=CUDA available="30.9 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=-1 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1417 msg="starting ollama engine" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:42019" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.467Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.519Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: found 1 CUDA devices: Apr 05 22:31:10 ubuntumainserver ollama[4010283]: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.671Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.675Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.688Z level=INFO source=model.go:138 msg="vision: decode" elapsed=3.098772ms bounds=(0,0)-(2048,2048) Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=75.059281ms size="[768 768]" Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=78.862504ms shape="[2816 256]" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.118Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.161Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.168Z level=INFO source=model.go:138 msg="vision: decode" elapsed=308.777µs bounds=(0,0)-(2048,2048) Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.231Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=62.685692ms size="[768 768]" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.234Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.965872ms shape="[2816 256]" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.0 GiB" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="318.7 MiB" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="72.0 MiB" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:272 msg="total memory" size="18.7 GiB" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=sched.go:561 msg="loaded runners" count=1 Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model" Apr 05 22:31:12 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:12.794Z level=INFO source=server.go:1390 msg="llama runner started in 2.35 seconds" Apr 05 22:31:15 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:15 | 200 | 5.93703499s | 127.0.0.1 | POST "/api/generate" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.20.2
GiteaMirror added the bug label 2026-04-22 20:10:34 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 5, 2026):

Disable flash attention.

<!-- gh-comment-id:4189659730 --> @rick-github commented on GitHub (Apr 5, 2026): Disable flash attention.
Author
Owner

@MMaturax commented on GitHub (Apr 5, 2026):

Interesting. After disabling Flash Attention, it reached the same speed as the others. I have been using Ollama since its early versions and Flash Attention used to provide a speed boost, so it seems something has changed in the recent versions that I hadn't noticed until now. Thanks for the help!

test@ubuntumainserver:~$ ollama run gemma4:26b "5+8/16=?" --verbose
Thinking...
The user wants to calculate the value of the expression $5 + \frac{8}{16}$.

    *   The expression is $5 + 8 / 16$.
    *   According to the Order of Operations (PEMDAS/BODMAS), division comes before addition.

    *   $8 / 16$
    *   Both numbers are divisible by 8.
    *   $8 \div 8 = 1$
    *   $16 \div 8 = 2$
    *   So, $8 / 16 = 1/2 = 0.5$.

    *   $5 + 0.5 = 5.5$

    *   Decimal: $5.5$
    *   Fraction: $5 \frac{1}{2}$ or $\frac{11}{2}$
...done thinking.

To solve **5 + 8/16**, follow the order of operations (division before addition):

1.  **Simplify the fraction:**
    8/16 can be simplified by dividing both the numerator and the denominator by 8.
    8 ÷ 8 = 1
    16 ÷ 8 = 2
    So, **8/16 = 1/2** (or **0.5** in decimal form).

2.  **Add to 5:**
    5 + 0.5 = **5.5**

**Final Answer:**
**5.5** (or $5 \frac{1}{2}$)

total duration:       1.930919737s
load duration:        207.212607ms
prompt eval count:    22 token(s)
prompt eval duration: 17.351291ms
prompt eval rate:     1267.92 tokens/s
eval count:           335 token(s)
eval duration:        1.607537904s
eval rate:            208.39 tokens/s
<!-- gh-comment-id:4189666281 --> @MMaturax commented on GitHub (Apr 5, 2026): Interesting. After disabling Flash Attention, it reached the same speed as the others. I have been using Ollama since its early versions and Flash Attention used to provide a speed boost, so it seems something has changed in the recent versions that I hadn't noticed until now. Thanks for the help! ``` test@ubuntumainserver:~$ ollama run gemma4:26b "5+8/16=?" --verbose Thinking... The user wants to calculate the value of the expression $5 + \frac{8}{16}$. * The expression is $5 + 8 / 16$. * According to the Order of Operations (PEMDAS/BODMAS), division comes before addition. * $8 / 16$ * Both numbers are divisible by 8. * $8 \div 8 = 1$ * $16 \div 8 = 2$ * So, $8 / 16 = 1/2 = 0.5$. * $5 + 0.5 = 5.5$ * Decimal: $5.5$ * Fraction: $5 \frac{1}{2}$ or $\frac{11}{2}$ ...done thinking. To solve **5 + 8/16**, follow the order of operations (division before addition): 1. **Simplify the fraction:** 8/16 can be simplified by dividing both the numerator and the denominator by 8. 8 ÷ 8 = 1 16 ÷ 8 = 2 So, **8/16 = 1/2** (or **0.5** in decimal form). 2. **Add to 5:** 5 + 0.5 = **5.5** **Final Answer:** **5.5** (or $5 \frac{1}{2}$) total duration: 1.930919737s load duration: 207.212607ms prompt eval count: 22 token(s) prompt eval duration: 17.351291ms prompt eval rate: 1267.92 tokens/s eval count: 335 token(s) eval duration: 1.607537904s eval rate: 208.39 tokens/s ```
Author
Owner

@rick-github commented on GitHub (Apr 5, 2026):

It's likely a temporary issue. FA is primarily a space optimizer, not a speed optimizer, but the observed slowness when FA is enabled for this model is unexpected so will probably be addressed in upcoming releases.

<!-- gh-comment-id:4189670096 --> @rick-github commented on GitHub (Apr 5, 2026): It's likely a temporary issue. FA is primarily a space optimizer, not a speed optimizer, but the observed slowness when FA is enabled for this model is unexpected so will probably be addressed in upcoming releases.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35582