[GH-ISSUE #9730] Gemma 3 12b uses 24 GB VRAM ??? | Flash Attention | KV Cache Quantization #32121

Closed
opened 2026-04-22 13:04:34 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @ALLMI78 on GitHub (Mar 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9730

Originally assigned to: @mxyng on GitHub.

What is the issue?

I'm experiencing problems when running the Gemma-3-12b (Q4_K_M ~ 8.1GB ) model in Ollama.

  • WIN 10
  • ollama 0.6.0
  • rtx 4060 ti (16GB)

Current Configuration: (from https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288)

  • OLLAMA_FLASH_ATTENTION: 1 (enabled)
  • OLLAMA_NUM_PARALLEL: 1
  • Context Size: 32k (needed!)

PROBLEM:

  • OLLAMA_KV_CACHE_TYPE:
    • with the default f16: VRAM requirement is around 22–23 GB !?, with 100% GPU utilization - slow, it runs forever
    • q8_0: VRAM usage decreases, but the computational load shifts to 100% CPU (GPU utilization drops to approximately 20–30%) but it also runs forever

After reading https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/#interactive-vram-estimator i checked the VRAM configurator and entered the corresponding parameters there; in fact, a 12B model with a 32k context should never be using that much VRAM, right?

Example with a Qwen-14b-Q4_K_M (OLLAMA_KV_CACHE_TYPE q8_0) >>> VRAM 13 GB >>> 100% GPU >>> runs fine

I'm curious why Qwen-14b (even with OLLAMA_KV_CACHE_TYPE=f16) runs without issues and can comfortably operate with a 32k context on my GPU, while Gemma-3-12b (which actually has fewer parameters) requires significantly more VRAM.

  • Is this expected behavior, or might there be an underlying issue with how Gemma 3 12b is handled in Ollama?
  • Has anyone encountered similar issues with Gemma 3 12b?
  • Is Gemma 3 12b more sensitive to Flash Attention and KV cache quantization compared to other models?
  • Are there any recommended adjustments or configurations to run Gemma 3 12b stably and performantly solely on the GPU?

Please specify which parts of the logs (e.g., GPU memory allocation messages, offload logs, or any errors) are relevant?

Thank you for your support!

OS Windows

GPU Nvidia

CPU Intel

Ollama version 0.6.0

Originally created by @ALLMI78 on GitHub (Mar 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9730 Originally assigned to: @mxyng on GitHub. ### What is the issue? I'm experiencing problems when running the Gemma-3-12b (Q4_K_M ~ 8.1GB ) model in Ollama. - WIN 10 - ollama 0.6.0 - rtx 4060 ti (16GB) **Current Configuration:** (from https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) - **OLLAMA_FLASH_ATTENTION:** 1 (enabled) - **OLLAMA_NUM_PARALLEL:** 1 - **Context Size:** 32k (needed!) **_PROBLEM:_** - **OLLAMA_KV_CACHE_TYPE:** - with the default **f16:** VRAM requirement is around 22–23 GB !?, with 100% GPU utilization - slow, it runs forever - **q8_0:** VRAM usage decreases, but the computational load shifts to 100% CPU (GPU utilization drops to approximately 20–30%) but it also runs forever After reading https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/#interactive-vram-estimator i checked the VRAM configurator and entered the corresponding parameters there; in fact, a 12B model with a 32k context should never be using that much VRAM, right? Example with a Qwen-14b-Q4_K_M (OLLAMA_KV_CACHE_TYPE q8_0) >>> VRAM 13 GB >>> 100% GPU >>> runs fine I'm curious why Qwen-14b (even with OLLAMA_KV_CACHE_TYPE=f16) runs without issues and can comfortably operate with a 32k context on my GPU, while Gemma-3-12b (which actually has fewer parameters) requires significantly more VRAM. - Is this expected behavior, or might there be an underlying issue with how Gemma 3 12b is handled in Ollama? - Has anyone encountered similar issues with Gemma 3 12b? - Is Gemma 3 12b more sensitive to Flash Attention and KV cache quantization compared to other models? - Are there any recommended adjustments or configurations to run Gemma 3 12b stably and performantly solely on the GPU? Please specify which parts of the logs (e.g., GPU memory allocation messages, offload logs, or any errors) are relevant? Thank you for your support! ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.0
GiteaMirror added the bug label 2026-04-22 13:04:35 -05:00
Author
Owner

@mbretter commented on GitHub (Mar 14, 2025):

On my system I can load gemma3 1b/4b/12b simultaneously on the Tesla T4/16GB:

ollama ps
NAME          ID              SIZE      PROCESSOR    UNTIL               
gemma3:12b    6fd036cefda5    9.3 GB    100% GPU     4 minutes from now     
gemma3:1b     2d27a774bc62    2.1 GB    100% GPU     27 seconds from now    
gemma3:4b     c0494fe00251    3.9 GB    100% GPU     6 seconds from now   

nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:01:00.0 Off |                  Off |
| N/A   83C    P0             33W /   70W |   12585MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          484569      C   /usr/local/bin/ollama                  3588MiB |
|    0   N/A  N/A          484770      C   /usr/local/bin/ollama                  1104MiB |
|    0   N/A  N/A          486331      C   /usr/local/bin/ollama                  7890MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:2723829230 --> @mbretter commented on GitHub (Mar 14, 2025): On my system I can load gemma3 1b/4b/12b simultaneously on the Tesla T4/16GB: ``` ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:12b 6fd036cefda5 9.3 GB 100% GPU 4 minutes from now gemma3:1b 2d27a774bc62 2.1 GB 100% GPU 27 seconds from now gemma3:4b c0494fe00251 3.9 GB 100% GPU 6 seconds from now ``` nvidia-smi: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 On | 00000000:01:00.0 Off | Off | | N/A 83C P0 33W / 70W | 12585MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 484569 C /usr/local/bin/ollama 3588MiB | | 0 N/A N/A 484770 C /usr/local/bin/ollama 1104MiB | | 0 N/A N/A 486331 C /usr/local/bin/ollama 7890MiB | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@mbretter commented on GitHub (Mar 14, 2025):

My first impression is that the gemma3 models are doing pretty well and fast:

>>> Alice has four brothers and she also has a sister. How many sisters does Alice's brother have?
This is a bit of a trick question! Here's how to solve it:

*   Alice's brothers all share the same sisters.
*   Alice is one sister.
*   Therefore, Alice's brothers each have **two** sisters (Alice and her sister).

total duration:       3.04409804s
load duration:        43.357277ms
prompt eval count:    30 token(s)
prompt eval duration: 345ms
prompt eval rate:     86.96 tokens/s
eval count:           61 token(s)
eval duration:        2.653s
eval rate:            22.99 tokens/s
>>> How many 'r's are in the word strawberry?
There are three "r"s in the word "strawberry".

total duration:       973.844571ms
load duration:        41.808834ms
prompt eval count:    21 token(s)
prompt eval duration: 228ms
prompt eval rate:     92.11 tokens/s
eval count:           15 token(s)
eval duration:        702ms
eval rate:            21.37 tokens/s

Many of the other models are stumbling over this prompts while taking more resources.

<!-- gh-comment-id:2723850005 --> @mbretter commented on GitHub (Mar 14, 2025): My first impression is that the gemma3 models are doing pretty well and fast: ``` >>> Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? This is a bit of a trick question! Here's how to solve it: * Alice's brothers all share the same sisters. * Alice is one sister. * Therefore, Alice's brothers each have **two** sisters (Alice and her sister). total duration: 3.04409804s load duration: 43.357277ms prompt eval count: 30 token(s) prompt eval duration: 345ms prompt eval rate: 86.96 tokens/s eval count: 61 token(s) eval duration: 2.653s eval rate: 22.99 tokens/s ``` ``` >>> How many 'r's are in the word strawberry? There are three "r"s in the word "strawberry". total duration: 973.844571ms load duration: 41.808834ms prompt eval count: 21 token(s) prompt eval duration: 228ms prompt eval rate: 92.11 tokens/s eval count: 15 token(s) eval duration: 702ms eval rate: 21.37 tokens/s ``` Many of the other models are stumbling over this prompts while taking more resources.
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

Did you checked it with Context Size: 32k ???

I can download it again and post some more information...

<!-- gh-comment-id:2724317971 --> @ALLMI78 commented on GitHub (Mar 14, 2025): Did you checked it with Context Size: 32k ??? I can download it again and post some more information...
Author
Owner

@mbretter commented on GitHub (Mar 14, 2025):

nope, just with defaults

  Model
    architecture        gemma3    
    parameters          12.2B     
    context length      8192      
    embedding length    3840      
    quantization        Q4_K_M    
<!-- gh-comment-id:2724323073 --> @mbretter commented on GitHub (Mar 14, 2025): nope, just with defaults ``` Model architecture gemma3 parameters 12.2B context length 8192 embedding length 3840 quantization Q4_K_M ```
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

With the default settings, you might not notice the issue. I observed VRAM requirements of around 22–23 GB when running Gemma 3 with a 32k context and the default OLLAMA_KV_CACHE_TYPE=f16.

In comparison, Qwen-14B (also with f16 and a 32k context) only uses around 16–17 GB of VRAM.

Trying to reduce the VRAM usage by setting OLLAMA_KV_CACHE_TYPE=q8_0 for Gemma 3 didn’t help either, as it became very slow—the model suddenly shifted a significant portion of the workload to the CPU.

<!-- gh-comment-id:2724351489 --> @ALLMI78 commented on GitHub (Mar 14, 2025): With the default settings, you might not notice the issue. I observed VRAM requirements of around 22–23 GB when running Gemma 3 with a 32k context and the default `OLLAMA_KV_CACHE_TYPE=f16`. In comparison, Qwen-14B (also with `f16` and a 32k context) only uses around 16–17 GB of VRAM. Trying to reduce the VRAM usage by setting `OLLAMA_KV_CACHE_TYPE=q8_0` for Gemma 3 didn’t help either, as it became very slow—the model suddenly shifted a significant portion of the workload to the CPU.
Author
Owner

@Asher9971 commented on GitHub (Mar 14, 2025):

Same here. Using Flash Attention with Gemma3 -> 10% GPU usage 100% CPU usage
without flash attention in ollama everything works fine.
With llama, qwen or all other models i tried flash attenation was never a problem

<!-- gh-comment-id:2724372026 --> @Asher9971 commented on GitHub (Mar 14, 2025): Same here. Using Flash Attention with Gemma3 -> 10% GPU usage 100% CPU usage without flash attention in ollama everything works fine. With llama, qwen or all other models i tried flash attenation was never a problem
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

I'm currently not sure how OLLAMA_FLASH_ATTENTION affects this as well. Thanks for the hint—I’ll test it later.

OLLAMA_NEW_ENGINE:false ???

whats the default of OLLAMA_NEW_ENGINE?

<!-- gh-comment-id:2724383843 --> @ALLMI78 commented on GitHub (Mar 14, 2025): I'm currently not sure how OLLAMA_FLASH_ATTENTION affects this as well. Thanks for the hint—I’ll test it later. OLLAMA_NEW_ENGINE:false ??? whats the default of OLLAMA_NEW_ENGINE?
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

Ahh OLLAMA_NEW_ENGINE seems to be new, it is false for me....

2025/03/14 12:25:25 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

num_ctx @ 32k via api call...

"options":{ "seed":219929875, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}';

gemma3:12b 22 GB 28%/72% CPU/GPU Stopping...

RAM usage goes also up, CPU usage is @ 20-30%

It runs but does not answer, it runs forever...

<!-- gh-comment-id:2724398230 --> @ALLMI78 commented on GitHub (Mar 14, 2025): Ahh OLLAMA_NEW_ENGINE seems to be new, it is false for me.... > 2025/03/14 12:25:25 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" num_ctx @ 32k via api call... > "options":{ "seed":219929875, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}'; gemma3:12b **_22_** GB 28%/72% CPU/GPU Stopping... RAM usage goes also up, CPU usage is @ 20-30% It runs but does not answer, it runs forever...
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

time=2025-03-14T12:32:45.014+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-14T12:32:45.014+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
time=2025-03-14T12:32:45.142+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"
time=2025-03-14T12:32:45.149+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-14T12:32:45.152+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-14T12:32:45.160+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\Users\admin\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model M:\OLLAMA\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 32768 --batch-size 128 --n-gpu-layers 100 --threads 8 --flash-attn --no-mmap --mlock --parallel 1 --port 49159"

  • kv cache type not supported by model ???
  • a lot of key not found
<!-- gh-comment-id:2724421882 --> @ALLMI78 commented on GitHub (Mar 14, 2025): > time=2025-03-14T12:32:45.014+01:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-14T12:32:45.014+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" time=2025-03-14T12:32:45.142+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-14T12:32:45.149+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-14T12:32:45.152+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-14T12:32:45.160+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model M:\\OLLAMA\\models\\blobs\\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 32768 --batch-size 128 --n-gpu-layers 100 --threads 8 --flash-attn --no-mmap --mlock --parallel 1 --port 49159" - kv cache type not supported by model ??? - a lot of key not found
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

2025/03/14 12:40:09 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

with OLLAMA_FLASH_ATTENTION false

gemma3:12b 22 GB 28%/72% CPU/GPU Stopping...

VRAM 22GB and normal RAM usage goes up by 10 GB, CPU usage is @ 20-30%

It runs but does not answer, it runs into my 5 min timeout...

<!-- gh-comment-id:2724432640 --> @ALLMI78 commented on GitHub (Mar 14, 2025): > 2025/03/14 12:40:09 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" with OLLAMA_FLASH_ATTENTION false gemma3:12b 22 GB 28%/72% CPU/GPU Stopping... VRAM 22GB and normal RAM usage goes up by 10 GB, CPU usage is @ 20-30% It runs but does not answer, it runs into my 5 min timeout...
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

[GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/03/14 - 12:46:23 | 500 | 4m59s | 127.0.0.1 | POST "/api/chat"
time=2025-03-14T12:46:42.445+01:00 level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="24.8 GiB" free_swap="28.6 GiB"
time=2025-03-14T12:46:42.447+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=100 layers.model=49 layers.offload=35 layers.split="" memory.available="[14.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.6 GiB" memory.required.partial="14.8 GiB" memory.required.kv="12.0 GiB" memory.required.allocations="[14.8 GiB]" memory.weights.total="18.0 GiB" memory.weights.repeating="17.3 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="279.8 MiB" memory.graph.partial="917.4 MiB"

  • memory.required.kv="12.0 GiB" ???

Driver Version: 572.47 CUDA Version: 12.8

<!-- gh-comment-id:2724449640 --> @ALLMI78 commented on GitHub (Mar 14, 2025): > [GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/14 - 12:46:23 | 500 | 4m59s | 127.0.0.1 | POST "/api/chat" time=2025-03-14T12:46:42.445+01:00 level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="24.8 GiB" free_swap="28.6 GiB" time=2025-03-14T12:46:42.447+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=100 layers.model=49 layers.offload=35 layers.split="" memory.available="[14.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.6 GiB" memory.required.partial="14.8 GiB" memory.required.kv="12.0 GiB" memory.required.allocations="[14.8 GiB]" memory.weights.total="18.0 GiB" memory.weights.repeating="17.3 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="279.8 MiB" memory.graph.partial="917.4 MiB" - memory.required.kv="12.0 GiB" ??? Driver Version: 572.47 CUDA Version: 12.8
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

Sorry, but I’m having a bit of trouble following your comments. I’ve noticed that you’ve been very active in other threads as well, but despite the detailed input, I haven’t really been able to find a solution from it. No offense intended—I just prefer a more direct approach instead of trial and error.

For now, I’ve removed Gemma-3 12b, as I couldn’t get it to run properly with a 32k context, while Qwen-14B works flawlessly on my setup.

<!-- gh-comment-id:2724491644 --> @ALLMI78 commented on GitHub (Mar 14, 2025): Sorry, but I’m having a bit of trouble following your comments. I’ve noticed that you’ve been very active in other threads as well, but despite the detailed input, I haven’t really been able to find a solution from it. No offense intended—I just prefer a more direct approach instead of trial and error. For now, I’ve removed Gemma-3 12b, as I couldn’t get it to run properly with a 32k context, while Qwen-14B works flawlessly on my setup.
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

Hey, I appreciate you taking the time to respond, but honestly, your comments haven’t been very helpful. They often seem quite general or not really relevant to the actual issue.

I’m looking for concrete solutions or at least well-founded insights that can genuinely help—random guesses unfortunately don’t get me any further. Hope you understand.

<!-- gh-comment-id:2724790307 --> @ALLMI78 commented on GitHub (Mar 14, 2025): Hey, I appreciate you taking the time to respond, but honestly, your comments haven’t been very helpful. They often seem quite general or not really relevant to the actual issue. I’m looking for concrete solutions or at least well-founded insights that can genuinely help—random guesses unfortunately don’t get me any further. Hope you understand.
Author
Owner

@wisepmlin commented on GitHub (Mar 14, 2025):

Error: llama runner process has terminated: this model is not supported by your version of Ollama. You may need to upgrade

<!-- gh-comment-id:2724907192 --> @wisepmlin commented on GitHub (Mar 14, 2025): Error: llama runner process has terminated: this model is not supported by your version of Ollama. You may need to upgrade
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

ollama 0.6.0

<!-- gh-comment-id:2724912239 --> @ALLMI78 commented on GitHub (Mar 14, 2025): ollama 0.6.0
Author
Owner

@gregbarbosa commented on GitHub (Mar 14, 2025):

For anyone running into issues with Ollama, Gemma, Flash and KV Cache, try out the 0.6.1 pre-release. Previously, gemma3:4b on my M1 Pro was choking on the most mundane 'hello' message and memory usage was skyrocketing. Now it can ingest context and reply as expected.

<!-- gh-comment-id:2725040704 --> @gregbarbosa commented on GitHub (Mar 14, 2025): For anyone running into issues with Ollama, Gemma, Flash and KV Cache, try out the [0.6.1 pre-release](https://github.com/ollama/ollama/releases/tag/v0.6.1-rc0). Previously, gemma3:4b on my M1 Pro was choking on the most mundane 'hello' message and memory usage was skyrocketing. Now it can ingest context and reply as expected.
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

With Ollama version 0.6.1 and OLLAMA_KV_CACHE_TYPE=q8_0:

  • 32k context size
  • still extremely high CPU usage instead of GPU usage.

Image

2025/03/14 17:45:04 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

,"options":{ "seed":239066125, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}';

you can see the high CPU load here in this example...

<!-- gh-comment-id:2725238528 --> @ALLMI78 commented on GitHub (Mar 14, 2025): With Ollama version 0.6.1 and OLLAMA_KV_CACHE_TYPE=q8_0: - 32k context size - still extremely high CPU usage instead of GPU usage. ![Image](https://github.com/user-attachments/assets/c7d0a024-4675-4f8a-bb09-64f0a0fbc920) > 2025/03/14 17:45:04 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" > ,"options":{ "seed":239066125, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}'; you can see the high CPU load here in this example...
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

With Ollama version 0.6.1 and the default OLLAMA_KV_CACHE_TYPE = f16 it runs @ 100% GPU but again high (24 GB) VRAM usage:

2025/03/14 17:57:38 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

NAME ID SIZE PROCESSOR UNTIL
gemma3:12b 6fd036cefda5 24 GB 34%/66% CPU/GPU Stopping...

Both examples with 32k context size...

<!-- gh-comment-id:2725257794 --> @ALLMI78 commented on GitHub (Mar 14, 2025): With Ollama version 0.6.1 and the default OLLAMA_KV_CACHE_TYPE = f16 it runs @ 100% GPU but again high (24 GB) VRAM usage: > 2025/03/14 17:57:38 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" > NAME ID SIZE PROCESSOR UNTIL gemma3:12b 6fd036cefda5 **_24 GB_** 34%/66% CPU/GPU Stopping... Both examples with 32k context size...
Author
Owner

@ALLMI78 commented on GitHub (Mar 14, 2025):

compared to qwen 14b | 32k context size

Ollama version 0.6.1 | default OLLAMA_KV_CACHE_TYPE = f16

Qwen2.5-14B-Q4_K_M 17 GB 7%/93% CPU/GPU

runs fine...

<!-- gh-comment-id:2725302039 --> @ALLMI78 commented on GitHub (Mar 14, 2025): compared to qwen 14b | 32k context size Ollama version 0.6.1 | default OLLAMA_KV_CACHE_TYPE = f16 > Qwen2.5-14B-Q4_K_M **_17 GB_** 7%/93% CPU/GPU runs fine...
Author
Owner

@LFd3v commented on GitHub (Mar 14, 2025):

I also have a similar problem here (Linux/Nvidia): gemma3:4b and gemma3:12b use an awful amount of RAM, even if Ollama says it is fully loaded in the GPU:

$ ollama ps
NAME         ID              SIZE      PROCESSOR    UNTIL
gemma3:4b    c0494fe00251    4.2 GB    100% GPU     4 minutes from now

The 4B seems to run fine here (related to tps, albeit using almost all free RAM), but after just a "hello", the 12B model simply causes Ollama to quit due to OOM.

I have seem something similar with llama3.2-vision, but only when sending an image to the model: from what I have read, the projector uses a different architecture than other models that cannot be loaded using both the GPU and CPU, then for some reason it just use all RAM available when an image needs to be processed and crashes Ollama.

As the gemma3 models above 1B have the projector included in the main GGUF, maybe this is why this happens with gemma3 > 1B?

<!-- gh-comment-id:2725803779 --> @LFd3v commented on GitHub (Mar 14, 2025): I also have a similar problem here (Linux/Nvidia): gemma3:4b and gemma3:12b use an awful amount of RAM, even if Ollama says it is fully loaded in the GPU: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:4b c0494fe00251 4.2 GB 100% GPU 4 minutes from now ``` The 4B seems to run fine here (related to tps, albeit using almost all free RAM), but after just a "hello", the 12B model simply causes Ollama to quit due to OOM. I have seem something similar with llama3.2-vision, but only when sending an image to the model: from what I have read, the projector uses a different architecture than other models that cannot be loaded using both the GPU and CPU, then for some reason it just use all RAM available when an image needs to be processed and crashes Ollama. As the gemma3 models above 1B have the projector included in the main GGUF, maybe this is why this happens with gemma3 > 1B?
Author
Owner

@ALLMI78 commented on GitHub (Mar 16, 2025):

gemma3: with 8 gb model size for a 12b model i'm @ 24 gb...

16 gb for the 32k context ? + around 10-15 GB ram?

that is normal or not?

how can qwen-14b with 32k do the same with only 16-17 gb?

<!-- gh-comment-id:2727610452 --> @ALLMI78 commented on GitHub (Mar 16, 2025): gemma3: with 8 gb model size for a 12b model i'm @ 24 gb... 16 gb for the 32k context ? + around 10-15 GB ram? that is normal or not? how can qwen-14b with 32k do the same with only 16-17 gb?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32121