[GH-ISSUE #9730] Gemma 3 12b uses 24 GB VRAM ??? | Flash Attention | KV Cache Quantization #68418

New Issue

GiteaMirror · 2026-05-04T13:53:19-05:00

GiteaMirror commented

2026-05-04 13:53:19 -05:00

Originally created by @ALLMI78 on GitHub (Mar 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9730

Originally assigned to: @mxyng on GitHub.

What is the issue?

I'm experiencing problems when running the Gemma-3-12b (Q4_K_M ~ 8.1GB ) model in Ollama.

WIN 10
ollama 0.6.0
rtx 4060 ti (16GB)

Current Configuration: (from https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288)

OLLAMA_FLASH_ATTENTION: 1 (enabled)
OLLAMA_NUM_PARALLEL: 1
Context Size: 32k (needed!)

PROBLEM:

OLLAMA_KV_CACHE_TYPE:
- with the default f16: VRAM requirement is around 22–23 GB !?, with 100% GPU utilization - slow, it runs forever
- q8_0: VRAM usage decreases, but the computational load shifts to 100% CPU (GPU utilization drops to approximately 20–30%) but it also runs forever

After reading https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/#interactive-vram-estimator i checked the VRAM configurator and entered the corresponding parameters there; in fact, a 12B model with a 32k context should never be using that much VRAM, right?

Example with a Qwen-14b-Q4_K_M (OLLAMA_KV_CACHE_TYPE q8_0) >>> VRAM 13 GB >>> 100% GPU >>> runs fine

I'm curious why Qwen-14b (even with OLLAMA_KV_CACHE_TYPE=f16) runs without issues and can comfortably operate with a 32k context on my GPU, while Gemma-3-12b (which actually has fewer parameters) requires significantly more VRAM.

Is this expected behavior, or might there be an underlying issue with how Gemma 3 12b is handled in Ollama?
Has anyone encountered similar issues with Gemma 3 12b?
Is Gemma 3 12b more sensitive to Flash Attention and KV cache quantization compared to other models?
Are there any recommended adjustments or configurations to run Gemma 3 12b stably and performantly solely on the GPU?

Please specify which parts of the logs (e.g., GPU memory allocation messages, offload logs, or any errors) are relevant?

Thank you for your support!

OS Windows

GPU Nvidia

CPU Intel

Ollama version 0.6.0

Originally created by @ALLMI78 on GitHub (Mar 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9730 Originally assigned to: @mxyng on GitHub. ### What is the issue? I'm experiencing problems when running the Gemma-3-12b (Q4_K_M ~ 8.1GB ) model in Ollama. - WIN 10 - ollama 0.6.0 - rtx 4060 ti (16GB) **Current Configuration:** (from https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) - **OLLAMA_FLASH_ATTENTION:** 1 (enabled) - **OLLAMA_NUM_PARALLEL:** 1 - **Context Size:** 32k (needed!) **_PROBLEM:_** - **OLLAMA_KV_CACHE_TYPE:** - with the default **f16:** VRAM requirement is around 22–23 GB !?, with 100% GPU utilization - slow, it runs forever - **q8_0:** VRAM usage decreases, but the computational load shifts to 100% CPU (GPU utilization drops to approximately 20–30%) but it also runs forever After reading https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/#interactive-vram-estimator i checked the VRAM configurator and entered the corresponding parameters there; in fact, a 12B model with a 32k context should never be using that much VRAM, right? Example with a Qwen-14b-Q4_K_M (OLLAMA_KV_CACHE_TYPE q8_0) >>> VRAM 13 GB >>> 100% GPU >>> runs fine I'm curious why Qwen-14b (even with OLLAMA_KV_CACHE_TYPE=f16) runs without issues and can comfortably operate with a 32k context on my GPU, while Gemma-3-12b (which actually has fewer parameters) requires significantly more VRAM. - Is this expected behavior, or might there be an underlying issue with how Gemma 3 12b is handled in Ollama? - Has anyone encountered similar issues with Gemma 3 12b? - Is Gemma 3 12b more sensitive to Flash Attention and KV cache quantization compared to other models? - Are there any recommended adjustments or configurations to run Gemma 3 12b stably and performantly solely on the GPU? Please specify which parts of the logs (e.g., GPU memory allocation messages, offload logs, or any errors) are relevant? Thank you for your support! ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.0

GiteaMirror added the bug label 2026-05-04 13:53:19 -05:00

GiteaMirror closed this issue

2026-05-04 13:53:20 -05:00

GiteaMirror commented

2026-05-04 13:53:22 -05:00

@mbretter commented on GitHub (Mar 14, 2025):

On my system I can load gemma3 1b/4b/12b simultaneously on the Tesla T4/16GB:

ollama ps
NAME          ID              SIZE      PROCESSOR    UNTIL               
gemma3:12b    6fd036cefda5    9.3 GB    100% GPU     4 minutes from now     
gemma3:1b     2d27a774bc62    2.1 GB    100% GPU     27 seconds from now    
gemma3:4b     c0494fe00251    3.9 GB    100% GPU     6 seconds from now

nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:01:00.0 Off |                  Off |
| N/A   83C    P0             33W /   70W |   12585MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          484569      C   /usr/local/bin/ollama                  3588MiB |
|    0   N/A  N/A          484770      C   /usr/local/bin/ollama                  1104MiB |
|    0   N/A  N/A          486331      C   /usr/local/bin/ollama                  7890MiB |
+-----------------------------------------------------------------------------------------+

@mbretter commented on GitHub (Mar 14, 2025): On my system I can load gemma3 1b/4b/12b simultaneously on the Tesla T4/16GB: ``` ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:12b 6fd036cefda5 9.3 GB 100% GPU 4 minutes from now gemma3:1b 2d27a774bc62 2.1 GB 100% GPU 27 seconds from now gemma3:4b c0494fe00251 3.9 GB 100% GPU 6 seconds from now ``` nvidia-smi: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 On | 00000000:01:00.0 Off | Off | | N/A 83C P0 33W / 70W | 12585MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 484569 C /usr/local/bin/ollama 3588MiB | | 0 N/A N/A 484770 C /usr/local/bin/ollama 1104MiB | | 0 N/A N/A 486331 C /usr/local/bin/ollama 7890MiB | +-----------------------------------------------------------------------------------------+ ```

GiteaMirror commented

2026-05-04 13:53:24 -05:00

@mbretter commented on GitHub (Mar 14, 2025):

My first impression is that the gemma3 models are doing pretty well and fast:

>>> Alice has four brothers and she also has a sister. How many sisters does Alice's brother have?
This is a bit of a trick question! Here's how to solve it:

*   Alice's brothers all share the same sisters.
*   Alice is one sister.
*   Therefore, Alice's brothers each have **two** sisters (Alice and her sister).

total duration:       3.04409804s
load duration:        43.357277ms
prompt eval count:    30 token(s)
prompt eval duration: 345ms
prompt eval rate:     86.96 tokens/s
eval count:           61 token(s)
eval duration:        2.653s
eval rate:            22.99 tokens/s

>>> How many 'r's are in the word strawberry?
There are three "r"s in the word "strawberry".

total duration:       973.844571ms
load duration:        41.808834ms
prompt eval count:    21 token(s)
prompt eval duration: 228ms
prompt eval rate:     92.11 tokens/s
eval count:           15 token(s)
eval duration:        702ms
eval rate:            21.37 tokens/s

Many of the other models are stumbling over this prompts while taking more resources.

@mbretter commented on GitHub (Mar 14, 2025): My first impression is that the gemma3 models are doing pretty well and fast: ``` >>> Alice has four brothers and she also has a sister. How many sisters does Alice's brother have? This is a bit of a trick question! Here's how to solve it: * Alice's brothers all share the same sisters. * Alice is one sister. * Therefore, Alice's brothers each have **two** sisters (Alice and her sister). total duration: 3.04409804s load duration: 43.357277ms prompt eval count: 30 token(s) prompt eval duration: 345ms prompt eval rate: 86.96 tokens/s eval count: 61 token(s) eval duration: 2.653s eval rate: 22.99 tokens/s ``` ``` >>> How many 'r's are in the word strawberry? There are three "r"s in the word "strawberry". total duration: 973.844571ms load duration: 41.808834ms prompt eval count: 21 token(s) prompt eval duration: 228ms prompt eval rate: 92.11 tokens/s eval count: 15 token(s) eval duration: 702ms eval rate: 21.37 tokens/s ``` Many of the other models are stumbling over this prompts while taking more resources.

GiteaMirror commented

2026-05-04 13:53:25 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

Did you checked it with Context Size: 32k ???

I can download it again and post some more information...

@ALLMI78 commented on GitHub (Mar 14, 2025): Did you checked it with Context Size: 32k ??? I can download it again and post some more information...

GiteaMirror commented

2026-05-04 13:53:26 -05:00

@mbretter commented on GitHub (Mar 14, 2025):

nope, just with defaults

  Model
    architecture        gemma3    
    parameters          12.2B     
    context length      8192      
    embedding length    3840      
    quantization        Q4_K_M

@mbretter commented on GitHub (Mar 14, 2025): nope, just with defaults ``` Model architecture gemma3 parameters 12.2B context length 8192 embedding length 3840 quantization Q4_K_M ```

GiteaMirror commented

2026-05-04 13:53:27 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

With the default settings, you might not notice the issue. I observed VRAM requirements of around 22–23 GB when running Gemma 3 with a 32k context and the default OLLAMA_KV_CACHE_TYPE=f16.

In comparison, Qwen-14B (also with f16 and a 32k context) only uses around 16–17 GB of VRAM.

Trying to reduce the VRAM usage by setting OLLAMA_KV_CACHE_TYPE=q8_0 for Gemma 3 didn’t help either, as it became very slow—the model suddenly shifted a significant portion of the workload to the CPU.

@ALLMI78 commented on GitHub (Mar 14, 2025): With the default settings, you might not notice the issue. I observed VRAM requirements of around 22–23 GB when running Gemma 3 with a 32k context and the default `OLLAMA_KV_CACHE_TYPE=f16`. In comparison, Qwen-14B (also with `f16` and a 32k context) only uses around 16–17 GB of VRAM. Trying to reduce the VRAM usage by setting `OLLAMA_KV_CACHE_TYPE=q8_0` for Gemma 3 didn’t help either, as it became very slow—the model suddenly shifted a significant portion of the workload to the CPU.

GiteaMirror commented

2026-05-04 13:53:28 -05:00

@Asher9971 commented on GitHub (Mar 14, 2025):

Same here. Using Flash Attention with Gemma3 -> 10% GPU usage 100% CPU usage
without flash attention in ollama everything works fine.
With llama, qwen or all other models i tried flash attenation was never a problem

@Asher9971 commented on GitHub (Mar 14, 2025): Same here. Using Flash Attention with Gemma3 -> 10% GPU usage 100% CPU usage without flash attention in ollama everything works fine. With llama, qwen or all other models i tried flash attenation was never a problem

GiteaMirror commented

2026-05-04 13:53:30 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

I'm currently not sure how OLLAMA_FLASH_ATTENTION affects this as well. Thanks for the hint—I’ll test it later.

OLLAMA_NEW_ENGINE:false ???

whats the default of OLLAMA_NEW_ENGINE?

@ALLMI78 commented on GitHub (Mar 14, 2025): I'm currently not sure how OLLAMA_FLASH_ATTENTION affects this as well. Thanks for the hint—I’ll test it later. OLLAMA_NEW_ENGINE:false ??? whats the default of OLLAMA_NEW_ENGINE?

GiteaMirror commented

2026-05-04 13:53:31 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

Ahh OLLAMA_NEW_ENGINE seems to be new, it is false for me....

2025/03/14 12:25:25 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

num_ctx @ 32k via api call...

"options":{ "seed":219929875, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}';

gemma3:12b 22 GB 28%/72% CPU/GPU Stopping...

RAM usage goes also up, CPU usage is @ 20-30%

It runs but does not answer, it runs forever...

@ALLMI78 commented on GitHub (Mar 14, 2025): Ahh OLLAMA_NEW_ENGINE seems to be new, it is false for me.... > 2025/03/14 12:25:25 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" num_ctx @ 32k via api call... > "options":{ "seed":219929875, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}'; gemma3:12b **_22_** GB 28%/72% CPU/GPU Stopping... RAM usage goes also up, CPU usage is @ 20-30% It runs but does not answer, it runs forever...

GiteaMirror commented

2026-05-04 13:53:32 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

time=2025-03-14T12:32:45.014+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-14T12:32:45.014+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
time=2025-03-14T12:32:45.142+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"
time=2025-03-14T12:32:45.149+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-14T12:32:45.152+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30
time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-14T12:32:45.160+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\Users\admin\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model M:\OLLAMA\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 32768 --batch-size 128 --n-gpu-layers 100 --threads 8 --flash-attn --no-mmap --mlock --parallel 1 --port 49159"

kv cache type not supported by model ???
a lot of key not found

@ALLMI78 commented on GitHub (Mar 14, 2025): > time=2025-03-14T12:32:45.014+01:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-14T12:32:45.014+01:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" time=2025-03-14T12:32:45.142+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-14T12:32:45.149+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-14T12:32:45.152+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30 time=2025-03-14T12:32:45.159+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-14T12:32:45.160+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model M:\\OLLAMA\\models\\blobs\\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 --ctx-size 32768 --batch-size 128 --n-gpu-layers 100 --threads 8 --flash-attn --no-mmap --mlock --parallel 1 --port 49159" - kv cache type not supported by model ??? - a lot of key not found

GiteaMirror commented

2026-05-04 13:53:35 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

2025/03/14 12:40:09 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

with OLLAMA_FLASH_ATTENTION false

gemma3:12b 22 GB 28%/72% CPU/GPU Stopping...

VRAM 22GB and normal RAM usage goes up by 10 GB, CPU usage is @ 20-30%

It runs but does not answer, it runs into my 5 min timeout...

@ALLMI78 commented on GitHub (Mar 14, 2025): > 2025/03/14 12:40:09 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" with OLLAMA_FLASH_ATTENTION false gemma3:12b 22 GB 28%/72% CPU/GPU Stopping... VRAM 22GB and normal RAM usage goes up by 10 GB, CPU usage is @ 20-30% It runs but does not answer, it runs into my 5 min timeout...

GiteaMirror commented

2026-05-04 13:53:36 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

[GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/03/14 - 12:46:23 | 500 | 4m59s | 127.0.0.1 | POST "/api/chat"
time=2025-03-14T12:46:42.445+01:00 level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="24.8 GiB" free_swap="28.6 GiB"
time=2025-03-14T12:46:42.447+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=100 layers.model=49 layers.offload=35 layers.split="" memory.available="[14.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.6 GiB" memory.required.partial="14.8 GiB" memory.required.kv="12.0 GiB" memory.required.allocations="[14.8 GiB]" memory.weights.total="18.0 GiB" memory.weights.repeating="17.3 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="279.8 MiB" memory.graph.partial="917.4 MiB"

memory.required.kv="12.0 GiB" ???

Driver Version: 572.47 CUDA Version: 12.8

@ALLMI78 commented on GitHub (Mar 14, 2025): > [GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/14 - 12:41:35 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/14 - 12:45:44 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/14 - 12:46:23 | 500 | 4m59s | 127.0.0.1 | POST "/api/chat" time=2025-03-14T12:46:42.445+01:00 level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="24.8 GiB" free_swap="28.6 GiB" time=2025-03-14T12:46:42.447+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=100 layers.model=49 layers.offload=35 layers.split="" memory.available="[14.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.6 GiB" memory.required.partial="14.8 GiB" memory.required.kv="12.0 GiB" memory.required.allocations="[14.8 GiB]" memory.weights.total="18.0 GiB" memory.weights.repeating="17.3 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="279.8 MiB" memory.graph.partial="917.4 MiB" - memory.required.kv="12.0 GiB" ??? Driver Version: 572.47 CUDA Version: 12.8

GiteaMirror commented

2026-05-04 13:53:37 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

Sorry, but I’m having a bit of trouble following your comments. I’ve noticed that you’ve been very active in other threads as well, but despite the detailed input, I haven’t really been able to find a solution from it. No offense intended—I just prefer a more direct approach instead of trial and error.

For now, I’ve removed Gemma-3 12b, as I couldn’t get it to run properly with a 32k context, while Qwen-14B works flawlessly on my setup.

@ALLMI78 commented on GitHub (Mar 14, 2025): Sorry, but I’m having a bit of trouble following your comments. I’ve noticed that you’ve been very active in other threads as well, but despite the detailed input, I haven’t really been able to find a solution from it. No offense intended—I just prefer a more direct approach instead of trial and error. For now, I’ve removed Gemma-3 12b, as I couldn’t get it to run properly with a 32k context, while Qwen-14B works flawlessly on my setup.

GiteaMirror commented

2026-05-04 13:53:38 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

Hey, I appreciate you taking the time to respond, but honestly, your comments haven’t been very helpful. They often seem quite general or not really relevant to the actual issue.

I’m looking for concrete solutions or at least well-founded insights that can genuinely help—random guesses unfortunately don’t get me any further. Hope you understand.

@ALLMI78 commented on GitHub (Mar 14, 2025): Hey, I appreciate you taking the time to respond, but honestly, your comments haven’t been very helpful. They often seem quite general or not really relevant to the actual issue. I’m looking for concrete solutions or at least well-founded insights that can genuinely help—random guesses unfortunately don’t get me any further. Hope you understand.

GiteaMirror commented

2026-05-04 13:53:41 -05:00

@wisepmlin commented on GitHub (Mar 14, 2025):

Error: llama runner process has terminated: this model is not supported by your version of Ollama. You may need to upgrade

@wisepmlin commented on GitHub (Mar 14, 2025): Error: llama runner process has terminated: this model is not supported by your version of Ollama. You may need to upgrade

GiteaMirror commented

2026-05-04 13:53:45 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

ollama 0.6.0

@ALLMI78 commented on GitHub (Mar 14, 2025): ollama 0.6.0

GiteaMirror commented

2026-05-04 13:53:49 -05:00

@gregbarbosa commented on GitHub (Mar 14, 2025):

For anyone running into issues with Ollama, Gemma, Flash and KV Cache, try out the 0.6.1 pre-release. Previously, gemma3:4b on my M1 Pro was choking on the most mundane 'hello' message and memory usage was skyrocketing. Now it can ingest context and reply as expected.

@gregbarbosa commented on GitHub (Mar 14, 2025): For anyone running into issues with Ollama, Gemma, Flash and KV Cache, try out the [0.6.1 pre-release](https://github.com/ollama/ollama/releases/tag/v0.6.1-rc0). Previously, gemma3:4b on my M1 Pro was choking on the most mundane 'hello' message and memory usage was skyrocketing. Now it can ingest context and reply as expected.

GiteaMirror commented

2026-05-04 13:53:51 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

With Ollama version 0.6.1 and OLLAMA_KV_CACHE_TYPE=q8_0:

32k context size
still extremely high CPU usage instead of GPU usage.

2025/03/14 17:45:04 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

,"options":{ "seed":239066125, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}';

you can see the high CPU load here in this example...

@ALLMI78 commented on GitHub (Mar 14, 2025): With Ollama version 0.6.1 and OLLAMA_KV_CACHE_TYPE=q8_0: - 32k context size - still extremely high CPU usage instead of GPU usage. ![Image](https://github.com/user-attachments/assets/c7d0a024-4675-4f8a-bb09-64f0a0fbc920) > 2025/03/14 17:45:04 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" > ,"options":{ "seed":239066125, "num_predict":4096, "top_k":2, "top_p":0.60000000, "min_p":0.10000000, "temperature": 0.40000000, "num_ctx": 32768, "num_batch":128, "num_gpu":100, "main_gpu":0, "repeat_last_n":128, "repeat_penalty":1.10000000, "use_mmap":false, "use_mlock":true, "num_thread":8},"stream":false,"cache_prompt":false,"keep_alive":0}'; you can see the high CPU load here in this example...

GiteaMirror commented

2026-05-04 13:53:53 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

With Ollama version 0.6.1 and the default OLLAMA_KV_CACHE_TYPE = f16 it runs @ 100% GPU but again high (24 GB) VRAM usage:

2025/03/14 17:57:38 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

NAME ID SIZE PROCESSOR UNTIL
gemma3:12b 6fd036cefda5 24 GB 34%/66% CPU/GPU Stopping...

Both examples with 32k context size...

@ALLMI78 commented on GitHub (Mar 14, 2025): With Ollama version 0.6.1 and the default OLLAMA_KV_CACHE_TYPE = f16 it runs @ 100% GPU but again high (24 GB) VRAM usage: > 2025/03/14 17:57:38 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" > NAME ID SIZE PROCESSOR UNTIL gemma3:12b 6fd036cefda5 **_24 GB_** 34%/66% CPU/GPU Stopping... Both examples with 32k context size...

GiteaMirror commented

2026-05-04 13:53:55 -05:00

@ALLMI78 commented on GitHub (Mar 14, 2025):

compared to qwen 14b | 32k context size

Ollama version 0.6.1 | default OLLAMA_KV_CACHE_TYPE = f16

Qwen2.5-14B-Q4_K_M 17 GB 7%/93% CPU/GPU

runs fine...

@ALLMI78 commented on GitHub (Mar 14, 2025): compared to qwen 14b | 32k context size Ollama version 0.6.1 | default OLLAMA_KV_CACHE_TYPE = f16 > Qwen2.5-14B-Q4_K_M **_17 GB_** 7%/93% CPU/GPU runs fine...

GiteaMirror commented

2026-05-04 13:53:56 -05:00

@LFd3v commented on GitHub (Mar 14, 2025):

I also have a similar problem here (Linux/Nvidia): gemma3:4b and gemma3:12b use an awful amount of RAM, even if Ollama says it is fully loaded in the GPU:

$ ollama ps
NAME         ID              SIZE      PROCESSOR    UNTIL
gemma3:4b    c0494fe00251    4.2 GB    100% GPU     4 minutes from now

The 4B seems to run fine here (related to tps, albeit using almost all free RAM), but after just a "hello", the 12B model simply causes Ollama to quit due to OOM.

I have seem something similar with llama3.2-vision, but only when sending an image to the model: from what I have read, the projector uses a different architecture than other models that cannot be loaded using both the GPU and CPU, then for some reason it just use all RAM available when an image needs to be processed and crashes Ollama.

As the gemma3 models above 1B have the projector included in the main GGUF, maybe this is why this happens with gemma3 > 1B?

@LFd3v commented on GitHub (Mar 14, 2025): I also have a similar problem here (Linux/Nvidia): gemma3:4b and gemma3:12b use an awful amount of RAM, even if Ollama says it is fully loaded in the GPU: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:4b c0494fe00251 4.2 GB 100% GPU 4 minutes from now ``` The 4B seems to run fine here (related to tps, albeit using almost all free RAM), but after just a "hello", the 12B model simply causes Ollama to quit due to OOM. I have seem something similar with llama3.2-vision, but only when sending an image to the model: from what I have read, the projector uses a different architecture than other models that cannot be loaded using both the GPU and CPU, then for some reason it just use all RAM available when an image needs to be processed and crashes Ollama. As the gemma3 models above 1B have the projector included in the main GGUF, maybe this is why this happens with gemma3 > 1B?

GiteaMirror commented

2026-05-04 13:53:57 -05:00

@ALLMI78 commented on GitHub (Mar 16, 2025):

gemma3: with 8 gb model size for a 12b model i'm @ 24 gb...

16 gb for the 32k context ? + around 10-15 GB ram?

that is normal or not?

how can qwen-14b with 32k do the same with only 16-17 gb?

@ALLMI78 commented on GitHub (Mar 16, 2025): gemma3: with 8 gb model size for a 12b model i'm @ 24 gb... 16 gb for the 32k context ? + around 10-15 GB ram? that is normal or not? how can qwen-14b with 32k do the same with only 16-17 gb?

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#68418