[GH-ISSUE #9678] Unusually high VRAM usage of Gemma 3 27B #6315

New Issue

GiteaMirror · 2026-04-12T17:47:46-05:00

GiteaMirror commented

2026-04-12 17:47:46 -05:00

Originally created by @vYLQs6 on GitHub (Mar 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9678

What is the issue?

I'm using Gemma 3 27B Q4KM: https://www.ollama.com/library/gemma3:27b

GPU: 4090

set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve

When using Gemma 3 27B with a context length of 20,000 (20k), I only have about 1 GB of VRAM left.

However, when using Qwen2.5 32B IQ4XS, which is basically the same size as Gemma 3 27B Q4KM, with a full 32K context, I still have 2 GB of VRAM left.

Is this a bug, or is Gemma 3's context cache just less efficient?

Relevant log output



time=2025-03-12T17:03:54.752+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-f47e9117-13d8-d21e-7b80-735c8d31444d library=cuda total="24.0 GiB" available="17.9 GiB"
time=2025-03-12T17:03:55.250+08:00 level=INFO source=server.go:105 msg="system memory" total="63.6 GiB" free="53.6 GiB"free_swap="107.2 GiB"
time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=256 layers.model=63 layers.offload=62 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.5 GiB" memory.required.partial="21.4 GiB" memory.required.kv="4.7 GiB" memory.required.allocations="[21.4 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-12T17:03:55.331+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:55.334+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-12T17:03:55.335+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-12T17:03:55.342+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\***\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\LLM\\.ollama\\models\\blobs\\sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 20000 --batch-size 512 --n-gpu-layers 256 --threads 16 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 65374"
time=2025-03-12T17:03:55.346+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-12T17:03:55.364+08:00 level=INFO source=runner.go:882 msg="starting ollama engine"
time=2025-03-12T17:03:55.369+08:00 level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:65374"
time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-12T17:03:55.431+08:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-03-12T17:03:55.511+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-03-12T17:03:55.599+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-03-12T17:03:59.908+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:59.911+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-12T17:03:59.913+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30
time=2025-03-12T17:03:59.918+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-12T17:04:00.137+08:00 level=INFO source=server.go:624 msg="llama runner started in 4.79 seconds"
[GIN] 2025/03/12 - 17:04:09 | 200 |   14.5549536s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

v0.6.0

Originally created by @vYLQs6 on GitHub (Mar 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9678 ### What is the issue? I'm using Gemma 3 27B Q4KM: https://www.ollama.com/library/gemma3:27b GPU: 4090 `set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve` When using Gemma 3 27B with a context length of 20,000 (20k), I only have about 1 GB of VRAM left. However, when using Qwen2.5 32B IQ4XS, which is basically the same size as Gemma 3 27B Q4KM, with a full 32K context, I still have 2 GB of VRAM left. Is this a bug, or is Gemma 3's context cache just less efficient? ### Relevant log output ```shell time=2025-03-12T17:03:54.752+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-f47e9117-13d8-d21e-7b80-735c8d31444d library=cuda total="24.0 GiB" available="17.9 GiB" time=2025-03-12T17:03:55.250+08:00 level=INFO source=server.go:105 msg="system memory" total="63.6 GiB" free="53.6 GiB"free_swap="107.2 GiB" time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=256 layers.model=63 layers.offload=62 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.5 GiB" memory.required.partial="21.4 GiB" memory.required.kv="4.7 GiB" memory.required.allocations="[21.4 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB" time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-12T17:03:55.331+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:55.334+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-12T17:03:55.335+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-12T17:03:55.342+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\***\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\LLM\\.ollama\\models\\blobs\\sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 20000 --batch-size 512 --n-gpu-layers 256 --threads 16 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 65374" time=2025-03-12T17:03:55.346+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-12T17:03:55.364+08:00 level=INFO source=runner.go:882 msg="starting ollama engine" time=2025-03-12T17:03:55.369+08:00 level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:65374" time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-12T17:03:55.431+08:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-03-12T17:03:55.511+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-12T17:03:55.599+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-03-12T17:03:59.908+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:59.911+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-12T17:03:59.913+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30 time=2025-03-12T17:03:59.918+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-12T17:04:00.137+08:00 level=INFO source=server.go:624 msg="llama runner started in 4.79 seconds" [GIN] 2025/03/12 - 17:04:09 | 200 | 14.5549536s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version v0.6.0

GiteaMirror added the bug label 2026-04-12 17:47:46 -05:00

GiteaMirror commented

2026-04-12 17:47:48 -05:00

@vYLQs6 commented on GitHub (Mar 12, 2025):

Edit: add log

@vYLQs6 commented on GitHub (Mar 12, 2025): Edit: add log

GiteaMirror commented

2026-04-12 17:47:49 -05:00

@Hioness commented on GitHub (Mar 12, 2025):

There are some models that have a much more efficient implementation of context-window memory usage. The qwen2.5 models, exaone3.5, and falcon3 models are very good examples. Assuming gemma3 has a similar arch to gemma2, they'll scale relatively inefficiently for long-context.
I also have experienced high VRAM usage when using vision models, but the gemma3 implementation doesn't seem to have a projector, so I'm not sure if that's a factor here.

@Hioness commented on GitHub (Mar 12, 2025): There are some models that have a much more efficient implementation of context-window memory usage. The qwen2.5 models, exaone3.5, and falcon3 models are very good examples. Assuming gemma3 has a similar arch to gemma2, they'll scale relatively inefficiently for long-context. I also have experienced high VRAM usage when using vision models, but the gemma3 implementation doesn't seem to have a projector, so I'm not sure if that's a factor here.

GiteaMirror commented

2026-04-12 17:47:50 -05:00

@focomfy commented on GitHub (Mar 12, 2025):

Same. When the KV Cache is set to q8, I can run QwQ 32b-q4 16k-ctx and Mistral 24b-q4 32k-ctx on an RTX 4060 Ti 16GB without OOM, but Gemma3 27b-q4 8k-ctx causes OOM.

Edit: OOM still occurs even at 2k-ctx

@focomfy commented on GitHub (Mar 12, 2025): Same. When the KV Cache is set to q8, I can run `QwQ 32b-q4 16k-ctx` and `Mistral 24b-q4 32k-ctx` on an RTX 4060 Ti 16GB without OOM, but `Gemma3 27b-q4 8k-ctx` causes OOM. Edit: OOM still occurs even at `2k-ctx`

GiteaMirror commented

2026-04-12 17:47:51 -05:00

@sirajperson commented on GitHub (Mar 12, 2025):

I'm having the same issue on Ubuntu 24.04. The model seems to have loaded to the GPU, but there are like 47 processes associated wit attempting to run it. Ollama eventually times out:

ollama run gemma3:27b-it-fp16
Error: timed out waiting for llama runner to start - progress 0.00 -

After exiting ollama, Gemma stays in the GPUs and the process that were trying to continue to run on.

@sirajperson commented on GitHub (Mar 12, 2025): I'm having the same issue on Ubuntu 24.04. The model seems to have loaded to the GPU, but there are like 47 processes associated wit attempting to run it. Ollama eventually times out: ollama run gemma3:27b-it-fp16 Error: timed out waiting for llama runner to start - progress 0.00 - After exiting ollama, Gemma stays in the GPUs and the process that were trying to continue to run on.

GiteaMirror commented

2026-04-12 17:47:52 -05:00

@jujaga commented on GitHub (Mar 12, 2025):

Still need to do more tests, but looks like if OLLAMA_KV_CACHE_TYPE is not set to the default f16, it overflows and eats a bunch of system memory. Both tests ran with OLLAMA_FLASH_ATTENTION=1.

gemma3:12b with OLLAMA_KV_CACHE_TYPE=f16, 8k num_ctx - 41.08 tps, ~0.7GB system memory overflow
gemma3:12b with OLLAMA_KV_CACHE_TYPE=q8_0 8k num_ctx - 11.86 tps, ~3GB system memory overflow

In both cases I still have plenty of VRAM available to use still, so something isn't being offloaded to GPU correctly is my preliminary guess. (Tested on Windows 10, Nvidia GPU w/ 16GB VRAM available)

@jujaga commented on GitHub (Mar 12, 2025): Still need to do more tests, but looks like if `OLLAMA_KV_CACHE_TYPE` is not set to the default f16, it overflows and eats a bunch of system memory. Both tests ran with `OLLAMA_FLASH_ATTENTION=1`. - `gemma3:12b` with `OLLAMA_KV_CACHE_TYPE=f16`, 8k num_ctx - 41.08 tps, ~0.7GB system memory overflow - `gemma3:12b` with `OLLAMA_KV_CACHE_TYPE=q8_0` 8k num_ctx - 11.86 tps, ~3GB system memory overflow In both cases I still have plenty of VRAM available to use still, so something isn't being offloaded to GPU correctly is my preliminary guess. (Tested on Windows 10, Nvidia GPU w/ 16GB VRAM available)

GiteaMirror commented

2026-04-12 17:47:55 -05:00

@sapphirepro commented on GitHub (Mar 12, 2025):

As my report was closed, mention here. Quadro mobile p5000. Issues is not just higher video ram usage, but insane, litery leaving only 0.1-0.2 GB free, making whole other UI stuff fully unusable, can not make screenshots, even browser freezes, as vram all 100% used. Typically all models tried before on ollama 0.5.13 left 1 gb free of vram, which was good and system remained fully operational.

@sapphirepro commented on GitHub (Mar 12, 2025): As my report was closed, mention here. Quadro mobile p5000. Issues is not just higher video ram usage, but insane, litery leaving only 0.1-0.2 GB free, making whole other UI stuff fully unusable, can not make screenshots, even browser freezes, as vram all 100% used. Typically all models tried before on ollama 0.5.13 left 1 gb free of vram, which was good and system remained fully operational.

GiteaMirror commented

2026-04-12 17:47:56 -05:00

@sapphirepro commented on GitHub (Mar 12, 2025):

And I found interesting thing. if to run model from terminal over ollama run, works normally (without flag listed below). Over api it doesn't even start, tries to 2 times to take ram and dies.

With update to ollama 0.6.0 something seems broken with API. OpenWebUI fails to run model without environmental flag GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /usr/local/bin/ollama serve and with that flag then eats 100% of ram. My suspect something got broken with API side

@sapphirepro commented on GitHub (Mar 12, 2025): And I found interesting thing. if to run model from terminal over ollama run, works normally (without flag listed below). Over api it doesn't even start, tries to 2 times to take ram and dies. With update to ollama 0.6.0 something seems broken with API. OpenWebUI fails to run model without environmental flag ```GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /usr/local/bin/ollama serve``` and with that flag then eats 100% of ram. My suspect something got broken with API side

GiteaMirror commented

2026-04-12 17:47:57 -05:00

@FelikZ commented on GitHub (Mar 12, 2025):

Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb)

@FelikZ commented on GitHub (Mar 12, 2025): Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb)

GiteaMirror commented

2026-04-12 17:47:58 -05:00

@sapphirepro commented on GitHub (Mar 12, 2025):

Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb)

Just curious, did you use from terminal or over API? For me over API 12B works perfectly, as all fits in vram. 27B goes mad. Either some sorta memory leaks or broken API I suspect. Try both direct from console and over API. See if any difference.

@sapphirepro commented on GitHub (Mar 12, 2025): > Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb) Just curious, did you use from terminal or over API? For me over API 12B works perfectly, as all fits in vram. 27B goes mad. Either some sorta memory leaks or broken API I suspect. Try both direct from console and over API. See if any difference.

GiteaMirror commented

2026-04-12 17:48:00 -05:00

@Ezbaze commented on GitHub (Mar 13, 2025):

During my testing with an RTX 3080 Ti and 64GB of RAM (40GB free), I found I couldn't set the context_length above 500 when using an image with the gemma3:12b, as it would run out of RAM. Without an image, I could set the context_length to around 32,000 without any issues.

@Ezbaze commented on GitHub (Mar 13, 2025): During my testing with an RTX 3080 Ti and 64GB of RAM (40GB free), I found I couldn't set the context_length above 500 when using an image with the gemma3:12b, as it would run out of RAM. Without an image, I could set the context_length to around 32,000 without any issues.

GiteaMirror commented

2026-04-12 17:48:02 -05:00

@sncix commented on GitHub (Mar 13, 2025):

I have a similar issue with CPU only (no GPU), with Gemma 3 27B consuming around 42.9 GiB of memory (including swap), much higher than other 32B Q4_K_M models in my experience.

@sncix commented on GitHub (Mar 13, 2025): I have a similar issue with CPU only (no GPU), with Gemma 3 27B consuming around 42.9 GiB of memory (including swap), much higher than other 32B Q4_K_M models in my experience.

GiteaMirror commented

2026-04-12 17:48:04 -05:00

@Igorgro commented on GitHub (Mar 13, 2025):

@Ezbaze what do you mean by 'with image' or 'without image'? What parameters did you set?

@Igorgro commented on GitHub (Mar 13, 2025): @Ezbaze what do you mean by 'with image' or 'without image'? What parameters did you set?

GiteaMirror commented

2026-04-12 17:48:05 -05:00

@xxvvii commented on GitHub (Mar 13, 2025):

gemma3:27b failed to run on my MBP M2 Max 32GB
Got "Error: llama runner process has terminated: signal: killed"

ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so
time=2025-03-13T13:25:41.116+08:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="16.2 GiB"
time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-03-13T13:25:41.302+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-13T13:25:51.743+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"

@xxvvii commented on GitHub (Mar 13, 2025): gemma3:27b failed to run on my MBP M2 Max 32GB Got **"Error: llama runner process has terminated: signal: killed"** ```log ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so time=2025-03-13T13:25:41.116+08:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="16.2 GiB" time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-03-13T13:25:41.302+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" time=2025-03-13T13:25:51.743+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" ```

GiteaMirror commented

2026-04-12 17:48:05 -05:00

@dongshimou commented on GitHub (Mar 13, 2025):

time=2025-03-13T07:20:20.303Z level=INFO source=server.go:624 msg="llama runner started in 117.82 seconds"
time=2025-03-13T07:20:20.581Z level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-a0ed380f-31b3-70ae-ef00-f2a86b00f0bc library=cuda total="23.9 GiB" available="2.4 GiB"
time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128

I can only restart ollama.

docker version : ollama/ollama latest b9162cd6df73 31 hours ago

@dongshimou commented on GitHub (Mar 13, 2025): ``` time=2025-03-13T07:20:20.303Z level=INFO source=server.go:624 msg="llama runner started in 117.82 seconds" time=2025-03-13T07:20:20.581Z level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-a0ed380f-31b3-70ae-ef00-f2a86b00f0bc library=cuda total="23.9 GiB" available="2.4 GiB" time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 ``` I can only restart ollama. docker version : ollama/ollama latest b9162cd6df73 31 hours ago

GiteaMirror commented

2026-04-12 17:48:06 -05:00

@nigelks commented on GitHub (Mar 13, 2025):

"Error loading llama server" error="llama runner process has terminated: signal: killed"

Gemma3:27B Q4_K_M fails to run on 2x RTX 3080. From the logs: memory.required.full="20.5 GiB", the model won't fit entirely in VRAM but should should use some system memory. However, this causes OOM and my whole system will freeze.

Running Gemma3:12b occasionally freezes my system, it is using VRAM from both cards and 95% of my system RAM (around 15GB out of total 32GB). Ollama ps shows 6%CPU/94%GPU. Did not modify any parameters including context length, the model was freshly pulled from the ollama model registry.

I have tried OLLAMA_FLASH_ATTENTION: 0, OLLAMA_KV_CACHE_TYPE: "f16", GGML_CUDA_ENABLE_UNIFIED_MEMORY: 1, all to no avail.

Running on the lastest version of Ollama:

root@e2008d6391f3:/# ollama -v
ollama version is 0.6.0

Logs:

time=2025-03-13T07:31:15.153Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB"
time=2025-03-13T07:31:15.153Z level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-13T07:31:15.262Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:15.267Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-13T07:31:15.269Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-13T07:31:15.276Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 40847"
time=2025-03-13T07:31:15.277Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-13T07:31:15.277Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-13T07:31:15.277Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-13T07:31:15.290Z level=INFO source=runner.go:882 msg="starting ollama engine"
time=2025-03-13T07:31:15.294Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:40847"
time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-13T07:31:15.388Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
time=2025-03-13T07:31:15.529Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-03-13T07:31:15.909Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB"
time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB"
time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB"
time=2025-03-13T07:31:23.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-13T07:31:29.202Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-13T07:31:29.452Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: killed"
time=2025-03-13T07:31:29.452Z level=WARN source=server.go:505 msg="llama runner process no longer running" sys=9 string="signal: killed"
time=2025-03-13T07:31:34.487Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.032135713 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:31:34.779Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.324459222 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:31:35.069Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.613885742 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:31:37.926Z level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="17.8 GiB" free_swap="0 B"
time=2025-03-13T07:31:38.216Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB"
time=2025-03-13T07:31:38.216Z level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-13T07:31:38.314Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:38.316Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-13T07:31:38.318Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-13T07:31:38.323Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 42935"
time=2025-03-13T07:31:38.324Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-13T07:31:38.324Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-13T07:31:38.324Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-13T07:31:38.341Z level=INFO source=runner.go:882 msg="starting ollama engine"
time=2025-03-13T07:31:38.344Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:42935"
time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-13T07:31:38.451Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
time=2025-03-13T07:31:38.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-03-13T07:31:38.981Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB"
time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB"
time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB"
time=2025-03-13T07:31:47.176Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-13T07:31:47.749Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-13T07:36:39.166Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-13T07:37:15.998Z level=ERROR source=sched.go:456 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - "
time=2025-03-13T07:37:35.422Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=18.005152578 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:37:36.773Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=19.3454669 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:38:14.929Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=57.529535337 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541

@nigelks commented on GitHub (Mar 13, 2025): **"Error loading llama server" error="llama runner process has terminated: signal: killed"** Gemma3:27B Q4_K_M fails to run on 2x RTX 3080. From the logs: `memory.required.full="20.5 GiB"`, the model won't fit entirely in VRAM but should should use some system memory. However, this causes OOM and my whole system will freeze. Running Gemma3:12b occasionally freezes my system, it is using VRAM from both cards and 95% of my system RAM (around 15GB out of total 32GB). `Ollama ps` shows 6%CPU/94%GPU. Did not modify any parameters including context length, the model was freshly pulled from the ollama model registry. I have tried `OLLAMA_FLASH_ATTENTION: 0`, `OLLAMA_KV_CACHE_TYPE: "f16"`, `GGML_CUDA_ENABLE_UNIFIED_MEMORY: 1`, all to no avail. Running on the lastest version of Ollama: ``` root@e2008d6391f3:/# ollama -v ollama version is 0.6.0 ``` Logs: ``` time=2025-03-13T07:31:15.153Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" time=2025-03-13T07:31:15.153Z level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-13T07:31:15.262Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:15.267Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-13T07:31:15.269Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-13T07:31:15.276Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 40847" time=2025-03-13T07:31:15.277Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-13T07:31:15.277Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-13T07:31:15.277Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-13T07:31:15.290Z level=INFO source=runner.go:882 msg="starting ollama engine" time=2025-03-13T07:31:15.294Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:40847" time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-13T07:31:15.388Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 time=2025-03-13T07:31:15.529Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-03-13T07:31:15.909Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB" time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB" time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB" time=2025-03-13T07:31:23.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" time=2025-03-13T07:31:29.202Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-13T07:31:29.452Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: killed" time=2025-03-13T07:31:29.452Z level=WARN source=server.go:505 msg="llama runner process no longer running" sys=9 string="signal: killed" time=2025-03-13T07:31:34.487Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.032135713 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:31:34.779Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.324459222 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:31:35.069Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.613885742 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:31:37.926Z level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="17.8 GiB" free_swap="0 B" time=2025-03-13T07:31:38.216Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" time=2025-03-13T07:31:38.216Z level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-13T07:31:38.314Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:38.316Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-13T07:31:38.318Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-13T07:31:38.323Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 42935" time=2025-03-13T07:31:38.324Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-13T07:31:38.324Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-13T07:31:38.324Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-13T07:31:38.341Z level=INFO source=runner.go:882 msg="starting ollama engine" time=2025-03-13T07:31:38.344Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:42935" time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-13T07:31:38.451Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 time=2025-03-13T07:31:38.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-03-13T07:31:38.981Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB" time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB" time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB" time=2025-03-13T07:31:47.176Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" time=2025-03-13T07:31:47.749Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" time=2025-03-13T07:36:39.166Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" time=2025-03-13T07:37:15.998Z level=ERROR source=sched.go:456 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " time=2025-03-13T07:37:35.422Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=18.005152578 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:37:36.773Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=19.3454669 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:38:14.929Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=57.529535337 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 ```

GiteaMirror commented

2026-04-12 17:48:06 -05:00

@FelikZ commented on GitHub (Mar 13, 2025):

Just curious, did you use from terminal or over API?

@sapphirepro over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size.

For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything.

@FelikZ commented on GitHub (Mar 13, 2025): > Just curious, did you use from terminal or over API? @sapphirepro over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size. For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything.

GiteaMirror commented

2026-04-12 17:48:07 -05:00

@sapphirepro commented on GitHub (Mar 13, 2025):

Just curious, did you use from terminal or over API?

@sapphirepro over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size.

For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything.

Well for me main problem is not memory usage as such, but vram filled till 100% used. Previous versions used only 90% leaving approx 1GB free of vram. Here problem is it stalls any gui stuff at all, total freeze of system visual

@sapphirepro commented on GitHub (Mar 13, 2025): > > Just curious, did you use from terminal or over API? > > [@sapphirepro](https://github.com/sapphirepro) over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size. > > For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything. Well for me main problem is not memory usage as such, but vram filled till 100% used. Previous versions used only 90% leaving approx 1GB free of vram. Here problem is it stalls any gui stuff at all, total freeze of system visual

GiteaMirror commented

2026-04-12 17:48:07 -05:00

@zeroward commented on GitHub (Mar 13, 2025):

Running into the same issue.

gemma3:27b-q4_k_m uses roughly 21098 MiB of VRAM (from a total of 24GB on one card and 6GB on another) but also uses 41.2% of my systems RAM.

gemma3:12b-it-q8_0 uses roughly 16420 MiB of VRAM, but also roughly 39.1% of my RAM.

gemma3:12b-q4_k_m uses roughly 12408 MiB of VRAM but also roughly 39.1% of my RAM.

Comparatively, mistral-small:24b-instruct-2501-q8_0 uses a combined 25594 MiB split across my two cards and no RAM at all.

I'm running with flash attention enabled.

@zeroward commented on GitHub (Mar 13, 2025): Running into the same issue. gemma3:27b-q4_k_m uses roughly 21098 MiB of VRAM (from a total of 24GB on one card and 6GB on another) but also uses 41.2% of my systems RAM. gemma3:12b-it-q8_0 uses roughly 16420 MiB of VRAM, but also roughly 39.1% of my RAM. gemma3:12b-q4_k_m uses roughly 12408 MiB of VRAM but also roughly 39.1% of my RAM. Comparatively, mistral-small:24b-instruct-2501-q8_0 uses a combined 25594 MiB split across my two cards and no RAM at all. I'm running with flash attention enabled.

GiteaMirror commented

2026-04-12 17:48:07 -05:00

@sapphirepro commented on GitHub (Mar 14, 2025):

It is a bit annoying that my topic was just simply closed "as duplicate", while it's not. I run into specific issue.

Nvidia GPU P5000 mobile. 16 GB VRAM. Fatal issue is using WHOLE vram and at start almost whole ram too. Fatal is it's NOT allowed to use whole vram as it blocks system leaving totally nothing to another processes left. All other models exceeding GPU vram available leave 1GB of video ram free, while gemma3:27B doesn't.

Also background, it crashes without this envirenmental flag GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

And video memory usage shown on screenshots not normal. Need somehow to enforce ollama denial of using over 90% of vram, ok maybe 93% as maximum.

This is 0.6.1-rc0 tested. behavior 1:1 same as 0.6.0

@sapphirepro commented on GitHub (Mar 14, 2025): It is a bit annoying that my topic was just simply closed "as duplicate", while it's not. I run into specific issue. Nvidia GPU P5000 mobile. 16 GB VRAM. Fatal issue is using WHOLE vram and at start almost whole ram too. Fatal is it's NOT allowed to use whole vram as it blocks system leaving totally nothing to another processes left. All other models exceeding GPU vram available leave 1GB of video ram free, while gemma3:27B doesn't. Also background, it crashes without this envirenmental flag ```GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ``` ![Image](https://github.com/user-attachments/assets/061153e2-364c-480b-b883-c1f10600ba2d) ![Image](https://github.com/user-attachments/assets/a391d704-a68a-4fe2-8d8a-181b04e80525) And video memory usage shown on screenshots not normal. Need somehow to enforce ollama denial of using over 90% of vram, ok maybe 93% as maximum. This is 0.6.1-rc0 tested. behavior 1:1 same as 0.6.0

GiteaMirror commented

2026-04-12 17:48:07 -05:00

@sapphirepro commented on GitHub (Mar 14, 2025):

And in addition to message above this is how looks DeepSeek-R1:32B. Look and video ram usage. It uses extra normal ram, but keeps 9% of vram free which is normal to keep all apps operational. So gemma model is total disaster in setting VRAM/RAM usage

@sapphirepro commented on GitHub (Mar 14, 2025): And in addition to message above this is how looks DeepSeek-R1:32B. Look and video ram usage. It uses extra normal ram, but keeps 9% of vram free which is normal to keep all apps operational. So gemma model is total disaster in setting VRAM/RAM usage ![Image](https://github.com/user-attachments/assets/3e194cd0-5109-445d-a29d-c630bab20c12)

GiteaMirror commented

2026-04-12 17:48:08 -05:00

@huankumo commented on GitHub (Mar 15, 2025):

Hi. I have the same issue on Nvidia GeForce RTX 4090 + cuda driver version 12.3 + ollama version is 0.6.0 and noticed that when invoking gemma:27b model through ollama server there's a subtle difference with the e.g. llama3.1 model invocation:

gemma model run with --ollama-engine flag and don't know if it has anything with the issue or not.

@huankumo commented on GitHub (Mar 15, 2025): Hi. I have the same issue on *Nvidia GeForce RTX 4090* + *cuda driver version 12.3* + *ollama version is 0.6.0* and noticed that when invoking gemma:27b model through ollama server there's a subtle difference with the e.g. llama3.1 model invocation: ![Image](https://github.com/user-attachments/assets/57aef10e-775a-4929-a6e6-e9e1dd02b506) gemma model run with --ollama-engine flag and don't know if it has anything with the issue or not.

GiteaMirror commented

2026-04-12 17:48:08 -05:00

@mehditahmasebi commented on GitHub (Mar 15, 2025):

Not fixed yet in ollama version 0.6.1
ollama run gemma3:27b-it-q8_0
Error: Post "http://127.0.0.1:11434/api/generate": EOF

I have overall 64gb RTX vram but it using my RAM not VRAM why?

@mehditahmasebi commented on GitHub (Mar 15, 2025): Not fixed yet in ollama version 0.6.1 ollama run gemma3:27b-it-q8_0 Error: Post "http://127.0.0.1:11434/api/generate": EOF I have overall 64gb RTX vram but it using my RAM not VRAM why? <img width="633" alt="Image" src="https://github.com/user-attachments/assets/9baafb00-2717-472d-9c66-ab6a6be97568" /> <img width="1499" alt="Image" src="https://github.com/user-attachments/assets/777fd88b-130b-4add-8953-4e3f67212c4d" />

GiteaMirror commented

2026-04-12 17:48:09 -05:00

@ALLMI78 commented on GitHub (Mar 16, 2025):

same here https://github.com/ollama/ollama/issues/9730 (0.6.1) (ctx @ 32k)

q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE f16 | 24 GB VRAM + around 10-12 RM usage -> gemma runs on GPU and answers but is very slow <<< problem high VRAM and RAM usage

q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE q8_0 | less VRAM usage <<< problem runs on CPU now and GPU load goes down to 20-30 % = slow

@ALLMI78 commented on GitHub (Mar 16, 2025): same here https://github.com/ollama/ollama/issues/9730 (0.6.1) (ctx @ 32k) > q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE f16 | 24 GB VRAM + around 10-12 RM usage -> gemma runs on GPU and answers but is very slow <<< problem high VRAM and RAM usage > q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE q8_0 | less VRAM usage <<< problem runs on CPU now and GPU load goes down to 20-30 % = slow

GiteaMirror commented

2026-04-12 17:48:09 -05:00

@nickcwilkins commented on GitHub (Mar 24, 2025):

I just installed 0.6.3rc0 and it seems to be using less memory, but splits 50%/50% between cpu and gpu. I'm seeing that I have 12 gigs of vram that aren't being used. This is with an RTX 3090 and Gemma 27B

@nickcwilkins commented on GitHub (Mar 24, 2025): I just installed 0.6.3rc0 and it seems to be using less memory, but splits 50%/50% between cpu and gpu. I'm seeing that I have 12 gigs of vram that aren't being used. This is with an RTX 3090 and Gemma 27B

GiteaMirror commented

2026-04-12 17:48:09 -05:00

@aablsk commented on GitHub (Apr 23, 2025):

For me this is fixed when using the new quantization-aware trained (QAT) models (27B, 12B). These are first party quants from Google that aren't quantized post-training.

Quoting from Google's blog-post.

Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

There also seem to have been changes that reduce this issue for Q4_K_M on my 4090, which previously crashed with any context sizes larger than ~4k. Q4_K_M is still using more VRAM and RAM than the QAT version.

@aablsk commented on GitHub (Apr 23, 2025): For me this is fixed when using the new quantization-aware trained (QAT) models ([27B](https://ollama.com/library/gemma3:27b-it-qat), [12B](https://ollama.com/library/gemma3:12b-it-qat)). These are first party quants from Google that aren't quantized post-training. Quoting from [Google's blog-post](https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/). > Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0. There also seem to have been changes that reduce this issue for Q4_K_M on my 4090, which previously crashed with any context sizes larger than ~4k. Q4_K_M is still using more VRAM and RAM than the QAT version.

GiteaMirror commented

2026-04-12 17:48:10 -05:00

@sapphirepro commented on GitHub (Apr 24, 2025):

Q4_K_M is still using more VRAM and RAM than the QAT version.

What is QAT version? Where to get that?

@sapphirepro commented on GitHub (Apr 24, 2025): > Q4_K_M is still using more VRAM and RAM than the QAT version. What is QAT version? Where to get that?

GiteaMirror commented

2026-04-12 17:48:10 -05:00

@aablsk commented on GitHub (Apr 24, 2025):

@sapphirepro I've updated my previous comment with more details and links 👍

@aablsk commented on GitHub (Apr 24, 2025): @sapphirepro I've updated my previous comment with more details and links 👍

GiteaMirror commented

2026-04-12 17:48:11 -05:00

@sapphirepro commented on GitHub (Apr 24, 2025):

After many tests I figure one definite minus and probably a bug of ollama. It's totally unable to use nvidia_uvm correctly. I read a lot docs already, and it should work by swapping memory block to and from gpu vram over cuda toolkit. Instead ollama does parallel compute on both which is slow and unproductive. Hopefully will get time in next few months to fork ollama and fix that to work properly without using cpu as compute unit.

For example. If to set num_gpu above gpu vram capacity it just crashed, which should swap blocks to ram but not let cpu compute it, but make cuda side controlled to swap memory blocks on demand.

@sapphirepro commented on GitHub (Apr 24, 2025): After many tests I figure one definite minus and probably a bug of ollama. It's totally unable to use nvidia_uvm correctly. I read a lot docs already, and it should work by swapping memory block to and from gpu vram over cuda toolkit. Instead ollama does parallel compute on both which is slow and unproductive. Hopefully will get time in next few months to fork ollama and fix that to work properly without using cpu as compute unit. For example. If to set num_gpu above gpu vram capacity it just crashed, which should swap blocks to ram but not let cpu compute it, but make cuda side controlled to swap memory blocks on demand.

GiteaMirror referenced this issue

2026-04-22 08:53:33 -05:00

[GH-ISSUE #6315] Sharing computing power in a decentralized P2P network #29722

GiteaMirror referenced this issue

2026-04-28 15:58:52 -05:00

[GH-ISSUE #6315] Sharing computing power in a decentralized P2P network #50473

GiteaMirror referenced this issue

2026-05-03 23:30:39 -05:00

[GH-ISSUE #6315] Sharing computing power in a decentralized P2P network #65999

GiteaMirror referenced this issue

2026-05-09 11:48:29 -05:00

[GH-ISSUE #6315] Sharing computing power in a decentralized P2P network #81642

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#6315