[GH-ISSUE #9678] Unusually high VRAM usage of Gemma 3 27B #6315

Open
opened 2026-04-12 17:47:46 -05:00 by GiteaMirror · 28 comments
Owner

Originally created by @vYLQs6 on GitHub (Mar 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9678

What is the issue?

I'm using Gemma 3 27B Q4KM: https://www.ollama.com/library/gemma3:27b

GPU: 4090

set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve

When using Gemma 3 27B with a context length of 20,000 (20k), I only have about 1 GB of VRAM left.

However, when using Qwen2.5 32B IQ4XS, which is basically the same size as Gemma 3 27B Q4KM, with a full 32K context, I still have 2 GB of VRAM left.

Is this a bug, or is Gemma 3's context cache just less efficient?

Relevant log output



time=2025-03-12T17:03:54.752+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-f47e9117-13d8-d21e-7b80-735c8d31444d library=cuda total="24.0 GiB" available="17.9 GiB"
time=2025-03-12T17:03:55.250+08:00 level=INFO source=server.go:105 msg="system memory" total="63.6 GiB" free="53.6 GiB"free_swap="107.2 GiB"
time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=256 layers.model=63 layers.offload=62 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.5 GiB" memory.required.partial="21.4 GiB" memory.required.kv="4.7 GiB" memory.required.allocations="[21.4 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-12T17:03:55.331+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:55.334+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-12T17:03:55.335+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30
time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-12T17:03:55.342+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\***\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\LLM\\.ollama\\models\\blobs\\sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 20000 --batch-size 512 --n-gpu-layers 256 --threads 16 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 65374"
time=2025-03-12T17:03:55.346+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-12T17:03:55.364+08:00 level=INFO source=runner.go:882 msg="starting ollama engine"
time=2025-03-12T17:03:55.369+08:00 level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:65374"
time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-12T17:03:55.431+08:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-03-12T17:03:55.511+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-03-12T17:03:55.599+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB"
time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-03-12T17:03:59.908+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:59.911+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-12T17:03:59.913+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30
time=2025-03-12T17:03:59.918+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-12T17:04:00.137+08:00 level=INFO source=server.go:624 msg="llama runner started in 4.79 seconds"
[GIN] 2025/03/12 - 17:04:09 | 200 |   14.5549536s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

v0.6.0

Originally created by @vYLQs6 on GitHub (Mar 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9678 ### What is the issue? I'm using Gemma 3 27B Q4KM: https://www.ollama.com/library/gemma3:27b GPU: 4090 `set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve` When using Gemma 3 27B with a context length of 20,000 (20k), I only have about 1 GB of VRAM left. However, when using Qwen2.5 32B IQ4XS, which is basically the same size as Gemma 3 27B Q4KM, with a full 32K context, I still have 2 GB of VRAM left. Is this a bug, or is Gemma 3's context cache just less efficient? ### Relevant log output ```shell time=2025-03-12T17:03:54.752+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-f47e9117-13d8-d21e-7b80-735c8d31444d library=cuda total="24.0 GiB" available="17.9 GiB" time=2025-03-12T17:03:55.250+08:00 level=INFO source=server.go:105 msg="system memory" total="63.6 GiB" free="53.6 GiB"free_swap="107.2 GiB" time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=256 layers.model=63 layers.offload=62 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.5 GiB" memory.required.partial="21.4 GiB" memory.required.kv="4.7 GiB" memory.required.allocations="[21.4 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB" time=2025-03-12T17:03:55.265+08:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-12T17:03:55.331+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:55.334+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-12T17:03:55.335+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30 time=2025-03-12T17:03:55.340+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-12T17:03:55.342+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\***\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\LLM\\.ollama\\models\\blobs\\sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 20000 --batch-size 512 --n-gpu-layers 256 --threads 16 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --port 65374" time=2025-03-12T17:03:55.346+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-12T17:03:55.346+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-12T17:03:55.364+08:00 level=INFO source=runner.go:882 msg="starting ollama engine" time=2025-03-12T17:03:55.369+08:00 level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:65374" time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-12T17:03:55.431+08:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-12T17:03:55.431+08:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\***\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-03-12T17:03:55.511+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-12T17:03:55.599+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-03-12T17:03:55.610+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="16.2 GiB" time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-03-12T17:03:59.907+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-03-12T17:03:59.908+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:59.911+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-12T17:03:59.913+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-12T17:03:59.917+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcappingdefault=30 time=2025-03-12T17:03:59.918+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-12T17:04:00.137+08:00 level=INFO source=server.go:624 msg="llama runner started in 4.79 seconds" [GIN] 2025/03/12 - 17:04:09 | 200 | 14.5549536s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version v0.6.0
GiteaMirror added the bug label 2026-04-12 17:47:46 -05:00
Author
Owner

@vYLQs6 commented on GitHub (Mar 12, 2025):

Edit: add log

<!-- gh-comment-id:2717157964 --> @vYLQs6 commented on GitHub (Mar 12, 2025): Edit: add log
Author
Owner

@Hioness commented on GitHub (Mar 12, 2025):

There are some models that have a much more efficient implementation of context-window memory usage. The qwen2.5 models, exaone3.5, and falcon3 models are very good examples. Assuming gemma3 has a similar arch to gemma2, they'll scale relatively inefficiently for long-context.
I also have experienced high VRAM usage when using vision models, but the gemma3 implementation doesn't seem to have a projector, so I'm not sure if that's a factor here.

<!-- gh-comment-id:2718260941 --> @Hioness commented on GitHub (Mar 12, 2025): There are some models that have a much more efficient implementation of context-window memory usage. The qwen2.5 models, exaone3.5, and falcon3 models are very good examples. Assuming gemma3 has a similar arch to gemma2, they'll scale relatively inefficiently for long-context. I also have experienced high VRAM usage when using vision models, but the gemma3 implementation doesn't seem to have a projector, so I'm not sure if that's a factor here.
Author
Owner

@focomfy commented on GitHub (Mar 12, 2025):

Same. When the KV Cache is set to q8, I can run QwQ 32b-q4 16k-ctx and Mistral 24b-q4 32k-ctx on an RTX 4060 Ti 16GB without OOM, but Gemma3 27b-q4 8k-ctx causes OOM.

Edit: OOM still occurs even at 2k-ctx

<!-- gh-comment-id:2718479571 --> @focomfy commented on GitHub (Mar 12, 2025): Same. When the KV Cache is set to q8, I can run `QwQ 32b-q4 16k-ctx` and `Mistral 24b-q4 32k-ctx` on an RTX 4060 Ti 16GB without OOM, but `Gemma3 27b-q4 8k-ctx` causes OOM. Edit: OOM still occurs even at `2k-ctx`
Author
Owner

@sirajperson commented on GitHub (Mar 12, 2025):

I'm having the same issue on Ubuntu 24.04. The model seems to have loaded to the GPU, but there are like 47 processes associated wit attempting to run it. Ollama eventually times out:

ollama run gemma3:27b-it-fp16
Error: timed out waiting for llama runner to start - progress 0.00 -

After exiting ollama, Gemma stays in the GPUs and the process that were trying to continue to run on.

<!-- gh-comment-id:2718848169 --> @sirajperson commented on GitHub (Mar 12, 2025): I'm having the same issue on Ubuntu 24.04. The model seems to have loaded to the GPU, but there are like 47 processes associated wit attempting to run it. Ollama eventually times out: ollama run gemma3:27b-it-fp16 Error: timed out waiting for llama runner to start - progress 0.00 - After exiting ollama, Gemma stays in the GPUs and the process that were trying to continue to run on.
Author
Owner

@jujaga commented on GitHub (Mar 12, 2025):

Still need to do more tests, but looks like if OLLAMA_KV_CACHE_TYPE is not set to the default f16, it overflows and eats a bunch of system memory. Both tests ran with OLLAMA_FLASH_ATTENTION=1.

  • gemma3:12b with OLLAMA_KV_CACHE_TYPE=f16, 8k num_ctx - 41.08 tps, ~0.7GB system memory overflow
  • gemma3:12b with OLLAMA_KV_CACHE_TYPE=q8_0 8k num_ctx - 11.86 tps, ~3GB system memory overflow

In both cases I still have plenty of VRAM available to use still, so something isn't being offloaded to GPU correctly is my preliminary guess. (Tested on Windows 10, Nvidia GPU w/ 16GB VRAM available)

<!-- gh-comment-id:2718887628 --> @jujaga commented on GitHub (Mar 12, 2025): Still need to do more tests, but looks like if `OLLAMA_KV_CACHE_TYPE` is not set to the default f16, it overflows and eats a bunch of system memory. Both tests ran with `OLLAMA_FLASH_ATTENTION=1`. - `gemma3:12b` with `OLLAMA_KV_CACHE_TYPE=f16`, 8k num_ctx - 41.08 tps, ~0.7GB system memory overflow - `gemma3:12b` with `OLLAMA_KV_CACHE_TYPE=q8_0` 8k num_ctx - 11.86 tps, ~3GB system memory overflow In both cases I still have plenty of VRAM available to use still, so something isn't being offloaded to GPU correctly is my preliminary guess. (Tested on Windows 10, Nvidia GPU w/ 16GB VRAM available)
Author
Owner

@sapphirepro commented on GitHub (Mar 12, 2025):

As my report was closed, mention here. Quadro mobile p5000. Issues is not just higher video ram usage, but insane, litery leaving only 0.1-0.2 GB free, making whole other UI stuff fully unusable, can not make screenshots, even browser freezes, as vram all 100% used. Typically all models tried before on ollama 0.5.13 left 1 gb free of vram, which was good and system remained fully operational.

<!-- gh-comment-id:2718943493 --> @sapphirepro commented on GitHub (Mar 12, 2025): As my report was closed, mention here. Quadro mobile p5000. Issues is not just higher video ram usage, but insane, litery leaving only 0.1-0.2 GB free, making whole other UI stuff fully unusable, can not make screenshots, even browser freezes, as vram all 100% used. Typically all models tried before on ollama 0.5.13 left 1 gb free of vram, which was good and system remained fully operational.
Author
Owner

@sapphirepro commented on GitHub (Mar 12, 2025):

And I found interesting thing. if to run model from terminal over ollama run, works normally (without flag listed below). Over api it doesn't even start, tries to 2 times to take ram and dies.

With update to ollama 0.6.0 something seems broken with API. OpenWebUI fails to run model without environmental flag GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /usr/local/bin/ollama serve and with that flag then eats 100% of ram. My suspect something got broken with API side

<!-- gh-comment-id:2719058688 --> @sapphirepro commented on GitHub (Mar 12, 2025): And I found interesting thing. if to run model from terminal over ollama run, works normally (without flag listed below). Over api it doesn't even start, tries to 2 times to take ram and dies. With update to ollama 0.6.0 something seems broken with API. OpenWebUI fails to run model without environmental flag ```GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /usr/local/bin/ollama serve``` and with that flag then eats 100% of ram. My suspect something got broken with API side
Author
Owner

@FelikZ commented on GitHub (Mar 12, 2025):

Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb)

<!-- gh-comment-id:2719122621 --> @FelikZ commented on GitHub (Mar 12, 2025): Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb)
Author
Owner

@sapphirepro commented on GitHub (Mar 12, 2025):

Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb)

Just curious, did you use from terminal or over API? For me over API 12B works perfectly, as all fits in vram. 27B goes mad. Either some sorta memory leaks or broken API I suspect. Try both direct from console and over API. See if any difference.

<!-- gh-comment-id:2719288204 --> @sapphirepro commented on GitHub (Mar 12, 2025): > Similar experience on M1 mac, it just consumes all the memory - regardless of 12b or 27b (on 32Gb) Just curious, did you use from terminal or over API? For me over API 12B works perfectly, as all fits in vram. 27B goes mad. Either some sorta memory leaks or broken API I suspect. Try both direct from console and over API. See if any difference.
Author
Owner

@Ezbaze commented on GitHub (Mar 13, 2025):

During my testing with an RTX 3080 Ti and 64GB of RAM (40GB free), I found I couldn't set the context_length above 500 when using an image with the gemma3:12b, as it would run out of RAM. Without an image, I could set the context_length to around 32,000 without any issues.

<!-- gh-comment-id:2719483074 --> @Ezbaze commented on GitHub (Mar 13, 2025): During my testing with an RTX 3080 Ti and 64GB of RAM (40GB free), I found I couldn't set the context_length above 500 when using an image with the gemma3:12b, as it would run out of RAM. Without an image, I could set the context_length to around 32,000 without any issues.
Author
Owner

@sncix commented on GitHub (Mar 13, 2025):

I have a similar issue with CPU only (no GPU), with Gemma 3 27B consuming around 42.9 GiB of memory (including swap), much higher than other 32B Q4_K_M models in my experience.

<!-- gh-comment-id:2719488079 --> @sncix commented on GitHub (Mar 13, 2025): I have a similar issue with CPU only (no GPU), with Gemma 3 27B consuming around 42.9 GiB of memory (including swap), much higher than other 32B Q4_K_M models in my experience.
Author
Owner

@Igorgro commented on GitHub (Mar 13, 2025):

@Ezbaze what do you mean by 'with image' or 'without image'? What parameters did you set?

<!-- gh-comment-id:2720046145 --> @Igorgro commented on GitHub (Mar 13, 2025): @Ezbaze what do you mean by 'with image' or 'without image'? What parameters did you set?
Author
Owner

@xxvvii commented on GitHub (Mar 13, 2025):

gemma3:27b failed to run on my MBP M2 Max 32GB
Got "Error: llama runner process has terminated: signal: killed"

ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so
time=2025-03-13T13:25:41.116+08:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="16.2 GiB"
time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-03-13T13:25:41.302+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-13T13:25:51.743+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
<!-- gh-comment-id:2720066716 --> @xxvvii commented on GitHub (Mar 13, 2025): gemma3:27b failed to run on my MBP M2 Max 32GB Got **"Error: llama runner process has terminated: signal: killed"** ```log ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so time=2025-03-13T13:25:41.116+08:00 level=INFO source=ggml.go:109 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=Metal size="16.2 GiB" time=2025-03-13T13:25:41.225+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-03-13T13:25:41.302+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" time=2025-03-13T13:25:51.743+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" ```
Author
Owner

@dongshimou commented on GitHub (Mar 13, 2025):

time=2025-03-13T07:20:20.303Z level=INFO source=server.go:624 msg="llama runner started in 117.82 seconds"
time=2025-03-13T07:20:20.581Z level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-a0ed380f-31b3-70ae-ef00-f2a86b00f0bc library=cuda total="23.9 GiB" available="2.4 GiB"
time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128

I can only restart ollama.

docker version : ollama/ollama latest b9162cd6df73 31 hours ago

<!-- gh-comment-id:2720220032 --> @dongshimou commented on GitHub (Mar 13, 2025): ``` time=2025-03-13T07:20:20.303Z level=INFO source=server.go:624 msg="llama runner started in 117.82 seconds" time=2025-03-13T07:20:20.581Z level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-a0ed380f-31b3-70ae-ef00-f2a86b00f0bc library=cuda total="23.9 GiB" available="2.4 GiB" time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-13T07:20:20.582Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.583Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-13T07:20:20.584Z level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 ``` I can only restart ollama. docker version : ollama/ollama latest b9162cd6df73 31 hours ago
Author
Owner

@nigelks commented on GitHub (Mar 13, 2025):

"Error loading llama server" error="llama runner process has terminated: signal: killed"

Gemma3:27B Q4_K_M fails to run on 2x RTX 3080. From the logs: memory.required.full="20.5 GiB", the model won't fit entirely in VRAM but should should use some system memory. However, this causes OOM and my whole system will freeze.

Running Gemma3:12b occasionally freezes my system, it is using VRAM from both cards and 95% of my system RAM (around 15GB out of total 32GB). Ollama ps shows 6%CPU/94%GPU. Did not modify any parameters including context length, the model was freshly pulled from the ollama model registry.

I have tried OLLAMA_FLASH_ATTENTION: 0, OLLAMA_KV_CACHE_TYPE: "f16", GGML_CUDA_ENABLE_UNIFIED_MEMORY: 1, all to no avail.

Running on the lastest version of Ollama:

root@e2008d6391f3:/# ollama -v
ollama version is 0.6.0

Logs:

time=2025-03-13T07:31:15.153Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB"
time=2025-03-13T07:31:15.153Z level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-13T07:31:15.262Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:15.267Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-13T07:31:15.269Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30
time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-13T07:31:15.276Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 40847"
time=2025-03-13T07:31:15.277Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-13T07:31:15.277Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-13T07:31:15.277Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-13T07:31:15.290Z level=INFO source=runner.go:882 msg="starting ollama engine"
time=2025-03-13T07:31:15.294Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:40847"
time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-13T07:31:15.388Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
time=2025-03-13T07:31:15.529Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-03-13T07:31:15.909Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB"
time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB"
time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB"
time=2025-03-13T07:31:23.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-13T07:31:29.202Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-13T07:31:29.452Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: killed"
time=2025-03-13T07:31:29.452Z level=WARN source=server.go:505 msg="llama runner process no longer running" sys=9 string="signal: killed"
time=2025-03-13T07:31:34.487Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.032135713 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:31:34.779Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.324459222 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:31:35.069Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.613885742 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:31:37.926Z level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="17.8 GiB" free_swap="0 B"
time=2025-03-13T07:31:38.216Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB"
time=2025-03-13T07:31:38.216Z level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-13T07:31:38.314Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:38.316Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-13T07:31:38.318Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30
time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-13T07:31:38.323Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 42935"
time=2025-03-13T07:31:38.324Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-13T07:31:38.324Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-13T07:31:38.324Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-13T07:31:38.341Z level=INFO source=runner.go:882 msg="starting ollama engine"
time=2025-03-13T07:31:38.344Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:42935"
time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-13T07:31:38.451Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
time=2025-03-13T07:31:38.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-03-13T07:31:38.981Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB"
time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB"
time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB"
time=2025-03-13T07:31:47.176Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-13T07:31:47.749Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-13T07:36:39.166Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-13T07:37:15.998Z level=ERROR source=sched.go:456 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - "
time=2025-03-13T07:37:35.422Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=18.005152578 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:37:36.773Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=19.3454669 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
time=2025-03-13T07:38:14.929Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=57.529535337 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541
<!-- gh-comment-id:2720298575 --> @nigelks commented on GitHub (Mar 13, 2025): **"Error loading llama server" error="llama runner process has terminated: signal: killed"** Gemma3:27B Q4_K_M fails to run on 2x RTX 3080. From the logs: `memory.required.full="20.5 GiB"`, the model won't fit entirely in VRAM but should should use some system memory. However, this causes OOM and my whole system will freeze. Running Gemma3:12b occasionally freezes my system, it is using VRAM from both cards and 95% of my system RAM (around 15GB out of total 32GB). `Ollama ps` shows 6%CPU/94%GPU. Did not modify any parameters including context length, the model was freshly pulled from the ollama model registry. I have tried `OLLAMA_FLASH_ATTENTION: 0`, `OLLAMA_KV_CACHE_TYPE: "f16"`, `GGML_CUDA_ENABLE_UNIFIED_MEMORY: 1`, all to no avail. Running on the lastest version of Ollama: ``` root@e2008d6391f3:/# ollama -v ollama version is 0.6.0 ``` Logs: ``` time=2025-03-13T07:31:15.153Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" time=2025-03-13T07:31:15.153Z level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-13T07:31:15.262Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:15.267Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-13T07:31:15.269Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30 time=2025-03-13T07:31:15.276Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-13T07:31:15.276Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 40847" time=2025-03-13T07:31:15.277Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-13T07:31:15.277Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-13T07:31:15.277Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-13T07:31:15.290Z level=INFO source=runner.go:882 msg="starting ollama engine" time=2025-03-13T07:31:15.294Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:40847" time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-13T07:31:15.388Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-13T07:31:15.388Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 time=2025-03-13T07:31:15.529Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-03-13T07:31:15.909Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB" time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB" time=2025-03-13T07:31:16.014Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB" time=2025-03-13T07:31:23.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" time=2025-03-13T07:31:29.202Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-13T07:31:29.452Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: killed" time=2025-03-13T07:31:29.452Z level=WARN source=server.go:505 msg="llama runner process no longer running" sys=9 string="signal: killed" time=2025-03-13T07:31:34.487Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.032135713 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:31:34.779Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.324459222 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:31:35.069Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.613885742 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:31:37.926Z level=INFO source=server.go:105 msg="system memory" total="31.0 GiB" free="17.8 GiB" free_swap="0 B" time=2025-03-13T07:31:38.216Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=58 layers.split=29,29 memory.available="[9.4 GiB 9.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.5 GiB" memory.required.partial="18.4 GiB" memory.required.kv="496.0 MiB" memory.required.allocations="[9.2 GiB 9.2 GiB]" memory.weights.total="14.8 GiB" memory.weights.repeating="13.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.6 GiB" time=2025-03-13T07:31:38.216Z level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-13T07:31:38.314Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:38.316Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-13T07:31:38.318Z level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.final_logit_softcapping default=30 time=2025-03-13T07:31:38.323Z level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-13T07:31:38.323Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 2048 --batch-size 512 --n-gpu-layers 58 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 29,29 --port 42935" time=2025-03-13T07:31:38.324Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-13T07:31:38.324Z level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-13T07:31:38.324Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-13T07:31:38.341Z level=INFO source=runner.go:882 msg="starting ollama engine" time=2025-03-13T07:31:38.344Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:42935" time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" time=2025-03-13T07:31:38.451Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-13T07:31:38.451Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 time=2025-03-13T07:31:38.576Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-03-13T07:31:38.981Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.9 GiB" time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="6.6 GiB" time=2025-03-13T07:31:39.088Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="6.7 GiB" time=2025-03-13T07:31:47.176Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" time=2025-03-13T07:31:47.749Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" time=2025-03-13T07:36:39.166Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server not responding" time=2025-03-13T07:37:15.998Z level=ERROR source=sched.go:456 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " time=2025-03-13T07:37:35.422Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=18.005152578 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:37:36.773Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=19.3454669 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 time=2025-03-13T07:38:14.929Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=57.529535337 model=/root/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 ```
Author
Owner

@FelikZ commented on GitHub (Mar 13, 2025):

Just curious, did you use from terminal or over API?

@sapphirepro over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size.

For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything.

<!-- gh-comment-id:2720554448 --> @FelikZ commented on GitHub (Mar 13, 2025): > Just curious, did you use from terminal or over API? @sapphirepro over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size. For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything.
Author
Owner

@sapphirepro commented on GitHub (Mar 13, 2025):

Just curious, did you use from terminal or over API?

@sapphirepro over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size.

For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything.

Well for me main problem is not memory usage as such, but vram filled till 100% used. Previous versions used only 90% leaving approx 1GB free of vram. Here problem is it stalls any gui stuff at all, total freeze of system visual

<!-- gh-comment-id:2720610879 --> @sapphirepro commented on GitHub (Mar 13, 2025): > > Just curious, did you use from terminal or over API? > > [@sapphirepro](https://github.com/sapphirepro) over terminal it is indeed better but still, 12b model consumes 21Gb while from API it consumes 29Gb (I have 32Gb m1) which is I am not expect for that model size. > > For comparison, 32B DeepSeek Distilled model consumes ~30GB and do not crash or anything. Well for me main problem is not memory usage as such, but vram filled till 100% used. Previous versions used only 90% leaving approx 1GB free of vram. Here problem is it stalls any gui stuff at all, total freeze of system visual
Author
Owner

@zeroward commented on GitHub (Mar 13, 2025):

Running into the same issue.

gemma3:27b-q4_k_m uses roughly 21098 MiB of VRAM (from a total of 24GB on one card and 6GB on another) but also uses 41.2% of my systems RAM.

gemma3:12b-it-q8_0 uses roughly 16420 MiB of VRAM, but also roughly 39.1% of my RAM.

gemma3:12b-q4_k_m uses roughly 12408 MiB of VRAM but also roughly 39.1% of my RAM.

Comparatively, mistral-small:24b-instruct-2501-q8_0 uses a combined 25594 MiB split across my two cards and no RAM at all.

I'm running with flash attention enabled.

<!-- gh-comment-id:2721916748 --> @zeroward commented on GitHub (Mar 13, 2025): Running into the same issue. gemma3:27b-q4_k_m uses roughly 21098 MiB of VRAM (from a total of 24GB on one card and 6GB on another) but also uses 41.2% of my systems RAM. gemma3:12b-it-q8_0 uses roughly 16420 MiB of VRAM, but also roughly 39.1% of my RAM. gemma3:12b-q4_k_m uses roughly 12408 MiB of VRAM but also roughly 39.1% of my RAM. Comparatively, mistral-small:24b-instruct-2501-q8_0 uses a combined 25594 MiB split across my two cards and no RAM at all. I'm running with flash attention enabled.
Author
Owner

@sapphirepro commented on GitHub (Mar 14, 2025):

It is a bit annoying that my topic was just simply closed "as duplicate", while it's not. I run into specific issue.

Nvidia GPU P5000 mobile. 16 GB VRAM. Fatal issue is using WHOLE vram and at start almost whole ram too. Fatal is it's NOT allowed to use whole vram as it blocks system leaving totally nothing to another processes left. All other models exceeding GPU vram available leave 1GB of video ram free, while gemma3:27B doesn't.

Also background, it crashes without this envirenmental flag GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

Image
Image

And video memory usage shown on screenshots not normal. Need somehow to enforce ollama denial of using over 90% of vram, ok maybe 93% as maximum.

This is 0.6.1-rc0 tested. behavior 1:1 same as 0.6.0

<!-- gh-comment-id:2723679443 --> @sapphirepro commented on GitHub (Mar 14, 2025): It is a bit annoying that my topic was just simply closed "as duplicate", while it's not. I run into specific issue. Nvidia GPU P5000 mobile. 16 GB VRAM. Fatal issue is using WHOLE vram and at start almost whole ram too. Fatal is it's NOT allowed to use whole vram as it blocks system leaving totally nothing to another processes left. All other models exceeding GPU vram available leave 1GB of video ram free, while gemma3:27B doesn't. Also background, it crashes without this envirenmental flag ```GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ``` ![Image](https://github.com/user-attachments/assets/061153e2-364c-480b-b883-c1f10600ba2d) ![Image](https://github.com/user-attachments/assets/a391d704-a68a-4fe2-8d8a-181b04e80525) And video memory usage shown on screenshots not normal. Need somehow to enforce ollama denial of using over 90% of vram, ok maybe 93% as maximum. This is 0.6.1-rc0 tested. behavior 1:1 same as 0.6.0
Author
Owner

@sapphirepro commented on GitHub (Mar 14, 2025):

And in addition to message above this is how looks DeepSeek-R1:32B. Look and video ram usage. It uses extra normal ram, but keeps 9% of vram free which is normal to keep all apps operational. So gemma model is total disaster in setting VRAM/RAM usage
Image

<!-- gh-comment-id:2723945980 --> @sapphirepro commented on GitHub (Mar 14, 2025): And in addition to message above this is how looks DeepSeek-R1:32B. Look and video ram usage. It uses extra normal ram, but keeps 9% of vram free which is normal to keep all apps operational. So gemma model is total disaster in setting VRAM/RAM usage ![Image](https://github.com/user-attachments/assets/3e194cd0-5109-445d-a29d-c630bab20c12)
Author
Owner

@huankumo commented on GitHub (Mar 15, 2025):

Hi. I have the same issue on Nvidia GeForce RTX 4090 + cuda driver version 12.3 + ollama version is 0.6.0 and noticed that when invoking gemma:27b model through ollama server there's a subtle difference with the e.g. llama3.1 model invocation:

Image

gemma model run with --ollama-engine flag and don't know if it has anything with the issue or not.

<!-- gh-comment-id:2726420833 --> @huankumo commented on GitHub (Mar 15, 2025): Hi. I have the same issue on *Nvidia GeForce RTX 4090* + *cuda driver version 12.3* + *ollama version is 0.6.0* and noticed that when invoking gemma:27b model through ollama server there's a subtle difference with the e.g. llama3.1 model invocation: ![Image](https://github.com/user-attachments/assets/57aef10e-775a-4929-a6e6-e9e1dd02b506) gemma model run with --ollama-engine flag and don't know if it has anything with the issue or not.
Author
Owner

@mehditahmasebi commented on GitHub (Mar 15, 2025):

Not fixed yet in ollama version 0.6.1
ollama run gemma3:27b-it-q8_0
Error: Post "http://127.0.0.1:11434/api/generate": EOF

I have overall 64gb RTX vram but it using my RAM not VRAM why?

Image Image
<!-- gh-comment-id:2726429921 --> @mehditahmasebi commented on GitHub (Mar 15, 2025): Not fixed yet in ollama version 0.6.1 ollama run gemma3:27b-it-q8_0 Error: Post "http://127.0.0.1:11434/api/generate": EOF I have overall 64gb RTX vram but it using my RAM not VRAM why? <img width="633" alt="Image" src="https://github.com/user-attachments/assets/9baafb00-2717-472d-9c66-ab6a6be97568" /> <img width="1499" alt="Image" src="https://github.com/user-attachments/assets/777fd88b-130b-4add-8953-4e3f67212c4d" />
Author
Owner

@ALLMI78 commented on GitHub (Mar 16, 2025):

same here https://github.com/ollama/ollama/issues/9730 (0.6.1) (ctx @ 32k)

q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE f16 | 24 GB VRAM + around 10-12 RM usage -> gemma runs on GPU and answers but is very slow <<< problem high VRAM and RAM usage

q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE q8_0 | less VRAM usage <<< problem runs on CPU now and GPU load goes down to 20-30 % = slow

<!-- gh-comment-id:2727319621 --> @ALLMI78 commented on GitHub (Mar 16, 2025): same here https://github.com/ollama/ollama/issues/9730 (0.6.1) (ctx @ 32k) > q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE f16 | 24 GB VRAM + around 10-12 RM usage -> gemma runs on GPU and answers but is very slow <<< problem high VRAM and RAM usage > q4_k_m Gemma-3-12b and OLLAMA_KV_CACHE_TYPE q8_0 | less VRAM usage <<< problem runs on CPU now and GPU load goes down to 20-30 % = slow
Author
Owner

@nickcwilkins commented on GitHub (Mar 24, 2025):

I just installed 0.6.3rc0 and it seems to be using less memory, but splits 50%/50% between cpu and gpu. I'm seeing that I have 12 gigs of vram that aren't being used. This is with an RTX 3090 and Gemma 27B

<!-- gh-comment-id:2748871141 --> @nickcwilkins commented on GitHub (Mar 24, 2025): I just installed 0.6.3rc0 and it seems to be using less memory, but splits 50%/50% between cpu and gpu. I'm seeing that I have 12 gigs of vram that aren't being used. This is with an RTX 3090 and Gemma 27B
Author
Owner

@aablsk commented on GitHub (Apr 23, 2025):

For me this is fixed when using the new quantization-aware trained (QAT) models (27B, 12B). These are first party quants from Google that aren't quantized post-training.

Quoting from Google's blog-post.

Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

There also seem to have been changes that reduce this issue for Q4_K_M on my 4090, which previously crashed with any context sizes larger than ~4k. Q4_K_M is still using more VRAM and RAM than the QAT version.

<!-- gh-comment-id:2824440699 --> @aablsk commented on GitHub (Apr 23, 2025): For me this is fixed when using the new quantization-aware trained (QAT) models ([27B](https://ollama.com/library/gemma3:27b-it-qat), [12B](https://ollama.com/library/gemma3:12b-it-qat)). These are first party quants from Google that aren't quantized post-training. Quoting from [Google's blog-post](https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/). > Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0. There also seem to have been changes that reduce this issue for Q4_K_M on my 4090, which previously crashed with any context sizes larger than ~4k. Q4_K_M is still using more VRAM and RAM than the QAT version.
Author
Owner

@sapphirepro commented on GitHub (Apr 24, 2025):

Q4_K_M is still using more VRAM and RAM than the QAT version.

What is QAT version? Where to get that?

<!-- gh-comment-id:2827104614 --> @sapphirepro commented on GitHub (Apr 24, 2025): > Q4_K_M is still using more VRAM and RAM than the QAT version. What is QAT version? Where to get that?
Author
Owner

@aablsk commented on GitHub (Apr 24, 2025):

@sapphirepro I've updated my previous comment with more details and links 👍

<!-- gh-comment-id:2827229913 --> @aablsk commented on GitHub (Apr 24, 2025): @sapphirepro I've updated my previous comment with more details and links 👍
Author
Owner

@sapphirepro commented on GitHub (Apr 24, 2025):

After many tests I figure one definite minus and probably a bug of ollama. It's totally unable to use nvidia_uvm correctly. I read a lot docs already, and it should work by swapping memory block to and from gpu vram over cuda toolkit. Instead ollama does parallel compute on both which is slow and unproductive. Hopefully will get time in next few months to fork ollama and fix that to work properly without using cpu as compute unit.

For example. If to set num_gpu above gpu vram capacity it just crashed, which should swap blocks to ram but not let cpu compute it, but make cuda side controlled to swap memory blocks on demand.

<!-- gh-comment-id:2827531322 --> @sapphirepro commented on GitHub (Apr 24, 2025): After many tests I figure one definite minus and probably a bug of ollama. It's totally unable to use nvidia_uvm correctly. I read a lot docs already, and it should work by swapping memory block to and from gpu vram over cuda toolkit. Instead ollama does parallel compute on both which is slow and unproductive. Hopefully will get time in next few months to fork ollama and fix that to work properly without using cpu as compute unit. For example. If to set num_gpu above gpu vram capacity it just crashed, which should swap blocks to ram but not let cpu compute it, but make cuda side controlled to swap memory blocks on demand.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6315