[GH-ISSUE #9934] ollama v0.6.2 gemma 3 OOM: Killed process #32261

Closed
opened 2026-04-22 13:21:35 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @akshaal on GitHub (Mar 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9934

Originally assigned to: @jessegross on GitHub.

What is the issue?

Gemma 3 IQ4 XS, num_ctx 32k, num_predict 32k. nvidia 4090 24g, 128g RAM

OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ./ollama serve

Image
Image

Relevant log output

ollama stdout+stderr:
  2025/03/21 19:47:08    routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-03-21T19:47:08.925+01:00 level=INFO source=images.go:432 msg="total blobs: 40"
time=2025-03-21T19:47:08.926+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-21T19:47:08.927+01:00 level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.2)"
time=2025-03-21T19:47:08.927+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-21T19:47:09.049+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-22eaf998-1aa8-14bc-3c72-c7275965de5e library=cuda variant=v12 compute=8.9 driver=12.2 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="22.1 GiB"
time=2025-03-21T19:47:20.652+01:00 level=INFO source=server.go:105 msg="system memory" total="125.7 GiB" free="115.3 GiB" free_swap="0 B"
time=2025-03-21T19:47:20.732+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split="" memory.available="[22.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.8 GiB" memory.required.partial="21.7 GiB" memory.required.kv="5.8 GiB" memory.required.allocations="[21.7 GiB]" memory.weights.total="12.7 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" projector.weights="818.0 MiB" projector.graph="0 B"
time=2025-03-21T19:47:20.732+01:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-21T19:47:20.798+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-21T19:47:20.813+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.num_channels default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.block_count default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.embedding_length default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.head_count default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-21T19:47:20.824+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/home/user/Downloads/ollama/bin/ollama runner --ollama-engine --model /home/user/.ollama/models/blobs/sha256-bd2f188c66d8ccb0bffcb0c91e4dbbb72754bb1732e0bca323a2f266a35e01c8 --ctx-size 24576 --batch-size 512 --n-gpu-layers 62 --threads 12 --flash-attn --kv-cache-type q8_0 --parallel 1 --port 41731"
time=2025-03-21T19:47:20.824+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-21T19:47:20.824+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-03-21T19:47:20.825+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-03-21T19:47:20.837+01:00 level=INFO source=runner.go:763 msg="starting ollama engine"
time=2025-03-21T19:47:20.837+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:41731"
time=2025-03-21T19:47:20.902+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
time=2025-03-21T19:47:20.902+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=IQ4_XS name="Gemma 3 27b It" description="" num_tensors=808 num_key_values=45
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /home/user/Downloads/ollama/lib/ollama/cuda_v12/libggml-cuda.so
load_backend: loaded CPU backend from /home/user/Downloads/ollama/lib/ollama/libggml-cpu-haswell.so
time=2025-03-21T19:47:20.942+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-03-21T19:47:21.001+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="2.2 GiB"
time=2025-03-21T19:47:21.001+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="12.7 GiB"
time=2025-03-21T19:47:21.276+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-21T19:47:22.009+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-03-21T19:47:24.735+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-03-21T19:47:24.735+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-03-21T19:47:24.735+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-21T19:47:24.738+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.num_channels default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.block_count default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.embedding_length default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.head_count default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-21T19:47:24.775+01:00 level=INFO source=server.go:619 msg="llama runner started in 3.95 seconds"
[GIN] 2025/03/21 - 19:56:26 | 200 |      22.462µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/21 - 19:56:26 | 200 |      30.257µs |       127.0.0.1 | GET      "/api/ps"
since it's killed by kernel, there is no error

journald:
[2597940.603198] Out of memory: Killed process 4044306 (ollama) total-vm:248418888kB, anon-rss:9889244kB, file-rss:71688kB, shmem-rss:106964460kB, UID:1000 pgtables:324024kB oom_score_adj:0

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.6.2

Originally created by @akshaal on GitHub (Mar 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9934 Originally assigned to: @jessegross on GitHub. ### What is the issue? Gemma 3 IQ4 XS, num_ctx 32k, num_predict 32k. nvidia 4090 24g, 128g RAM ``` OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ./ollama serve ``` ![Image](https://github.com/user-attachments/assets/87b39473-7007-4591-b241-07a68106c1c5) ![Image](https://github.com/user-attachments/assets/0802e37f-0bb5-4ee6-8771-67827c320e9d) ### Relevant log output ```shell ollama stdout+stderr: 2025/03/21 19:47:08 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-03-21T19:47:08.925+01:00 level=INFO source=images.go:432 msg="total blobs: 40" time=2025-03-21T19:47:08.926+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-21T19:47:08.927+01:00 level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.2)" time=2025-03-21T19:47:08.927+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-21T19:47:09.049+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-22eaf998-1aa8-14bc-3c72-c7275965de5e library=cuda variant=v12 compute=8.9 driver=12.2 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="22.1 GiB" time=2025-03-21T19:47:20.652+01:00 level=INFO source=server.go:105 msg="system memory" total="125.7 GiB" free="115.3 GiB" free_swap="0 B" time=2025-03-21T19:47:20.732+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split="" memory.available="[22.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.8 GiB" memory.required.partial="21.7 GiB" memory.required.kv="5.8 GiB" memory.required.allocations="[21.7 GiB]" memory.weights.total="12.7 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB" projector.weights="818.0 MiB" projector.graph="0 B" time=2025-03-21T19:47:20.732+01:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-21T19:47:20.798+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-21T19:47:20.813+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.num_channels default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.block_count default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.embedding_length default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.head_count default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 time=2025-03-21T19:47:20.818+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-21T19:47:20.823+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-21T19:47:20.824+01:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/home/user/Downloads/ollama/bin/ollama runner --ollama-engine --model /home/user/.ollama/models/blobs/sha256-bd2f188c66d8ccb0bffcb0c91e4dbbb72754bb1732e0bca323a2f266a35e01c8 --ctx-size 24576 --batch-size 512 --n-gpu-layers 62 --threads 12 --flash-attn --kv-cache-type q8_0 --parallel 1 --port 41731" time=2025-03-21T19:47:20.824+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-21T19:47:20.824+01:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-03-21T19:47:20.825+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-03-21T19:47:20.837+01:00 level=INFO source=runner.go:763 msg="starting ollama engine" time=2025-03-21T19:47:20.837+01:00 level=INFO source=runner.go:823 msg="Server listening on 127.0.0.1:41731" time=2025-03-21T19:47:20.902+01:00 level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" time=2025-03-21T19:47:20.902+01:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=IQ4_XS name="Gemma 3 27b It" description="" num_tensors=808 num_key_values=45 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from /home/user/Downloads/ollama/lib/ollama/cuda_v12/libggml-cuda.so load_backend: loaded CPU backend from /home/user/Downloads/ollama/lib/ollama/libggml-cpu-haswell.so time=2025-03-21T19:47:20.942+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-03-21T19:47:21.001+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="2.2 GiB" time=2025-03-21T19:47:21.001+01:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="12.7 GiB" time=2025-03-21T19:47:21.276+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding" time=2025-03-21T19:47:22.009+01:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" time=2025-03-21T19:47:24.735+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-03-21T19:47:24.735+01:00 level=INFO source=ggml.go:358 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-03-21T19:47:24.735+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-21T19:47:24.738+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.num_channels default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.block_count default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.embedding_length default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.head_count default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.image_size default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.patch_size default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 time=2025-03-21T19:47:24.741+01:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-21T19:47:24.747+01:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-21T19:47:24.775+01:00 level=INFO source=server.go:619 msg="llama runner started in 3.95 seconds" [GIN] 2025/03/21 - 19:56:26 | 200 | 22.462µs | 127.0.0.1 | HEAD "/" [GIN] 2025/03/21 - 19:56:26 | 200 | 30.257µs | 127.0.0.1 | GET "/api/ps" since it's killed by kernel, there is no error journald: [2597940.603198] Out of memory: Killed process 4044306 (ollama) total-vm:248418888kB, anon-rss:9889244kB, file-rss:71688kB, shmem-rss:106964460kB, UID:1000 pgtables:324024kB oom_score_adj:0 ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.2
GiteaMirror added the bug label 2026-04-22 13:21:35 -05:00
Author
Owner

@pdevine commented on GitHub (Mar 24, 2025):

There are going to be a couple of issues running bartowski's quants on ollama:

  • the memory estimation is going to be off w/ cuda because we don't (yet) optimize this for the kvcache (it should work fine w/ unified memory)
  • we haven't tested iquants/imatrix based weights w/ the ollama engine (the gemma3 architecture uses the ollama engine, and not llama.cpp)
  • bartowski doesn't include the vision tensors in the main model, so you're only going to get the text part of the model. I'm not 100% sure if this will work correctly (it may), but you certainly won't get the vision part
<!-- gh-comment-id:2749121936 --> @pdevine commented on GitHub (Mar 24, 2025): There are going to be a couple of issues running bartowski's quants on ollama: * the memory estimation is going to be off w/ cuda because we don't (yet) optimize this for the kvcache (it should work fine w/ unified memory) * we haven't tested iquants/imatrix based weights w/ the ollama engine (the gemma3 architecture uses the ollama engine, and *not* llama.cpp) * bartowski doesn't include the vision tensors in the main model, so you're only going to get the text part of the model. I'm not 100% sure if this will work correctly (it may), but you certainly won't get the vision part
Author
Owner

@jetnet commented on GitHub (Mar 29, 2025):

Same issue, when sending lots of requests to /api/generate.
Workaround: a cronjob to stop the model every hour :(

<!-- gh-comment-id:2763329145 --> @jetnet commented on GitHub (Mar 29, 2025): Same issue, when sending lots of requests to `/api/generate`. Workaround: a cronjob to stop the model every hour :(
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32261