[GH-ISSUE #10341] Gemma 3 12b (Q4_K_M) fills system RAM despite available VRAM (OLLAMA 0.6.5) #32552

New Issue

GiteaMirror · 2026-04-22T13:56:44-05:00

GiteaMirror commented

2026-04-22 13:56:44 -05:00

Originally created by @ALLMI78 on GitHub (Apr 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10341

What is the issue?

Hi there,

I'm using WIN10+OLLAMA 0.6.5 on a system with a 16 GB RTX 4060 Ti and 32 GB of RAM.

Running the Gemma 3 12b model (Q4_K_M) leads to a memory issue:

During inference with CONTEXT @ 32k !!! The model uses around 10 GB of VRAM, leaving 6 GB unused.

However, system RAM usage keeps increasing with each run until it eventually exhausts all memory and crashes with an out-of-memory error.

Interestingly, larger models like Qwen 14b run smoothly on the same setup and use the available VRAM effectively.

I've attached screenshots showing how RAM usage increases over time while VRAM stays constant.

Question:

Why is Gemma 3 not utilizing the available VRAM and instead offloading to system memory? Is this a known issue, and are there any workarounds?

You can also see that gemma3 is running unstable, GPU-Load and VRAM-Workload are jumping around...

Originally created by @ALLMI78 on GitHub (Apr 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10341 ### What is the issue? Hi there, I'm using WIN10+OLLAMA 0.6.5 on a system with a 16 GB RTX 4060 Ti and 32 GB of RAM. Running the Gemma 3 12b model (Q4_K_M) leads to a memory issue: During inference with CONTEXT @ 32k !!! The model uses around 10 GB of VRAM, leaving 6 GB unused. However, system RAM usage keeps increasing with each run until it eventually exhausts all memory and crashes with an out-of-memory error. Interestingly, larger models like Qwen 14b run smoothly on the same setup and use the available VRAM effectively. I've attached screenshots showing how RAM usage increases over time while VRAM stays constant. Question: Why is Gemma 3 not utilizing the available VRAM and instead offloading to system memory? Is this a known issue, and are there any workarounds? ![Image](https://github.com/user-attachments/assets/062c1b12-b2f6-42f1-8f00-e2e281c1eab6) ![Image](https://github.com/user-attachments/assets/797348c4-c0ce-4594-a17c-c4439147d8bd) ![Image](https://github.com/user-attachments/assets/15b0f6e6-43ef-410d-9b5c-38a017b9c287) ![Image](https://github.com/user-attachments/assets/43b4239d-efaa-4728-9d59-a5edc28f4f7d) You can also see that gemma3 is running unstable, GPU-Load and VRAM-Workload are jumping around...

GiteaMirror added the bug label 2026-04-22 13:56:44 -05:00

GiteaMirror closed this issue

2026-04-22 13:56:45 -05:00

GiteaMirror commented

2026-04-22 13:56:46 -05:00

@ALLMI78 commented on GitHub (Apr 18, 2025):

same setup, only a different model, nice and clean running qwen 14b Q4KM

@ALLMI78 commented on GitHub (Apr 18, 2025): same setup, only a different model, nice and clean running qwen 14b Q4KM ![Image](https://github.com/user-attachments/assets/1a344f5f-1c7e-4ec2-9483-7911a6d0a50b)

GiteaMirror commented

2026-04-22 13:56:47 -05:00

@rick-github commented on GitHub (Apr 18, 2025):

#10040

@rick-github commented on GitHub (Apr 18, 2025): #10040

GiteaMirror commented

2026-04-22 13:56:49 -05:00

@esciron commented on GitHub (Apr 19, 2025):

I'm having the opposite problem, (gemma 3 27b it qat - Q4 quant) the VRAM fills but it doesn't offload the context window, and when it does, it takes like 9 minutes to post, (with 48k window) while the shared vram is not used like it does with every single other model I use, models from 8b to 70b. When using a context size that of like 4k or 8k it works correctly as it doesn't need offloading to ram.
Windows 10, ollama 0.6.6 rc2
4090
64gb ram

@esciron commented on GitHub (Apr 19, 2025): I'm having the opposite problem, (gemma 3 27b it qat - Q4 quant) the VRAM fills but it doesn't offload the context window, and when it does, it takes like 9 minutes to post, (with 48k window) while the shared vram is not used like it does with every single other model I use, models from 8b to 70b. When using a context size that of like 4k or 8k it works correctly as it doesn't need offloading to ram. Windows 10, ollama 0.6.6 rc2 4090 64gb ram

GiteaMirror commented

2026-04-22 13:56:50 -05:00

@ALLMI78 commented on GitHub (Apr 19, 2025):

Ohhh, ok i was even posting this:

Update on Gemma 3 12b with OLLAMA 0.6.6

After updating to OLLAMA version 0.6.6, Gemma 3 12b (Q4_K_M) shows notable improvements but also some remaining issues:

Positives:

The RAM memory leak observed in earlier versions seems to be resolved. Even with two alternating models (gemma3-12b/qwen14b) running, system memory usage remains stable over several hours.
The model now runs reliably at around 10 GB VRAM, well within the 16 GB limit of the RTX 4060 Ti.

Remaining issues:

During inference, there's a pronounced sawtooth pattern in both CPU and GPU usage, suggesting load is shifting inefficiently between them.
Despite available VRAM, performance seems suboptimal due to this back-and-forth, likely leaving GPU capacity underutilized.
By contrast, a parallel-running Qwen 14b model shows stable, expected workload behavior without such fluctuations.

These findings suggest that (in my setup) Gemma 3 is now stable, but inference performance and GPU load management may still need refinement.

@ALLMI78 commented on GitHub (Apr 19, 2025): Ohhh, ok i was even posting this: **Update on Gemma 3 12b with OLLAMA 0.6.6** After updating to OLLAMA version 0.6.6, Gemma 3 12b (Q4_K_M) shows notable improvements but also some remaining issues: **Positives:** - The RAM memory leak observed in earlier versions seems to be resolved. Even with two alternating models (gemma3-12b/qwen14b) running, system memory usage remains stable over several hours. - The model now runs reliably at around 10 GB VRAM, well within the 16 GB limit of the RTX 4060 Ti. **Remaining issues:** - During inference, there's a pronounced sawtooth pattern in both CPU and GPU usage, suggesting load is shifting inefficiently between them. - Despite available VRAM, performance seems suboptimal due to this back-and-forth, likely leaving GPU capacity underutilized. - By contrast, a parallel-running Qwen 14b model shows stable, expected workload behavior without such fluctuations. These findings suggest that (in my setup) Gemma 3 is now stable, but inference performance and GPU load management may still need refinement. ![Image](https://github.com/user-attachments/assets/616d5683-d116-4a49-be7b-f53e8f2e9cdc) ![Image](https://github.com/user-attachments/assets/143be9e7-aba7-45e2-8a38-747f9ace813b)

GiteaMirror commented

2026-04-22 13:56:50 -05:00

@ALLMI78 commented on GitHub (Apr 19, 2025):

I’ve noticed that the sawtooth-like CPU/GPU load pattern might be influenced by batch size — it seems to become more stable with larger batches, now i can confirm a relationship.

I continued testing Gemma 3 12b with different batch sizes and noticed a possible performance sweet spot around batch sizes of 512 to 1024. In that range, GPU load appears more stable, and performance improves significantly — with about 1700 tokens/sec during prompt processing and around 19 tokens/sec during token generation.

However, there is a clear pattern, at batch size 2048, the performance drops again, and GPU-Load behavior becomes again unstable. So while performance improves with moderate batch sizes, it’s still unclear whether this is the maximum achievable or if underlying inefficiencies remain in the current implementation.

@ALLMI78 commented on GitHub (Apr 19, 2025): I’ve noticed that the sawtooth-like CPU/GPU load pattern might be influenced by batch size — it seems to become more stable with larger batches, now i can confirm a relationship. I continued testing Gemma 3 12b with different batch sizes and noticed a possible performance sweet spot around batch sizes of 512 to 1024. In that range, GPU load appears more stable, and performance improves significantly — with about 1700 tokens/sec during prompt processing and around 19 tokens/sec during token generation. However, there is a clear pattern, at batch size 2048, the performance drops again, and GPU-Load behavior becomes again unstable. So while performance improves with moderate batch sizes, it’s still unclear whether this is the maximum achievable or if underlying inefficiencies remain in the current implementation. ![Image](https://github.com/user-attachments/assets/ecfc8ef0-4eef-4a97-bac9-c7f31c32b62f) ![Image](https://github.com/user-attachments/assets/4c45c57b-3b71-4274-ad48-ee71ebfd1815)

GiteaMirror commented

2026-04-22 13:56:51 -05:00

@meltyness commented on GitHub (Apr 20, 2025):

this hard-coded unconfigurable value whose provenance i cannot determine:
88738b357b/discover/gpu.go (L41)
incident on the following calculation:
1e7f62cb42/llm/memory.go (L193)

Seems to be at play in my case. I took the arbitrary 457 value down to zero on my configuration and then I was able to vram much larger models into my gpu, though for some reason i had to completely recompile including gguf which took a super long time. but also some other stuff prevents partial loads for me and i haven't traced out the cause there, also i'm not sure what the consequences are since presumably this was like this for a reason, but i'm not hosting the service externally, so i'm not terribly concerned about the consequences.

For posterity I got gemma3:latest / gemma3:4b stuffed into my RTX 3050Ti by making this simple change, GPU-based multimodal inference, here I go, oh boy! Also of note, inference on my platform is indeed much faster this way, and mystically, so is loading; I guess those PCI-E pipes do some pretty good bandwidth vs. loading the models into main memory virtualized.

@meltyness commented on GitHub (Apr 20, 2025): this hard-coded unconfigurable value whose provenance i cannot determine: https://github.com/ollama/ollama/blob/88738b357bcd25eea860b59bf7de2f6b94cfc352/discover/gpu.go#L41 incident on the following calculation: https://github.com/ollama/ollama/blob/1e7f62cb429e5a962dd9c448e7b1b3371879e48b/llm/memory.go#L193 Seems to be at play in my case. I took the arbitrary `457` value down to zero on my configuration and then I was able to vram much larger models into my gpu, though for some reason i had to completely recompile including gguf which took a super long time. but also some other stuff prevents partial loads for me and i haven't traced out the cause there, also i'm not sure what the consequences are since presumably this was like this for a reason, but i'm not hosting the service externally, so i'm not terribly concerned about the consequences. For posterity I got `gemma3:latest` / `gemma3:4b` stuffed into my RTX 3050Ti by making this simple change, GPU-based multimodal inference, here I go, oh boy! Also of note, inference on my platform is indeed much faster this way, and mystically, so is loading; I guess those PCI-E pipes do some pretty good bandwidth vs. loading the models into main memory virtualized.

GiteaMirror commented

2026-04-22 13:56:51 -05:00

@semidark commented on GitHub (Apr 24, 2025):

I have a similar issue but with gemma3:12b-it-qat only. gemma3:12b is working fine on my setup.

So the issue is, that even though my 3060 has 12GB VRAM ollama only runs part of the gemma3:12b-it-qat in vram.

log files for loading gemma3:12b-it-qat

Apr 24 15:54:54 ollama ollama[5616]: 2025/04/24 15:54:54 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA>
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:458 msg="total blobs: 29"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:465 msg="total unused blobs removed: 0"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.809Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB"
Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 41.651µs | 127.0.0.1 | HEAD "/"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.173Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.209Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 73.168836ms | 127.0.0.1 | POST "/api/show"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.247Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.454Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.489Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.654Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require>
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.718Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.724Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 2048 --batch-size 512 --n-gpu-layers 48 --threads 4 -->
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=sched.go:451 msg="loaded runners" count=1
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.748Z level=INFO source=runner.go:866 msg="starting ollama engine"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.749Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:39483"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.809Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: found 1 CUDA devices:
Apr 24 15:55:50 ollama ollama[5616]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.893Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.989Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="4.5 GiB"
Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="5.6 GiB"
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.518Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB"
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.757Z level=INFO source=server.go:619 msg="llama runner started in 3.02 seconds

The gemma3:12b model is running 100% in VRAM:

log files for loading gemma3:12b

Apr 24 16:01:29 ollama ollama[6888]: 2025/04/24 16:01:29 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA>
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:458 msg="total blobs: 29"
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:465 msg="total unused blobs removed: 0"
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)"
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Apr 24 16:01:30 ollama ollama[6888]: time=2025-04-24T16:01:30.025Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB"
Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 47.985µs | 127.0.0.1 | HEAD "/"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.220Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.272Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 106.021177ms | 127.0.0.1 | POST "/api/show"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.327Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.540Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.591Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.598Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de gpu=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d parallel=1 available=12382371840 re>
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.751Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir>
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.880Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.886Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.897Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 2048 --batch-size 512 --n-gpu-layers 49 --threads 4 -->
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=sched.go:451 msg="loaded runners" count=1
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:866 msg="starting ollama engine"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:42889"
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.039Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37
Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: found 1 CUDA devices:
Apr 24 16:01:37 ollama ollama[6888]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.122Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.149Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.6 GiB"
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="787.5 MiB"
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.160Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB"
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.408Z level=INFO source=server.go:619 msg="llama runner started in 2.51 seconds"

@semidark commented on GitHub (Apr 24, 2025): I have a similar issue but with `gemma3:12b-it-qat` only. `gemma3:12b` is working fine on my setup. So the issue is, that even though my 3060 has 12GB VRAM ollama only runs part of the `gemma3:12b-it-qat` in vram. ![Image](https://github.com/user-attachments/assets/a2167537-ca8e-4d70-9c31-fdab9b4311f7) <details> <summary>log files for loading gemma3:12b-it-qat</summary> Apr 24 15:54:54 ollama ollama[5616]: 2025/04/24 15:54:54 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA> Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:458 msg="total blobs: 29" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:465 msg="total unused blobs removed: 0" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.809Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB" Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 41.651µs | 127.0.0.1 | HEAD "/" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.173Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.209Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 73.168836ms | 127.0.0.1 | POST "/api/show" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.247Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.454Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.489Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.654Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require> Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.718Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.724Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 2048 --batch-size 512 --n-gpu-layers 48 --threads 4 --> Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=sched.go:451 msg="loaded runners" count=1 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.748Z level=INFO source=runner.go:866 msg="starting ollama engine" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.749Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:39483" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.809Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: found 1 CUDA devices: Apr 24 15:55:50 ollama ollama[5616]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.893Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.989Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="4.5 GiB" Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="5.6 GiB" Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.518Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB" Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.757Z level=INFO source=server.go:619 msg="llama runner started in 3.02 seconds </details> The `gemma3:12b` model is running 100% in VRAM: ![Image](https://github.com/user-attachments/assets/c2f8af72-6779-4f13-a995-b90ce0cda28b) <details> <summary>log files for loading gemma3:12b</summary> Apr 24 16:01:29 ollama ollama[6888]: 2025/04/24 16:01:29 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA> Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:458 msg="total blobs: 29" Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:465 msg="total unused blobs removed: 0" Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)" Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Apr 24 16:01:30 ollama ollama[6888]: time=2025-04-24T16:01:30.025Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB" Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 47.985µs | 127.0.0.1 | HEAD "/" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.220Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.272Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 106.021177ms | 127.0.0.1 | POST "/api/show" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.327Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.540Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.591Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.598Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de gpu=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d parallel=1 available=12382371840 re> Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.751Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir> Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.880Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.886Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.897Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 2048 --batch-size 512 --n-gpu-layers 49 --threads 4 --> Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=sched.go:451 msg="loaded runners" count=1 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:866 msg="starting ollama engine" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:42889" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.039Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37 Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: found 1 CUDA devices: Apr 24 16:01:37 ollama ollama[6888]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.122Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.149Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.6 GiB" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="787.5 MiB" Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.160Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB" Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.408Z level=INFO source=server.go:619 msg="llama runner started in 2.51 seconds" </details>

GiteaMirror commented

2026-04-22 13:56:52 -05:00

@bshor commented on GitHub (Apr 24, 2025):

I have the same problem as @semidark with the 12b-qat model overfilling VRAM on my 12Gb 4070, while the regular 12b-Q4_K_M model doing just fine.

@bshor commented on GitHub (Apr 24, 2025): I have the same problem as @semidark with the 12b-qat model overfilling VRAM on my 12Gb 4070, while the regular 12b-Q4_K_M model doing just fine.

GiteaMirror commented

2026-04-22 13:56:53 -05:00

@rick-github commented on GitHub (Apr 24, 2025):

gemma3:12b-it-qat

Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda
 layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require>

gemma3:12b

Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda
 layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir>

The QAT quant requires an extra 1.5G of RAM, causing one layer to be spilled into system RAM. You can try working around this by overriding ollama's memory estimation and setting num_gpu to 49. Other ways to reduce the memory footprint can be found here.

@rick-github commented on GitHub (Apr 24, 2025): gemma3:12b-it-qat ``` Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require> ``` gemma3:12b ``` Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir> ``` The QAT quant requires an extra 1.5G of RAM, causing one layer to be spilled into system RAM. You can try working around this by overriding ollama's memory estimation and [setting](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) `num_gpu` to 49. Other ways to reduce the memory footprint can be found [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).

GiteaMirror commented

2026-04-22 13:56:55 -05:00

@semidark commented on GitHub (Apr 24, 2025):

Hello @rick-github: How can I try to working around ollamas memory estimation?

@semidark commented on GitHub (Apr 24, 2025): Hello @rick-github: How can I try to working around ollamas memory estimation?

GiteaMirror commented

2026-04-22 13:56:57 -05:00

@rick-github commented on GitHub (Apr 24, 2025):

https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650

@rick-github commented on GitHub (Apr 24, 2025): https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650

GiteaMirror commented

2026-04-22 13:56:59 -05:00

@semidark commented on GitHub (Apr 24, 2025):

Thanks again for the link. I already followed it the first time you sent it, but thought you ment something other than just setting num_gpu to the desired value. But if it is that simple, I will try it for sure.

@semidark commented on GitHub (Apr 24, 2025): Thanks again for the link. I already followed it the first time you sent it, but thought you ment something other than just setting num_gpu to the desired value. But if it is that simple, I will try it for sure.

GiteaMirror commented

2026-04-22 13:56:59 -05:00

@rick-github commented on GitHub (Apr 24, 2025):

Be aware that depending on your OS/drivers, overriding can cause performance issues and/or OOMs.

@rick-github commented on GitHub (Apr 24, 2025): Be aware that depending on your OS/drivers, overriding can cause [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) and/or OOMs.

GiteaMirror commented

2026-04-22 13:57:01 -05:00

@Numieo commented on GitHub (May 21, 2025):

I have 8GB of VRAM, and I also tried deploying Gemma 3 12B with Ollama. However, the Q4KM version only used a little bit of VRAM while using a lot of RAM, and the QAT version only used RAM. The Gemma 3 4B's Q4KM version used 4GB of VRAM, the QAT version used 2GB of VRAM while also using a lot of RAM. Using Ollama seems to have some issues.

I switched to LM Studio, and the Gemma 3 12B Q4KM and QAT versions can both use 7.7GB of VRAM and 4GB of RAM. The Gemma 3 4B's Q4KM and QAT versions load normally into the VRAM, so switch to LM Studio.

@Numieo commented on GitHub (May 21, 2025): I have 8GB of VRAM, and I also tried deploying Gemma 3 12B with Ollama. However, the Q4KM version only used a little bit of VRAM while using a lot of RAM, and the QAT version only used RAM. The Gemma 3 4B's Q4KM version used 4GB of VRAM, the QAT version used 2GB of VRAM while also using a lot of RAM. Using Ollama seems to have some issues. I switched to LM Studio, and the Gemma 3 12B Q4KM and QAT versions can both use 7.7GB of VRAM and 4GB of RAM. The Gemma 3 4B's Q4KM and QAT versions load normally into the VRAM, so switch to LM Studio.

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#32552