[GH-ISSUE #10341] Gemma 3 12b (Q4_K_M) fills system RAM despite available VRAM (OLLAMA 0.6.5) #32552

Closed
opened 2026-04-22 13:56:44 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @ALLMI78 on GitHub (Apr 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10341

What is the issue?

Hi there,

I'm using WIN10+OLLAMA 0.6.5 on a system with a 16 GB RTX 4060 Ti and 32 GB of RAM.

Running the Gemma 3 12b model (Q4_K_M) leads to a memory issue:

During inference with CONTEXT @ 32k !!! The model uses around 10 GB of VRAM, leaving 6 GB unused.

However, system RAM usage keeps increasing with each run until it eventually exhausts all memory and crashes with an out-of-memory error.

Interestingly, larger models like Qwen 14b run smoothly on the same setup and use the available VRAM effectively.

I've attached screenshots showing how RAM usage increases over time while VRAM stays constant.

Question:

Why is Gemma 3 not utilizing the available VRAM and instead offloading to system memory? Is this a known issue, and are there any workarounds?

Image

Image

Image

Image

You can also see that gemma3 is running unstable, GPU-Load and VRAM-Workload are jumping around...

Originally created by @ALLMI78 on GitHub (Apr 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10341 ### What is the issue? Hi there, I'm using WIN10+OLLAMA 0.6.5 on a system with a 16 GB RTX 4060 Ti and 32 GB of RAM. Running the Gemma 3 12b model (Q4_K_M) leads to a memory issue: During inference with CONTEXT @ 32k !!! The model uses around 10 GB of VRAM, leaving 6 GB unused. However, system RAM usage keeps increasing with each run until it eventually exhausts all memory and crashes with an out-of-memory error. Interestingly, larger models like Qwen 14b run smoothly on the same setup and use the available VRAM effectively. I've attached screenshots showing how RAM usage increases over time while VRAM stays constant. Question: Why is Gemma 3 not utilizing the available VRAM and instead offloading to system memory? Is this a known issue, and are there any workarounds? ![Image](https://github.com/user-attachments/assets/062c1b12-b2f6-42f1-8f00-e2e281c1eab6) ![Image](https://github.com/user-attachments/assets/797348c4-c0ce-4594-a17c-c4439147d8bd) ![Image](https://github.com/user-attachments/assets/15b0f6e6-43ef-410d-9b5c-38a017b9c287) ![Image](https://github.com/user-attachments/assets/43b4239d-efaa-4728-9d59-a5edc28f4f7d) You can also see that gemma3 is running unstable, GPU-Load and VRAM-Workload are jumping around...
GiteaMirror added the bug label 2026-04-22 13:56:44 -05:00
Author
Owner

@ALLMI78 commented on GitHub (Apr 18, 2025):

same setup, only a different model, nice and clean running qwen 14b Q4KM

Image

<!-- gh-comment-id:2816244999 --> @ALLMI78 commented on GitHub (Apr 18, 2025): same setup, only a different model, nice and clean running qwen 14b Q4KM ![Image](https://github.com/user-attachments/assets/1a344f5f-1c7e-4ec2-9483-7911a6d0a50b)
Author
Owner

@rick-github commented on GitHub (Apr 18, 2025):

#10040

<!-- gh-comment-id:2816289097 --> @rick-github commented on GitHub (Apr 18, 2025): #10040
Author
Owner

@esciron commented on GitHub (Apr 19, 2025):

I'm having the opposite problem, (gemma 3 27b it qat - Q4 quant) the VRAM fills but it doesn't offload the context window, and when it does, it takes like 9 minutes to post, (with 48k window) while the shared vram is not used like it does with every single other model I use, models from 8b to 70b. When using a context size that of like 4k or 8k it works correctly as it doesn't need offloading to ram.
Windows 10, ollama 0.6.6 rc2
4090
64gb ram

<!-- gh-comment-id:2816502433 --> @esciron commented on GitHub (Apr 19, 2025): I'm having the opposite problem, (gemma 3 27b it qat - Q4 quant) the VRAM fills but it doesn't offload the context window, and when it does, it takes like 9 minutes to post, (with 48k window) while the shared vram is not used like it does with every single other model I use, models from 8b to 70b. When using a context size that of like 4k or 8k it works correctly as it doesn't need offloading to ram. Windows 10, ollama 0.6.6 rc2 4090 64gb ram
Author
Owner

@ALLMI78 commented on GitHub (Apr 19, 2025):

Ohhh, ok i was even posting this:

Update on Gemma 3 12b with OLLAMA 0.6.6

After updating to OLLAMA version 0.6.6, Gemma 3 12b (Q4_K_M) shows notable improvements but also some remaining issues:

Positives:

  • The RAM memory leak observed in earlier versions seems to be resolved. Even with two alternating models (gemma3-12b/qwen14b) running, system memory usage remains stable over several hours.

  • The model now runs reliably at around 10 GB VRAM, well within the 16 GB limit of the RTX 4060 Ti.

Remaining issues:

  • During inference, there's a pronounced sawtooth pattern in both CPU and GPU usage, suggesting load is shifting inefficiently between them.

  • Despite available VRAM, performance seems suboptimal due to this back-and-forth, likely leaving GPU capacity underutilized.

  • By contrast, a parallel-running Qwen 14b model shows stable, expected workload behavior without such fluctuations.

These findings suggest that (in my setup) Gemma 3 is now stable, but inference performance and GPU load management may still need refinement.

Image

Image

<!-- gh-comment-id:2816761598 --> @ALLMI78 commented on GitHub (Apr 19, 2025): Ohhh, ok i was even posting this: **Update on Gemma 3 12b with OLLAMA 0.6.6** After updating to OLLAMA version 0.6.6, Gemma 3 12b (Q4_K_M) shows notable improvements but also some remaining issues: **Positives:** - The RAM memory leak observed in earlier versions seems to be resolved. Even with two alternating models (gemma3-12b/qwen14b) running, system memory usage remains stable over several hours. - The model now runs reliably at around 10 GB VRAM, well within the 16 GB limit of the RTX 4060 Ti. **Remaining issues:** - During inference, there's a pronounced sawtooth pattern in both CPU and GPU usage, suggesting load is shifting inefficiently between them. - Despite available VRAM, performance seems suboptimal due to this back-and-forth, likely leaving GPU capacity underutilized. - By contrast, a parallel-running Qwen 14b model shows stable, expected workload behavior without such fluctuations. These findings suggest that (in my setup) Gemma 3 is now stable, but inference performance and GPU load management may still need refinement. ![Image](https://github.com/user-attachments/assets/616d5683-d116-4a49-be7b-f53e8f2e9cdc) ![Image](https://github.com/user-attachments/assets/143be9e7-aba7-45e2-8a38-747f9ace813b)
Author
Owner

@ALLMI78 commented on GitHub (Apr 19, 2025):

I’ve noticed that the sawtooth-like CPU/GPU load pattern might be influenced by batch size — it seems to become more stable with larger batches, now i can confirm a relationship.

I continued testing Gemma 3 12b with different batch sizes and noticed a possible performance sweet spot around batch sizes of 512 to 1024. In that range, GPU load appears more stable, and performance improves significantly — with about 1700 tokens/sec during prompt processing and around 19 tokens/sec during token generation.

However, there is a clear pattern, at batch size 2048, the performance drops again, and GPU-Load behavior becomes again unstable. So while performance improves with moderate batch sizes, it’s still unclear whether this is the maximum achievable or if underlying inefficiencies remain in the current implementation.

Image

Image

<!-- gh-comment-id:2816762647 --> @ALLMI78 commented on GitHub (Apr 19, 2025): I’ve noticed that the sawtooth-like CPU/GPU load pattern might be influenced by batch size — it seems to become more stable with larger batches, now i can confirm a relationship. I continued testing Gemma 3 12b with different batch sizes and noticed a possible performance sweet spot around batch sizes of 512 to 1024. In that range, GPU load appears more stable, and performance improves significantly — with about 1700 tokens/sec during prompt processing and around 19 tokens/sec during token generation. However, there is a clear pattern, at batch size 2048, the performance drops again, and GPU-Load behavior becomes again unstable. So while performance improves with moderate batch sizes, it’s still unclear whether this is the maximum achievable or if underlying inefficiencies remain in the current implementation. ![Image](https://github.com/user-attachments/assets/ecfc8ef0-4eef-4a97-bac9-c7f31c32b62f) ![Image](https://github.com/user-attachments/assets/4c45c57b-3b71-4274-ad48-ee71ebfd1815)
Author
Owner

@meltyness commented on GitHub (Apr 20, 2025):

this hard-coded unconfigurable value whose provenance i cannot determine:
88738b357b/discover/gpu.go (L41)
incident on the following calculation:
1e7f62cb42/llm/memory.go (L193)

Seems to be at play in my case. I took the arbitrary 457 value down to zero on my configuration and then I was able to vram much larger models into my gpu, though for some reason i had to completely recompile including gguf which took a super long time. but also some other stuff prevents partial loads for me and i haven't traced out the cause there, also i'm not sure what the consequences are since presumably this was like this for a reason, but i'm not hosting the service externally, so i'm not terribly concerned about the consequences.

For posterity I got gemma3:latest / gemma3:4b stuffed into my RTX 3050Ti by making this simple change, GPU-based multimodal inference, here I go, oh boy! Also of note, inference on my platform is indeed much faster this way, and mystically, so is loading; I guess those PCI-E pipes do some pretty good bandwidth vs. loading the models into main memory virtualized.

<!-- gh-comment-id:2817234649 --> @meltyness commented on GitHub (Apr 20, 2025): this hard-coded unconfigurable value whose provenance i cannot determine: https://github.com/ollama/ollama/blob/88738b357bcd25eea860b59bf7de2f6b94cfc352/discover/gpu.go#L41 incident on the following calculation: https://github.com/ollama/ollama/blob/1e7f62cb429e5a962dd9c448e7b1b3371879e48b/llm/memory.go#L193 Seems to be at play in my case. I took the arbitrary `457` value down to zero on my configuration and then I was able to vram much larger models into my gpu, though for some reason i had to completely recompile including gguf which took a super long time. but also some other stuff prevents partial loads for me and i haven't traced out the cause there, also i'm not sure what the consequences are since presumably this was like this for a reason, but i'm not hosting the service externally, so i'm not terribly concerned about the consequences. For posterity I got `gemma3:latest` / `gemma3:4b` stuffed into my RTX 3050Ti by making this simple change, GPU-based multimodal inference, here I go, oh boy! Also of note, inference on my platform is indeed much faster this way, and mystically, so is loading; I guess those PCI-E pipes do some pretty good bandwidth vs. loading the models into main memory virtualized.
Author
Owner

@semidark commented on GitHub (Apr 24, 2025):

I have a similar issue but with gemma3:12b-it-qat only. gemma3:12b is working fine on my setup.

So the issue is, that even though my 3060 has 12GB VRAM ollama only runs part of the gemma3:12b-it-qat in vram.

Image

log files for loading gemma3:12b-it-qat

Apr 24 15:54:54 ollama ollama[5616]: 2025/04/24 15:54:54 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA>
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:458 msg="total blobs: 29"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:465 msg="total unused blobs removed: 0"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.809Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB"
Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 41.651µs | 127.0.0.1 | HEAD "/"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.173Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.209Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 73.168836ms | 127.0.0.1 | POST "/api/show"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.247Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.454Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.489Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.654Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require>
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.718Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.724Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 2048 --batch-size 512 --n-gpu-layers 48 --threads 4 -->
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=sched.go:451 msg="loaded runners" count=1
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.748Z level=INFO source=runner.go:866 msg="starting ollama engine"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.749Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:39483"
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.809Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: found 1 CUDA devices:
Apr 24 15:55:50 ollama ollama[5616]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.893Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.989Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="4.5 GiB"
Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="5.6 GiB"
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.518Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB"
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.757Z level=INFO source=server.go:619 msg="llama runner started in 3.02 seconds

The gemma3:12b model is running 100% in VRAM:

Image

log files for loading gemma3:12b

Apr 24 16:01:29 ollama ollama[6888]: 2025/04/24 16:01:29 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA>
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:458 msg="total blobs: 29"
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:465 msg="total unused blobs removed: 0"
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)"
Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Apr 24 16:01:30 ollama ollama[6888]: time=2025-04-24T16:01:30.025Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB"
Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 47.985µs | 127.0.0.1 | HEAD "/"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.220Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.272Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 106.021177ms | 127.0.0.1 | POST "/api/show"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.327Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.540Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.591Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.598Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de gpu=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d parallel=1 available=12382371840 re>
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.751Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir>
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.880Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.886Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.897Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 2048 --batch-size 512 --n-gpu-layers 49 --threads 4 -->
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=sched.go:451 msg="loaded runners" count=1
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:866 msg="starting ollama engine"
Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:42889"
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.039Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37
Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: found 1 CUDA devices:
Apr 24 16:01:37 ollama ollama[6888]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.122Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.149Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.6 GiB"
Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="787.5 MiB"
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.160Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB"
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.408Z level=INFO source=server.go:619 msg="llama runner started in 2.51 seconds"

<!-- gh-comment-id:2828151613 --> @semidark commented on GitHub (Apr 24, 2025): I have a similar issue but with `gemma3:12b-it-qat` only. `gemma3:12b` is working fine on my setup. So the issue is, that even though my 3060 has 12GB VRAM ollama only runs part of the `gemma3:12b-it-qat` in vram. ![Image](https://github.com/user-attachments/assets/a2167537-ca8e-4d70-9c31-fdab9b4311f7) <details> <summary>log files for loading gemma3:12b-it-qat</summary> Apr 24 15:54:54 ollama ollama[5616]: 2025/04/24 15:54:54 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA> Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:458 msg="total blobs: 29" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.503Z level=INFO source=images.go:465 msg="total unused blobs removed: 0" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.504Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Apr 24 15:54:54 ollama ollama[5616]: time=2025-04-24T15:54:54.809Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB" Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 41.651µs | 127.0.0.1 | HEAD "/" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.173Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.209Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: [GIN] 2025/04/24 - 15:55:50 | 200 | 73.168836ms | 127.0.0.1 | POST "/api/show" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.247Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.454Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.489Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.654Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require> Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.718Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.724Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.736Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 2048 --batch-size 512 --n-gpu-layers 48 --threads 4 --> Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=sched.go:451 msg="loaded runners" count=1 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.737Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.748Z level=INFO source=runner.go:866 msg="starting ollama engine" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.749Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:39483" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.809Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.811Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 24 15:55:50 ollama ollama[5616]: ggml_cuda_init: found 1 CUDA devices: Apr 24 15:55:50 ollama ollama[5616]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Apr 24 15:55:50 ollama ollama[5616]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.893Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.989Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="4.5 GiB" Apr 24 15:55:51 ollama ollama[5616]: time=2025-04-24T15:55:51.003Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="5.6 GiB" Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.518Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.531Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB" Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.559Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" Apr 24 15:55:53 ollama ollama[5616]: time=2025-04-24T15:55:53.757Z level=INFO source=server.go:619 msg="llama runner started in 3.02 seconds </details> The `gemma3:12b` model is running 100% in VRAM: ![Image](https://github.com/user-attachments/assets/c2f8af72-6779-4f13-a995-b90ce0cda28b) <details> <summary>log files for loading gemma3:12b</summary> Apr 24 16:01:29 ollama ollama[6888]: 2025/04/24 16:01:29 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA> Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:458 msg="total blobs: 29" Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.746Z level=INFO source=images.go:465 msg="total unused blobs removed: 0" Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)" Apr 24 16:01:29 ollama ollama[6888]: time=2025-04-24T16:01:29.747Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Apr 24 16:01:30 ollama ollama[6888]: time=2025-04-24T16:01:30.025Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB" Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 47.985µs | 127.0.0.1 | HEAD "/" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.220Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.272Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: [GIN] 2025/04/24 - 16:01:36 | 200 | 106.021177ms | 127.0.0.1 | POST "/api/show" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.327Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.540Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.591Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.598Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de gpu=GPU-a7015e82-3d05-23cf-2390-ce645357fd6d parallel=1 available=12382371840 re> Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.751Z level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="31.8 GiB" free_swap="512.0 MiB" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir> Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.880Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.886Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.897Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 2048 --batch-size 512 --n-gpu-layers 49 --threads 4 --> Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=sched.go:451 msg="loaded runners" count=1 Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.898Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:866 msg="starting ollama engine" Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.910Z level=INFO source=runner.go:929 msg="Server listening on 127.0.0.1:42889" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.039Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.041Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37 Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Apr 24 16:01:37 ollama ollama[6888]: ggml_cuda_init: found 1 CUDA devices: Apr 24 16:01:37 ollama ollama[6888]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Apr 24 16:01:37 ollama ollama[6888]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.122Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.149Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.6 GiB" Apr 24 16:01:37 ollama ollama[6888]: time=2025-04-24T16:01:37.243Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="787.5 MiB" Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.160Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.169Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="102.0 MiB" Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.195Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" Apr 24 16:01:39 ollama ollama[6888]: time=2025-04-24T16:01:39.408Z level=INFO source=server.go:619 msg="llama runner started in 2.51 seconds" </details>
Author
Owner

@bshor commented on GitHub (Apr 24, 2025):

I have the same problem as @semidark with the 12b-qat model overfilling VRAM on my 12Gb 4070, while the regular 12b-Q4_K_M model doing just fine.

<!-- gh-comment-id:2828221810 --> @bshor commented on GitHub (Apr 24, 2025): I have the same problem as @semidark with the 12b-qat model overfilling VRAM on my 12Gb 4070, while the regular 12b-Q4_K_M model doing just fine.
Author
Owner

@rick-github commented on GitHub (Apr 24, 2025):

gemma3:12b-it-qat

Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda
 layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require>

gemma3:12b

Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda
 layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir>

The QAT quant requires an extra 1.5G of RAM, causing one layer to be spilled into system RAM. You can try working around this by overriding ollama's memory estimation and setting num_gpu to 49. Other ways to reduce the memory footprint can be found here.

<!-- gh-comment-id:2828235500 --> @rick-github commented on GitHub (Apr 24, 2025): gemma3:12b-it-qat ``` Apr 24 15:55:50 ollama ollama[5616]: time=2025-04-24T15:55:50.657Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="9.9 GiB" memory.required.kv="608.0 MiB" memory.require> ``` gemma3:12b ``` Apr 24 16:01:36 ollama ollama[6888]: time=2025-04-24T16:01:36.753Z level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.3 GiB" memory.required.partial="10.3 GiB" memory.required.kv="608.0 MiB" memory.requir> ``` The QAT quant requires an extra 1.5G of RAM, causing one layer to be spilled into system RAM. You can try working around this by overriding ollama's memory estimation and [setting](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) `num_gpu` to 49. Other ways to reduce the memory footprint can be found [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@semidark commented on GitHub (Apr 24, 2025):

Hello @rick-github: How can I try to working around ollamas memory estimation?

<!-- gh-comment-id:2828768175 --> @semidark commented on GitHub (Apr 24, 2025): Hello @rick-github: How can I try to working around ollamas memory estimation?
Author
Owner

@rick-github commented on GitHub (Apr 24, 2025):

https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650

<!-- gh-comment-id:2828772573 --> @rick-github commented on GitHub (Apr 24, 2025): https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650
Author
Owner

@semidark commented on GitHub (Apr 24, 2025):

Thanks again for the link. I already followed it the first time you sent it, but thought you ment something other than just setting num_gpu to the desired value. But if it is that simple, I will try it for sure.

<!-- gh-comment-id:2828787383 --> @semidark commented on GitHub (Apr 24, 2025): Thanks again for the link. I already followed it the first time you sent it, but thought you ment something other than just setting num_gpu to the desired value. But if it is that simple, I will try it for sure.
Author
Owner

@rick-github commented on GitHub (Apr 24, 2025):

Be aware that depending on your OS/drivers, overriding can cause performance issues and/or OOMs.

<!-- gh-comment-id:2828792905 --> @rick-github commented on GitHub (Apr 24, 2025): Be aware that depending on your OS/drivers, overriding can cause [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) and/or OOMs.
Author
Owner

@Numieo commented on GitHub (May 21, 2025):

I have 8GB of VRAM, and I also tried deploying Gemma 3 12B with Ollama. However, the Q4KM version only used a little bit of VRAM while using a lot of RAM, and the QAT version only used RAM. The Gemma 3 4B's Q4KM version used 4GB of VRAM, the QAT version used 2GB of VRAM while also using a lot of RAM. Using Ollama seems to have some issues.

I switched to LM Studio, and the Gemma 3 12B Q4KM and QAT versions can both use 7.7GB of VRAM and 4GB of RAM. The Gemma 3 4B's Q4KM and QAT versions load normally into the VRAM, so switch to LM Studio.

<!-- gh-comment-id:2898194537 --> @Numieo commented on GitHub (May 21, 2025): I have 8GB of VRAM, and I also tried deploying Gemma 3 12B with Ollama. However, the Q4KM version only used a little bit of VRAM while using a lot of RAM, and the QAT version only used RAM. The Gemma 3 4B's Q4KM version used 4GB of VRAM, the QAT version used 2GB of VRAM while also using a lot of RAM. Using Ollama seems to have some issues. I switched to LM Studio, and the Gemma 3 12B Q4KM and QAT versions can both use 7.7GB of VRAM and 4GB of RAM. The Gemma 3 4B's Q4KM and QAT versions load normally into the VRAM, so switch to LM Studio.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32552