[GH-ISSUE #11258] Gemma3n E4B Q4_K_M unusual high RAM usage #7419

Open
opened 2026-04-12 19:30:08 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @Ryderjj89 on GitHub (Jul 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11258

Originally assigned to: @mxyng on GitHub.

What is the issue?

I have a Quadro RTX 4000 with 8 GB VRAM and when I use gemma3n:e4b-it-q4_K_M, it uses up 92% of the VRAM available.

To me, this seems highly unusual considering I have other models with more parameters (qwen3:8b-q4_K_M) and it uses about 70% of the VRAM.

I'm on Ollama 0.9.4.

Is this really to be expected with Gemma3n?

Screenshot from nvidia-smi:

Image

Relevant log output


OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.9.4

Originally created by @Ryderjj89 on GitHub (Jul 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11258 Originally assigned to: @mxyng on GitHub. ### What is the issue? I have a Quadro RTX 4000 with 8 GB VRAM and when I use gemma3n:e4b-it-q4_K_M, it uses up 92% of the VRAM available. To me, this seems highly unusual considering I have other models with more parameters (qwen3:8b-q4_K_M) and it uses about 70% of the VRAM. I'm on Ollama 0.9.4. Is this really to be expected with Gemma3n? Screenshot from nvidia-smi: ![Image](https://github.com/user-attachments/assets/10f4ad68-1c57-4fc0-900c-e95d17248331) ### Relevant log output ```shell ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.4
GiteaMirror added the bug label 2026-04-12 19:30:08 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

Probably context or parallelism. Server logs will aid in debugging.

<!-- gh-comment-id:3025235123 --> @rick-github commented on GitHub (Jul 1, 2025): Probably context or parallelism. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@Ryderjj89 commented on GitHub (Jul 1, 2025):

Here they are. _ollama_logs.txt

<!-- gh-comment-id:3025247238 --> @Ryderjj89 commented on GitHub (Jul 1, 2025): Here they are. [_ollama_logs.txt](https://github.com/user-attachments/files/21006389/_ollama_logs.txt)
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

time=2025-07-01T18:51:02.410Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1
 layers.model=36 layers.offload=36 layers.split="" memory.available="[7.4 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="560.0 MiB"
 memory.required.allocations="[5.7 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.2 GiB"
 memory.weights.nonrepeating="420.4 MiB" memory.graph.full="2.0 GiB" memory.graph.partial="3.7 GiB"

time=2025-07-01T18:51:02.486Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner
 --ollama-engine --model /root/.ollama/models/blobs/sha256-38e8dcc30df4eb0e29eaf5c74ba6ce3f2cd66badad50768fc14362acfb8b8cb6
 --ctx-size 8192 --batch-size 512 --n-gpu-layers 36 --threads 6 --flash-attn --parallel 2 --port 33239"

gemma3n:e4b-it-q4_K_M is 6.9B parameter model. Because you have sufficient VRAM and OLLAMA_NUM_PARALLEL is unset, ollama has decided to use a default parallelism of 2. qwen3:8b-q4_K_M is a bit larger and uses more VRAM, likely leaving only enough VRAM for 1 buffer (ie --parallel 1). If you want to minimize VRAM usage, set OLLAMA_NUM_PARALLEL=1 in the server environment.

<!-- gh-comment-id:3025265078 --> @rick-github commented on GitHub (Jul 1, 2025): ``` time=2025-07-01T18:51:02.410Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=36 layers.offload=36 layers.split="" memory.available="[7.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="560.0 MiB" memory.required.allocations="[5.7 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.2 GiB" memory.weights.nonrepeating="420.4 MiB" memory.graph.full="2.0 GiB" memory.graph.partial="3.7 GiB" time=2025-07-01T18:51:02.486Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-38e8dcc30df4eb0e29eaf5c74ba6ce3f2cd66badad50768fc14362acfb8b8cb6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 36 --threads 6 --flash-attn --parallel 2 --port 33239" ``` `gemma3n:e4b-it-q4_K_M` is 6.9B parameter model. Because you have sufficient VRAM and `OLLAMA_NUM_PARALLEL` is unset, ollama has decided to use a default parallelism of 2. `qwen3:8b-q4_K_M` is a bit larger and uses more VRAM, likely leaving only enough VRAM for 1 buffer (ie `--parallel 1`). If you want to minimize VRAM usage, set `OLLAMA_NUM_PARALLEL=1` in the server environment.
Author
Owner

@Ryderjj89 commented on GitHub (Jul 1, 2025):

I added that environment variable and recreated the container. Still using up almost all the available VRAM.

Image

_ollama_logs (1).txt

<!-- gh-comment-id:3025307491 --> @Ryderjj89 commented on GitHub (Jul 1, 2025): I added that environment variable and recreated the container. Still using up almost all the available VRAM. ![Image](https://github.com/user-attachments/assets/344946a2-5369-4e21-bd94-3f3af964f8d7) [_ollama_logs (1).txt](https://github.com/user-attachments/files/21006606/_ollama_logs.1.txt)
Author
Owner

@Ryderjj89 commented on GitHub (Jul 7, 2025):

Good evening. Just checking in on this. Any other troubleshooting I can do or is this a bug that needs to be fixed?

<!-- gh-comment-id:3046745960 --> @Ryderjj89 commented on GitHub (Jul 7, 2025): Good evening. Just checking in on this. Any other troubleshooting I can do or is this a bug that needs to be fixed?
Author
Owner

@Ryderjj89 commented on GitHub (Jul 22, 2025):

Good morning. Bumping this again. Any updates?

<!-- gh-comment-id:3103251870 --> @Ryderjj89 commented on GitHub (Jul 22, 2025): Good morning. Bumping this again. Any updates?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7419