[GH-ISSUE #10884] Help: Ollama Only Uses 1.8/4.0 GB VRAM, Swapping 2.5 GB to RAM #53665

Closed
opened 2026-04-29 04:26:10 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @TDuy22 on GitHub (May 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10884

Help: Ollama Only Uses 1.8/4.0 GB VRAM, Swapping 2.5 GB to RAM

Hi everyone,

I'm running a local LLM using Ollama on my machine, but I'm encountering an issue where it only utilizes 1.8 GB out of the available 4.0 GB VRAM on my GPU, with 2.5 GB being swapped to RAM. This is causing slower computation speeds (my gpu only run about 30%), and I’d like to maximize VRAM usage to improve performance. I’d appreciate any insights or solutions from the community!
System and Model Details

GPU: NVIDIA RTX 3050 laptop (4 GB VRAM)

Model: gemma3:4b (3.3gb)
Architecture: gemma3
Parameters: 4.3B
Context Length: 131072 (but set to num_ctx=1024)
Embedding Length: 2560
Quantization: Q4_K_M
Capabilities: Completion, Vision
Parameters:
num_ctx: 1024
stop: "<end_of_turn>"
temperature: 1
top_k: 64
top_p: 0.95
License: Gemma Terms of Use (Last modified: February 21, 2024)
Model Size: Approximately 3.3 GB
Ollama Version: 0.7.1

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.7.1

Originally created by @TDuy22 on GitHub (May 28, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10884 ### Help: Ollama Only Uses 1.8/4.0 GB VRAM, Swapping 2.5 GB to RAM Hi everyone, I'm running a local LLM using Ollama on my machine, but I'm encountering an issue where it only utilizes 1.8 GB out of the available 4.0 GB VRAM on my GPU, with 2.5 GB being swapped to RAM. This is causing slower computation speeds (my gpu only run about 30%), and I’d like to maximize VRAM usage to improve performance. I’d appreciate any insights or solutions from the community! System and Model Details GPU: NVIDIA RTX 3050 laptop (4 GB VRAM) Model: gemma3:4b (3.3gb) Architecture: gemma3 Parameters: 4.3B Context Length: 131072 (but set to num_ctx=1024) Embedding Length: 2560 Quantization: Q4_K_M Capabilities: Completion, Vision Parameters: num_ctx: 1024 stop: "<end_of_turn>" temperature: 1 top_k: 64 top_p: 0.95 License: Gemma Terms of Use (Last modified: February 21, 2024) Model Size: Approximately 3.3 GB Ollama Version: 0.7.1 ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.7.1
GiteaMirror added the bug label 2026-04-29 04:26:10 -05:00
Author
Owner

@rick-github commented on GitHub (May 28, 2025):

Server logs will give more information, but the likely cause is that ollama is not correctly estimating the amount of layers it can offload. You can override this be setting num_gpu as described here.

<!-- gh-comment-id:2916094586 --> @rick-github commented on GitHub (May 28, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will give more information, but the likely cause is that ollama is not correctly estimating the amount of layers it can offload. You can override this be setting `num_gpu` as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53665