[GH-ISSUE #5670] The usage of VRAM has significantly increased #3536

Open
opened 2026-04-12 14:15:10 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @lingyezhixing on GitHub (Jul 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5670

What is the issue?

In previous versions, I set the context length of each of my models to the maximum value that could be fully loaded onto the GPU memory. However, after the update, I found that parts of them were being partially loaded onto the CPU instead. I wonder what could be causing this. The following table is some examples.

NAME SIZE PROCESSOR
glm4:9b-chat-2K-q5_K_M 8.3 GB 10%/90% CPU/GPU
glm4:9b-chat-10K-q4_K_M 7.8 GB 7%/93% CPU/GPU
codegeex4:9b-all-10K-q4_K_M 7.8 GB 7%/93% CPU/GPU
qwen2:7b-instruct-19K-q5_K_M 8.3 GB 13%/87% CPU/GPU
internlm2:7b-chat-v2.5-8K-q5_K_M 7.7 GB 4%/96% CPU/GPU
llama3:8b-instruct-5K-q6_K 8.2 GB 10%/90% CPU/GPU
My graphics card is a 4060 laptop model, with only 8GB of VRAM. Interestingly, even before the update, none of the models was actually utilizing the full capacity of my GPU memory.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.2.3

Originally created by @lingyezhixing on GitHub (Jul 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5670 ### What is the issue? In previous versions, I set the context length of each of my models to the maximum value that could be fully loaded onto the GPU memory. However, after the update, I found that parts of them were being partially loaded onto the CPU instead. I wonder what could be causing this. The following table is some examples. |NAME|SIZE|PROCESSOR| | :-: | :-: | :-: | |glm4:9b-chat-2K-q5_K_M|8.3 GB|10%/90% CPU/GPU| |glm4:9b-chat-10K-q4_K_M|7.8 GB|7%/93% CPU/GPU| |codegeex4:9b-all-10K-q4_K_M|7.8 GB|7%/93% CPU/GPU| |qwen2:7b-instruct-19K-q5_K_M|8.3 GB|13%/87% CPU/GPU| |internlm2:7b-chat-v2.5-8K-q5_K_M|7.7 GB|4%/96% CPU/GPU| |llama3:8b-instruct-5K-q6_K|8.2 GB|10%/90% CPU/GPU| My graphics card is a 4060 laptop model, with only 8GB of VRAM. Interestingly, even before the update, none of the models was actually utilizing the full capacity of my GPU memory. ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.2.3
GiteaMirror added the memorybug labels 2026-04-12 14:15:10 -05:00
Author
Owner

@xuexiaojingquan commented on GitHub (Jul 15, 2024):

Same situation here! This issue makes me have to stay on 0.2.1.

<!-- gh-comment-id:2227994450 --> @xuexiaojingquan commented on GitHub (Jul 15, 2024): Same situation here! This issue makes me have to stay on 0.2.1.
Author
Owner

@lingyezhixing commented on GitHub (Jul 16, 2024):

Same situation here! This issue makes me have to stay on 0.2.1.

I am also stuck in version 0.2.1

<!-- gh-comment-id:2230249819 --> @lingyezhixing commented on GitHub (Jul 16, 2024): > Same situation here! This issue makes me have to stay on 0.2.1. I am also stuck in version 0.2.1
Author
Owner

@sbera77 commented on GitHub (Jul 18, 2024):

Same here. WSL2 + NVIDIA GPU

<!-- gh-comment-id:2235862954 --> @sbera77 commented on GitHub (Jul 18, 2024): Same here. WSL2 + NVIDIA GPU
Author
Owner

@chrisoutwright commented on GitHub (Jul 25, 2024):

Is there any update?
With 0.3.0 I am still on:

offloading 79 repeating layers to GPU
llm_load_tensors: offloaded 79/81 layers to GPU

for qwen2:

llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q3_K - Large
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 36.79 GiB (4.35 BPW)
llm_load_print_meta: general.name     = Qwen2-72B-Instruct

for 0.2.1 I could load all in vram.

<!-- gh-comment-id:2250891431 --> @chrisoutwright commented on GitHub (Jul 25, 2024): Is there any update? With 0.3.0 I am still on: ``` offloading 79 repeating layers to GPU llm_load_tensors: offloaded 79/81 layers to GPU ``` for qwen2: ``` llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q3_K - Large llm_load_print_meta: model params = 72.71 B llm_load_print_meta: model size = 36.79 GiB (4.35 BPW) llm_load_print_meta: general.name = Qwen2-72B-Instruct ``` for 0.2.1 I could load all in vram.
Author
Owner

@chrisoutwright commented on GitHub (Jul 25, 2024):

I added "num_gpu":81 to the params file of the model and now it loads all of it!

<!-- gh-comment-id:2250976707 --> @chrisoutwright commented on GitHub (Jul 25, 2024): I added "num_gpu":81 to the params file of the model and now it loads all of it!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3536