[GH-ISSUE #4995] Ollama GPU not loding properly #3159

Closed
opened 2026-04-12 13:38:38 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @tankvpython on GitHub (Jun 12, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4995

Originally assigned to: @jmorganca on GitHub.

What is the issue?

I am facing an issue with the Ollama service. I have an RTX 4090 GPU with 80GB of RAM and 24GB of VRAM. When I run the Llama 3 70B model and ask it a question, it initially loads on the GPU, but after 5-10 seconds, it shifts entirely to the CPU. This causes the response time to be slow. Please provide me with a solution for this. Thank you in advance.

Note:- GPU load is 6-12 % and CPU load is 70% .

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

v0.1.43

Originally created by @tankvpython on GitHub (Jun 12, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4995 Originally assigned to: @jmorganca on GitHub. ### What is the issue? I am facing an issue with the Ollama service. I have an **RTX 4090 GPU** with 80GB of RAM and 24GB of VRAM. When I run the Llama 3 70B model and ask it a question, it initially loads on the GPU, but after 5-10 seconds, it shifts entirely to the CPU. This causes the response time to be slow. Please provide me with a solution for this. Thank you in advance. Note:- GPU load is 6-12 % and CPU load is 70% . ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version v0.1.43
GiteaMirror added the questionperformance labels 2026-04-12 13:38:39 -05:00
Author
Owner

@phong-phuong commented on GitHub (Jun 12, 2024):

Your GPU has too little VRAM to run a 70 billion parameter model entirely on the GPU. Try a smaller model like a 22 billion parameter model.

<!-- gh-comment-id:2162542743 --> @phong-phuong commented on GitHub (Jun 12, 2024): Your GPU has too little VRAM to run a 70 billion parameter model entirely on the GPU. Try a smaller model like a 22 billion parameter model.
Author
Owner

@tankvpython commented on GitHub (Jun 12, 2024):

Your GPU has too little VRAM to run a 70 billion parameter model entirely on the GPU. Try a smaller model like a 22 billion parameter model.

if you have any idea what is configuration needed ?

<!-- gh-comment-id:2162573179 --> @tankvpython commented on GitHub (Jun 12, 2024): > Your GPU has too little VRAM to run a 70 billion parameter model entirely on the GPU. Try a smaller model like a 22 billion parameter model. if you have any idea what is configuration needed ?
Author
Owner

@phong-phuong commented on GitHub (Jun 12, 2024):

See https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ/discussions/2
A user used a 48GB card to run a 70b model.

<!-- gh-comment-id:2162589151 --> @phong-phuong commented on GitHub (Jun 12, 2024): See https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ/discussions/2 A user used a 48GB card to run a 70b model.
Author
Owner

@jmorganca commented on GitHub (Jun 12, 2024):

Hi @tankvpython sorry about this – I have a similar system and it should definitely be partially loading the 70B model to GPU – would it be possible to check with ollama ps after loading a model?

<!-- gh-comment-id:2163367364 --> @jmorganca commented on GitHub (Jun 12, 2024): Hi @tankvpython sorry about this – I have a similar system and it should definitely be partially loading the 70B model to GPU – would it be possible to check with `ollama ps` after loading a model?
Author
Owner

@tankvpython commented on GitHub (Jun 13, 2024):

Hi @tankvpython sorry about this – I have a similar system and it should definitely be partially loading the 70B model to GPU – would it be possible to check with ollama ps after loading a model?

Yes i am able to do this i will check it

<!-- gh-comment-id:2164668881 --> @tankvpython commented on GitHub (Jun 13, 2024): > Hi @tankvpython sorry about this – I have a similar system and it should definitely be partially loading the 70B model to GPU – would it be possible to check with `ollama ps` after loading a model? Yes i am able to do this i will check it
Author
Owner

@phong-phuong commented on GitHub (Jun 13, 2024):

If you are using the standard ollama3:70b version, that one is 4bit quant and uses ~33GB VRAM which is a lot more than your GPU can handle. When that happens, the rest of the model gets offloaded to the CPU which is much slower.
If you want faster interference times, you can try a lower quant like the 2bit one which uses ~17GB, much more suitable for a 24GB card : ollama run llama3:70b-instruct-q2_K.
Just bear in mind, that lowering the quant size may degrade the quality of the answers (think of it like compressing an image or video).

<!-- gh-comment-id:2166299807 --> @phong-phuong commented on GitHub (Jun 13, 2024): If you are using the standard ollama3:70b version, that one is 4bit quant and uses ~33GB VRAM which is a lot more than your GPU can handle. When that happens, the rest of the model gets offloaded to the CPU which is much slower. If you want faster interference times, you can try a lower quant like the 2bit one which uses ~17GB, much more suitable for a 24GB card : `ollama run llama3:70b-instruct-q2_K`. Just bear in mind, that lowering the quant size may degrade the quality of the answers (think of it like compressing an image or video).
Author
Owner

@tankvpython commented on GitHub (Jun 13, 2024):

If you are using the standard ollama3:70b version, that one is 4bit quant and uses ~33GB VRAM which is a lot more than your GPU can handle. When that happens, the rest of the model gets offloaded to the CPU which is much slower. If you want faster interference times, you can try a lower quant like the 2bit one which uses ~17GB, much more suitable for a 24GB card : ollama run llama3:70b-instruct-q2_K. Just bear in mind, that lowering the quant size may degrade the quality of the answers (think of it like compressing an image or video).

thanks for help i don't need light weight model because of i need more accuracy

<!-- gh-comment-id:2166550159 --> @tankvpython commented on GitHub (Jun 13, 2024): > If you are using the standard ollama3:70b version, that one is 4bit quant and uses ~33GB VRAM which is a lot more than your GPU can handle. When that happens, the rest of the model gets offloaded to the CPU which is much slower. If you want faster interference times, you can try a lower quant like the 2bit one which uses ~17GB, much more suitable for a 24GB card : `ollama run llama3:70b-instruct-q2_K`. Just bear in mind, that lowering the quant size may degrade the quality of the answers (think of it like compressing an image or video). thanks for help i don't need light weight model because of i need more accuracy
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

I think we can close this one as it seems the system is behaving as intended. ollama ps will show the model is loading split between CPU and GPU, which will result in slower performance, bottlenecked by the CPU.

<!-- gh-comment-id:2332718308 --> @dhiltgen commented on GitHub (Sep 5, 2024): I think we can close this one as it seems the system is behaving as intended. `ollama ps` will show the model is loading split between CPU and GPU, which will result in slower performance, bottlenecked by the CPU.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3159