Ollama cannot use as much gpu memory as possible at 2080ti 22g #7247

Closed
opened 2025-11-12 14:00:58 -06:00 by GiteaMirror · 2 comments
Owner

Originally created by @lvyonghuan on GitHub (Jun 5, 2025).

I just recently bought a 2080ti 22g to replace my m40 24g. However, when I used ollama to select the 32b model for inference, I found that the gpu memory can only use up to 19 to 20G, and the remaining 2 or 3G can only use shared memory. Since my computer's CPU is still E5 2698bv3, this causes the 32B model to reason surprisingly slow in my senses.
And I remember what I saw when I bought this card was that it is possible to reason the 32B model for a single card.
Due to the environment limitations of the workstation, my computer can only run in the Windows environment. But if WSL can solve this problem, maybe I can try it too.
Image

Originally created by @lvyonghuan on GitHub (Jun 5, 2025). I just recently bought a 2080ti 22g to replace my m40 24g. However, when I used ollama to select the 32b model for inference, I found that the gpu memory can only use up to 19 to 20G, and the remaining 2 or 3G can only use shared memory. Since my computer's CPU is still E5 2698bv3, this causes the 32B model to reason surprisingly slow in my senses. And I remember what I saw when I bought this card was that it is possible to reason the 32B model for a single card. Due to the environment limitations of the workstation, my computer can only run in the Windows environment. But if WSL can solve this problem, maybe I can try it too. ![Image](https://github.com/user-attachments/assets/f6766844-226f-4ad5-b19b-65a14ce7e0e1)
Author
Owner

@rick-github commented on GitHub (Jun 5, 2025):

ollama sometimes underestimates the amount of VRAM it can use. You can override this by setting num_gpu as describe here. For example setting num_gpu:999 will force all of the layers onto the GPU. If you are using a model that is actually bigger than the available VRAM, this will cause some of the layers to be loaded in shared RAM. This will avoid an OOM condition but can result in a performance penalty as the GPU accesses the shared RAM via the PCI bus.

@rick-github commented on GitHub (Jun 5, 2025): ollama sometimes underestimates the amount of VRAM it can use. You can override this by setting `num_gpu` as describe [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). For example setting `num_gpu:999` will force all of the layers onto the GPU. If you are using a model that is actually bigger than the available VRAM, this will cause some of the layers to be loaded in shared RAM. This will avoid an OOM condition but can result in a [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) as the GPU accesses the shared RAM via the PCI bus.
Author
Owner

@lvyonghuan commented on GitHub (Jun 5, 2025):

ollama sometimes underestimates the amount of VRAM it can use. You can override this by setting num_gpu as describe here. For example setting num_gpu:999 will force all of the layers onto the GPU. If you are using a model that is actually bigger than the available VRAM, this will cause some of the layers to be loaded in shared RAM. This will avoid an OOM condition but can result in a performance penalty as the GPU accesses the shared RAM via the PCI bus.

It works, thank you.

@lvyonghuan commented on GitHub (Jun 5, 2025): > ollama sometimes underestimates the amount of VRAM it can use. You can override this by setting `num_gpu` as describe [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). For example setting `num_gpu:999` will force all of the layers onto the GPU. If you are using a model that is actually bigger than the available VRAM, this will cause some of the layers to be loaded in shared RAM. This will avoid an OOM condition but can result in a [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) as the GPU accesses the shared RAM via the PCI bus. It works, thank you.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#7247