[GH-ISSUE #1742] Low VRAM mode? #47508

Closed
opened 2026-04-28 04:03:08 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @JumboTortoise on GitHub (Dec 30, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1742

I have a 12GB RTX 3060 that can easily run 7B models, but fails on the larger ones. Does ollama have a low-vram mode? Any way to move model layers from VRAM to system RAM? I would really like to try out larger LLM's without having to rent a cloud compute server or buy a new GPU, even if it is much slower due to inference optimizations.

I am not very knowledgeable on the subject, but maybe using DeepSpeed for boosting inference performance is a possibility?

Originally created by @JumboTortoise on GitHub (Dec 30, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1742 I have a 12GB RTX 3060 that can easily run 7B models, but fails on the larger ones. Does ollama have a low-vram mode? Any way to move model layers from VRAM to system RAM? I would really like to try out larger LLM's without having to rent a cloud compute server or buy a new GPU, even if it is much slower due to inference optimizations. I am not very knowledgeable on the subject, but maybe using DeepSpeed for boosting inference performance is a possibility?
Author
Owner

@DrGood01 commented on GitHub (Dec 30, 2023):

#1727

<!-- gh-comment-id:1872500991 --> @DrGood01 commented on GitHub (Dec 30, 2023): #1727
Author
Owner

@JumboTortoise commented on GitHub (Dec 30, 2023):

Thanks, I saw this issue in the tracker but didn't understand it was what I was searching for.

<!-- gh-comment-id:1872507547 --> @JumboTortoise commented on GitHub (Dec 30, 2023): Thanks, I saw this issue in the tracker but didn't understand it was what I was searching for.
Author
Owner

@easp commented on GitHub (Jan 2, 2024):

Ollama automatically spills models into system RAM, except when it doesn't work properly. I don't know why it sometimes doesn't work properly. I suspect it may be an issue with models that have larger context sizes, but I don't have a PC with NVIDIA, so I can't test it for myself.

<!-- gh-comment-id:1874575213 --> @easp commented on GitHub (Jan 2, 2024): Ollama automatically spills models into system RAM, except when it doesn't work properly. I don't know why it sometimes doesn't work properly. I suspect it may be an issue with models that have larger context sizes, but I don't have a PC with NVIDIA, so I can't test it for myself.
Author
Owner

@kilitary commented on GitHub (Jul 3, 2024):

also if you have integrated video card, you have memory shared with cpu and belong to all system memory

<!-- gh-comment-id:2206929232 --> @kilitary commented on GitHub (Jul 3, 2024): also if you have integrated video card, you have memory shared with cpu and belong to all system memory
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47508