[GH-ISSUE #11748] Ollama does not natively support offloading model weights to shared memory (system RAM or CPU memory) when dedicated GPU memory is full #69843

Closed
opened 2026-05-04 19:32:48 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @Hacoor on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11748

Originally created by @Hacoor on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11748
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

What operating system? It should offload automatically (depending on driver) on Windows, Linux users need to set an environment variable.

<!-- gh-comment-id:3160909214 --> @rick-github commented on GitHub (Aug 6, 2025): What operating system? It should offload automatically (depending on driver) on Windows, Linux users need to set an environment variable.
Author
Owner

@jhall39 commented on GitHub (Aug 6, 2025):

I get an error with larger models now.. gpt-oss:20b and even models that worked before like qwen3:30b-a3b on Windows 11 I get errors like 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory
I'm on 0.11.3

<!-- gh-comment-id:3161021230 --> @jhall39 commented on GitHub (Aug 6, 2025): I get an error with larger models now.. gpt-oss:20b and even models that worked before like qwen3:30b-a3b on Windows 11 I get errors like 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory I'm on 0.11.3
Author
Owner

@Hacoor commented on GitHub (Aug 6, 2025):

I get an error with larger models now.. gpt-oss:20b and even models that worked before like qwen3:30b-a3b on Windows 11 I get errors like 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory I'm on 0.11.3

check your gpu vram it must be at least 16gb or ram must be above 20gb to run it

<!-- gh-comment-id:3161053609 --> @Hacoor commented on GitHub (Aug 6, 2025): > I get an error with larger models now.. gpt-oss:20b and even models that worked before like qwen3:30b-a3b on Windows 11 I get errors like 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory I'm on 0.11.3 check your gpu vram it must be at least 16gb or ram must be above 20gb to run it
Author
Owner

@jhall39 commented on GitHub (Aug 6, 2025):

I have a 5070ti with 16gb ram and also 64GB system ram as well. qwen3:30b-a3b ran perfectly fine on an older version of ollama before the gui was added.

<!-- gh-comment-id:3161407344 --> @jhall39 commented on GitHub (Aug 6, 2025): I have a 5070ti with 16gb ram and also 64GB system ram as well. qwen3:30b-a3b ran perfectly fine on an older version of ollama before the gui was added.
Author
Owner

@azomDev commented on GitHub (Aug 10, 2025):

@jhall39 This appears to be related to #11676

<!-- gh-comment-id:3172342096 --> @azomDev commented on GitHub (Aug 10, 2025): @jhall39 This appears to be related to #11676
Author
Owner

@rick-github commented on GitHub (Sep 1, 2025):

The memory management in recent ollama releases may have resolved this. Upgrade and add a comment if the problem persists.

<!-- gh-comment-id:3243159227 --> @rick-github commented on GitHub (Sep 1, 2025): The memory management in recent ollama releases may have resolved this. Upgrade and add a comment if the problem persists.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69843