[GH-ISSUE #5504] 0xc0000409 CUDA error | was working fine before - OOM crash #49953

Closed
opened 2026-04-28 13:32:41 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @gaduffl on GitHub (Jul 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5504

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Ollama was working fine with all small models I tested so far (4GB VRAM).
After upgrading to 0.1.48, I get a CUDA error with all models, e.g. Llama3 8B:

Error: llama runner process has terminated: exit status 0xc0000409 CUDA error"

This model was running perfectly fine before.
server.log

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @gaduffl on GitHub (Jul 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5504 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Ollama was working fine with all small models I tested so far (4GB VRAM). After upgrading to 0.1.48, I get a CUDA error with all models, e.g. Llama3 8B: _Error: llama runner process has terminated: exit status 0xc0000409 CUDA error"_ This model was running perfectly fine before. [server.log](https://github.com/user-attachments/files/16113333/server.log) ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48
GiteaMirror added the memorybugwindows labels 2026-04-28 13:32:44 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jul 9, 2024):

Digging into this a bit, it looks like there's a skew between what the management library and driver libraries are reporting on windows when the GPU is the primary display device. I believe the management library is unaware of some amount of allocation on the card (probably OS level), and as a result, our recent changes to favor the management library to enable concurrency scheduling leads us to over allocate. I'll get a PR up soon to record the skew between these two and remember that overhead so we can set that VRAM aside which should resolve this.

<!-- gh-comment-id:2218136775 --> @dhiltgen commented on GitHub (Jul 9, 2024): Digging into this a bit, it looks like there's a skew between what the management library and driver libraries are reporting on windows when the GPU is the primary display device. I believe the management library is unaware of some amount of allocation on the card (probably OS level), and as a result, our recent changes to favor the management library to enable concurrency scheduling leads us to over allocate. I'll get a PR up soon to record the skew between these two and remember that overhead so we can set that VRAM aside which should resolve this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49953