[GH-ISSUE #5204] Can't even attempt to load Deepseek-Coder-v2:236B due to arbitrary timeout #65308

Closed
opened 2026-05-03 20:26:42 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Nantris on GitHub (Jun 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5204

What is the issue?

This issue thread mentions the overarching issue, and the specific comment a potential workaround: https://github.com/ollama/ollama/issues/630#issuecomment-2182371780

My understanding is that the 236B model should be feasible to load into less RAM than the model actually takes up since not all parameters need to be loaded simultaneously - but I can't find out whether that's true because ollama decides it's giving up after an arbitrary amount of time.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.1.44

Originally created by @Nantris on GitHub (Jun 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5204 ### What is the issue? This issue thread mentions the overarching issue, and the specific comment a potential workaround: https://github.com/ollama/ollama/issues/630#issuecomment-2182371780 My understanding is that the 236B model should be feasible to load into less RAM than the model actually takes up since not all parameters need to be loaded simultaneously - but I can't find out whether that's true because ollama decides it's giving up after an arbitrary amount of time. ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.44
GiteaMirror added the bug label 2026-05-03 20:26:42 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

On the topic of timeout when loading large models, new in 0.1.45 is adjusted logic to detect if the model is larger than available system memory and disable mmap based loading. (you can still set "use_map": false to force this at the API level) As long as the model load progresses and doesn't stall for more than 5 minutes we wont timeout and cancel the load.

<!-- gh-comment-id:2183373072 --> @dhiltgen commented on GitHub (Jun 21, 2024): On the topic of timeout when loading large models, new in 0.1.45 is adjusted logic to detect if the model is larger than available system memory and disable mmap based loading. (you can still set `"use_map": false` to force this at the API level) As long as the model load progresses and doesn't stall for more than 5 minutes we wont timeout and cancel the load.
Author
Owner

@Nantris commented on GitHub (Jun 21, 2024):

Thanks for the tip! Could I suggest a configurable timeout? I'd love to be able to try loading the model from an HDD rather than an SSD for trial purposes, even if it's not ideal.

As it stands now, I'd have to clear space on my SSD just to know if the model could even theoretically load.

I'd rather know I could load the model before I have to clear the space for it.

<!-- gh-comment-id:2183460459 --> @Nantris commented on GitHub (Jun 21, 2024): Thanks for the tip! Could I suggest a configurable timeout? I'd love to be able to try loading the model from an HDD rather than an SSD for trial purposes, even if it's not ideal. As it stands now, I'd have to clear space on my SSD just to know if the model could even theoretically load. I'd rather know I could load the model before I have to clear the space for it.
Author
Owner

@Nantris commented on GitHub (Jun 21, 2024):

Oh I may have misinterpreted. It's not a 5 minute timeout for the load to complete, but for some progress to occur?

<!-- gh-comment-id:2183462734 --> @Nantris commented on GitHub (Jun 21, 2024): Oh I may have misinterpreted. It's not a 5 minute timeout for the load to complete, but for some progress to occur?
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

Correct, as long as the progress percentage is changing, we reset the timer. We only timeout if it hasn't changed progress in 5 minutes.

<!-- gh-comment-id:2183534577 --> @dhiltgen commented on GitHub (Jun 21, 2024): Correct, as long as the progress percentage is changing, we reset the timer. We only timeout if it hasn't changed progress in 5 minutes.
Author
Owner

@Nantris commented on GitHub (Jun 21, 2024):

@dhiltgen with 0.1.45 it now just errors instantly when loading the 236B model: Error: llama runner process has terminated: exit status 0xc0000409

I have 64gb of system RAM, another 64gb of swap, and 16gb VRAM. I understand this may still not be enough, but the behavior is still unexpected.

<!-- gh-comment-id:2183552896 --> @Nantris commented on GitHub (Jun 21, 2024): @dhiltgen with 0.1.45 it now just errors instantly when loading the 236B model: `Error: llama runner process has terminated: exit status 0xc0000409` I have 64gb of system RAM, another 64gb of swap, and 16gb VRAM. I understand this may still not be enough, but the behavior is still unexpected.
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

It sounds like you're hitting #4955

The current validation logic only kicks in on the 2nd+ model load, and we blindly attempt to load a first model regardless of size. Once that issue is resolved, the system will detect models that can't possible load on a combination of VRAM+System memory and prevent attempting with a good error message.

<!-- gh-comment-id:2183564058 --> @dhiltgen commented on GitHub (Jun 21, 2024): It sounds like you're hitting #4955 The current validation logic only kicks in on the 2nd+ model load, and we blindly attempt to load a first model regardless of size. Once that issue is resolved, the system will detect models that can't possible load on a combination of VRAM+System memory and prevent attempting with a good error message.
Author
Owner

@Nantris commented on GitHub (Jun 21, 2024):

Thanks for the reply! That sounds like the issue.

Related question: Am I misunderstanding how "active parameters" works in this context? I thought that only a portion of the model needs to be loaded for the gating network to decide which parameters do need to be loaded for use? Or is the entire model actually required to be loaded at initialization?

<!-- gh-comment-id:2183580962 --> @Nantris commented on GitHub (Jun 21, 2024): Thanks for the reply! That sounds like the issue. Related question: Am I misunderstanding how "active parameters" works in this context? I thought that only a portion of the model needs to be loaded for the gating network to decide which parameters do need to be loaded for use? Or is the entire model actually required to be loaded at initialization?
Author
Owner

@hemangjoshi37a commented on GitHub (Jul 16, 2024):

I tried to run on dual H100 80GB GPU with 256gb Ram. I got error out of memory for GPU. Has anyone got it running?

<!-- gh-comment-id:2229837551 --> @hemangjoshi37a commented on GitHub (Jul 16, 2024): I tried to run on dual H100 80GB GPU with 256gb Ram. I got error out of memory for GPU. Has anyone got it running?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65308