[GH-ISSUE #6283] attempt to load llama 3.1 on system with insufficient system memory and crash with host alloc failure #3935

Open
opened 2026-04-12 14:48:43 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @razvanab on GitHub (Aug 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6283

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I get this error when I am trying to load this model. Other llama 3.1 models in the Ollama library work great.

(base) PS C:\Users\razva> ollama run CognitiveComputations/dolphin-llama3.1
Error: llama runner process has terminated: error:failed to create context with model 'C:\Users\razva.ollama\blobs\sha256-c4e04968e3ca697b947c4820d7d4e58873e9f93908a043e7280863b31019b7df'

verbose.txt

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.3.4

Originally created by @razvanab on GitHub (Aug 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6283 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I get this error when I am trying to load this model. Other llama 3.1 models in the Ollama library work great. (base) PS C:\Users\razva> ollama run CognitiveComputations/dolphin-llama3.1 Error: llama runner process has terminated: error:failed to create context with model 'C:\Users\razva\.ollama\blobs\sha256-c4e04968e3ca697b947c4820d7d4e58873e9f93908a043e7280863b31019b7df' [verbose.txt](https://github.com/user-attachments/files/16560593/verbose.txt) ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.4
GiteaMirror added the bug label 2026-04-12 14:48:43 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 9, 2024):

This model has a default context size of 128000, you have OLLAMA_NUM_PARALLEL=2, so lama.cpp is trying to allocate 31G just for the KV, and your machine just does not have the resources:

time=2024-08-09T14:28:55.092+03:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.9 GiB" before.free="17.1 GiB" before.free_swap="19.4 GiB" now.total="31.9 GiB" now.free="17.1 GiB" now.free_swap="19.5 GiB"

time=2024-08-09T14:28:55.142+03:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[4.6 GiB]" memory.required.full="35.3 GiB" memory.required.partial="0 B" memory.required.kv="31.2 GiB" memory.required.allocations="[0 B]" memory.weights.total="34.9 GiB" memory.weights.repeating="34.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="16.1 GiB" memory.graph.partial="17.1 GiB"

If you need the full 128k you can try setting OLLAMA_NUM_PARALLEL=1 or increasing the swap on your machine. If that doesn't work, you can load the model with a smaller context window: https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726

<!-- gh-comment-id:2277876705 --> @rick-github commented on GitHub (Aug 9, 2024): This model has a default context size of 128000, you have `OLLAMA_NUM_PARALLEL=2`, so lama.cpp is trying to allocate 31G just for the KV, and your machine just does not have the resources: ``` time=2024-08-09T14:28:55.092+03:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.9 GiB" before.free="17.1 GiB" before.free_swap="19.4 GiB" now.total="31.9 GiB" now.free="17.1 GiB" now.free_swap="19.5 GiB" time=2024-08-09T14:28:55.142+03:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[4.6 GiB]" memory.required.full="35.3 GiB" memory.required.partial="0 B" memory.required.kv="31.2 GiB" memory.required.allocations="[0 B]" memory.weights.total="34.9 GiB" memory.weights.repeating="34.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="16.1 GiB" memory.graph.partial="17.1 GiB" ``` If you need the full 128k you can try setting `OLLAMA_NUM_PARALLEL=1` or increasing the swap on your machine. If that doesn't work, you can load the model with a smaller context window: https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726
Author
Owner

@razvanab commented on GitHub (Aug 9, 2024):

Oh, that makes sense. I didn't think about that. Thank you.

<!-- gh-comment-id:2278042329 --> @razvanab commented on GitHub (Aug 9, 2024): Oh, that makes sense. I didn't think about that. Thank you.
Author
Owner

@razvanab commented on GitHub (Aug 9, 2024):

It didn't work with setting the OLLAMA_NUM_PARALLEL=1. I had to lower the context size for the model to load.

<!-- gh-comment-id:2278082398 --> @razvanab commented on GitHub (Aug 9, 2024): It didn't work with setting the OLLAMA_NUM_PARALLEL=1. I had to lower the context size for the model to load.
Author
Owner

@dhiltgen commented on GitHub (Aug 9, 2024):

We've been improving our detection of models that can't possible load and failing with a better error message instead of crashing, but this scenario is close enough that we tried and ultimately failed. Some users are OK with swapping and running very slowly, so we're trying not to be too aggressive, but this is an example where we should try to do better to detect this up front instead of crashing. The KV cache needed ~32G, but that was effectively more than the physical memory of the system so the swap file couldn't help.

<!-- gh-comment-id:2278849088 --> @dhiltgen commented on GitHub (Aug 9, 2024): We've been improving our detection of models that can't possible load and failing with a better error message instead of crashing, but this scenario is close enough that we tried and ultimately failed. Some users are OK with swapping and running very slowly, so we're trying not to be too aggressive, but this is an example where we should try to do better to detect this up front instead of crashing. The KV cache needed ~32G, but that was effectively more than the physical memory of the system so the swap file couldn't help.
Author
Owner

@razvanab commented on GitHub (Aug 10, 2024):

I don't know how hard this may be to implement, but it will be great if, when you execute "Olama run model," you first get a warning that this model may not work on your system.

<!-- gh-comment-id:2280405302 --> @razvanab commented on GitHub (Aug 10, 2024): I don't know how hard this may be to implement, but it will be great if, when you execute "Olama run model," you first get a warning that this model may not work on your system.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3935