[GH-ISSUE #7867] Deepseek (various) 236b crashes on run #67089

Closed
opened 2026-05-04 09:26:26 -05:00 by GiteaMirror · 20 comments
Owner

Originally created by @Maltz42 on GitHub (Nov 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7867

What is the issue?

Deepseek V2, V2.5, and V2-coder all crash with an OOM error when loading the 236b size. Other versions of Deepseek may as well, that's all I've tested. Hardware is dual A6000's with 48GB each.

Error: llama runner process has terminated: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040
llama_new_context_with_model: failed to allocate compute buffers

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

v0.4.5

Originally created by @Maltz42 on GitHub (Nov 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7867 ### What is the issue? Deepseek V2, V2.5, and V2-coder all crash with an OOM error when loading the 236b size. Other versions of Deepseek may as well, that's all I've tested. Hardware is dual A6000's with 48GB each. ``` Error: llama runner process has terminated: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040 llama_new_context_with_model: failed to allocate compute buffers ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version v0.4.5
GiteaMirror added the bugneeds more info labels 2026-05-04 09:26:27 -05:00
Author
Owner

@igorschlum commented on GitHub (Nov 28, 2024):

@Maltz42, what behavior were you expecting? To run a 236B model, you would need at least 236GB of VRAM on your system. If you're encountering an out-of-memory error, that's expected behavior due to insufficient VRAM.

I recommend using the 16B model, which should fit within your available VRAM and avoid memory allocation errors.

You can consider closing the issue, as the "out of memory" error is a result of the hardware limitations and not a bug.

<!-- gh-comment-id:2506815245 --> @igorschlum commented on GitHub (Nov 28, 2024): @Maltz42, what behavior were you expecting? To run a 236B model, you would need at least 236GB of VRAM on your system. If you're encountering an out-of-memory error, that's expected behavior due to insufficient VRAM. I recommend using the 16B model, which should fit within your available VRAM and avoid memory allocation errors. You can consider closing the issue, as the "out of memory" error is a result of the hardware limitations and not a bug.
Author
Owner

@Maltz42 commented on GitHub (Nov 29, 2024):

@igorschlum - Ollama (normally) falls back to using system RAM when it runs out of VRAM. You don't need a GPU at all to run models, you can run them 100% on CPU, 100% on GPU, or a combination of both. I've installed llama3.1:405b-instruct-q6_K, which is a 336GB quantization, and it runs just fine. Only about 0.75t/s, but otherwise fine. System resources is not the issue.

<!-- gh-comment-id:2506943935 --> @Maltz42 commented on GitHub (Nov 29, 2024): @igorschlum - Ollama (normally) falls back to using system RAM when it runs out of VRAM. You don't need a GPU at all to run models, you can run them 100% on CPU, 100% on GPU, or a combination of both. I've installed llama3.1:405b-instruct-q6_K, which is a 336GB quantization, and it runs just fine. Only about 0.75t/s, but otherwise fine. System resources is not the issue.
Author
Owner

@igorschlum commented on GitHub (Nov 29, 2024):

@Maltz42 Your're write for the fall back.How much RAM do you have on your computer. I have a Mac Station with 192GB and could not run larger than Llama3.1:405b-instruct-q2_K?

<!-- gh-comment-id:2508210905 --> @igorschlum commented on GitHub (Nov 29, 2024): @Maltz42 Your're write for the fall back.How much RAM do you have on your computer. I have a Mac Station with 192GB and could not run larger than Llama3.1:405b-instruct-q2_K?
Author
Owner

@Maltz42 commented on GitHub (Nov 29, 2024):

A total of 96GB VRAM plus 256GB of system RAM. q6_K fills it all to the rim, but it runs!

<!-- gh-comment-id:2508216446 --> @Maltz42 commented on GitHub (Nov 29, 2024): A total of 96GB VRAM plus 256GB of system RAM. q6_K fills it all to the rim, but it runs!
Author
Owner

@rick-github commented on GitHub (Nov 30, 2024):

deepseek architecture is different to most other models, and ollama sometimes misjudges how much of the model can be loaded into VRAM. You can compensate by decreasing the number of layers offloaded (https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). You can also try enabling fallback memory with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, just be aware of the performance impact mentioned in the comment. Lastly, you can set OLLAMA_GPU_OVERHEAD to reserve some space on the GPUs for llama.cpp to allocate as required, although sometimes it doesn't seem to help.

<!-- gh-comment-id:2508783498 --> @rick-github commented on GitHub (Nov 30, 2024): deepseek architecture is different to most other models, and ollama sometimes misjudges how much of the model can be loaded into VRAM. You can compensate by decreasing the number of layers offloaded (https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). You can also try enabling [fallback memory](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) with `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`, just be aware of the performance impact mentioned in the comment. Lastly, you can set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/39e29ae5ddb9ff710c0e28652b61850f458e1205/envconfig/config.go#L237) to reserve some space on the GPUs for llama.cpp to allocate as required, although sometimes it doesn't seem to help.
Author
Owner

@rick-github commented on GitHub (Dec 2, 2024):

Has this been resolved?

<!-- gh-comment-id:2511880175 --> @rick-github commented on GitHub (Dec 2, 2024): Has this been resolved?
Author
Owner

@Maltz42 commented on GitHub (Dec 2, 2024):

I'm going to say it has not been resolved. I haven't had a chance to try the mitigations listed above yet, but I have verified that the model still crashes out-of-the-box on v0.4.7

<!-- gh-comment-id:2512287064 --> @Maltz42 commented on GitHub (Dec 2, 2024): I'm going to say it has not been resolved. I haven't had a chance to try the mitigations listed above yet, but I have verified that the model still crashes out-of-the-box on v0.4.7
Author
Owner

@rick-github commented on GitHub (Dec 2, 2024):

Serve logs may aid in debugging.

<!-- gh-comment-id:2512301187 --> @rick-github commented on GitHub (Dec 2, 2024): [Serve logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@Maltz42 commented on GitHub (Dec 2, 2024):

@rick-github Oh yes, of course... See attached. The log covers from server launch through the crash.

deepseek_crash.log

<!-- gh-comment-id:2513197250 --> @Maltz42 commented on GitHub (Dec 2, 2024): @rick-github Oh yes, of course... See attached. The log covers from server launch through the crash. [deepseek_crash.log](https://github.com/user-attachments/files/17985039/deepseek_crash.log)
Author
Owner

@rick-github commented on GitHub (Dec 3, 2024):

Yes, this looks like ollama being too optimistic about how many layers it can offload. It's figuring it can use [46.8 GiB 46.4 GiB] of [47.3 GiB 47.3 GiB], ie a margin of [0.5GB, 0.9GB]. This comes unstuck while it's trying to allocate 0.8G on GPU0. As mentioned, there are some mitigation techniques:

  1. Set OLLAMA_GPU_OVERHEAD to give llama.cpp a buffer to grow in to (eg, OLLAMA_GPU_OVERHEAD=536870912 to reserve 512M)
  2. Enable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, allowing the GPU to overflow into system RAM. Potential performance impact, see here.
  3. Reduce the number layers that ollama thinks it can offload to the GPU, see here. Ollama is currently offloading 41 of 61 layers, try setting num_gpu to 35.
<!-- gh-comment-id:2513228963 --> @rick-github commented on GitHub (Dec 3, 2024): Yes, this looks like ollama being too optimistic about how many layers it can offload. It's figuring it can use [46.8 GiB 46.4 GiB] of [47.3 GiB 47.3 GiB], ie a margin of [0.5GB, 0.9GB]. This comes unstuck while it's trying to allocate 0.8G on GPU0. As mentioned, there are some mitigation techniques: 1. Set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L237) to give llama.cpp a buffer to grow in to (eg, `OLLAMA_GPU_OVERHEAD=536870912` to reserve 512M) 2. Enable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`, allowing the GPU to overflow into system RAM. Potential performance impact, see [here](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). 3. Reduce the number layers that ollama thinks it can offload to the GPU, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Ollama is currently offloading 41 of 61 layers, try setting `num_gpu` to 35.
Author
Owner

@Maltz42 commented on GitHub (Dec 5, 2024):

Yeah, such miscalculations have been a problem for a very long time, to varying degrees. But I haven't seen it happen on an out-of-the-box model before. Usually it happens when increasing the context size. But I don't think Ollama has ever calculated memory usage really properly.

<!-- gh-comment-id:2521709116 --> @Maltz42 commented on GitHub (Dec 5, 2024): Yeah, such miscalculations have been a problem for a very long time, to varying degrees. But I haven't seen it happen on an out-of-the-box model before. Usually it happens when increasing the context size. But I don't think Ollama has ever calculated memory usage really properly.
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

Did the mitigations help?

<!-- gh-comment-id:2559105118 --> @rick-github commented on GitHub (Dec 23, 2024): Did the mitigations help?
Author
Owner

@Maltz42 commented on GitHub (Jan 13, 2025):

While the mitigations work, as expected, this is still broken in 0.5.5 - not sure why this should be closed as completed?

<!-- gh-comment-id:2586089865 --> @Maltz42 commented on GitHub (Jan 13, 2025): While the mitigations work, as expected, this is still broken in 0.5.5 - not sure why this should be closed as completed?
Author
Owner

@rick-github commented on GitHub (Jan 13, 2025):

Which mitigations helped?

<!-- gh-comment-id:2586104731 --> @rick-github commented on GitHub (Jan 13, 2025): Which mitigations helped?
Author
Owner

@Maltz42 commented on GitHub (Jan 13, 2025):

num_gpu 35 did the trick in my case, but that's the only one I tried. I can test the others (or other values of num_gpu) if it would be helpful. I've also seen errors occur with very large context windows, so there's something fundamentally wrong somewhere with the way ollama calculates GPU memory usage I guess.

<!-- gh-comment-id:2586140484 --> @Maltz42 commented on GitHub (Jan 13, 2025): num_gpu 35 did the trick in my case, but that's the only one I tried. I can test the others (or other values of num_gpu) if it would be helpful. I've also seen errors occur with very large context windows, so there's something fundamentally wrong somewhere with the way ollama calculates GPU memory usage I guess.
Author
Owner

@Maltz42 commented on GitHub (Jan 13, 2025):

Actually, while tinkering with it just now, there might be more to it. I got the following error after a few back-and-forth responses. The output froze, mid-sentence, for a few seconds and then ollama exited with the following error:

Error: an error was encountered while running the model: unexpected EOF

<!-- gh-comment-id:2586142084 --> @Maltz42 commented on GitHub (Jan 13, 2025): Actually, while tinkering with it just now, there might be more to it. I got the following error after a few back-and-forth responses. The output froze, mid-sentence, for a few seconds and then ollama exited with the following error: Error: an error was encountered while running the model: unexpected EOF
Author
Owner

@rick-github commented on GitHub (Jan 13, 2025):

Can you add logs for that failure?

<!-- gh-comment-id:2586173479 --> @rick-github commented on GitHub (Jan 13, 2025): Can you add logs for that failure?
Author
Owner

@Maltz42 commented on GitHub (Jan 13, 2025):

Jan 12 22:09:45 daisy ollama[590872]: llama.cpp:11968: The current context does not support K-shift
Jan 12 22:09:45 daisy ollama[590872]: SIGSEGV: segmentation violation
Jan 12 22:09:45 daisy ollama[590872]: PC=0x7c9ff2824c47 m=0 sigcode=1 addr=0x20ae03fc8
Jan 12 22:09:45 daisy ollama[590872]: signal arrived during cgo execution

There's a lot after that, but those are the first few lines at the moment the output froze - I suspect that's the relevant bit. I've also reproduced it just now.

<!-- gh-comment-id:2587608874 --> @Maltz42 commented on GitHub (Jan 13, 2025): ``` Jan 12 22:09:45 daisy ollama[590872]: llama.cpp:11968: The current context does not support K-shift Jan 12 22:09:45 daisy ollama[590872]: SIGSEGV: segmentation violation Jan 12 22:09:45 daisy ollama[590872]: PC=0x7c9ff2824c47 m=0 sigcode=1 addr=0x20ae03fc8 Jan 12 22:09:45 daisy ollama[590872]: signal arrived during cgo execution ``` There's a lot after that, but those are the first few lines at the moment the output froze - I suspect that's the relevant bit. I've also reproduced it just now.
Author
Owner

@rick-github commented on GitHub (Jan 13, 2025):

OK, this is different to the OOM issues. The deepseek class of models don't support shifting the context window when the buffer fills up, see here.

<!-- gh-comment-id:2588466720 --> @rick-github commented on GitHub (Jan 13, 2025): OK, this is different to the OOM issues. The deepseek class of models don't support shifting the context window when the buffer fills up, see [here](https://github.com/ollama/ollama/issues/5975).
Author
Owner

@rick-github commented on GitHub (Apr 13, 2025):

Deepseek k-shift problem resolved with #9433.

<!-- gh-comment-id:2799980739 --> @rick-github commented on GitHub (Apr 13, 2025): Deepseek k-shift problem resolved with #9433.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67089