[GH-ISSUE #8720] Large models crash #52165

Closed
opened 2026-04-28 22:22:44 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @DragonCrafted87 on GitHub (Jan 31, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8720

What is the issue?

server.log
server-1.log
app.log
app-1.log
config.json

the larger deepseek-r1:32b & codellama:34b models are crashing on boot when there should be enough vram to run them the smaller 14b models run as expected full logs from both are attached

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

ollama version is 0.5.7

Originally created by @DragonCrafted87 on GitHub (Jan 31, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8720 ### What is the issue? [server.log](https://github.com/user-attachments/files/18613534/server.log) [server-1.log](https://github.com/user-attachments/files/18613537/server-1.log) [app.log](https://github.com/user-attachments/files/18613538/app.log) [app-1.log](https://github.com/user-attachments/files/18613536/app-1.log) [config.json](https://github.com/user-attachments/files/18613535/config.json) the larger deepseek-r1:32b & codellama:34b models are crashing on boot when there should be enough vram to run them the smaller 14b models run as expected full logs from both are attached ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version ollama version is 0.5.7
GiteaMirror added the bug label 2026-04-28 22:22:44 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

It's not clear from the logs what the cause is, setting OLLAMA_DEBUG=1 in the server environment may show more details.

However, because it loads smaller models, and the log shows that almost all VRAM is being allocated (21.5 out of 23.7):

time=2025-01-30T23:46:35.569-06:00 level=INFO source=memory.go:356 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"

I think the chance that this is an OOM situation is quite good. ollama does ts best to calculate the right values but sometime it's a bit off.

There are mitigations you can try:

  1. Set OLLAMA_GPU_OVERHEAD to give llama.cpp a buffer to grow in to (eg, OLLAMA_GPU_OVERHEAD=536870912 to reserve 512M)
  2. Enable flash attention by setting OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure.
  3. Reduce the number layers that ollama thinks it can offload to the GPU, see here.
<!-- gh-comment-id:2626996486 --> @rick-github commented on GitHub (Jan 31, 2025): It's not clear from the logs what the cause is, setting `OLLAMA_DEBUG=1` in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows) may show more details. However, because it loads smaller models, and the log shows that almost all VRAM is being allocated (21.5 out of 23.7): ``` time=2025-01-30T23:46:35.569-06:00 level=INFO source=memory.go:356 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" ``` I think the chance that this is an OOM situation is quite good. ollama does ts best to calculate the right values but sometime it's a bit off. There are mitigations you can try: 1. Set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L237) to give llama.cpp a buffer to grow in to (eg, `OLLAMA_GPU_OVERHEAD=536870912` to reserve 512M) 2. Enable flash attention by setting [`OLLAMA_FLASH_ATTENTION=1`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L236) in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure. 3. Reduce the number layers that ollama thinks it can offload to the GPU, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).
Author
Owner

@DragonCrafted87 commented on GitHub (Jan 31, 2025):

@rick-github thank you for the info looks like I'm going to have to go with #3 fiddle with the layer count until I find what's optimal for each model on my system as i was able to get codellamma:34b to run by forcing the layer count down to 40 instead of the full 49 from the model.

the OLLAMA_GPU_OVERHEAD didn't seem to have any effect even after a full reboot and setting it all the way up to 8gb to make sure there wasn't any stale env hanging around

<!-- gh-comment-id:2628578618 --> @DragonCrafted87 commented on GitHub (Jan 31, 2025): @rick-github thank you for the info looks like I'm going to have to go with #3 fiddle with the layer count until I find what's optimal for each model on my system as i was able to get codellamma:34b to run by forcing the layer count down to 40 instead of the full 49 from the model. the OLLAMA_GPU_OVERHEAD didn't seem to have any effect even after a full reboot and setting it all the way up to 8gb to make sure there wasn't any stale env hanging around
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

If setting num_gpu worked, then setting OLLAMA_GPU_OVERHEAD should have. Can you provide logs from when you set it to 8G?

<!-- gh-comment-id:2628585451 --> @rick-github commented on GitHub (Jan 31, 2025): If setting `num_gpu` worked, then setting `OLLAMA_GPU_OVERHEAD` should have. Can you provide logs from when you set it to 8G?
Author
Owner

@DragonCrafted87 commented on GitHub (Feb 1, 2025):

i have a bad habit of deleting log files when I'm poking at stuff like this so I don't have them anymore but I just finished trying it again and it looks like user error, don't know how I managed that when I know I rebooted it this morning to make sure the env vars should have been getting read in when ollama started

<!-- gh-comment-id:2628616597 --> @DragonCrafted87 commented on GitHub (Feb 1, 2025): i have a bad habit of deleting log files when I'm poking at stuff like this so I don't have them anymore but I just finished trying it again and it looks like user error, don't know how I managed that when I know I rebooted it this morning to make sure the env vars should have been getting read in when ollama started
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52165