[GH-ISSUE #9322] Ubuntu 24.10 128G RAM, 2x RTX A4000 ollama run deepseek-r1:70b-llama-distill-fp16 crashes desktop env. #68140

Closed
opened 2026-05-04 12:37:28 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @aloeppert on GitHub (Feb 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9322

What is the issue?

Hi, I tried deepseek-r1:70b-llama-distill-fp16 on a computer with 256GB of RAM, just CPU no GPUs and it ran fine (Ubuntu 22.04). No issues there. Then...

I know it is a stretch due to the size but I thought with the gpus each having 16 GB plus the 128 for CPU, split across resources this might run. The system is running Ubuntu 24.10.

I wasn't expecting it to crash my desktop env. It didn't reset the computer but when desktop relaunched automatically, according to htop, most of the CPU ram was still claimed.

Maybe this is an Ubuntu problem but I thought I'd report it here first.

It is 100% reproducible so if there are some logs to collect let me know the procedure.

Relevant log output


OS

Ubuntu 24.10

GPU

2 x RTX A4000 each by 8 lanes.

CPU

Intel 12900K

Ollama version

0.5.4

Originally created by @aloeppert on GitHub (Feb 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9322 ### What is the issue? Hi, I tried deepseek-r1:70b-llama-distill-fp16 on a computer with 256GB of RAM, just CPU no GPUs and it ran fine (Ubuntu 22.04). No issues there. Then... I know it is a stretch due to the size but I thought with the gpus each having 16 GB plus the 128 for CPU, split across resources this might run. The system is running Ubuntu 24.10. I wasn't expecting it to crash my desktop env. It didn't reset the computer but when desktop relaunched automatically, according to htop, most of the CPU ram was still claimed. Maybe this is an Ubuntu problem but I thought I'd report it here first. It is 100% reproducible so if there are some logs to collect let me know the procedure. ### Relevant log output ```shell ``` ### OS Ubuntu 24.10 ### GPU 2 x RTX A4000 each by 8 lanes. ### CPU Intel 12900K ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-05-04 12:37:28 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 24, 2025):

Server logs. It's likely that your desktop env is also using some of the GPU for rendering, and perhaps there's some non-optimal interaction. What's the output of nvidia-smi?

<!-- gh-comment-id:2679497418 --> @rick-github commented on GitHub (Feb 24, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues). It's likely that your desktop env is also using some of the GPU for rendering, and perhaps there's some non-optimal interaction. What's the output of `nvidia-smi`?
Author
Owner

@aloeppert commented on GitHub (Feb 24, 2025):

It could be. I've got two monitors (one on each card).

Mon Feb 24 12:01:40 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:01:00.0 On | Off |
| 57% 75C P3 41W / 140W | 957MiB / 16376MiB | 26% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:02:00.0 On | Off |
| 50% 70C P0 46W / 140W | 285MiB / 16376MiB | 13% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3224 G /usr/lib/xorg/Xorg 450MiB |
| 0 N/A N/A 3579 G /usr/bin/gnome-shell 358MiB |
| 0 N/A N/A 5539 G ...seed-version=20250224-050149.987000 125MiB |
| 1 N/A N/A 3224 G /usr/lib/xorg/Xorg 271MiB |
+-----------------------------------------------------------------------------------------+

This is without any ollama run/crash.

<!-- gh-comment-id:2679522846 --> @aloeppert commented on GitHub (Feb 24, 2025): It could be. I've got two monitors (one on each card). Mon Feb 24 12:01:40 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A4000 Off | 00000000:01:00.0 On | Off | | 57% 75C P3 41W / 140W | 957MiB / 16376MiB | 26% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX A4000 Off | 00000000:02:00.0 On | Off | | 50% 70C P0 46W / 140W | 285MiB / 16376MiB | 13% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3224 G /usr/lib/xorg/Xorg 450MiB | | 0 N/A N/A 3579 G /usr/bin/gnome-shell 358MiB | | 0 N/A N/A 5539 G ...seed-version=20250224-050149.987000 125MiB | | 1 N/A N/A 3224 G /usr/lib/xorg/Xorg 271MiB | +-----------------------------------------------------------------------------------------+ This is without any ollama run/crash.
Author
Owner

@aloeppert commented on GitHub (Feb 24, 2025):

ollamalog_postcrash_mem.txt
ollamalog-trimmed.txt

<!-- gh-comment-id:2679957656 --> @aloeppert commented on GitHub (Feb 24, 2025): [ollamalog_postcrash_mem.txt](https://github.com/user-attachments/files/18951406/ollamalog_postcrash_mem.txt) [ollamalog-trimmed.txt](https://github.com/user-attachments/files/18951407/ollamalog-trimmed.txt)
Author
Owner

@alienatedsec commented on GitHub (Feb 27, 2025):

This model needs 141+GB of VRAM to run efficiently, so you are offloading a massive chunk of the model to the RAM and CPU processing. It is unlikely to work on your system.

<!-- gh-comment-id:2688853989 --> @alienatedsec commented on GitHub (Feb 27, 2025): This model needs 141+GB of VRAM to run efficiently, so you are offloading a massive chunk of the model to the RAM and CPU processing. It is unlikely to work on your system.
Author
Owner

@aloeppert commented on GitHub (Feb 27, 2025):

This model needs 141+GB of VRAM to run efficiently, so you are offloading a massive chunk of the model to the RAM and CPU processing. It is unlikely to work on your system.

Efficiently is a judgement, but I was interested in the trying models that push the limits of my system and I agree it was a stretch. I thought the failure should be something more graceful than crashing my desktop not that the model should run and why I opened the issue.

<!-- gh-comment-id:2688879942 --> @aloeppert commented on GitHub (Feb 27, 2025): > This model needs 141+GB of VRAM to run efficiently, so you are offloading a massive chunk of the model to the RAM and CPU processing. It is unlikely to work on your system. Efficiently is a judgement, but I was interested in the trying models that push the limits of my system and I agree it was a stretch. I thought the failure should be something more graceful than crashing my desktop not that the model should run and why I opened the issue.
Author
Owner

@alienatedsec commented on GitHub (Feb 27, 2025):

Efficiently is a judgement, but I was interested in the trying models that push the limits of my system and I agree it was a stretch.

You need to be realistic - the below is the quantization Q4_K_M

Image

Also, the response is not as great on a similar hardware to yours

Image

Image

The response time is acceptable but anything lower than that is not worth it.

<!-- gh-comment-id:2688991947 --> @alienatedsec commented on GitHub (Feb 27, 2025): > Efficiently is a judgement, but I was interested in the trying models that push the limits of my system and I agree it was a stretch. You need to be realistic - the below is the quantization `Q4_K_M` ![Image](https://github.com/user-attachments/assets/e5938ed3-6621-4458-9051-62dace30553f) Also, the response is not as great on a similar hardware to yours ![Image](https://github.com/user-attachments/assets/63b57e02-e03e-4552-a6ed-c2ccf7b6ddba) ![Image](https://github.com/user-attachments/assets/2f19cada-6a4e-4e07-b7df-8c96b3ae675c) The response time is acceptable but anything lower than that is not worth it.
Author
Owner

@rick-github commented on GitHub (Feb 27, 2025):

Feb 24 11:16:31 adl ollama[2534]: time=2025-02-24T11:16:31.470-08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"

OLLAMA_LOAD_TIMEOUT=30m will allow the loads to complete.

Feb 24 15:36:32 adl ollama[2521]: time=2025-02-24T15:36:32.318-08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=14 layers.split=7,7 memory.available="[14.5 GiB 15.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="136.3 GiB" memory.required.partial="28.7 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[14.3 GiB 14.3 GiB]" memory.weights.total="128.1 GiB" memory.weights.repeating="126.2 GiB" memory.weights.nonrepeating="2.0 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
Feb 24 11:16:31 adl systemd[1]: ollama.service: A process of this unit has been killed by the OOM killer.
Feb 24 11:16:32 adl systemd[1]: ollama.service: Failed with result 'oom-kill'.
Feb 24 11:16:32 adl systemd[1]: ollama.service: Consumed 32.964s CPU time, 123.9G memory peak.

I think this is the crux of the issue. The likely reason that the desktop environment crashed is that it was killed by the OOM (out of memory) detector. 128G RAM and 32G VRAM is going to be a tight squeeze for this model. You can ease the memory pressure by adding swap.

<!-- gh-comment-id:2688994159 --> @rick-github commented on GitHub (Feb 27, 2025): ``` Feb 24 11:16:31 adl ollama[2534]: time=2025-02-24T11:16:31.470-08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" ``` `OLLAMA_LOAD_TIMEOUT=30m` will allow the loads to complete. ``` Feb 24 15:36:32 adl ollama[2521]: time=2025-02-24T15:36:32.318-08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=14 layers.split=7,7 memory.available="[14.5 GiB 15.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="136.3 GiB" memory.required.partial="28.7 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[14.3 GiB 14.3 GiB]" memory.weights.total="128.1 GiB" memory.weights.repeating="126.2 GiB" memory.weights.nonrepeating="2.0 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" ``` ``` Feb 24 11:16:31 adl systemd[1]: ollama.service: A process of this unit has been killed by the OOM killer. Feb 24 11:16:32 adl systemd[1]: ollama.service: Failed with result 'oom-kill'. Feb 24 11:16:32 adl systemd[1]: ollama.service: Consumed 32.964s CPU time, 123.9G memory peak. ``` I think this is the crux of the issue. The likely reason that the desktop environment crashed is that it was killed by the OOM (out of memory) detector. 128G RAM and 32G VRAM is going to be a tight squeeze for this model. You can ease the memory pressure by adding swap.
Author
Owner

@aloeppert commented on GitHub (Feb 28, 2025):

Thanks for taking a look. I'm fine to close this.

<!-- gh-comment-id:2690585153 --> @aloeppert commented on GitHub (Feb 28, 2025): Thanks for taking a look. I'm fine to close this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68140