[GH-ISSUE #12457] bug: CUDA error during cudaMemcpyPeerAsync #8277

Open
opened 2026-04-12 20:49:18 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @dan-and on GitHub (Sep 30, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12457

What is the issue?

System Setup

Ollama with up to date code ( #c47154c )
Update: Also tested same effect with Ollama 0.11.5

Ubuntu 24.04.3 LTS
32 GB Ram
7x Nvidia 3060 12GB with CUDA 13.0

Running gps-oss 120B with:
OLLAMA_CONTEXT_LENGTH=131072
OLLAMA_NUM_PARALLEL=1
OLLAMA_FLASH_ATTENTION=1
OLLAMA_SCHED_SPREAD=0
OLLAMA_NEW_ESTIMATES=1
OLLAMA_NEW_ENGINE=1

$ ollama ps

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gptoss-120:130b ee4b43a1e01f 72 GB 100% GPU 131072 28 minutes from now

nvidia smi after fresh load of model:

$ nvidia-smi
Tue Sep 30 18:41:41 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:23:00.0 Off | N/A |
| 0% 31C P8 14W / 105W | 11279MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 On | 00000000:25:00.0 Off | N/A |
| 0% 29C P8 8W / 105W | 9347MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3060 On | 00000000:26:00.0 Off | N/A |
| 0% 32C P8 10W / 105W | 9591MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3060 On | 00000000:2E:00.0 Off | N/A |
| 0% 31C P8 3W / 105W | 9347MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 3060 On | 00000000:2F:00.0 Off | N/A |
| 0% 33C P8 9W / 105W | 9591MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 3060 On | 00000000:30:00.0 Off | N/A |
| 0% 32C P8 9W / 105W | 9345MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 3060 On | 00000000:31:00.0 Off | N/A |
| 0% 31C P8 5W / 105W | 10697MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 6070 C /usr/bin/ollama 11270MiB |
| 1 N/A N/A 6070 C /usr/bin/ollama 9338MiB |
| 2 N/A N/A 6070 C /usr/bin/ollama 9582MiB |
| 3 N/A N/A 6070 C /usr/bin/ollama 9338MiB |
| 4 N/A N/A 6070 C /usr/bin/ollama 9582MiB |
| 5 N/A N/A 6070 C /usr/bin/ollama 9336MiB |
| 6 N/A N/A 6070 C /usr/bin/ollama 10688MiB |
+-----------------------------------------------------------------------------------------+

Forcing it to break:

With running kilocode, the model get filled up it's context window several times and works for a time quite fine. Then it get into memory pressure and runs into a deadlock.

ollama_c47154c.log

Relevant log output

github.com/ollama/ollama/kvcache.(*Causal).defrag
  → ggml.(*Context).ComputeWithNotify (line 766)
    → sync.Mutex.Lock (DEADLOCKED)

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

ollama dev #c47154c
ollama 0.11.5

Originally created by @dan-and on GitHub (Sep 30, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12457 ### What is the issue? ### System Setup Ollama with up to date code ( #c47154c ) _Update: Also tested same effect with Ollama 0.11.5_ Ubuntu 24.04.3 LTS 32 GB Ram 7x Nvidia 3060 12GB with CUDA 13.0 Running gps-oss 120B with: OLLAMA_CONTEXT_LENGTH=131072 OLLAMA_NUM_PARALLEL=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_SCHED_SPREAD=0 OLLAMA_NEW_ESTIMATES=1 OLLAMA_NEW_ENGINE=1 ### $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gptoss-120:130b ee4b43a1e01f 72 GB 100% GPU 131072 28 minutes from now ### nvidia smi after fresh load of model: $ nvidia-smi Tue Sep 30 18:41:41 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:23:00.0 Off | N/A | | 0% 31C P8 14W / 105W | 11279MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 On | 00000000:25:00.0 Off | N/A | | 0% 29C P8 8W / 105W | 9347MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3060 On | 00000000:26:00.0 Off | N/A | | 0% 32C P8 10W / 105W | 9591MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 3060 On | 00000000:2E:00.0 Off | N/A | | 0% 31C P8 3W / 105W | 9347MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce RTX 3060 On | 00000000:2F:00.0 Off | N/A | | 0% 33C P8 9W / 105W | 9591MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 3060 On | 00000000:30:00.0 Off | N/A | | 0% 32C P8 9W / 105W | 9345MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 3060 On | 00000000:31:00.0 Off | N/A | | 0% 31C P8 5W / 105W | 10697MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 6070 C /usr/bin/ollama 11270MiB | | 1 N/A N/A 6070 C /usr/bin/ollama 9338MiB | | 2 N/A N/A 6070 C /usr/bin/ollama 9582MiB | | 3 N/A N/A 6070 C /usr/bin/ollama 9338MiB | | 4 N/A N/A 6070 C /usr/bin/ollama 9582MiB | | 5 N/A N/A 6070 C /usr/bin/ollama 9336MiB | | 6 N/A N/A 6070 C /usr/bin/ollama 10688MiB | +-----------------------------------------------------------------------------------------+ ### Forcing it to break: With running kilocode, the model get filled up it's context window several times and works for a time quite fine. Then it get into memory pressure and runs into a deadlock. [ollama_c47154c.log](https://github.com/user-attachments/files/22625201/ollama_c47154c.log) ### Relevant log output ```shell github.com/ollama/ollama/kvcache.(*Causal).defrag → ggml.(*Context).ComputeWithNotify (line 766) → sync.Mutex.Lock (DEADLOCKED) ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version ollama dev #c47154c ollama 0.11.5
GiteaMirror added the bug label 2026-04-12 20:49:18 -05:00
Author
Owner

@jessegross commented on GitHub (Sep 30, 2025):

Does this happen consistently or just once? If consistent, is it a change from earlier versions?

This doesn't look like it is happening during defrag, that just happens to be the last log message. It was 6 seconds earlier, so probably not even the next token after defrag. It looks like it is crashing with a CUDA error related to peer copy between GPUs.

<!-- gh-comment-id:3353586846 --> @jessegross commented on GitHub (Sep 30, 2025): Does this happen consistently or just once? If consistent, is it a change from earlier versions? This doesn't look like it is happening during defrag, that just happens to be the last log message. It was 6 seconds earlier, so probably not even the next token after defrag. It looks like it is crashing with a CUDA error related to peer copy between GPUs.
Author
Owner

@dan-and commented on GitHub (Sep 30, 2025):

I have seen it since the
OLLAMA_NEW_ESTIMATES=1
OLLAMA_NEW_ENGINE=1
introduced, but I never could nail it down, as the issue came up very infrequently. I was hunting the issue now since I am building a reranker/crawler currently, why I started using larger context windows. Using a tool like kilocode (or Roo, Cline, Charm, OpenCoder) the context window fills up quite often and was the best way to trigger it.

However, without the new estimates, the calculation is so conservative that it could not allocate that large context window.
I tried to replicate the issue with a slightly lower context window (98k instead of 130k), but then I didn't trigger it.

<!-- gh-comment-id:3353624455 --> @dan-and commented on GitHub (Sep 30, 2025): I have seen it since the OLLAMA_NEW_ESTIMATES=1 OLLAMA_NEW_ENGINE=1 introduced, but I never could nail it down, as the issue came up very infrequently. I was hunting the issue now since I am building a reranker/crawler currently, why I started using larger context windows. Using a tool like kilocode (or Roo, Cline, Charm, OpenCoder) the context window fills up quite often and was the best way to trigger it. However, without the new estimates, the calculation is so conservative that it could not allocate that large context window. I tried to replicate the issue with a slightly lower context window (98k instead of 130k), but then I didn't trigger it.
Author
Owner

@dan-and commented on GitHub (Oct 1, 2025):

Double-checked with ollama 0.11.5 . Same issue here.

ollama_0.11.5.log

<!-- gh-comment-id:3355963640 --> @dan-and commented on GitHub (Oct 1, 2025): Double-checked with ollama 0.11.5 . Same issue here. [ollama_0.11.5.log](https://github.com/user-attachments/files/22636968/ollama_0.11.5.log)
Author
Owner

@dan-and commented on GitHub (Oct 7, 2025):

Just a short update: I tried to replicate it with only one GPU to make it easier to spot the issue. I haven't been able to replicated it with one, so it needs to be related with the usage of several GPUs. I am busy for a few days, but will continue building a test environment that is more simplified and reproducible and will update this issue.

<!-- gh-comment-id:3375517955 --> @dan-and commented on GitHub (Oct 7, 2025): Just a short update: I tried to replicate it with only one GPU to make it easier to spot the issue. I haven't been able to replicated it with one, so it needs to be related with the usage of several GPUs. I am busy for a few days, but will continue building a test environment that is more simplified and reproducible and will update this issue.
Author
Owner

@jessegross commented on GitHub (Oct 7, 2025):

The failing call is cudaMemcpyPeerAsync so it likely is related to multiple GPUs. To be honest, I've never seen this before or other reports of it, so it is possible that you have a hardware fault, especially given the number of GPUs in the machine.

Separately, I don't actually see any errors in the log from 0.11.5.

<!-- gh-comment-id:3378005984 --> @jessegross commented on GitHub (Oct 7, 2025): The failing call is cudaMemcpyPeerAsync so it likely is related to multiple GPUs. To be honest, I've never seen this before or other reports of it, so it is possible that you have a hardware fault, especially given the number of GPUs in the machine. Separately, I don't actually see any errors in the log from 0.11.5.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8277