[PR #14620] cuda: graceful OOM fallback when creating events during partial GPU offload #61448

Open
opened 2026-04-29 16:31:58 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14620
Author: @ssam18
Created: 3/4/2026
Status: 🔄 Open

Base: mainHead: fix/cuda-event-oom-graceful-fallback


📝 Commits (3)

  • e95bc5f cuda: handle graceful OOM when creating events in a partial gpu offload
  • 0f348b2 sync: apply ggml-cuda graceful OOM fallback for event creation to synced file
  • 0b99f3f sync: restore missing comment lines in graceful OOM fallback for event creation

📊 Changes

2 files changed (+49 additions, -1 deletions)

View changed files

llama/patches/0035-ggml-cuda-graceful-oom-fallback-for-event-creation.patch (+41 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+8 -1)

📄 Description

When the model is too large to fit entirely on the GPU and some portions are transferred to the host system, then the CUDA Host Buffer for transfering data from CPU to GPU will run out of pinned memory resources on the host system. After that occurs, calling cudaEventCreateWithFlags() will fail -- like ggml_cuda_host_malloc(), it has a graceful recovery path when this happens. However, cudaEventCreateWithFlags() was wrapped in CUDA_CHECK() a fatal abort macro. Therefore, when the CUDA_HOST_BUFFER runs out of pinned memory resources, the cudaEventCreateWithFlags() call will result in a "CUDA error: out of memory" message when the llama runner terminates with a status code of 500. The error is reproducible using command-r7b:latest on a system with limited pinned memory resources as described in #14615.

This patch substitutes the fatal CUDA_CHECK() with an error checking routine that first resets the CUDA error state, writes a log warning and returns nullptr. This is identical to what ggml_cuda_host_malloc() does a couple hundred lines before, and returning nullptr is safe since the GGML Backend Scheduler performs a null check on every event usage, therefore instead of terminating the process, cross-device synchronization of events is simply skipped.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14620 **Author:** [@ssam18](https://github.com/ssam18) **Created:** 3/4/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/cuda-event-oom-graceful-fallback` --- ### 📝 Commits (3) - [`e95bc5f`](https://github.com/ollama/ollama/commit/e95bc5ffaecb07479b968a0b25236ba9c74a16f0) cuda: handle graceful OOM when creating events in a partial gpu offload - [`0f348b2`](https://github.com/ollama/ollama/commit/0f348b2e789d626b388a1e8c2657ed2725edee0b) sync: apply ggml-cuda graceful OOM fallback for event creation to synced file - [`0b99f3f`](https://github.com/ollama/ollama/commit/0b99f3f4611e2e5f63ce565945083b58d5215c22) sync: restore missing comment lines in graceful OOM fallback for event creation ### 📊 Changes **2 files changed** (+49 additions, -1 deletions) <details> <summary>View changed files</summary> ➕ `llama/patches/0035-ggml-cuda-graceful-oom-fallback-for-event-creation.patch` (+41 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+8 -1) </details> ### 📄 Description When the model is too large to fit entirely on the GPU and some portions are transferred to the host system, then the CUDA Host Buffer for transfering data from CPU to GPU will run out of pinned memory resources on the host system. After that occurs, calling cudaEventCreateWithFlags() will fail -- like ggml_cuda_host_malloc(), it has a graceful recovery path when this happens. However, cudaEventCreateWithFlags() was wrapped in CUDA_CHECK() a fatal abort macro. Therefore, when the CUDA_HOST_BUFFER runs out of pinned memory resources, the cudaEventCreateWithFlags() call will result in a "CUDA error: out of memory" message when the llama runner terminates with a status code of 500. The error is reproducible using command-r7b:latest on a system with limited pinned memory resources as described in #14615. This patch substitutes the fatal CUDA_CHECK() with an error checking routine that first resets the CUDA error state, writes a log warning and returns nullptr. This is identical to what ggml_cuda_host_malloc() does a couple hundred lines before, and returning nullptr is safe since the GGML Backend Scheduler performs a null check on every event usage, therefore instead of terminating the process, cross-device synchronization of events is simply skipped. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:31:58 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61448