[PR #6717] [CLOSED] Improve nvidia GPU discovery error handling #22746

Closed
opened 2026-04-19 16:32:20 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/6717
Author: @dhiltgen
Created: 9/9/2024
Status: Closed

Base: mainHead: busy_gpu_retry


📝 Commits (3)

  • e1eeea7 Bubble up cuda library error codes with some retries
  • 106f20b When falling back to CPU, don't send GPU flags to the runner
  • dae361f discovery: wire up cuda error strings

📊 Changes

8 files changed (+204 additions, -108 deletions)

View changed files

📝 discover/gpu.go (+61 -44)
📝 discover/gpu_info_cudart.c (+16 -12)
📝 discover/gpu_info_cudart.h (+2 -1)
📝 discover/gpu_info_nvcuda.c (+96 -26)
📝 discover/gpu_info_nvcuda.h (+4 -2)
📝 discover/gpu_info_nvml.c (+11 -14)
📝 discover/gpu_info_nvml.h (+2 -1)
📝 llm/server.go (+12 -8)

📄 Description

In some cases, the cuda library may respond with a status code indicating we should retry later.

If we get an error, use the applicable cuda library error string function to get a human readable explanation.

Improve logging during retries in the server subprocess logic as well.

Fixes #6637


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/6717 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 9/9/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `busy_gpu_retry` --- ### 📝 Commits (3) - [`e1eeea7`](https://github.com/ollama/ollama/commit/e1eeea73a95200c51af6ea0a090aba8c6498eb60) Bubble up cuda library error codes with some retries - [`106f20b`](https://github.com/ollama/ollama/commit/106f20b38ab2c6250d6a67d21a6951a1aea30d66) When falling back to CPU, don't send GPU flags to the runner - [`dae361f`](https://github.com/ollama/ollama/commit/dae361ff76f30860e174db4343855b917bea91c4) discovery: wire up cuda error strings ### 📊 Changes **8 files changed** (+204 additions, -108 deletions) <details> <summary>View changed files</summary> 📝 `discover/gpu.go` (+61 -44) 📝 `discover/gpu_info_cudart.c` (+16 -12) 📝 `discover/gpu_info_cudart.h` (+2 -1) 📝 `discover/gpu_info_nvcuda.c` (+96 -26) 📝 `discover/gpu_info_nvcuda.h` (+4 -2) 📝 `discover/gpu_info_nvml.c` (+11 -14) 📝 `discover/gpu_info_nvml.h` (+2 -1) 📝 `llm/server.go` (+12 -8) </details> ### 📄 Description In some cases, the cuda library may respond with a status code indicating we should retry later. If we get an error, use the applicable cuda library error string function to get a human readable explanation. Improve logging during retries in the server subprocess logic as well. Fixes #6637 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 16:32:20 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#22746