[PR #15146] discover: retry CUDA probe without visibility filter when CUDA init fails on MIG #61746

Open
opened 2026-04-29 16:46:26 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15146
Author: @Efreh
Created: 3/30/2026
Status: 🔄 Open

Base: mainHead: fix/mig-cuda-fallback


📝 Commits (2)

  • 933c003 discover: retry CUDA probe without visibility filter when CUDA init fails on MIG
  • 70d6504 Merge branch 'ollama:main' into fix/mig-cuda-fallback

📊 Changes

1 file changed (+40 additions, -22 deletions)

View changed files

📝 discover/runner.go (+40 -22)

📄 Description

Summary

This change adds a CUDA fallback probe path in discover/GPUDevices for MIG setups.

When the first CUDA probe (with visibility filtering) fails to initialize a device, Ollama now retries a second CUDA probe without visibility filtering.
If the fallback succeeds, initialization continues and a debug log is emitted:
device recovered with secondary unfiltered CUDA probe.

Why

On some MIG configurations, device initialization can fail during the filtered probe path.
This fallback preserves current behavior for successful cases and is only used when the initial probe fails.

Scope

  • Added helper logic to build init-validation envs (getInitValidationEnvs).
  • Added a secondary unfiltered CUDA probe path on initialization failure.
  • Added debug logging for successful recovery via fallback.

Testing

  • Reproduced on a host with MIG-enabled NVIDIA GPU.
  • Verified that:
    • without fallback, the filtered probe path fails to initialize the device in this scenario;
    • with this change, the fallback probe recovers and device initialization succeeds;
    • non-MIG / already-successful probe paths remain unchanged.

Validation logs (after fix)

time=2026-03-30T11:25:41.430Z level=DEBUG source=server.go:433 msg=subprocess ... CUDA_VISIBLE_DEVICES=GPU-e63c5ad2-45ef-63a2-6d5d-bfb485e52209 GGML_CUDA_INIT=1
time=2026-03-30T11:25:41.675Z level=DEBUG source=runner.go:169 msg="device recovered with secondary unfiltered CUDA probe" id=GPU-e63c5ad2-45ef-63a2-6d5d-bfb485e52209 libdir=/usr/lib/ollama/cuda_v13 pci_id=0000:91:00.0
library=CUDA
time=2026-03-30T11:25:41.675Z level=INFO source=types.go:42 msg="inference compute" id=GPU-e63c5ad2-45ef-63a2-6d5d-bfb485e52209 filter_id="" library=CUDA compute=9.0 name=CUDA0 description="NVIDIA H100 NVL MIG 1g.24gb"
libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:91:00.0 type=discrete total="21.6 GiB" available="21.5 GiB"

## Environment notes

- MIG is enabled and the container is pinned to a MIG device:
    - DOCKER_RESOURCE_GPU=MIG-49c28d85-b102-579c-a21d-90637af716b0
    - NVIDIA_VISIBLE_DEVICES=MIG-49c28d85-b102-579c-a21d-90637af716b0

---

<sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15146 **Author:** [@Efreh](https://github.com/Efreh) **Created:** 3/30/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/mig-cuda-fallback` --- ### 📝 Commits (2) - [`933c003`](https://github.com/ollama/ollama/commit/933c00352bcb419c58ce1f0fb2ad9220787d6a39) discover: retry CUDA probe without visibility filter when CUDA init fails on MIG - [`70d6504`](https://github.com/ollama/ollama/commit/70d65047daa876d72db7ac1d065724e7c3f3d6ab) Merge branch 'ollama:main' into fix/mig-cuda-fallback ### 📊 Changes **1 file changed** (+40 additions, -22 deletions) <details> <summary>View changed files</summary> 📝 `discover/runner.go` (+40 -22) </details> ### 📄 Description ## Summary This change adds a CUDA fallback probe path in `discover/GPUDevices` for MIG setups. When the first CUDA probe (with visibility filtering) fails to initialize a device, Ollama now retries a second CUDA probe without visibility filtering. If the fallback succeeds, initialization continues and a debug log is emitted: `device recovered with secondary unfiltered CUDA probe`. ## Why On some MIG configurations, device initialization can fail during the filtered probe path. This fallback preserves current behavior for successful cases and is only used when the initial probe fails. ## Scope - Added helper logic to build init-validation envs (`getInitValidationEnvs`). - Added a secondary unfiltered CUDA probe path on initialization failure. - Added debug logging for successful recovery via fallback. ## Related issues - Addresses #13800 - Addresses #14031 ## Testing - Reproduced on a host with MIG-enabled NVIDIA GPU. - Verified that: - without fallback, the filtered probe path fails to initialize the device in this scenario; - with this change, the fallback probe recovers and device initialization succeeds; - non-MIG / already-successful probe paths remain unchanged. ## Validation logs (after fix) ```text time=2026-03-30T11:25:41.430Z level=DEBUG source=server.go:433 msg=subprocess ... CUDA_VISIBLE_DEVICES=GPU-e63c5ad2-45ef-63a2-6d5d-bfb485e52209 GGML_CUDA_INIT=1 time=2026-03-30T11:25:41.675Z level=DEBUG source=runner.go:169 msg="device recovered with secondary unfiltered CUDA probe" id=GPU-e63c5ad2-45ef-63a2-6d5d-bfb485e52209 libdir=/usr/lib/ollama/cuda_v13 pci_id=0000:91:00.0 library=CUDA time=2026-03-30T11:25:41.675Z level=INFO source=types.go:42 msg="inference compute" id=GPU-e63c5ad2-45ef-63a2-6d5d-bfb485e52209 filter_id="" library=CUDA compute=9.0 name=CUDA0 description="NVIDIA H100 NVL MIG 1g.24gb" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:91:00.0 type=discrete total="21.6 GiB" available="21.5 GiB" ## Environment notes - MIG is enabled and the container is pinned to a MIG device: - DOCKER_RESOURCE_GPU=MIG-49c28d85-b102-579c-a21d-90637af716b0 - NVIDIA_VISIBLE_DEVICES=MIG-49c28d85-b102-579c-a21d-90637af716b0 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:46:26 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61746