[PR #13851] ml: skip init validation for MIG GPU devices #19689

Open
opened 2026-04-16 07:13:34 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13851
Author: @jessiewbailey
Created: 1/23/2026
Status: 🔄 Open

Base: mainHead: fix-mig-gpu-initialization


📝 Commits (1)

  • bc9aade ml: skip init validation for MIG GPU devices

📊 Changes

2 files changed (+93 additions, -0 deletions)

View changed files

📝 ml/device.go (+8 -0)
ml/device_test.go (+85 -0)

📄 Description

Summary

Skip GPU init validation for MIG (Multi-Instance GPU) devices to fix GPU detection failures in Kubernetes environments.

Problem

MIG partitions fail the second-pass device validation introduced in v0.13.2:

  1. Initial discovery correctly detects the MIG device via cudaGetDeviceProperties()
  2. The validation subprocess sets CUDA_VISIBLE_DEVICES to the parent GPU UUID (e.g., GPU-7327fca2-...)
  3. For MIG devices, this UUID format doesn't work - CUDA requires MIG-<uuid> or GPU-<uuid>/<gi>/<ci> formats
  4. The subprocess fails with "no CUDA-capable device is detected"
  5. The MIG device is filtered out, forcing CPU-only inference

Solution

Detect MIG devices by checking for " MIG " in the device description and skip validation for them. This is safe because:

  • MIG is only available on enterprise A100/H100 GPUs
  • These GPUs have well-supported compute capabilities
  • The validation was designed to catch old GPUs with unsupported CC - not applicable to MIG hardware

Changes

  • ml/device.go: Skip validation for CUDA devices with MIG in description
  • x/ml/device.go: Same change (duplicate file)
  • ml/device_test.go: Unit tests for NeedsInitValidation()

Test plan

  • Added unit tests covering MIG and non-MIG device detection
  • Test cases match real MIG device descriptions from the issue

Fixes #13800


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13851 **Author:** [@jessiewbailey](https://github.com/jessiewbailey) **Created:** 1/23/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix-mig-gpu-initialization` --- ### 📝 Commits (1) - [`bc9aade`](https://github.com/ollama/ollama/commit/bc9aade6cab0f676173d59e3fba1ca761e742fd0) ml: skip init validation for MIG GPU devices ### 📊 Changes **2 files changed** (+93 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `ml/device.go` (+8 -0) ➕ `ml/device_test.go` (+85 -0) </details> ### 📄 Description ## Summary Skip GPU init validation for MIG (Multi-Instance GPU) devices to fix GPU detection failures in Kubernetes environments. ## Problem MIG partitions fail the second-pass device validation introduced in v0.13.2: 1. Initial discovery correctly detects the MIG device via `cudaGetDeviceProperties()` 2. The validation subprocess sets `CUDA_VISIBLE_DEVICES` to the parent GPU UUID (e.g., `GPU-7327fca2-...`) 3. For MIG devices, this UUID format doesn't work - CUDA requires `MIG-<uuid>` or `GPU-<uuid>/<gi>/<ci>` formats 4. The subprocess fails with "no CUDA-capable device is detected" 5. The MIG device is filtered out, forcing CPU-only inference ## Solution Detect MIG devices by checking for `" MIG "` in the device description and skip validation for them. This is safe because: - MIG is only available on enterprise A100/H100 GPUs - These GPUs have well-supported compute capabilities - The validation was designed to catch old GPUs with unsupported CC - not applicable to MIG hardware ## Changes - `ml/device.go`: Skip validation for CUDA devices with MIG in description - `x/ml/device.go`: Same change (duplicate file) - `ml/device_test.go`: Unit tests for `NeedsInitValidation()` ## Test plan - Added unit tests covering MIG and non-MIG device detection - Test cases match real MIG device descriptions from the issue Fixes #13800 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:13:34 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#19689