[PR #14448] discover: increase GPU discovery timeout from 3s to 15s #25215

Open
opened 2026-04-19 18:04:50 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14448
Author: @Colossus14
Created: 2/26/2026
Status: 🔄 Open

Base: mainHead: increase-discovery-timeout


📝 Commits (1)

  • 81f335c discover: increase GPU discovery timeout from 3s to 15s

📊 Changes

1 file changed (+5 additions, -4 deletions)

View changed files

📝 discover/runner.go (+5 -4)

📄 Description

Problem

The 3-second timeout for GPU device enumeration in discover/runner.go is too short for integrated GPUs with large shared memory pools.

On AMD Strix Halo APUs (Radeon 8060S, gfx1151) with 111 GiB GTT shared memory, ROCm device discovery routinely takes 5–10+ seconds. When the 3-second timeout fires, the GPU goes undetected on subsequent model loads, causing models to silently fall back to CPU or fail entirely.

This is especially problematic at larger context sizes where memory allocation takes longer:

Context Size KV Cache VRAM Discovery Time 3s Timeout Result
65k 26.2 GiB ~3–5s Intermittent failure
131k 28.2 GiB ~5–8s Consistent failure
262k 32.2 GiB ~8–12s Always fails

Fix

Raise both the runner-refresh and bootstrap discovery timeouts from 3 seconds to 15 seconds.

Impact

  • Discrete GPUs: No practical impact — discovery typically completes in ~500ms, well under either timeout
  • iGPUs (AMD ROCm): Prevents false negatives during device enumeration, especially under memory pressure or at large context sizes
  • Worst case: A user with a genuinely dead/missing GPU waits 15 seconds instead of 3 before fallback — negligible compared to multi-minute model loads

Testing

Tested on AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), Fedora 43, ROCm 6.4.2 with qwen3.5:35b:

  • 131k context (28.2 GiB): 96k prompt tokens, 457 tok/s prefill, 30.9 tok/s generation —
  • 262k context (32.2 GiB): 257k prompt tokens, 265 tok/s prefill, 21.5 tok/s generation —

Both failed consistently with the 3-second timeout and work reliably with 15 seconds.

Related: #14445 (gfx1150/1151 target support for Strix Halo)


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14448 **Author:** [@Colossus14](https://github.com/Colossus14) **Created:** 2/26/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `increase-discovery-timeout` --- ### 📝 Commits (1) - [`81f335c`](https://github.com/ollama/ollama/commit/81f335c4e952567e46e46d51c7ef97ca5bd4ceae) discover: increase GPU discovery timeout from 3s to 15s ### 📊 Changes **1 file changed** (+5 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `discover/runner.go` (+5 -4) </details> ### 📄 Description ## Problem The 3-second timeout for GPU device enumeration in `discover/runner.go` is too short for integrated GPUs with large shared memory pools. On AMD Strix Halo APUs (Radeon 8060S, gfx1151) with **111 GiB GTT shared memory**, ROCm device discovery routinely takes **5–10+ seconds**. When the 3-second timeout fires, the GPU goes undetected on subsequent model loads, causing models to silently fall back to CPU or fail entirely. This is especially problematic at larger context sizes where memory allocation takes longer: | Context Size | KV Cache VRAM | Discovery Time | 3s Timeout Result | |---|---|---|---| | 65k | 26.2 GiB | ~3–5s | Intermittent failure | | 131k | 28.2 GiB | ~5–8s | Consistent failure | | 262k | 32.2 GiB | ~8–12s | Always fails | ## Fix Raise both the runner-refresh and bootstrap discovery timeouts from **3 seconds to 15 seconds**. ## Impact - **Discrete GPUs**: No practical impact — discovery typically completes in ~500ms, well under either timeout - **iGPUs (AMD ROCm)**: Prevents false negatives during device enumeration, especially under memory pressure or at large context sizes - **Worst case**: A user with a genuinely dead/missing GPU waits 15 seconds instead of 3 before fallback — negligible compared to multi-minute model loads ## Testing Tested on AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), Fedora 43, ROCm 6.4.2 with `qwen3.5:35b`: - **131k context** (28.2 GiB): 96k prompt tokens, 457 tok/s prefill, 30.9 tok/s generation — ✅ - **262k context** (32.2 GiB): 257k prompt tokens, 265 tok/s prefill, 21.5 tok/s generation — ✅ Both failed consistently with the 3-second timeout and work reliably with 15 seconds. Related: #14445 (gfx1150/1151 target support for Strix Halo) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 18:04:50 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#25215