[PR #15716] server: fix download stall watchdog not firing when no bytes arrive #77570

Open
opened 2026-05-05 10:14:29 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15716
Author: @alvinttang
Created: 4/20/2026
Status: 🔄 Open

Base: mainHead: fix/download-stall-watchdog-zero-progress


📝 Commits (1)

  • 3b754dd server: fix download stall watchdog not firing when no bytes arrive

📊 Changes

2 files changed (+75 additions, -5 deletions)

View changed files

📝 server/download.go (+12 -5)
server/download_test.go (+63 -0)

📄 Description

Summary

Fixes a client-side watchdog bug in downloadChunk that matches the "99% hang" symptom tracked in #1736 and the maintainer-described pattern of "certain parts stall completely and zero data is received from the backend. The connection itself is still healthy so it doesn't trigger a retry".

Root cause

The per-part stall watchdog in server/download.go was gated by two conditions that, together, silently disabled it on the cases where it was most needed:

  1. The initial-byte guardif !lastUpdated.IsZero() && time.Since(lastUpdated) > 30*time.Second means the stall timer only starts after the first byte is written. A stream that never delivers a byte is never checked.
  2. The reset-to-zero on stall — after firing errPartStalled the watchdog wrote part.lastUpdated = time.Time{}. The outer retry loop then calls downloadChunk again. The new watchdog instance re-reads the zero-value lastUpdated, the guard above fails, and if the new connection also stalls before producing any bytes, no timeout ever fires and the goroutine blocks forever on a dead TCP stream.

In practice this is what users see as the "99% hang": after the first stall is detected and retried, the retry's new stream opens successfully but streams zero bytes (Cloudflare / R2 byte-range stream goes dead, TCP stays open, no FIN). With no watchdog, only Ctrl+C recovers.

Fix

  • Initialize part.lastUpdated = time.Now() at watchdog entry so the stall timer has a reference point even before the first byte arrives.
  • Drop the !lastUpdated.IsZero() guard and the subsequent reset-to-zero block — both were the mechanism of the bug.
  • Extract the magic 30 * time.Second into a package var stallDuration so tests can shorten it. Production default is unchanged.

Diff is 10 lines of code change in server/download.go and 63 lines of new test.

Test plan

New test TestDownloadChunkStallWatchdogFiresWithoutProgress in server/download_test.go:

  • Starts an httptest.Server whose handler writes 206 Partial Content headers, flushes, then blocks on r.Context().Done() — i.e. it accepts the connection but streams zero body bytes. This mirrors the mxyng-described pattern exactly.
  • Calls b.downloadChunk with a short stallDuration (200 ms) and a 5 s outer context deadline.
  • Asserts that downloadChunk returns errPartStalled in under 2 s.

Before the fix: the test returns context deadline exceeded at 5 s (watchdog never fires).
After the fix: the test returns errPartStalled at ~1 s (first ticker tick).

Verification run locally (go 1.26, darwin/arm64):

go test ./server/                                    → ok   4.97s
go test -race ./server/ -run TestDownload -count=5   → ok   (no races)
go vet ./server/                                     → clean
gofmt -l server/download.go server/download_test.go  → clean

(Pre-existing race-detector failures in TestQuantizeModel and TestSchedRequests* are present on unmodified main and are unrelated to this change — confirmed by checking out main and reproducing them.)

Notes on behaviour change

The old watchdog intentionally did not tick until the first byte arrived, which meant a very slow initial-byte server response (e.g. cold CDN origin fetch) would not trigger a stall. With this fix, a 30 s no-data window at the start of a part will now surface as errPartStalled and the outer retry loop will open a fresh connection — which is almost certainly the desired behaviour and matches what users expect when ollama pull looks frozen. The outer retry logic treats errPartStalled as try--; continue, so a genuinely slow (but not dead) server still converges rather than exhausting retries.

Refs #1736


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15716 **Author:** [@alvinttang](https://github.com/alvinttang) **Created:** 4/20/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/download-stall-watchdog-zero-progress` --- ### 📝 Commits (1) - [`3b754dd`](https://github.com/ollama/ollama/commit/3b754dde9b1700cfd04cb2e46ec7f63eb4adba62) server: fix download stall watchdog not firing when no bytes arrive ### 📊 Changes **2 files changed** (+75 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `server/download.go` (+12 -5) ➕ `server/download_test.go` (+63 -0) </details> ### 📄 Description ## Summary Fixes a client-side watchdog bug in `downloadChunk` that matches the "99% hang" symptom tracked in #1736 and the maintainer-described pattern of *"certain parts stall completely and zero data is received from the backend. The connection itself is still healthy so it doesn't trigger a retry"*. ## Root cause The per-part stall watchdog in `server/download.go` was gated by two conditions that, together, silently disabled it on the cases where it was most needed: 1. **The initial-byte guard** — `if !lastUpdated.IsZero() && time.Since(lastUpdated) > 30*time.Second` means the stall timer only starts after the first byte is written. A stream that never delivers a byte is never checked. 2. **The reset-to-zero on stall** — after firing `errPartStalled` the watchdog wrote `part.lastUpdated = time.Time{}`. The outer retry loop then calls `downloadChunk` again. The new watchdog instance re-reads the zero-value `lastUpdated`, the guard above fails, and if the new connection *also* stalls before producing any bytes, no timeout ever fires and the goroutine blocks forever on a dead TCP stream. In practice this is what users see as the "99% hang": after the first stall is detected and retried, the retry's new stream opens successfully but streams zero bytes (Cloudflare / R2 byte-range stream goes dead, TCP stays open, no FIN). With no watchdog, only Ctrl+C recovers. ## Fix - Initialize `part.lastUpdated = time.Now()` at watchdog entry so the stall timer has a reference point even before the first byte arrives. - Drop the `!lastUpdated.IsZero()` guard and the subsequent reset-to-zero block — both were the mechanism of the bug. - Extract the magic `30 * time.Second` into a package var `stallDuration` so tests can shorten it. Production default is unchanged. Diff is 10 lines of code change in `server/download.go` and 63 lines of new test. ## Test plan New test `TestDownloadChunkStallWatchdogFiresWithoutProgress` in `server/download_test.go`: - Starts an `httptest.Server` whose handler writes `206 Partial Content` headers, flushes, then blocks on `r.Context().Done()` — i.e. it accepts the connection but streams zero body bytes. This mirrors the mxyng-described pattern exactly. - Calls `b.downloadChunk` with a short `stallDuration` (200 ms) and a 5 s outer context deadline. - Asserts that `downloadChunk` returns `errPartStalled` in under 2 s. **Before the fix**: the test returns `context deadline exceeded` at 5 s (watchdog never fires). **After the fix**: the test returns `errPartStalled` at ~1 s (first ticker tick). Verification run locally (go 1.26, darwin/arm64): ``` go test ./server/ → ok 4.97s go test -race ./server/ -run TestDownload -count=5 → ok (no races) go vet ./server/ → clean gofmt -l server/download.go server/download_test.go → clean ``` (Pre-existing race-detector failures in `TestQuantizeModel` and `TestSchedRequests*` are present on unmodified `main` and are unrelated to this change — confirmed by checking out `main` and reproducing them.) ## Notes on behaviour change The old watchdog *intentionally* did not tick until the first byte arrived, which meant a very slow initial-byte server response (e.g. cold CDN origin fetch) would not trigger a stall. With this fix, a 30 s no-data window at the start of a part will now surface as `errPartStalled` and the outer retry loop will open a fresh connection — which is almost certainly the desired behaviour and matches what users expect when `ollama pull` looks frozen. The outer retry logic treats `errPartStalled` as `try--; continue`, so a genuinely slow (but not dead) server still converges rather than exhausting retries. Refs #1736 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:14:29 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77570