[GH-ISSUE #2700] Health check status not invalidated when Newt site goes offline #13016

Closed
opened 2026-05-13 18:33:35 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @strausmann on GitHub (Mar 24, 2026).
Original GitHub issue: https://github.com/fosrl/pangolin/issues/2700

Description

When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site.

This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users.

Steps to Reproduce

  1. Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B)
  2. Enable health checks on all targets
  3. Verify all targets show "healthy"
  4. Stop the Newt agent on Site A (e.g., docker stop pangolin-newt)
  5. Observe: Site A shows "Offline" in the Sites dashboard
  6. Observe: All targets via Site A still show "healthy" in the resource configuration

Expected Behavior

When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites.

Actual Behavior

  • Site correctly shows "Offline"
  • Target health check status retains the last known value ("healthy")
  • Pangolin continues to route traffic through the dead tunnel
  • Users experience sporadic timeouts (requests randomly hit the dead route)

Root Cause Analysis

Based on log analysis:

  1. Health checks run through the Newt tunnel (Pangolin → WebSocket → Newt → HTTP → target)
  2. When Newt disconnects, no new health check results arrive
  3. The last-known-good status stays in the database and is displayed as current
  4. Additionally: newt/disconnecting message type throws an exception instead of triggering state cleanup:
    Unsupported message type: newt/disconnecting
    
  5. Pangolin continues sending health check requests to the disconnected Newt (phantom checks)

Environment

  • Pangolin: Enterprise Edition (PostgreSQL)
  • Newt: v1.10.3
  • Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline

Suggested Fix

When a Newt disconnect is detected:

  1. Set all target health checks on that site to "unknown" or "unhealthy"
  2. Handle the newt/disconnecting message type (currently throws exception)
  3. Stop sending health check requests to disconnected sites
  4. When Newt reconnects, resume health checks and let them naturally transition back to "healthy"
Originally created by @strausmann on GitHub (Mar 24, 2026). Original GitHub issue: https://github.com/fosrl/pangolin/issues/2700 ## Description When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site. This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users. ## Steps to Reproduce 1. Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B) 2. Enable health checks on all targets 3. Verify all targets show "healthy" 4. Stop the Newt agent on Site A (e.g., `docker stop pangolin-newt`) 5. Observe: Site A shows "Offline" in the Sites dashboard 6. Observe: All targets via Site A **still show "healthy"** in the resource configuration ## Expected Behavior When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites. ## Actual Behavior - Site correctly shows "Offline" - Target health check status retains the last known value ("healthy") - Pangolin continues to route traffic through the dead tunnel - Users experience sporadic timeouts (requests randomly hit the dead route) ## Root Cause Analysis Based on log analysis: 1. Health checks run **through the Newt tunnel** (Pangolin → WebSocket → Newt → HTTP → target) 2. When Newt disconnects, no new health check results arrive 3. The last-known-good status stays in the database and is displayed as current 4. Additionally: `newt/disconnecting` message type throws an exception instead of triggering state cleanup: ``` Unsupported message type: newt/disconnecting ``` 5. Pangolin continues sending health check requests to the disconnected Newt (phantom checks) ## Environment - Pangolin: Enterprise Edition (PostgreSQL) - Newt: v1.10.3 - Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline ## Suggested Fix When a Newt disconnect is detected: 1. Set all target health checks on that site to `"unknown"` or `"unhealthy"` 2. Handle the `newt/disconnecting` message type (currently throws exception) 3. Stop sending health check requests to disconnected sites 4. When Newt reconnects, resume health checks and let them naturally transition back to "healthy"
Author
Owner

@strausmann commented on GitHub (Mar 24, 2026):

Code Analysis — Root Cause Identified

After analyzing the source code, the root cause is clear:

Health Check Flow

Health checks run on the Newt agent (remote), not on the Pangolin server. The Newt performs HTTP checks against the target and reports status back via WebSocket to server/routers/target/handleHealthcheckStatusMessage.ts, which updates targetHealthCheck.hcHealth in the database.

The Bug: Three disconnect paths, none invalidate HC status

When a Newt disconnects, three code paths handle the cleanup — but none of them reset target health check status:

Code Path File What it does What it misses
Explicit disconnect `server/routers/newt/handleNewtDisconnectingMessage.ts` Sets `sites.online = false` Does NOT touch `targetHealthCheck.hcHealth`
Offline checker (ping timeout) `server/routers/newt/handleNewtPingMessage.ts` (L26-78) Sets `sites.online = false` Does NOT touch `targetHealthCheck.hcHealth`
WebSocket close `server/routers/ws/ws.ts` (L376) Removes client from tracking Does NOT touch `targetHealthCheck.hcHealth`

Mitigation in Traefik Config (partial)

`server/lib/traefik/getTraefikConfig.ts` (L500) does filter out targets from offline sites when generating Traefik config — but only if at least one other site for that resource is online. This means:

  • Multi-site resources: Traffic routing is partially protected (offline site targets excluded)
  • Single-site resources: No protection — the stale "healthy" status causes routing to a dead tunnel

The Dashboard Problem

Even with the Traefik mitigation, the dashboard always shows the stale DB value. Users see green "healthy" badges for targets on an offline site, which is misleading.

Suggested Fix

In each of the three disconnect handlers, add a query to reset health check status:

```typescript
// After setting sites.online = false:
await db.update(targetHealthCheck)
.set({ hcHealth: "unknown" })
.where(
inArray(
targetHealthCheck.targetId,
db.select({ id: targets.targetId })
.from(targets)
.where(eq(targets.siteId, siteId))
)
);
```

This ensures targets transition to "unknown" immediately when their Newt disconnects, and naturally recover to "healthy" when the Newt reconnects and health checks resume.

<!-- gh-comment-id:4118875041 --> @strausmann commented on GitHub (Mar 24, 2026): ## Code Analysis — Root Cause Identified After analyzing the source code, the root cause is clear: ### Health Check Flow Health checks run on the **Newt agent** (remote), not on the Pangolin server. The Newt performs HTTP checks against the target and reports status back via WebSocket to `server/routers/target/handleHealthcheckStatusMessage.ts`, which updates `targetHealthCheck.hcHealth` in the database. ### The Bug: Three disconnect paths, none invalidate HC status When a Newt disconnects, three code paths handle the cleanup — but **none of them reset target health check status**: | Code Path | File | What it does | What it misses | |-----------|------|-------------|---------------| | Explicit disconnect | \`server/routers/newt/handleNewtDisconnectingMessage.ts\` | Sets \`sites.online = false\` | Does NOT touch \`targetHealthCheck.hcHealth\` | | Offline checker (ping timeout) | \`server/routers/newt/handleNewtPingMessage.ts\` (L26-78) | Sets \`sites.online = false\` | Does NOT touch \`targetHealthCheck.hcHealth\` | | WebSocket close | \`server/routers/ws/ws.ts\` (L376) | Removes client from tracking | Does NOT touch \`targetHealthCheck.hcHealth\` | ### Mitigation in Traefik Config (partial) \`server/lib/traefik/getTraefikConfig.ts\` (L500) does filter out targets from offline sites when generating Traefik config — **but only if at least one other site for that resource is online**. This means: - Multi-site resources: Traffic routing is partially protected (offline site targets excluded) - **Single-site resources: No protection** — the stale "healthy" status causes routing to a dead tunnel ### The Dashboard Problem Even with the Traefik mitigation, the dashboard always shows the stale DB value. Users see green "healthy" badges for targets on an offline site, which is misleading. ### Suggested Fix In each of the three disconnect handlers, add a query to reset health check status: \`\`\`typescript // After setting sites.online = false: await db.update(targetHealthCheck) .set({ hcHealth: "unknown" }) .where( inArray( targetHealthCheck.targetId, db.select({ id: targets.targetId }) .from(targets) .where(eq(targets.siteId, siteId)) ) ); \`\`\` This ensures targets transition to "unknown" immediately when their Newt disconnects, and naturally recover to "healthy" when the Newt reconnects and health checks resume.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/pangolin#13016