[GH-ISSUE #2700] Health check status not invalidated when Newt site goes offline #13016

New Issue

GiteaMirror · 2026-05-13T18:33:35-05:00

GiteaMirror commented

2026-05-13 18:33:35 -05:00

Originally created by @strausmann on GitHub (Mar 24, 2026).
Original GitHub issue: https://github.com/fosrl/pangolin/issues/2700

Description

When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site.

This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users.

Steps to Reproduce

Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B)
Enable health checks on all targets
Verify all targets show "healthy"
Stop the Newt agent on Site A (e.g., docker stop pangolin-newt)
Observe: Site A shows "Offline" in the Sites dashboard
Observe: All targets via Site A still show "healthy" in the resource configuration

Expected Behavior

When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites.

Actual Behavior

Site correctly shows "Offline"
Target health check status retains the last known value ("healthy")
Pangolin continues to route traffic through the dead tunnel
Users experience sporadic timeouts (requests randomly hit the dead route)

Root Cause Analysis

Based on log analysis:

Health checks run through the Newt tunnel (Pangolin → WebSocket → Newt → HTTP → target)
When Newt disconnects, no new health check results arrive
The last-known-good status stays in the database and is displayed as current
Additionally: newt/disconnecting message type throws an exception instead of triggering state cleanup:
```
Unsupported message type: newt/disconnecting
```
Pangolin continues sending health check requests to the disconnected Newt (phantom checks)

Environment

Pangolin: Enterprise Edition (PostgreSQL)
Newt: v1.10.3
Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline

Suggested Fix

When a Newt disconnect is detected:

Set all target health checks on that site to "unknown" or "unhealthy"
Handle the newt/disconnecting message type (currently throws exception)
Stop sending health check requests to disconnected sites
When Newt reconnects, resume health checks and let them naturally transition back to "healthy"

Originally created by @strausmann on GitHub (Mar 24, 2026). Original GitHub issue: https://github.com/fosrl/pangolin/issues/2700 ## Description When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site. This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users. ## Steps to Reproduce 1. Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B) 2. Enable health checks on all targets 3. Verify all targets show "healthy" 4. Stop the Newt agent on Site A (e.g., `docker stop pangolin-newt`) 5. Observe: Site A shows "Offline" in the Sites dashboard 6. Observe: All targets via Site A **still show "healthy"** in the resource configuration ## Expected Behavior When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites. ## Actual Behavior - Site correctly shows "Offline" - Target health check status retains the last known value ("healthy") - Pangolin continues to route traffic through the dead tunnel - Users experience sporadic timeouts (requests randomly hit the dead route) ## Root Cause Analysis Based on log analysis: 1. Health checks run **through the Newt tunnel** (Pangolin → WebSocket → Newt → HTTP → target) 2. When Newt disconnects, no new health check results arrive 3. The last-known-good status stays in the database and is displayed as current 4. Additionally: `newt/disconnecting` message type throws an exception instead of triggering state cleanup: ``` Unsupported message type: newt/disconnecting ``` 5. Pangolin continues sending health check requests to the disconnected Newt (phantom checks) ## Environment - Pangolin: Enterprise Edition (PostgreSQL) - Newt: v1.10.3 - Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline ## Suggested Fix When a Newt disconnect is detected: 1. Set all target health checks on that site to `"unknown"` or `"unhealthy"` 2. Handle the `newt/disconnecting` message type (currently throws exception) 3. Stop sending health check requests to disconnected sites 4. When Newt reconnects, resume health checks and let them naturally transition back to "healthy"

GiteaMirror closed this issue

2026-05-13 18:33:36 -05:00

GiteaMirror commented

2026-05-13 18:33:38 -05:00

@strausmann commented on GitHub (Mar 24, 2026):

Code Analysis — Root Cause Identified

After analyzing the source code, the root cause is clear:

Health Check Flow

Health checks run on the Newt agent (remote), not on the Pangolin server. The Newt performs HTTP checks against the target and reports status back via WebSocket to server/routers/target/handleHealthcheckStatusMessage.ts, which updates targetHealthCheck.hcHealth in the database.

The Bug: Three disconnect paths, none invalidate HC status

When a Newt disconnects, three code paths handle the cleanup — but none of them reset target health check status:

Code Path	File	What it does	What it misses
Explicit disconnect	`server/routers/newt/handleNewtDisconnectingMessage.ts`	Sets `sites.online = false`	Does NOT touch `targetHealthCheck.hcHealth`
Offline checker (ping timeout)	`server/routers/newt/handleNewtPingMessage.ts` (L26-78)	Sets `sites.online = false`	Does NOT touch `targetHealthCheck.hcHealth`
WebSocket close	`server/routers/ws/ws.ts` (L376)	Removes client from tracking	Does NOT touch `targetHealthCheck.hcHealth`

Mitigation in Traefik Config (partial)

`server/lib/traefik/getTraefikConfig.ts` (L500) does filter out targets from offline sites when generating Traefik config — but only if at least one other site for that resource is online. This means:

Multi-site resources: Traffic routing is partially protected (offline site targets excluded)
Single-site resources: No protection — the stale "healthy" status causes routing to a dead tunnel

The Dashboard Problem

Even with the Traefik mitigation, the dashboard always shows the stale DB value. Users see green "healthy" badges for targets on an offline site, which is misleading.

Suggested Fix

In each of the three disconnect handlers, add a query to reset health check status:

```typescript
// After setting sites.online = false:
await db.update(targetHealthCheck)
.set({ hcHealth: "unknown" })
.where(
inArray(
targetHealthCheck.targetId,
db.select({ id: targets.targetId })
.from(targets)
.where(eq(targets.siteId, siteId))
)
);
```

This ensures targets transition to "unknown" immediately when their Newt disconnects, and naturally recover to "healthy" when the Newt reconnects and health checks resume.

@strausmann commented on GitHub (Mar 24, 2026): ## Code Analysis — Root Cause Identified After analyzing the source code, the root cause is clear: ### Health Check Flow Health checks run on the **Newt agent** (remote), not on the Pangolin server. The Newt performs HTTP checks against the target and reports status back via WebSocket to `server/routers/target/handleHealthcheckStatusMessage.ts`, which updates `targetHealthCheck.hcHealth` in the database. ### The Bug: Three disconnect paths, none invalidate HC status When a Newt disconnects, three code paths handle the cleanup — but **none of them reset target health check status**: | Code Path | File | What it does | What it misses | |-----------|------|-------------|---------------| | Explicit disconnect | \`server/routers/newt/handleNewtDisconnectingMessage.ts\` | Sets \`sites.online = false\` | Does NOT touch \`targetHealthCheck.hcHealth\` | | Offline checker (ping timeout) | \`server/routers/newt/handleNewtPingMessage.ts\` (L26-78) | Sets \`sites.online = false\` | Does NOT touch \`targetHealthCheck.hcHealth\` | | WebSocket close | \`server/routers/ws/ws.ts\` (L376) | Removes client from tracking | Does NOT touch \`targetHealthCheck.hcHealth\` | ### Mitigation in Traefik Config (partial) \`server/lib/traefik/getTraefikConfig.ts\` (L500) does filter out targets from offline sites when generating Traefik config — **but only if at least one other site for that resource is online**. This means: - Multi-site resources: Traffic routing is partially protected (offline site targets excluded) - **Single-site resources: No protection** — the stale "healthy" status causes routing to a dead tunnel ### The Dashboard Problem Even with the Traefik mitigation, the dashboard always shows the stale DB value. Users see green "healthy" badges for targets on an offline site, which is misleading. ### Suggested Fix In each of the three disconnect handlers, add a query to reset health check status: \`\`\`typescript // After setting sites.online = false: await db.update(targetHealthCheck) .set({ hcHealth: "unknown" }) .where( inArray( targetHealthCheck.targetId, db.select({ id: targets.targetId }) .from(targets) .where(eq(targets.siteId, siteId)) ) ); \`\`\` This ensures targets transition to "unknown" immediately when their Newt disconnects, and naturally recover to "healthy" when the Newt reconnects and health checks resume.

Sign in to join this conversation.

Branches Tags

main

dev

dependabot/npm_and_yarn/npm-dependencies-8b2d4a9f3a

local-connection

dependabot/go_modules/install/go-install-dependencies-3804ca7238

dependabot/docker/docker-dependencies-4faa477378

dependabot/github_actions/github-actions-dependencies-6d79802a48

api-improvements

feat/remember-last-idp-on-smart-login-form

refactor/batch-status-requests

dependabot/npm_and_yarn/multi-5f1280885e

fix/non-semver-version-error

private-resource-page

resource-launcher

backhaul

exit-node-reconnect

feat/command-palette

ssh

delete-account

msg-delivery

org-only-idp

cicd

patch

site-targets-auto-login

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/pangolin#13016