[GH-ISSUE #268] Possible memory leak running on debian 12 #2063

Closed
opened 2026-05-03 05:46:30 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @sotima on GitHub (Mar 12, 2026).
Original GitHub issue: https://github.com/fosrl/newt/issues/268

Originally assigned to: @LaurenceJJones on GitHub.

Describe the Bug

Hi there and thanks for the great work you have done!
I am experiencing a problem with memory leakage here. I have Pangolin running on a VPS and the newt client in an LXC container on my Proxmox server.

Over time the LXC container runs out of memory. Round about after one week depending on the memory allocated to the container. Nothing else is running on the LXC - only newt.

Thanks for looking into it!

Environment

  • LXC-Container with Debian 12 under Proxmox 8.4.14
  • Pangolin Version: 1.16.2
  • Gerbil Version: 1.3.0
  • Traefik Version: 1.3.6
  • Newt Version: 1.10.1
  • Olm Version: (if applicable)

To Reproduce

Just install newt client in an LXC-Container with Debian 12 (also tried alpine, same behaviour) under proxmox and let it run. Run top in the console of the container.

Below is a screenshot of top running about 2-3 minutes - observe the values in the column RES - starting at 93408 and ending at 94688 :

When I started the test, one browser-window was connected to the client. In the middle of it, i have closed the connection. The increase slowed down a bit, but still increased.

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
154 newt      20   0 1317720  93408  18944 S   0.0   8.9   0:58.85 newt     
154 newt      20   0 1317720  93408  18944 S   0.3   8.9   0:58.86 newt     
154 newt      20   0 1317720  93536  18944 S   0.0   8.9   0:58.86 newt     
154 newt      20   0 1317720  93536  18944 S   0.3   8.9   0:58.87 newt     
154 newt      20   0 1317720  93536  18944 S   0.0   8.9   0:58.87 newt     
154 newt      20   0 1317720  93664  18944 S   0.3   8.9   0:58.88 newt     
154 newt      20   0 1317720  93664  18944 S   0.3   8.9   0:58.89 newt     
154 newt      20   0 1317720  93664  18944 S   0.0   8.9   0:58.89 newt     
154 newt      20   0 1317720  93664  18944 S   0.3   8.9   0:58.90 newt     
154 newt      20   0 1317720  93792  18944 S   0.0   8.9   0:58.90 newt     
154 newt      20   0 1317720  93792  18944 S   0.3   8.9   0:58.91 newt     
154 newt      20   0 1317720  93792  18944 S   0.0   8.9   0:58.91 newt     
154 newt      20   0 1317720  93920  18944 S   0.3   9.0   0:58.92 newt     
154 newt      20   0 1317720  93920  18944 S   0.0   9.0   0:58.92 newt     
154 newt      20   0 1317720  93920  18944 S   0.0   9.0   0:58.92 newt     
154 newt      20   0 1317720  93920  18944 S   0.3   9.0   0:58.93 newt     
154 newt      20   0 1317720  94048  18944 S   0.0   9.0   0:58.93 newt     
154 newt      20   0 1317720  94048  18944 S   0.3   9.0   0:58.94 newt     
154 newt      20   0 1317720  94048  18944 S   0.0   9.0   0:58.94 newt     
154 newt      20   0 1317720  94048  18944 S   0.3   9.0   0:58.95 newt     
154 newt      20   0 1317720  94176  18944 S   0.3   9.0   0:58.96 newt     
154 newt      20   0 1317720  94176  18944 S   0.0   9.0   0:58.96 newt     
154 newt      20   0 1317720  94176  18944 S   0.3   9.0   0:58.97 newt     
154 newt      20   0 1317720  94304  18944 S   0.0   9.0   0:58.97 newt     
154 newt      20   0 1317720  94304  18944 S   0.3   9.0   0:58.98 newt     
154 newt      20   0 1317720  94304  18944 S   0.0   9.0   0:58.98 newt     
154 newt      20   0 1317720  94304  18944 S   0.0   9.0   0:58.98 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:58.99 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:58.99 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.00 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:59.00 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.01 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:59.01 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.02 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:59.02 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:59.02 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.03 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.04 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:59.04 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.05 newt     
154 newt      20   0 1317720  94432  18944 S   0.0   9.0   0:59.05 newt     
154 newt      20   0 1317720  94432  18944 S   0.3   9.0   0:59.06 newt     
154 newt      20   0 1317720  94560  18944 S   0.0   9.0   0:59.06 newt     
154 newt      20   0 1317720  94560  18944 S   0.3   9.0   0:59.07 newt     
154 newt      20   0 1317720  94560  18944 S   0.0   9.0   0:59.07 newt     
154 newt      20   0 1317720  94688  18944 S   0.0   9.0   0:59.07 newt     
154 newt      20   0 1317720  94688  18944 S   0.3   9.0   0:59.08 newt     
154 newt      20   0 1317720  94688  18944 S   0.0   9.0   0:59.08 newt

Expected Behavior

No increase in memory usage over time.

Originally created by @sotima on GitHub (Mar 12, 2026). Original GitHub issue: https://github.com/fosrl/newt/issues/268 Originally assigned to: @LaurenceJJones on GitHub. ### Describe the Bug Hi there and thanks for the great work you have done! I am experiencing a problem with memory leakage here. I have Pangolin running on a VPS and the newt client in an LXC container on my Proxmox server. Over time the LXC container runs out of memory. Round about after one week depending on the memory allocated to the container. Nothing else is running on the LXC - only newt. Thanks for looking into it! ### Environment - LXC-Container with Debian 12 under Proxmox 8.4.14 - Pangolin Version: 1.16.2 - Gerbil Version: 1.3.0 - Traefik Version: 1.3.6 - Newt Version: 1.10.1 - Olm Version: (if applicable) ### To Reproduce Just install newt client in an LXC-Container with Debian 12 (also tried alpine, same behaviour) under proxmox and let it run. Run top in the console of the container. Below is a screenshot of top running about 2-3 minutes - observe the values in the column RES - starting at 93408 and ending at 94688 : When I started the test, one browser-window was connected to the client. In the middle of it, i have closed the connection. The increase slowed down a bit, but still increased. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 154 newt 20 0 1317720 93408 18944 S 0.0 8.9 0:58.85 newt 154 newt 20 0 1317720 93408 18944 S 0.3 8.9 0:58.86 newt 154 newt 20 0 1317720 93536 18944 S 0.0 8.9 0:58.86 newt 154 newt 20 0 1317720 93536 18944 S 0.3 8.9 0:58.87 newt 154 newt 20 0 1317720 93536 18944 S 0.0 8.9 0:58.87 newt 154 newt 20 0 1317720 93664 18944 S 0.3 8.9 0:58.88 newt 154 newt 20 0 1317720 93664 18944 S 0.3 8.9 0:58.89 newt 154 newt 20 0 1317720 93664 18944 S 0.0 8.9 0:58.89 newt 154 newt 20 0 1317720 93664 18944 S 0.3 8.9 0:58.90 newt 154 newt 20 0 1317720 93792 18944 S 0.0 8.9 0:58.90 newt 154 newt 20 0 1317720 93792 18944 S 0.3 8.9 0:58.91 newt 154 newt 20 0 1317720 93792 18944 S 0.0 8.9 0:58.91 newt 154 newt 20 0 1317720 93920 18944 S 0.3 9.0 0:58.92 newt 154 newt 20 0 1317720 93920 18944 S 0.0 9.0 0:58.92 newt 154 newt 20 0 1317720 93920 18944 S 0.0 9.0 0:58.92 newt 154 newt 20 0 1317720 93920 18944 S 0.3 9.0 0:58.93 newt 154 newt 20 0 1317720 94048 18944 S 0.0 9.0 0:58.93 newt 154 newt 20 0 1317720 94048 18944 S 0.3 9.0 0:58.94 newt 154 newt 20 0 1317720 94048 18944 S 0.0 9.0 0:58.94 newt 154 newt 20 0 1317720 94048 18944 S 0.3 9.0 0:58.95 newt 154 newt 20 0 1317720 94176 18944 S 0.3 9.0 0:58.96 newt 154 newt 20 0 1317720 94176 18944 S 0.0 9.0 0:58.96 newt 154 newt 20 0 1317720 94176 18944 S 0.3 9.0 0:58.97 newt 154 newt 20 0 1317720 94304 18944 S 0.0 9.0 0:58.97 newt 154 newt 20 0 1317720 94304 18944 S 0.3 9.0 0:58.98 newt 154 newt 20 0 1317720 94304 18944 S 0.0 9.0 0:58.98 newt 154 newt 20 0 1317720 94304 18944 S 0.0 9.0 0:58.98 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:58.99 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:58.99 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.00 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:59.00 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.01 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:59.01 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.02 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:59.02 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:59.02 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.03 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.04 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:59.04 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.05 newt 154 newt 20 0 1317720 94432 18944 S 0.0 9.0 0:59.05 newt 154 newt 20 0 1317720 94432 18944 S 0.3 9.0 0:59.06 newt 154 newt 20 0 1317720 94560 18944 S 0.0 9.0 0:59.06 newt 154 newt 20 0 1317720 94560 18944 S 0.3 9.0 0:59.07 newt 154 newt 20 0 1317720 94560 18944 S 0.0 9.0 0:59.07 newt 154 newt 20 0 1317720 94688 18944 S 0.0 9.0 0:59.07 newt 154 newt 20 0 1317720 94688 18944 S 0.3 9.0 0:59.08 newt 154 newt 20 0 1317720 94688 18944 S 0.0 9.0 0:59.08 newt ### Expected Behavior No increase in memory usage over time.
Author
Owner

@LaurenceJJones commented on GitHub (Mar 12, 2026):

No increase in memory usage over time.

I dont understand this expected behavior, all software increases memory over time and compacts back down once the memory has been garbage collected. Newt acts as a proxy between clients and downstream applications so it passes byte buffers through itself so generally this will increase with usage and then once Go (garbage collected runtime) decides a buffer or memory allocated is safe to release it will do so.

about 2-3 minutes

This is not concrete timeline or actionable for us to debug as a "leak", if generally over a day you see consistent memory without anything being released, then yes thats a leak but so far this is expected memory growth with general usage.

To further explain even if a client disconnects Go will not release this memory straight away, this is just a caveat of using Go we have no control over memory usage and depend on the GC being "smart" enough.

if you want technical gc docs

<!-- gh-comment-id:4044998363 --> @LaurenceJJones commented on GitHub (Mar 12, 2026): > No increase in memory usage over time. I dont understand this expected behavior, all software increases memory over time and compacts back down once the memory has been garbage collected. Newt acts as a proxy between clients and downstream applications so it passes byte buffers through itself so generally this will increase with usage and then once Go (garbage collected runtime) decides a buffer or memory allocated is safe to release it will do so. > about 2-3 minutes This is not concrete timeline or actionable for us to debug as a "leak", if generally over a day you see consistent memory without _anything_ being released, then yes thats a leak but so far this is expected memory growth with general usage. To further explain even if a client disconnects Go will not release this memory straight away, this is just a caveat of using Go we have no control over memory usage and depend on the GC being "smart" enough. [if you want technical gc docs](https://go.dev/doc/gc-guide)
Author
Owner

@sotima commented on GitHub (Mar 12, 2026):

Ok, You are right concerning the expected behaviour. I should have written: memory shall be de-allocated in a regular base, preventing the system to run out of memory. Concerning the top results. Here is the memory graph of proxmox over three days:

on March 08, 13:00 it startet with 38 MB of memory and steadily increased up to 443MB at March 10. 13:00. I then
restarted the LXC and it went from 74MB up to 324MB the next day at 16:00. Then I increased the memory of the LXC to 1GB (has been 512MB), it now started at 40MB and today at 10:00 it has reached 115MB.
I cannot see any memeory de-allocation in this graph although during the night, I am sure there are no connections.

Image
<!-- gh-comment-id:4045122410 --> @sotima commented on GitHub (Mar 12, 2026): Ok, You are right concerning the expected behaviour. I should have written: memory shall be de-allocated in a regular base, preventing the system to run out of memory. Concerning the top results. Here is the memory graph of proxmox over three days: on March 08, 13:00 it startet with 38 MB of memory and steadily increased up to 443MB at March 10. 13:00. I then restarted the LXC and it went from 74MB up to 324MB the next day at 16:00. Then I increased the memory of the LXC to 1GB (has been 512MB), it now started at 40MB and today at 10:00 it has reached 115MB. I cannot see any memeory de-allocation in this graph although during the night, I am sure there are no connections. <img width="558" height="311" alt="Image" src="https://github.com/user-attachments/assets/255572ef-ac0f-4724-b461-645c424ef825" />
Author
Owner

@github-actions[bot] commented on GitHub (Mar 27, 2026):

This issue has been automatically marked as stale due to 14 days of inactivity. It will be closed in 14 days if no further activity occurs.

<!-- gh-comment-id:4139265686 --> @github-actions[bot] commented on GitHub (Mar 27, 2026): This issue has been automatically marked as stale due to 14 days of inactivity. It will be closed in 14 days if no further activity occurs.
Author
Owner

@strausmann commented on GitHub (Mar 27, 2026):


Additional findings: TCP connection leak with SMTP targets (FIN-WAIT-2 accumulation)

We are experiencing the same issue and have done a detailed root cause analysis. Sharing our findings here as they complement the reports in #268 and #238, and directly relate to PR #277.


Environment:

  • Newt version: fosrl/newt:latest (Newt 1.10.3, image pulled 2026-03-26)
  • Deployment: Docker container, network_mode: bridge, on Linux (Ubuntu 24.04)
  • Setup: 5 nodes running Newt, identical base configuration
  • Only 1 node affected — the one routing SMTP traffic (TCP port 25 and 26)

Metrics (affected node vs. normal nodes after ~35h uptime):

Metric Affected Node (node-A) Normal Nodes (4x, node-B through node-E)
RAM (RSS) 1.98 GiB 35–41 MiB
Virtual memory 4.8 GiB ~200 MiB
CPU usage 234% 0.2–2.8%
Open file descriptors 3,590 ~100
Total TCP connections 14,447 normal
Goroutine / threads 38 expected
Uptime at measurement 35h up to 7 days (no growth)

RAM growth factor: ~51x compared to healthy nodes.


Root cause analysis:

We identified two concurrent issues:

Issue 1: TCP connection leak in the forwarder (primary cause)

node-A routes a mail gateway resource with health checks enabled on TCP port 25 and 26. The TCP Forwarder generates a very high rate of connections — we observed 246 connection log entries to a single TCP target within a 500-line log window.

Each connection goes through the pattern:

TCP Forwarder: Handling connection ...
TCP Forwarder: Successfully connected to 10.x.x.A:25 starting bidirectional copy

These connections accumulate in FIN-WAIT-2 state:

FIN-WAIT-2: 10.x.x.B:60026 -> 10.x.x.A:26
FIN-WAIT-2: 10.x.x.B:47698 -> 10.x.x.A:26
FIN-WAIT-2: 10.x.x.B:51698 -> 10.x.x.A:25
FIN-WAIT-2: 10.x.x.B:33140 -> 10.x.x.A:26
FIN-WAIT-2: 10.x.x.B:52742 -> 10.x.x.A:26
FIN-WAIT-2: 10.x.x.B:56314 -> 10.x.x.A:25

FIN-WAIT-2 explanation: Newt sends FIN (local side closes), but the remote host (mail server or Gerbil tunnel endpoint) never sends the final FIN. These half-closed connections are never cleaned up by the OS, each consuming one file descriptor indefinitely.

With 3,590 open FDs (vs. ~100 on healthy nodes), this confirms the file descriptor leak is driven by accumulated FIN-WAIT-2 TCP connections.

We also observed 11+ simultaneous UDP connections to our DNS resolver, each with separate, non-reused file descriptors:

UDP: fd=86  -> dns-resolver:853
UDP: fd=479 -> dns-resolver:853
UDP: fd=1057 -> dns-resolver:853
UDP: fd=1232 -> dns-resolver:853
UDP: fd=2749 -> dns-resolver:853
... (11 separate FDs for DNS)

This matches exactly the pattern described in PR #277: UDP buffers are allocated without sync.Pool, causing each DNS lookup to allocate a new buffer and connection object without reuse. Under high TCP load (Issue 1), DNS lookups are frequent, multiplying this effect.


Correlation: Why only the node with SMTP targets is affected

All 5 nodes run identical Newt configurations. The only difference:

  • node-A (affected): Routes a mail gateway resource. Health checks run on TCP port 25/26. This generates continuous short-lived TCP connections.
  • node-B through node-E (healthy): Route HTTP/HTTPS resources only. No TCP-level health checks on long-lived SMTP connections.

After restarting Newt on node-A, RAM immediately dropped back to ~40 MiB and CPU to <3%. The other nodes have been running continuously for 7+ days without any memory growth.


Key observation: health check behavior on TCP targets

SMTP (port 25/26) connections have a specific characteristic: the server side keeps the connection alive waiting for client commands (e.g., EHLO, QUIT). When Newt's TCP forwarder opens a health-check or probe connection without completing the SMTP handshake, the server holds the connection open. Newt sends FIN locally, but the SMTP server never sends FIN back — resulting in permanent FIN-WAIT-2.

This is not SMTP-specific per se — any TCP target that holds connections open waiting for application-layer data (SMTP, SSH, database ports, etc.) will trigger this pattern.


Workaround (applied until upstream fix):

# Immediate fix: restart Newt
docker restart pangolin-newt

# Containment: set memory limit in docker-compose.yml
deploy:
  resources:
    limits:
      memory: 512m

# Prevent OOM: scheduled daily restart via cron
0 4 * * * root /usr/bin/docker restart pangolin-newt

The memory limit prevents the host from being affected if the leak accelerates; the cron restart keeps RAM in check until a fix is released.


Suggested fix directions:

  1. TCP Forwarder: Implement a connection timeout and explicit FIN-WAIT-2 cleanup — if a connection has been in FIN-WAIT state for longer than N seconds, force close and release the goroutine and FD.
  2. TCP Forwarder: Track open forwarding goroutines with a bounded goroutine pool to prevent unbounded growth.
  3. UDP buffers: Apply sync.Pool as proposed in PR #277 to avoid allocating new buffer objects per DNS query.
  4. Health checks on TCP targets: Consider adding an application-layer close signal (e.g., send QUIT\r\n for SMTP) before closing the TCP connection, to allow the server to send a proper FIN.

References:

  • This issue: #268 (memory leak, same symptoms)
  • Related: #238 (100% CPU after DNS failure — our 234% CPU may be related)
  • Related PR: #277 (UDP sync.Pool — directly observed in our FD dump)
  • Pangolin issue: fosrl/pangolin#2631 (endpoint roaming leak — same bidirectional copy log pattern)
<!-- gh-comment-id:4145186873 --> @strausmann commented on GitHub (Mar 27, 2026): --- ## Additional findings: TCP connection leak with SMTP targets (FIN-WAIT-2 accumulation) We are experiencing the same issue and have done a detailed root cause analysis. Sharing our findings here as they complement the reports in #268 and #238, and directly relate to PR #277. --- **Environment:** - Newt version: `fosrl/newt:latest` (Newt 1.10.3, image pulled 2026-03-26) - Deployment: Docker container, `network_mode: bridge`, on Linux (Ubuntu 24.04) - Setup: 5 nodes running Newt, identical base configuration - Only 1 node affected — the one routing SMTP traffic (TCP port 25 and 26) --- **Metrics (affected node vs. normal nodes after ~35h uptime):** | Metric | Affected Node (node-A) | Normal Nodes (4x, node-B through node-E) | |---|---|---| | RAM (RSS) | **1.98 GiB** | 35–41 MiB | | Virtual memory | 4.8 GiB | ~200 MiB | | CPU usage | **234%** | 0.2–2.8% | | Open file descriptors | **3,590** | ~100 | | Total TCP connections | **14,447** | normal | | Goroutine / threads | 38 | expected | | Uptime at measurement | 35h | up to 7 days (no growth) | **RAM growth factor: ~51x compared to healthy nodes.** --- **Root cause analysis:** We identified two concurrent issues: ### Issue 1: TCP connection leak in the forwarder (primary cause) node-A routes a mail gateway resource with health checks enabled on TCP port 25 and 26. The `TCP Forwarder` generates a very high rate of connections — we observed 246 connection log entries to a single TCP target within a 500-line log window. Each connection goes through the pattern: ``` TCP Forwarder: Handling connection ... TCP Forwarder: Successfully connected to 10.x.x.A:25 starting bidirectional copy ``` These connections accumulate in **FIN-WAIT-2** state: ``` FIN-WAIT-2: 10.x.x.B:60026 -> 10.x.x.A:26 FIN-WAIT-2: 10.x.x.B:47698 -> 10.x.x.A:26 FIN-WAIT-2: 10.x.x.B:51698 -> 10.x.x.A:25 FIN-WAIT-2: 10.x.x.B:33140 -> 10.x.x.A:26 FIN-WAIT-2: 10.x.x.B:52742 -> 10.x.x.A:26 FIN-WAIT-2: 10.x.x.B:56314 -> 10.x.x.A:25 ``` **FIN-WAIT-2 explanation:** Newt sends FIN (local side closes), but the remote host (mail server or Gerbil tunnel endpoint) never sends the final FIN. These half-closed connections are never cleaned up by the OS, each consuming one file descriptor indefinitely. With 3,590 open FDs (vs. ~100 on healthy nodes), this confirms the file descriptor leak is driven by accumulated FIN-WAIT-2 TCP connections. ### Issue 2: UDP DNS connection leak (amplifying factor, related to PR #277) We also observed 11+ simultaneous UDP connections to our DNS resolver, each with separate, non-reused file descriptors: ``` UDP: fd=86 -> dns-resolver:853 UDP: fd=479 -> dns-resolver:853 UDP: fd=1057 -> dns-resolver:853 UDP: fd=1232 -> dns-resolver:853 UDP: fd=2749 -> dns-resolver:853 ... (11 separate FDs for DNS) ``` This matches exactly the pattern described in PR #277: UDP buffers are allocated without `sync.Pool`, causing each DNS lookup to allocate a new buffer and connection object without reuse. Under high TCP load (Issue 1), DNS lookups are frequent, multiplying this effect. --- **Correlation: Why only the node with SMTP targets is affected** All 5 nodes run identical Newt configurations. The only difference: - **node-A (affected):** Routes a mail gateway resource. Health checks run on TCP port 25/26. This generates continuous short-lived TCP connections. - **node-B through node-E (healthy):** Route HTTP/HTTPS resources only. No TCP-level health checks on long-lived SMTP connections. After restarting Newt on node-A, RAM immediately dropped back to ~40 MiB and CPU to <3%. The other nodes have been running continuously for 7+ days without any memory growth. --- **Key observation: health check behavior on TCP targets** SMTP (port 25/26) connections have a specific characteristic: the server side keeps the connection alive waiting for client commands (e.g., `EHLO`, `QUIT`). When Newt's TCP forwarder opens a health-check or probe connection without completing the SMTP handshake, the server holds the connection open. Newt sends FIN locally, but the SMTP server never sends FIN back — resulting in permanent FIN-WAIT-2. This is not SMTP-specific per se — **any TCP target that holds connections open waiting for application-layer data** (SMTP, SSH, database ports, etc.) will trigger this pattern. --- **Workaround (applied until upstream fix):** ```bash # Immediate fix: restart Newt docker restart pangolin-newt # Containment: set memory limit in docker-compose.yml deploy: resources: limits: memory: 512m # Prevent OOM: scheduled daily restart via cron 0 4 * * * root /usr/bin/docker restart pangolin-newt ``` The memory limit prevents the host from being affected if the leak accelerates; the cron restart keeps RAM in check until a fix is released. --- **Suggested fix directions:** 1. **TCP Forwarder:** Implement a connection timeout and explicit FIN-WAIT-2 cleanup — if a connection has been in FIN-WAIT state for longer than N seconds, force close and release the goroutine and FD. 2. **TCP Forwarder:** Track open forwarding goroutines with a bounded goroutine pool to prevent unbounded growth. 3. **UDP buffers:** Apply `sync.Pool` as proposed in PR #277 to avoid allocating new buffer objects per DNS query. 4. **Health checks on TCP targets:** Consider adding an application-layer close signal (e.g., send `QUIT\r\n` for SMTP) before closing the TCP connection, to allow the server to send a proper FIN. --- **References:** - This issue: #268 (memory leak, same symptoms) - Related: #238 (100% CPU after DNS failure — our 234% CPU may be related) - Related PR: #277 (UDP sync.Pool — directly observed in our FD dump) - Pangolin issue: fosrl/pangolin#2631 (endpoint roaming leak — same `bidirectional copy` log pattern)
Author
Owner

@strausmann commented on GitHub (Mar 27, 2026):

Suggested code fix: TCP connection timeout and half-close in proxy/manager.go

After reading the full source code, here is a concrete analysis and fix proposal for the TCP connection leak.


Root cause in proxy/manager.go (handleTCPProxy)

The current code at the core of the leak (simplified):

go func(tunnelID string, accepted net.Conn) {
    target, err := net.Dial("tcp", targetAddr)  // ← no timeout
    if err != nil { /* ... */ return }

    var wg sync.WaitGroup
    wg.Add(2)

    go func() {
        defer wg.Done()
        _, _ = io.Copy(cw_target, accepted)
        _ = target.Close()
    }()

    go func() {
        defer wg.Done()
        _, _ = io.Copy(cw_accepted, target)
        _ = accepted.Close()
    }()

    wg.Wait()
}(tunnelID, conn)

Three problems:

  1. net.Dial without timeout — If DNS resolution hangs or the target is slow to respond, the goroutine blocks indefinitely.

  2. io.Copy without deadline — If one side stops sending but doesn't close (common with SMTP EHLO/health checks), the io.Copy reading from that side blocks forever. The connection enters FIN-WAIT-2 and the goroutine + FD leak permanently.

  3. Half-close not propagated — When one io.Copy returns (one direction finished), the other side is not signaled. Both goroutines must independently reach EOF or error, which may never happen for long-lived protocols.


Contrast with netstack2/handlers.go

The netstack2 TCP handler already has better practices:

  • tcpConnectTimeout = 5 * time.Second for dial
  • tcpWaitTimeout = 60 * time.Second for half-close
  • TCP keepalive configured via setTCPSocketOptions
  • CloseRead() / CloseWrite() for half-close signaling in unidirectionalStreamTCP

The proxy/manager.go path lacks all of these.


Suggested fix

const (
    proxyDialTimeout   = 10 * time.Second
    proxyHalfCloseWait = 60 * time.Second
)

// Inside the goroutine spawned per accepted connection in handleTCPProxy:

go func(tunnelID string, accepted net.Conn) {
    connStart := time.Now()

    // Fix 1: Use DialTimeout instead of Dial
    target, err := net.DialTimeout("tcp", targetAddr, proxyDialTimeout)
    if err != nil {
        logger.Error("Error connecting to target: %v", err)
        accepted.Close()
        // ... telemetry ...
        return
    }

    entry := pm.getEntry(tunnelID)
    if entry == nil {
        entry = &tunnelEntry{}
    }
    var wg sync.WaitGroup
    wg.Add(2)

    // Fix 2: Half-close aware copy with idle timeout
    // Goroutine 1: accepted -> target (client sends to upstream)
    go func(ent *tunnelEntry) {
        defer wg.Done()
        cw := &countingWriter{ctx: context.Background(), w: target, set: ent.attrInTCP, pm: pm, ent: ent, out: false, proto: "tcp"}
        _, _ = io.Copy(cw, accepted)

        // Half-close: we finished reading from accepted, signal target
        if cr, ok := accepted.(interface{ CloseRead() error }); ok {
            cr.CloseRead()
        }
        if tw, ok := target.(interface{ CloseWrite() error }); ok {
            tw.CloseWrite()
        }
        // Set deadline on target so goroutine 2's io.Copy(accepted, target)
        // unblocks if the remote never sends FIN back
        target.SetReadDeadline(time.Now().Add(proxyHalfCloseWait))
    }(entry)

    // Goroutine 2: target -> accepted (upstream sends to client)
    go func(ent *tunnelEntry) {
        defer wg.Done()
        cw := &countingWriter{ctx: context.Background(), w: accepted, set: ent.attrOutTCP, pm: pm, ent: ent, out: true, proto: "tcp"}
        _, _ = io.Copy(cw, target)

        // Half-close: we finished reading from target, signal accepted
        if cr, ok := target.(interface{ CloseRead() error }); ok {
            cr.CloseRead()
        }
        if aw, ok := accepted.(interface{ CloseWrite() error }); ok {
            aw.CloseWrite()
        }
        // Set deadline on accepted so goroutine 1's io.Copy(target, accepted)
        // unblocks if the client never sends FIN back
        accepted.SetReadDeadline(time.Now().Add(proxyHalfCloseWait))
    }(entry)

    wg.Wait()

    // Fix 3: Explicit close of both sides after both goroutines are done
    target.Close()
    accepted.Close()

    // ... telemetry / session tracking ...
}(tunnelID, conn)

Key changes:

  1. net.DialTimeout — Prevents indefinite blocking on connection setup (10s timeout).

  2. CloseRead() + CloseWrite() after each io.Copy — Sends TCP FIN to the other side when one direction finishes, signaling proper half-close. This is the same pattern used in netstack2/handlers.go (unidirectionalStreamTCP).

  3. SetReadDeadline on the opposite connection — This is the critical detail. When goroutine 1 (accepted→target) finishes, it sets a deadline on target — because goroutine 2 is reading from target. This gives the remote 60 seconds to send its FIN; if it doesn't, the read times out and the goroutine exits. The same logic applies symmetrically for goroutine 2 setting a deadline on accepted. Without this, FIN-WAIT-2 connections accumulate indefinitely.

  4. Explicit Close() after wg.Wait() — Guarantees both connections are fully closed regardless of the copy outcome. The per-goroutine Close() calls in the current code are removed to avoid closing a connection while the other goroutine may still be using it.


Type safety note

CloseWrite() and CloseRead() do not exist on net.Conn. They are available on *net.TCPConn and *gonet.TCPConn. The interface type assertion (interface{ CloseWrite() error }) is the correct Go-idiomatic way to call these — it works for both standard net.TCPConn and gvisor's gonet.TCPConn. If the connection doesn't support half-close, the assertion fails gracefully and we fall through to the deadline-based cleanup.

Race condition safety

  • SetReadDeadline is safe to call from a different goroutine — the Go net.Conn contract explicitly allows concurrent calls to Read, Write, Close, and deadline setters.
  • CloseRead()/CloseWrite() affect only one direction and are safe while the other direction is active.
  • Close() after wg.Wait() is safe because both goroutines have already exited.

Impact assessment

  • Backward compatible — Only adds timeouts to existing code paths, no API changes
  • Matches netstack2 patterns — The netstack2/handlers.go already uses these exact patterns (CloseRead/CloseWrite, tcpWaitTimeout, tcpConnectTimeout)
  • Fixes FIN-WAIT-2 leak — The half-close + deadline combination ensures connections cannot accumulate indefinitely
  • WebSocket safe — WebSocket uses TCP; half-close is transparent to the protocol layer
  • WireGuard/Gerbil safe — Only affects the local proxy hop, not the tunnel itself

Happy to submit this as a PR if the maintainers think this direction is correct. We can test it in our environment with the SMTP workload that triggers the issue reliably.

<!-- gh-comment-id:4145205863 --> @strausmann commented on GitHub (Mar 27, 2026): ## Suggested code fix: TCP connection timeout and half-close in `proxy/manager.go` After reading the full source code, here is a concrete analysis and fix proposal for the TCP connection leak. --- ### Root cause in `proxy/manager.go` (`handleTCPProxy`) The current code at the core of the leak (simplified): ```go go func(tunnelID string, accepted net.Conn) { target, err := net.Dial("tcp", targetAddr) // ← no timeout if err != nil { /* ... */ return } var wg sync.WaitGroup wg.Add(2) go func() { defer wg.Done() _, _ = io.Copy(cw_target, accepted) _ = target.Close() }() go func() { defer wg.Done() _, _ = io.Copy(cw_accepted, target) _ = accepted.Close() }() wg.Wait() }(tunnelID, conn) ``` **Three problems:** 1. **`net.Dial` without timeout** — If DNS resolution hangs or the target is slow to respond, the goroutine blocks indefinitely. 2. **`io.Copy` without deadline** — If one side stops sending but doesn't close (common with SMTP EHLO/health checks), the `io.Copy` reading from that side blocks forever. The connection enters FIN-WAIT-2 and the goroutine + FD leak permanently. 3. **Half-close not propagated** — When one `io.Copy` returns (one direction finished), the other side is not signaled. Both goroutines must independently reach EOF or error, which may never happen for long-lived protocols. --- ### Contrast with `netstack2/handlers.go` The netstack2 TCP handler already has better practices: - `tcpConnectTimeout = 5 * time.Second` for dial - `tcpWaitTimeout = 60 * time.Second` for half-close - TCP keepalive configured via `setTCPSocketOptions` - `CloseRead()` / `CloseWrite()` for half-close signaling in `unidirectionalStreamTCP` The `proxy/manager.go` path lacks all of these. --- ### Suggested fix ```go const ( proxyDialTimeout = 10 * time.Second proxyHalfCloseWait = 60 * time.Second ) // Inside the goroutine spawned per accepted connection in handleTCPProxy: go func(tunnelID string, accepted net.Conn) { connStart := time.Now() // Fix 1: Use DialTimeout instead of Dial target, err := net.DialTimeout("tcp", targetAddr, proxyDialTimeout) if err != nil { logger.Error("Error connecting to target: %v", err) accepted.Close() // ... telemetry ... return } entry := pm.getEntry(tunnelID) if entry == nil { entry = &tunnelEntry{} } var wg sync.WaitGroup wg.Add(2) // Fix 2: Half-close aware copy with idle timeout // Goroutine 1: accepted -> target (client sends to upstream) go func(ent *tunnelEntry) { defer wg.Done() cw := &countingWriter{ctx: context.Background(), w: target, set: ent.attrInTCP, pm: pm, ent: ent, out: false, proto: "tcp"} _, _ = io.Copy(cw, accepted) // Half-close: we finished reading from accepted, signal target if cr, ok := accepted.(interface{ CloseRead() error }); ok { cr.CloseRead() } if tw, ok := target.(interface{ CloseWrite() error }); ok { tw.CloseWrite() } // Set deadline on target so goroutine 2's io.Copy(accepted, target) // unblocks if the remote never sends FIN back target.SetReadDeadline(time.Now().Add(proxyHalfCloseWait)) }(entry) // Goroutine 2: target -> accepted (upstream sends to client) go func(ent *tunnelEntry) { defer wg.Done() cw := &countingWriter{ctx: context.Background(), w: accepted, set: ent.attrOutTCP, pm: pm, ent: ent, out: true, proto: "tcp"} _, _ = io.Copy(cw, target) // Half-close: we finished reading from target, signal accepted if cr, ok := target.(interface{ CloseRead() error }); ok { cr.CloseRead() } if aw, ok := accepted.(interface{ CloseWrite() error }); ok { aw.CloseWrite() } // Set deadline on accepted so goroutine 1's io.Copy(target, accepted) // unblocks if the client never sends FIN back accepted.SetReadDeadline(time.Now().Add(proxyHalfCloseWait)) }(entry) wg.Wait() // Fix 3: Explicit close of both sides after both goroutines are done target.Close() accepted.Close() // ... telemetry / session tracking ... }(tunnelID, conn) ``` **Key changes:** 1. **`net.DialTimeout`** — Prevents indefinite blocking on connection setup (10s timeout). 2. **`CloseRead()` + `CloseWrite()` after each `io.Copy`** — Sends TCP FIN to the other side when one direction finishes, signaling proper half-close. This is the same pattern used in `netstack2/handlers.go` (`unidirectionalStreamTCP`). 3. **`SetReadDeadline` on the _opposite_ connection** — This is the critical detail. When goroutine 1 (accepted→target) finishes, it sets a deadline on **`target`** — because goroutine 2 is reading from `target`. This gives the remote 60 seconds to send its FIN; if it doesn't, the read times out and the goroutine exits. The same logic applies symmetrically for goroutine 2 setting a deadline on `accepted`. Without this, FIN-WAIT-2 connections accumulate indefinitely. 4. **Explicit `Close()` after `wg.Wait()`** — Guarantees both connections are fully closed regardless of the copy outcome. The per-goroutine `Close()` calls in the current code are removed to avoid closing a connection while the other goroutine may still be using it. --- ### Type safety note `CloseWrite()` and `CloseRead()` do not exist on `net.Conn`. They are available on `*net.TCPConn` and `*gonet.TCPConn`. The interface type assertion (`interface{ CloseWrite() error }`) is the correct Go-idiomatic way to call these — it works for both standard `net.TCPConn` and gvisor's `gonet.TCPConn`. If the connection doesn't support half-close, the assertion fails gracefully and we fall through to the deadline-based cleanup. ### Race condition safety - `SetReadDeadline` is safe to call from a different goroutine — the Go `net.Conn` contract explicitly allows concurrent calls to `Read`, `Write`, `Close`, and deadline setters. - `CloseRead()`/`CloseWrite()` affect only one direction and are safe while the other direction is active. - `Close()` after `wg.Wait()` is safe because both goroutines have already exited. --- ### Impact assessment - **Backward compatible** — Only adds timeouts to existing code paths, no API changes - **Matches netstack2 patterns** — The `netstack2/handlers.go` already uses these exact patterns (`CloseRead`/`CloseWrite`, `tcpWaitTimeout`, `tcpConnectTimeout`) - **Fixes FIN-WAIT-2 leak** — The half-close + deadline combination ensures connections cannot accumulate indefinitely - **WebSocket safe** — WebSocket uses TCP; half-close is transparent to the protocol layer - **WireGuard/Gerbil safe** — Only affects the local proxy hop, not the tunnel itself Happy to submit this as a PR if the maintainers think this direction is correct. We can test it in our environment with the SMTP workload that triggers the issue reliably.
Author
Owner

@LaurenceJJones commented on GitHub (Apr 1, 2026):

Hey @strausmann I get you want to help, but this is not useful.

In short this is hallucination "Issue 2: UDP DNS connection leak (amplifying factor, related to PR https://github.com/fosrl/newt/pull/277)" it has nothing to do with DNS, the code modifies the proxy to use a buffer pool between the tunnel connection and the application. Now I agree with direction as I already saw the DNS call each time and we should implement a short cache for dns lookup even if the TTL is 30 seconds as currently we issue a DNS per request which if the host never changes cause the strain you see with fd's as the fd's are sockets not buffers.

I will go through the rest but next time please keep it constructive.

<!-- gh-comment-id:4170900030 --> @LaurenceJJones commented on GitHub (Apr 1, 2026): Hey @strausmann I get you want to help, but this is not useful. In short this is hallucination "Issue 2: UDP DNS connection leak (amplifying factor, related to PR https://github.com/fosrl/newt/pull/277)" it has nothing to do with DNS, the code modifies the proxy to use a buffer pool between the tunnel connection and the application. Now I agree with direction as I already saw the DNS call each time and we should implement a short cache for dns lookup even if the TTL is 30 seconds as currently we issue a DNS per request which if the host never changes cause the strain you see with fd's as the fd's are sockets not buffers. I will go through the rest but next time please keep it constructive.
Author
Owner

@strausmann commented on GitHub (Apr 1, 2026):

All right, I'll do it. Thanks

<!-- gh-comment-id:4170918242 --> @strausmann commented on GitHub (Apr 1, 2026): All right, I'll do it. Thanks
Author
Owner

@LaurenceJJones commented on GitHub (Apr 1, 2026):

Hey @sotima please update to 1.10.4 so we can debug this on the live system.

We added Go pprof endpoints to the admin HTTP server. Please restart Newt with the admin interface and pprof enabled:

  • admin interface: --metrics-admin-addr 127.0.0.1:2112 or NEWT_ADMIN_ADDR=127.0.0.1:2112
  • pprof: --pprof or NEWT_PPROF_ENABLED=true

Then, after memory has grown, run this from inside the Proxmox guest/container where Newt is running, since it is bound to 127.0.0.1:

curl -o newt.heap.pprof http://127.0.0.1:2112/debug/pprof/heap

If curl is not installed:

wget -O newt.heap.pprof http://127.0.0.1:2112/debug/pprof/heap

if you have issues getting it to download with binding to 127.0.0.1 then you can bind to 0.0.0.0 instead if easier.

Please send the resulting newt.heap.pprof file to laurence at pangolin.net and mention issue #268.

also @strausmann as well if you got memory issue dont hesitate to do the same and send the pprof, live environment are easier to see actual memory issues then our dev/test envs without "real traffic".

<!-- gh-comment-id:4170967597 --> @LaurenceJJones commented on GitHub (Apr 1, 2026): Hey @sotima please update to `1.10.4` so we can debug this on the live system. We added Go `pprof` endpoints to the admin HTTP server. Please restart Newt with the admin interface and pprof enabled: - admin interface: `--metrics-admin-addr 127.0.0.1:2112` or `NEWT_ADMIN_ADDR=127.0.0.1:2112` - pprof: `--pprof` or `NEWT_PPROF_ENABLED=true` Then, after memory has grown, run this from inside the Proxmox guest/container where Newt is running, since it is bound to `127.0.0.1`: ```bash curl -o newt.heap.pprof http://127.0.0.1:2112/debug/pprof/heap ``` If `curl` is not installed: ```bash wget -O newt.heap.pprof http://127.0.0.1:2112/debug/pprof/heap ``` if you have issues getting it to download with binding to `127.0.0.1` then you can bind to `0.0.0.0` instead if easier. Please send the resulting `newt.heap.pprof` file to `laurence` at `pangolin.net` and mention issue `#268`. also @strausmann as well if you got memory issue dont hesitate to do the same and send the pprof, live environment are easier to see actual memory issues then our dev/test envs without "real traffic".
Author
Owner

@sotima commented on GitHub (Apr 2, 2026):

Hi @LaurenceJJones : thanks for those changes. I am glad if I could help. I have updated to 1.10.4 and changed my systemd/system/newt.service to the following:

ExecStart=/usr/local/bin/newt --id <id> --secret <secret> --endpoint <endpoint> --metrics-admin-addr 0.0.0.0:2112

And restartet the container. When I try to curl the newt.heap.pprof shortly after the start, I only get "404 Page not found" as a result. I tried also with

ExecStart=/usr/local/bin/newt --id <id> --secret <secret> --endpoint <endpoint> --metrics-admin-addr 127.0.0.1:2112

Is that expected?

<!-- gh-comment-id:4174676811 --> @sotima commented on GitHub (Apr 2, 2026): Hi @LaurenceJJones : thanks for those changes. I am glad if I could help. I have updated to 1.10.4 and changed my systemd/system/newt.service to the following: `ExecStart=/usr/local/bin/newt --id <id> --secret <secret> --endpoint <endpoint> --metrics-admin-addr 0.0.0.0:2112` And restartet the container. When I try to curl the newt.heap.pprof shortly after the start, I only get "404 Page not found" as a result. I tried also with `ExecStart=/usr/local/bin/newt --id <id> --secret <secret> --endpoint <endpoint> --metrics-admin-addr 127.0.0.1:2112` Is that expected?
Author
Owner

@LaurenceJJones commented on GitHub (Apr 2, 2026):

You need both items your missing the pprof

<!-- gh-comment-id:4174706317 --> @LaurenceJJones commented on GitHub (Apr 2, 2026): You need both items your missing the pprof
Author
Owner

@sotima commented on GitHub (Apr 2, 2026):

...ups...
Too early in the morning!
--pprof added, and now it works.
Now I will let it run for a day or two and send you the result...

<!-- gh-comment-id:4174715119 --> @sotima commented on GitHub (Apr 2, 2026): ...ups... Too early in the morning! --pprof added, and now it works. Now I will let it run for a day or two and send you the result...
Author
Owner

@sotima commented on GitHub (Apr 3, 2026):

After 27h+ a short status update: Seems you have broken it! ...the memory leak I mean :-D. Since the update to 1.10.4 and activating the -pprof (which is only for debugging, I know), the memory consumption stays rock solid at 96 MB.

<!-- gh-comment-id:4182441802 --> @sotima commented on GitHub (Apr 3, 2026): After 27h+ a short status update: Seems you have broken it! ...the memory leak I mean :-D. Since the update to 1.10.4 and activating the -pprof (which is only for debugging, I know), the memory consumption stays rock solid at 96 MB.
Author
Owner

@LaurenceJJones commented on GitHub (Apr 3, 2026):

Damn we should make sure we re implement the memory leak so we can find what it was 😉

Great news, keep it running with the flags as it doesnt hurt it, and if it spikes again (most likely due to usage rather than a leak) then provide it as im sure we can optimize wherever the allocation is happening most likely #277 will help also.

<!-- gh-comment-id:4182915975 --> @LaurenceJJones commented on GitHub (Apr 3, 2026): Damn we should make sure we re implement the memory leak so we can find what it was 😉 Great news, keep it running with the flags as it doesnt hurt it, and if it spikes again (most likely due to usage rather than a leak) then provide it as im sure we can optimize wherever the allocation is happening most likely #277 will help also.
Author
Owner

@LaurenceJJones commented on GitHub (Apr 7, 2026):

Classing as not planned due since we couldnt pin point a cause

<!-- gh-comment-id:4200063422 --> @LaurenceJJones commented on GitHub (Apr 7, 2026): Classing as not planned due since we couldnt pin point a cause
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/newt#2063