mirror of
https://github.com/fosrl/newt.git
synced 2026-05-06 07:59:04 -05:00
[GH-ISSUE #268] Possible memory leak running on debian 12 #1448
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sotima on GitHub (Mar 12, 2026).
Original GitHub issue: https://github.com/fosrl/newt/issues/268
Originally assigned to: @LaurenceJJones on GitHub.
Describe the Bug
Hi there and thanks for the great work you have done!
I am experiencing a problem with memory leakage here. I have Pangolin running on a VPS and the newt client in an LXC container on my Proxmox server.
Over time the LXC container runs out of memory. Round about after one week depending on the memory allocated to the container. Nothing else is running on the LXC - only newt.
Thanks for looking into it!
Environment
To Reproduce
Just install newt client in an LXC-Container with Debian 12 (also tried alpine, same behaviour) under proxmox and let it run. Run top in the console of the container.
Below is a screenshot of top running about 2-3 minutes - observe the values in the column RES - starting at 93408 and ending at 94688 :
When I started the test, one browser-window was connected to the client. In the middle of it, i have closed the connection. The increase slowed down a bit, but still increased.
Expected Behavior
No increase in memory usage over time.
@LaurenceJJones commented on GitHub (Mar 12, 2026):
I dont understand this expected behavior, all software increases memory over time and compacts back down once the memory has been garbage collected. Newt acts as a proxy between clients and downstream applications so it passes byte buffers through itself so generally this will increase with usage and then once Go (garbage collected runtime) decides a buffer or memory allocated is safe to release it will do so.
This is not concrete timeline or actionable for us to debug as a "leak", if generally over a day you see consistent memory without anything being released, then yes thats a leak but so far this is expected memory growth with general usage.
To further explain even if a client disconnects Go will not release this memory straight away, this is just a caveat of using Go we have no control over memory usage and depend on the GC being "smart" enough.
if you want technical gc docs
@sotima commented on GitHub (Mar 12, 2026):
Ok, You are right concerning the expected behaviour. I should have written: memory shall be de-allocated in a regular base, preventing the system to run out of memory. Concerning the top results. Here is the memory graph of proxmox over three days:
on March 08, 13:00 it startet with 38 MB of memory and steadily increased up to 443MB at March 10. 13:00. I then
restarted the LXC and it went from 74MB up to 324MB the next day at 16:00. Then I increased the memory of the LXC to 1GB (has been 512MB), it now started at 40MB and today at 10:00 it has reached 115MB.
I cannot see any memeory de-allocation in this graph although during the night, I am sure there are no connections.
@github-actions[bot] commented on GitHub (Mar 27, 2026):
This issue has been automatically marked as stale due to 14 days of inactivity. It will be closed in 14 days if no further activity occurs.
@strausmann commented on GitHub (Mar 27, 2026):
Additional findings: TCP connection leak with SMTP targets (FIN-WAIT-2 accumulation)
We are experiencing the same issue and have done a detailed root cause analysis. Sharing our findings here as they complement the reports in #268 and #238, and directly relate to PR #277.
Environment:
fosrl/newt:latest(Newt 1.10.3, image pulled 2026-03-26)network_mode: bridge, on Linux (Ubuntu 24.04)Metrics (affected node vs. normal nodes after ~35h uptime):
RAM growth factor: ~51x compared to healthy nodes.
Root cause analysis:
We identified two concurrent issues:
Issue 1: TCP connection leak in the forwarder (primary cause)
node-A routes a mail gateway resource with health checks enabled on TCP port 25 and 26. The
TCP Forwardergenerates a very high rate of connections — we observed 246 connection log entries to a single TCP target within a 500-line log window.Each connection goes through the pattern:
These connections accumulate in FIN-WAIT-2 state:
FIN-WAIT-2 explanation: Newt sends FIN (local side closes), but the remote host (mail server or Gerbil tunnel endpoint) never sends the final FIN. These half-closed connections are never cleaned up by the OS, each consuming one file descriptor indefinitely.
With 3,590 open FDs (vs. ~100 on healthy nodes), this confirms the file descriptor leak is driven by accumulated FIN-WAIT-2 TCP connections.
Issue 2: UDP DNS connection leak (amplifying factor, related to PR #277)
We also observed 11+ simultaneous UDP connections to our DNS resolver, each with separate, non-reused file descriptors:
This matches exactly the pattern described in PR #277: UDP buffers are allocated without
sync.Pool, causing each DNS lookup to allocate a new buffer and connection object without reuse. Under high TCP load (Issue 1), DNS lookups are frequent, multiplying this effect.Correlation: Why only the node with SMTP targets is affected
All 5 nodes run identical Newt configurations. The only difference:
After restarting Newt on node-A, RAM immediately dropped back to ~40 MiB and CPU to <3%. The other nodes have been running continuously for 7+ days without any memory growth.
Key observation: health check behavior on TCP targets
SMTP (port 25/26) connections have a specific characteristic: the server side keeps the connection alive waiting for client commands (e.g.,
EHLO,QUIT). When Newt's TCP forwarder opens a health-check or probe connection without completing the SMTP handshake, the server holds the connection open. Newt sends FIN locally, but the SMTP server never sends FIN back — resulting in permanent FIN-WAIT-2.This is not SMTP-specific per se — any TCP target that holds connections open waiting for application-layer data (SMTP, SSH, database ports, etc.) will trigger this pattern.
Workaround (applied until upstream fix):
The memory limit prevents the host from being affected if the leak accelerates; the cron restart keeps RAM in check until a fix is released.
Suggested fix directions:
sync.Poolas proposed in PR #277 to avoid allocating new buffer objects per DNS query.QUIT\r\nfor SMTP) before closing the TCP connection, to allow the server to send a proper FIN.References:
bidirectional copylog pattern)@strausmann commented on GitHub (Mar 27, 2026):
Suggested code fix: TCP connection timeout and half-close in
proxy/manager.goAfter reading the full source code, here is a concrete analysis and fix proposal for the TCP connection leak.
Root cause in
proxy/manager.go(handleTCPProxy)The current code at the core of the leak (simplified):
Three problems:
net.Dialwithout timeout — If DNS resolution hangs or the target is slow to respond, the goroutine blocks indefinitely.io.Copywithout deadline — If one side stops sending but doesn't close (common with SMTP EHLO/health checks), theio.Copyreading from that side blocks forever. The connection enters FIN-WAIT-2 and the goroutine + FD leak permanently.Half-close not propagated — When one
io.Copyreturns (one direction finished), the other side is not signaled. Both goroutines must independently reach EOF or error, which may never happen for long-lived protocols.Contrast with
netstack2/handlers.goThe netstack2 TCP handler already has better practices:
tcpConnectTimeout = 5 * time.Secondfor dialtcpWaitTimeout = 60 * time.Secondfor half-closesetTCPSocketOptionsCloseRead()/CloseWrite()for half-close signaling inunidirectionalStreamTCPThe
proxy/manager.gopath lacks all of these.Suggested fix
Key changes:
net.DialTimeout— Prevents indefinite blocking on connection setup (10s timeout).CloseRead()+CloseWrite()after eachio.Copy— Sends TCP FIN to the other side when one direction finishes, signaling proper half-close. This is the same pattern used innetstack2/handlers.go(unidirectionalStreamTCP).SetReadDeadlineon the opposite connection — This is the critical detail. When goroutine 1 (accepted→target) finishes, it sets a deadline ontarget— because goroutine 2 is reading fromtarget. This gives the remote 60 seconds to send its FIN; if it doesn't, the read times out and the goroutine exits. The same logic applies symmetrically for goroutine 2 setting a deadline onaccepted. Without this, FIN-WAIT-2 connections accumulate indefinitely.Explicit
Close()afterwg.Wait()— Guarantees both connections are fully closed regardless of the copy outcome. The per-goroutineClose()calls in the current code are removed to avoid closing a connection while the other goroutine may still be using it.Type safety note
CloseWrite()andCloseRead()do not exist onnet.Conn. They are available on*net.TCPConnand*gonet.TCPConn. The interface type assertion (interface{ CloseWrite() error }) is the correct Go-idiomatic way to call these — it works for both standardnet.TCPConnand gvisor'sgonet.TCPConn. If the connection doesn't support half-close, the assertion fails gracefully and we fall through to the deadline-based cleanup.Race condition safety
SetReadDeadlineis safe to call from a different goroutine — the Gonet.Conncontract explicitly allows concurrent calls toRead,Write,Close, and deadline setters.CloseRead()/CloseWrite()affect only one direction and are safe while the other direction is active.Close()afterwg.Wait()is safe because both goroutines have already exited.Impact assessment
netstack2/handlers.goalready uses these exact patterns (CloseRead/CloseWrite,tcpWaitTimeout,tcpConnectTimeout)Happy to submit this as a PR if the maintainers think this direction is correct. We can test it in our environment with the SMTP workload that triggers the issue reliably.
@LaurenceJJones commented on GitHub (Apr 1, 2026):
Hey @strausmann I get you want to help, but this is not useful.
In short this is hallucination "Issue 2: UDP DNS connection leak (amplifying factor, related to PR https://github.com/fosrl/newt/pull/277)" it has nothing to do with DNS, the code modifies the proxy to use a buffer pool between the tunnel connection and the application. Now I agree with direction as I already saw the DNS call each time and we should implement a short cache for dns lookup even if the TTL is 30 seconds as currently we issue a DNS per request which if the host never changes cause the strain you see with fd's as the fd's are sockets not buffers.
I will go through the rest but next time please keep it constructive.
@strausmann commented on GitHub (Apr 1, 2026):
All right, I'll do it. Thanks
@LaurenceJJones commented on GitHub (Apr 1, 2026):
Hey @sotima please update to
1.10.4so we can debug this on the live system.We added Go
pprofendpoints to the admin HTTP server. Please restart Newt with the admin interface and pprof enabled:--metrics-admin-addr 127.0.0.1:2112orNEWT_ADMIN_ADDR=127.0.0.1:2112--pproforNEWT_PPROF_ENABLED=trueThen, after memory has grown, run this from inside the Proxmox guest/container where Newt is running, since it is bound to
127.0.0.1:If
curlis not installed:if you have issues getting it to download with binding to
127.0.0.1then you can bind to0.0.0.0instead if easier.Please send the resulting
newt.heap.pproffile tolaurenceatpangolin.netand mention issue#268.also @strausmann as well if you got memory issue dont hesitate to do the same and send the pprof, live environment are easier to see actual memory issues then our dev/test envs without "real traffic".
@sotima commented on GitHub (Apr 2, 2026):
Hi @LaurenceJJones : thanks for those changes. I am glad if I could help. I have updated to 1.10.4 and changed my systemd/system/newt.service to the following:
ExecStart=/usr/local/bin/newt --id <id> --secret <secret> --endpoint <endpoint> --metrics-admin-addr 0.0.0.0:2112And restartet the container. When I try to curl the newt.heap.pprof shortly after the start, I only get "404 Page not found" as a result. I tried also with
ExecStart=/usr/local/bin/newt --id <id> --secret <secret> --endpoint <endpoint> --metrics-admin-addr 127.0.0.1:2112Is that expected?
@LaurenceJJones commented on GitHub (Apr 2, 2026):
You need both items your missing the pprof
@sotima commented on GitHub (Apr 2, 2026):
...ups...
Too early in the morning!
--pprof added, and now it works.
Now I will let it run for a day or two and send you the result...
@sotima commented on GitHub (Apr 3, 2026):
After 27h+ a short status update: Seems you have broken it! ...the memory leak I mean :-D. Since the update to 1.10.4 and activating the -pprof (which is only for debugging, I know), the memory consumption stays rock solid at 96 MB.
@LaurenceJJones commented on GitHub (Apr 3, 2026):
Damn we should make sure we re implement the memory leak so we can find what it was 😉
Great news, keep it running with the flags as it doesnt hurt it, and if it spikes again (most likely due to usage rather than a leak) then provide it as im sure we can optimize wherever the allocation is happening most likely #277 will help also.
@LaurenceJJones commented on GitHub (Apr 7, 2026):
Classing as not planned due since we couldnt pin point a cause