[GH-ISSUE #72] Holepunching unreliable if there's network overlap #315

Open
opened 2026-04-23 01:54:54 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @asardaes on GitHub (Jan 2, 2026).
Original GitHub issue: https://github.com/fosrl/olm/issues/72

Originally assigned to: @oschwartz10612 on GitHub.

Describe the Bug

It seems like Olm can't decide beetween holepunching and relaying when there's network overlap:

INFO: 2026/01/02 12:56:40 Added IPv4 included route: {DestinationAddress:172.16.15.133 SubnetMask:255.255.255.255 GatewayAddress: IsDefault:false}
INFO: 2026/01/02 12:56:40 Adding route to 172.16.15.133/32 via interface olm
INFO: 2026/01/02 12:56:40 Added route for remote subnet: 172.16.15.133/32
INFO: 2026/01/02 12:56:40 Started monitoring for site 3 at 100.90.128.3:50145
INFO: 2026/01/02 12:56:40 Configured peer hGI22xrRZQJQ8weL8Ye2FIutO+vBX6IXJpCUpTqJ4W0=
INFO: 2026/01/02 12:56:40 Started monitoring peer 3
INFO: 2026/01/02 12:56:40 Started holepunch connection monitor
INFO: 2026/01/02 12:56:40 DNS proxy started on 100.96.128.1:53 (tunnelDNS=false)
INFO: 2026/01/02 12:56:40 WireGuard device created.
INFO: 2026/01/02 12:56:40 Starting rapid holepunch test for site 3 at 172.16.15.133:50144 (max 5 attempts, 400ms timeout each)
WARN: 2026/01/02 12:56:42 Rapid test: site 3 holepunch FAILED after 5 attempts, will relay
INFO: 2026/01/02 12:56:42 Rapid test failed for site 3, requesting relay
INFO: 2026/01/02 12:56:42 Sent relay message
INFO: 2026/01/02 12:56:42 Adjusted peer 3 to point to relay!
INFO: 2026/01/02 12:56:45 Holepunch to site 3 (172.16.15.133:50144) is CONNECTED (RTT: 1.033343387s)
INFO: 2026/01/02 12:56:45 Holepunch to site 3 succeeded while relayed, switching to direct connection
INFO: 2026/01/02 12:56:45 Sent unrelay message
INFO: 2026/01/02 12:56:45 Switched peer 3 back to direct connection at 172.16.15.133:50144
WARN: 2026/01/02 12:56:45 WireGuard connection to site 3 is DISCONNECTED
INFO: 2026/01/02 12:56:45 WireGuard connection to site 3 is CONNECTED (RTT: 3.102040645s)
WARN: 2026/01/02 12:56:48 Holepunch to site 3 (172.16.15.133:50144) is DISCONNECTED: timeout waiting for response
WARN: 2026/01/02 12:56:49 WireGuard connection to site 3 is DISCONNECTED
INFO: 2026/01/02 12:56:52 Holepunch to site 3 failed 3 times, triggering relay
INFO: 2026/01/02 12:56:52 Sent relay message
INFO: 2026/01/02 12:56:52 Adjusted peer 3 to point to relay!
INFO: 2026/01/02 12:56:52 WireGuard connection to site 3 is CONNECTED (RTT: 21.876093ms)
INFO: 2026/01/02 12:56:54 Holepunch to site 3 (172.16.15.133:50144) is CONNECTED (RTT: 15.023471ms)
INFO: 2026/01/02 12:56:54 Holepunch to site 3 succeeded while relayed, switching to direct connection
INFO: 2026/01/02 12:56:54 Sent unrelay message
INFO: 2026/01/02 12:56:54 Switched peer 3 back to direct connection at 172.16.15.133:50144
WARN: 2026/01/02 12:56:58 Holepunch to site 3 (172.16.15.133:50144) is DISCONNECTED: timeout waiting for response
WARN: 2026/01/02 12:56:59 WireGuard connection to site 3 is DISCONNECTED
INFO: 2026/01/02 12:57:02 Holepunch to site 3 failed 3 times, triggering relay
INFO: 2026/01/02 12:57:02 Sent relay message
INFO: 2026/01/02 12:57:02 Adjusted peer 3 to point to relay!
INFO: 2026/01/02 12:57:02 WireGuard connection to site 3 is CONNECTED (RTT: 20.854135ms)

172.16.15.133 is the Newt site's private IP, see more below.

Environment

  • OS Type & Version: Debian GNU/Linux 12 (bookworm)
  • Pangolin Version: 1.14.1
  • Gerbil Version: 1.3.0
  • Olm Version: 1.3.0

To Reproduce

I did a maybe-unusual experiment. I have 2 VMs in my VPS and they're both in the same subnet, one for Pangolin and one for a Newt site. Pangolin has a public domain, but I tried to connect the Newt site through the internal subnet by manually entering an entry in its /etc/hosts:

172.16.15.101 my.domain.com

Newt is not running inside a container.

I then defined a private resource in the Newt VM. The client machine ended up being a container with Olm that's not on the VPS and is not running in host network mode. The Docker daemon from Olm's host has a pretty large IP pool (TrueNAS default): 172.17.0.0/12, so as seen in the logs, the holepunch tried to use the VPS private IP, which obviously cannot work, but since that IP was also valid in the Olm container's network, it looked like it could work, but it didn't.

Expected Behavior

My guess is that Olm should completely ignore private IP ranges when attempting to holepunch.

Originally created by @asardaes on GitHub (Jan 2, 2026). Original GitHub issue: https://github.com/fosrl/olm/issues/72 Originally assigned to: @oschwartz10612 on GitHub. ### Describe the Bug It seems like Olm can't decide beetween holepunching and relaying when there's network overlap: ``` INFO: 2026/01/02 12:56:40 Added IPv4 included route: {DestinationAddress:172.16.15.133 SubnetMask:255.255.255.255 GatewayAddress: IsDefault:false} INFO: 2026/01/02 12:56:40 Adding route to 172.16.15.133/32 via interface olm INFO: 2026/01/02 12:56:40 Added route for remote subnet: 172.16.15.133/32 INFO: 2026/01/02 12:56:40 Started monitoring for site 3 at 100.90.128.3:50145 INFO: 2026/01/02 12:56:40 Configured peer hGI22xrRZQJQ8weL8Ye2FIutO+vBX6IXJpCUpTqJ4W0= INFO: 2026/01/02 12:56:40 Started monitoring peer 3 INFO: 2026/01/02 12:56:40 Started holepunch connection monitor INFO: 2026/01/02 12:56:40 DNS proxy started on 100.96.128.1:53 (tunnelDNS=false) INFO: 2026/01/02 12:56:40 WireGuard device created. INFO: 2026/01/02 12:56:40 Starting rapid holepunch test for site 3 at 172.16.15.133:50144 (max 5 attempts, 400ms timeout each) WARN: 2026/01/02 12:56:42 Rapid test: site 3 holepunch FAILED after 5 attempts, will relay INFO: 2026/01/02 12:56:42 Rapid test failed for site 3, requesting relay INFO: 2026/01/02 12:56:42 Sent relay message INFO: 2026/01/02 12:56:42 Adjusted peer 3 to point to relay! INFO: 2026/01/02 12:56:45 Holepunch to site 3 (172.16.15.133:50144) is CONNECTED (RTT: 1.033343387s) INFO: 2026/01/02 12:56:45 Holepunch to site 3 succeeded while relayed, switching to direct connection INFO: 2026/01/02 12:56:45 Sent unrelay message INFO: 2026/01/02 12:56:45 Switched peer 3 back to direct connection at 172.16.15.133:50144 WARN: 2026/01/02 12:56:45 WireGuard connection to site 3 is DISCONNECTED INFO: 2026/01/02 12:56:45 WireGuard connection to site 3 is CONNECTED (RTT: 3.102040645s) WARN: 2026/01/02 12:56:48 Holepunch to site 3 (172.16.15.133:50144) is DISCONNECTED: timeout waiting for response WARN: 2026/01/02 12:56:49 WireGuard connection to site 3 is DISCONNECTED INFO: 2026/01/02 12:56:52 Holepunch to site 3 failed 3 times, triggering relay INFO: 2026/01/02 12:56:52 Sent relay message INFO: 2026/01/02 12:56:52 Adjusted peer 3 to point to relay! INFO: 2026/01/02 12:56:52 WireGuard connection to site 3 is CONNECTED (RTT: 21.876093ms) INFO: 2026/01/02 12:56:54 Holepunch to site 3 (172.16.15.133:50144) is CONNECTED (RTT: 15.023471ms) INFO: 2026/01/02 12:56:54 Holepunch to site 3 succeeded while relayed, switching to direct connection INFO: 2026/01/02 12:56:54 Sent unrelay message INFO: 2026/01/02 12:56:54 Switched peer 3 back to direct connection at 172.16.15.133:50144 WARN: 2026/01/02 12:56:58 Holepunch to site 3 (172.16.15.133:50144) is DISCONNECTED: timeout waiting for response WARN: 2026/01/02 12:56:59 WireGuard connection to site 3 is DISCONNECTED INFO: 2026/01/02 12:57:02 Holepunch to site 3 failed 3 times, triggering relay INFO: 2026/01/02 12:57:02 Sent relay message INFO: 2026/01/02 12:57:02 Adjusted peer 3 to point to relay! INFO: 2026/01/02 12:57:02 WireGuard connection to site 3 is CONNECTED (RTT: 20.854135ms) ``` `172.16.15.133` is the Newt site's private IP, see more below. ### Environment - OS Type & Version: Debian GNU/Linux 12 (bookworm) - Pangolin Version: 1.14.1 - Gerbil Version: 1.3.0 - Olm Version: 1.3.0 ### To Reproduce I did a maybe-unusual experiment. I have 2 VMs in my VPS and they're both in the same subnet, one for Pangolin and one for a Newt site. Pangolin has a public domain, but I tried to connect the Newt site through the internal subnet by manually entering an entry in its `/etc/hosts`: ``` 172.16.15.101 my.domain.com ``` Newt is *not* running inside a container. I then defined a private resource in the Newt VM. The client machine ended up being a container with Olm that's *not* on the VPS and is *not* running in host network mode. The Docker daemon from Olm's host has a pretty large IP pool (TrueNAS default): `172.17.0.0/12`, so as seen in the logs, the holepunch tried to use the VPS private IP, which obviously cannot work, but since that IP was also valid in the Olm container's network, it looked like it could work, but it didn't. ### Expected Behavior My guess is that Olm should completely ignore private IP ranges when attempting to holepunch.
GiteaMirror added the needs investigating label 2026-04-23 01:54:54 -05:00
Author
Owner

@TerrifiedBug commented on GitHub (Feb 10, 2026):

I ran into this as well.

I have two OLM clients (v1.4.1) connecting via Pangolin. The holepunch was constantly flapping: it would connect direct for about 2–3 seconds, drop, fall back to relay, detect that direct worked again, switch back, drop again. Endless cycle.

Root cause

The root cause in my case was that the site’s public IP was also configured as a Pangolin private resource. OLM added a /32 host route for that IP through the tunnel.

When holepunch failed (which was expected, since no firewall rules were open for it) and fell back to relay, the holepunch monitor kept testing connectivity. However, those probe packets were now being routed through the OLM tunnel itself because of that static route. From OLM’s perspective, the probes appeared to succeed.

OLM then switched back to direct, the real connection immediately died, it fell back to relay again, the probes “succeeded” through the tunnel, and the whole cycle repeated endlessly.

I tried --disable-holepunch (CLI flag), DISABLE_HOLEPUNCH=true (env var), and confirmed via olm -show-config that disable-holepunch = true [file] was saved, but it made no difference. It kept flapping.

I dug into the source and found a related issue. In olm/olm.go, the OnTokenUpdate callback unconditionally starts the holepunch manager:

// line ~464
logger.Info("Starting hole punch for %d exit nodes", len(exitNodes))
if err := o.holePunchManager.StartMultipleExitNodes(hpExitNodes); err != nil {
    logger.Warn("Failed to start hole punch: %v", err)
}

There’s no check for the holepunch config here. The flag only sets "relay": true in the registration message (around line ~416), but the client-side holepunch monitor keeps running regardless. It keeps testing, sees the peer is reachable, calls sendUnRelay(), switches to direct, the connection dies, and the cycle repeats. The disable-holepunch flag really should guard the holepunch manager, because right now it is effectively a no-op on the client side.

Workaround

I fixed this by creating a dummy interface on the server with a private IP and pointing the Pangolin private resource at that instead of the public IP:

ip link add dummy0 type dummy
ip addr add 10.99.99.1/32 dev dummy0
ip link set dummy0 up

This stopped OLM from hijacking the public IP route.

<!-- gh-comment-id:3881238270 --> @TerrifiedBug commented on GitHub (Feb 10, 2026): I ran into this as well. I have two OLM clients (v1.4.1) connecting via Pangolin. The holepunch was constantly flapping: it would connect direct for about 2–3 seconds, drop, fall back to relay, detect that direct worked again, switch back, drop again. Endless cycle. ### Root cause The root cause in my case was that the site’s public IP was also configured as a Pangolin private resource. OLM added a /32 host route for that IP through the tunnel. When holepunch failed (which was expected, since no firewall rules were open for it) and fell back to relay, the holepunch monitor kept testing connectivity. However, those probe packets were now being routed through the OLM tunnel itself because of that static route. From OLM’s perspective, the probes appeared to succeed. OLM then switched back to direct, the real connection immediately died, it fell back to relay again, the probes “succeeded” through the tunnel, and the whole cycle repeated endlessly. I tried --disable-holepunch (CLI flag), DISABLE_HOLEPUNCH=true (env var), and confirmed via olm -show-config that disable-holepunch = true [file] was saved, but it made no difference. It kept flapping. I dug into the source and found a related issue. In olm/olm.go, the OnTokenUpdate callback unconditionally starts the holepunch manager: ``` // line ~464 logger.Info("Starting hole punch for %d exit nodes", len(exitNodes)) if err := o.holePunchManager.StartMultipleExitNodes(hpExitNodes); err != nil { logger.Warn("Failed to start hole punch: %v", err) } ``` There’s no check for the holepunch config here. The flag only sets "relay": true in the registration message (around line ~416), but the client-side holepunch monitor keeps running regardless. It keeps testing, sees the peer is reachable, calls sendUnRelay(), switches to direct, the connection dies, and the cycle repeats. The disable-holepunch flag really should guard the holepunch manager, because right now it is effectively a no-op on the client side. ### Workaround I fixed this by creating a dummy interface on the server with a private IP and pointing the Pangolin private resource at that instead of the public IP: ip link add dummy0 type dummy ip addr add 10.99.99.1/32 dev dummy0 ip link set dummy0 up This stopped OLM from hijacking the public IP route.
Author
Owner

@ercoppa commented on GitHub (Feb 28, 2026):

I can confirm the issue.

<!-- gh-comment-id:3977531752 --> @ercoppa commented on GitHub (Feb 28, 2026): I can confirm the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/olm#315