[GH-ISSUE #2134] Abnormal disk IO in version 1.13.1 #6873

Closed
opened 2026-04-25 15:52:19 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @asardaes on GitHub (Dec 21, 2025).
Original GitHub issue: https://github.com/fosrl/pangolin/issues/2134

Describe the Bug

I'm not sure if this is related to #2120, so I figured I would report it to be sure. My Pangolin container hasn't suffered from OOM kills, but it has been lagging significantly; I set the memory limit of the container to 320M. Now I looked at the disk metrics from my VPS and I see something like this:

Image

Environment

  • OS Type & Version: Ubuntu 24.04.3 LTS
  • Pangolin Version: 1.13.1 with geolite2 DB
  • Gerbil Version: 1.3.0
  • Traefik Version: 3.6.5
  • Newt Version: 1.7.0
  • Olm Version: 1.2.0

Crowdsec is not running on the VM hosting Pangolin.

To Reproduce

You'll notice a couple spikes in the screenshot I posted. In this experiment I did the following:

  1. Enable site A which only has a private resource configured. In this machine Newt is running directly on the VM through systemd, no container. The disk read spike after this appears to have normalized.
  2. Enable machine X. This is a Docker container using Olm to connect, and it's the machine that gets access to the private resource from site A. Very small spike seen in the screenshot.
  3. Enable site B with multiple public resources. This is another Docker container running Newt. This caused the read throughput to skyrocket and stay there until I disabled the site again.

I wonder if the memory constraints from Pangolin's container make it flush caches continuously and reload them from disk.

Expected Behavior

Sane disk IO.

Originally created by @asardaes on GitHub (Dec 21, 2025). Original GitHub issue: https://github.com/fosrl/pangolin/issues/2134 ### Describe the Bug I'm not sure if this is related to #2120, so I figured I would report it to be sure. My Pangolin container hasn't suffered from OOM kills, but it has been lagging significantly; I set the memory limit of the container to 320M. Now I looked at the disk metrics from my VPS and I see something like this: <img width="377" height="354" alt="Image" src="https://github.com/user-attachments/assets/80d0e806-c47c-42f1-b7d8-9a4003630250" /> ### Environment - OS Type & Version: Ubuntu 24.04.3 LTS - Pangolin Version: 1.13.1 with geolite2 DB - Gerbil Version: 1.3.0 - Traefik Version: 3.6.5 - Newt Version: 1.7.0 - Olm Version: 1.2.0 Crowdsec is *not* running on the VM hosting Pangolin. ### To Reproduce You'll notice a couple spikes in the screenshot I posted. In this experiment I did the following: 1. Enable site A which only has a private resource configured. In this machine Newt is running directly on the VM through systemd, no container. The disk read spike after this appears to have normalized. 2. Enable machine X. This is a Docker container using Olm to connect, and it's the machine that gets access to the private resource from site A. Very small spike seen in the screenshot. 3. Enable site B with multiple public resources. This is another Docker container running Newt. This caused the read throughput to skyrocket and stay there until I disabled the site again. I wonder if the memory constraints from Pangolin's container make it flush caches continuously and reload them from disk. ### Expected Behavior Sane disk IO.
Author
Owner

@asardaes commented on GitHub (Dec 21, 2025):

I think this isn't related to geo-blocking. I stopped all Newts and then the Pangolin stack to comment out maxmind_db_path from the YAML and then started it again while running iotop. I'm not very familiar with iotop so I'm not sure if it includes socket communications, but it reports gigabytes read during bootup?

Total DISK READ:        53.37 M/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:      52.71 M/s | Current DISK WRITE:       0.00 B/s
    PID  PRIO  USER    DISK READ>  DISK WRITE    COMMAND
  85467 be/4 root         10.80 G   1432.00 K traefik traefik --configFile=/etc/traefik/traefik_config.yml
   1120 be/4 root          7.39 G   1012.00 K dockerd -H fd:// --containerd=/run/containerd/containerd.sock
  85108 be/4 root          5.63 G    364.00 K node --enable-source-maps dist/server.mjs
    951 be/4 root          5.41 G    428.00 K containerd

I only have info logs for pangolin, but it needs a couple of minutes to load, surely because the disk IO is being throttled, but I don't know how much it's reading and from where.

pangolin   | > ENVIRONMENT=prod node dist/migrations.mjs && ENVIRONMENT=prod NODE_ENV=development node --enable-source-maps dist/server.mjs
pangolin   |
pangolin   | Starting migrations from version 1.13.0
pangolin   | Migrations to run:
pangolin   | All migrations completed successfully
pangolin   | 2025-12-21T15:17:49+00:00 [info]: Pangolin gathers anonymous usage data to help us better understand how the software is used and guide future improvements and feature development. You can find more details, including instructions for opting out of this anonymous data collection, at: https://docs.pangolin.net/telemetry
pangolin   | 2025-12-21T15:17:51+00:00 [info]: API server is running on http://localhost:3000
pangolin   | 2025-12-21T15:17:51+00:00 [info]: Internal server is running on http://localhost:3001
pangolin   | 2025-12-21T15:18:03+00:00 [info]: OpenAPI documentation saved to config/openapi.yaml
pangolin     | 2025-12-21T15:20:26+00:00 [info]: Next.js server is running on http://localhost:3002
pangolin     | 2025-12-21T15:20:28+00:00 [info]: Integration API server is running on http://localhost:3003
pangolin     | 2025-12-21T15:20:34+00:00 [info]: Kicking offline olm client 1 due to inactivity
pangolin     | 2025-12-21T15:21:56+00:00 [info]: Updated exit node with reachableAt to http://gerbil:3003
pangolin     | Error while flushing PostHog [Error [PostHogFetchNetworkError]: Network error while fetching PostHog] {
pangolin     |   error: [Error [TimeoutError]: The operation was aborted due to timeout] {
pangolin     |     code: 23,
pangolin     |     INDEX_SIZE_ERR: 1,
pangolin     |     DOMSTRING_SIZE_ERR: 2,
pangolin     |     HIERARCHY_REQUEST_ERR: 3,
pangolin     |     WRONG_DOCUMENT_ERR: 4,
pangolin     |     INVALID_CHARACTER_ERR: 5,
pangolin     |     NO_DATA_ALLOWED_ERR: 6,
pangolin     |     NO_MODIFICATION_ALLOWED_ERR: 7,
pangolin     |     NOT_FOUND_ERR: 8,
pangolin     |     NOT_SUPPORTED_ERR: 9,
pangolin     |     INUSE_ATTRIBUTE_ERR: 10,
pangolin     |     INVALID_STATE_ERR: 11,
pangolin     |     SYNTAX_ERR: 12,
pangolin     |     INVALID_MODIFICATION_ERR: 13,
pangolin     |     NAMESPACE_ERR: 14,
pangolin     |     INVALID_ACCESS_ERR: 15,
pangolin     |     VALIDATION_ERR: 16,
pangolin     |     TYPE_MISMATCH_ERR: 17,
pangolin     |     SECURITY_ERR: 18,
pangolin     |     NETWORK_ERR: 19,
pangolin     |     ABORT_ERR: 20,
pangolin     |     URL_MISMATCH_ERR: 21,
pangolin     |     QUOTA_EXCEEDED_ERR: 22,
pangolin     |     TIMEOUT_ERR: 23,
pangolin     |     INVALID_NODE_TYPE_ERR: 24,
pangolin     |     DATA_CLONE_ERR: 25
pangolin     |   },
pangolin     |   [cause]: [Error [TimeoutError]: The operation was aborted due to timeout] {
pangolin     |     code: 23,
pangolin     |     INDEX_SIZE_ERR: 1,
pangolin     |     DOMSTRING_SIZE_ERR: 2,
pangolin     |     HIERARCHY_REQUEST_ERR: 3,
pangolin     |     WRONG_DOCUMENT_ERR: 4,
pangolin     |     INVALID_CHARACTER_ERR: 5,
pangolin     |     NO_DATA_ALLOWED_ERR: 6,
pangolin     |     NO_MODIFICATION_ALLOWED_ERR: 7,
pangolin     |     NOT_FOUND_ERR: 8,
pangolin     |     NOT_SUPPORTED_ERR: 9,
pangolin     |     INUSE_ATTRIBUTE_ERR: 10,
pangolin     |     INVALID_STATE_ERR: 11,
pangolin     |     SYNTAX_ERR: 12,
pangolin     |     INVALID_MODIFICATION_ERR: 13,
pangolin     |     NAMESPACE_ERR: 14,
pangolin     |     INVALID_ACCESS_ERR: 15,
pangolin     |     VALIDATION_ERR: 16,
pangolin     |     TYPE_MISMATCH_ERR: 17,
pangolin     |     SECURITY_ERR: 18,
pangolin     |     NETWORK_ERR: 19,
pangolin     |     ABORT_ERR: 20,
pangolin     |     URL_MISMATCH_ERR: 21,
pangolin     |     QUOTA_EXCEEDED_ERR: 22,
pangolin     |     TIMEOUT_ERR: 23,
pangolin     |     INVALID_NODE_TYPE_ERR: 24,
pangolin     |     DATA_CLONE_ERR: 25
pangolin     |   }
pangolin     | }

And naturally Traefik can't reach Pangolin during this time

traefik      | {"level":"info","providerName":"letsencrypt.acme","acmeCA":"https://acme-v02.api.letsencrypt.org/directory","time":"2025-12-21T15:18:24Z","message":"Testing certificate renew..."}
traefik      | {"level":"info","time":"2025-12-21T15:18:24Z","message":"Starting provider *http.Provider"}
traefik      | {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-12-21T15:18:50Z","message":"Provider error, retrying in 292.001728ms"}
<!-- gh-comment-id:3678907947 --> @asardaes commented on GitHub (Dec 21, 2025): I think this isn't related to geo-blocking. I stopped all Newts and then the Pangolin stack to comment out `maxmind_db_path` from the YAML and then started it again while running `iotop`. I'm not very familiar with `iotop` so I'm not sure if it includes socket communications, but it reports gigabytes read during bootup? ``` Total DISK READ: 53.37 M/s | Total DISK WRITE: 0.00 B/s Current DISK READ: 52.71 M/s | Current DISK WRITE: 0.00 B/s PID PRIO USER DISK READ> DISK WRITE COMMAND 85467 be/4 root 10.80 G 1432.00 K traefik traefik --configFile=/etc/traefik/traefik_config.yml 1120 be/4 root 7.39 G 1012.00 K dockerd -H fd:// --containerd=/run/containerd/containerd.sock 85108 be/4 root 5.63 G 364.00 K node --enable-source-maps dist/server.mjs 951 be/4 root 5.41 G 428.00 K containerd ``` I only have info logs for pangolin, but it needs a couple of minutes to load, surely because the disk IO is being throttled, but I don't know how much it's reading and from where. ``` pangolin | > ENVIRONMENT=prod node dist/migrations.mjs && ENVIRONMENT=prod NODE_ENV=development node --enable-source-maps dist/server.mjs pangolin | pangolin | Starting migrations from version 1.13.0 pangolin | Migrations to run: pangolin | All migrations completed successfully pangolin | 2025-12-21T15:17:49+00:00 [info]: Pangolin gathers anonymous usage data to help us better understand how the software is used and guide future improvements and feature development. You can find more details, including instructions for opting out of this anonymous data collection, at: https://docs.pangolin.net/telemetry pangolin | 2025-12-21T15:17:51+00:00 [info]: API server is running on http://localhost:3000 pangolin | 2025-12-21T15:17:51+00:00 [info]: Internal server is running on http://localhost:3001 pangolin | 2025-12-21T15:18:03+00:00 [info]: OpenAPI documentation saved to config/openapi.yaml pangolin | 2025-12-21T15:20:26+00:00 [info]: Next.js server is running on http://localhost:3002 pangolin | 2025-12-21T15:20:28+00:00 [info]: Integration API server is running on http://localhost:3003 pangolin | 2025-12-21T15:20:34+00:00 [info]: Kicking offline olm client 1 due to inactivity pangolin | 2025-12-21T15:21:56+00:00 [info]: Updated exit node with reachableAt to http://gerbil:3003 pangolin | Error while flushing PostHog [Error [PostHogFetchNetworkError]: Network error while fetching PostHog] { pangolin | error: [Error [TimeoutError]: The operation was aborted due to timeout] { pangolin | code: 23, pangolin | INDEX_SIZE_ERR: 1, pangolin | DOMSTRING_SIZE_ERR: 2, pangolin | HIERARCHY_REQUEST_ERR: 3, pangolin | WRONG_DOCUMENT_ERR: 4, pangolin | INVALID_CHARACTER_ERR: 5, pangolin | NO_DATA_ALLOWED_ERR: 6, pangolin | NO_MODIFICATION_ALLOWED_ERR: 7, pangolin | NOT_FOUND_ERR: 8, pangolin | NOT_SUPPORTED_ERR: 9, pangolin | INUSE_ATTRIBUTE_ERR: 10, pangolin | INVALID_STATE_ERR: 11, pangolin | SYNTAX_ERR: 12, pangolin | INVALID_MODIFICATION_ERR: 13, pangolin | NAMESPACE_ERR: 14, pangolin | INVALID_ACCESS_ERR: 15, pangolin | VALIDATION_ERR: 16, pangolin | TYPE_MISMATCH_ERR: 17, pangolin | SECURITY_ERR: 18, pangolin | NETWORK_ERR: 19, pangolin | ABORT_ERR: 20, pangolin | URL_MISMATCH_ERR: 21, pangolin | QUOTA_EXCEEDED_ERR: 22, pangolin | TIMEOUT_ERR: 23, pangolin | INVALID_NODE_TYPE_ERR: 24, pangolin | DATA_CLONE_ERR: 25 pangolin | }, pangolin | [cause]: [Error [TimeoutError]: The operation was aborted due to timeout] { pangolin | code: 23, pangolin | INDEX_SIZE_ERR: 1, pangolin | DOMSTRING_SIZE_ERR: 2, pangolin | HIERARCHY_REQUEST_ERR: 3, pangolin | WRONG_DOCUMENT_ERR: 4, pangolin | INVALID_CHARACTER_ERR: 5, pangolin | NO_DATA_ALLOWED_ERR: 6, pangolin | NO_MODIFICATION_ALLOWED_ERR: 7, pangolin | NOT_FOUND_ERR: 8, pangolin | NOT_SUPPORTED_ERR: 9, pangolin | INUSE_ATTRIBUTE_ERR: 10, pangolin | INVALID_STATE_ERR: 11, pangolin | SYNTAX_ERR: 12, pangolin | INVALID_MODIFICATION_ERR: 13, pangolin | NAMESPACE_ERR: 14, pangolin | INVALID_ACCESS_ERR: 15, pangolin | VALIDATION_ERR: 16, pangolin | TYPE_MISMATCH_ERR: 17, pangolin | SECURITY_ERR: 18, pangolin | NETWORK_ERR: 19, pangolin | ABORT_ERR: 20, pangolin | URL_MISMATCH_ERR: 21, pangolin | QUOTA_EXCEEDED_ERR: 22, pangolin | TIMEOUT_ERR: 23, pangolin | INVALID_NODE_TYPE_ERR: 24, pangolin | DATA_CLONE_ERR: 25 pangolin | } pangolin | } ``` And naturally Traefik can't reach Pangolin during this time ``` traefik | {"level":"info","providerName":"letsencrypt.acme","acmeCA":"https://acme-v02.api.letsencrypt.org/directory","time":"2025-12-21T15:18:24Z","message":"Testing certificate renew..."} traefik | {"level":"info","time":"2025-12-21T15:18:24Z","message":"Starting provider *http.Provider"} traefik | {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-12-21T15:18:50Z","message":"Provider error, retrying in 292.001728ms"} ```
Author
Owner

@asardaes commented on GitHub (Dec 21, 2025):

Well, after starting containers one by one, it seems the main culprit is Traefik, but I don't really know why, although there's this warning in its documentation:

Image

So I disabled watching for dynamic_config.yml, but that didn't make such a big difference.

I then found this old Traefik issue that mentions memory constraints as the issue, so I gave the container more memory, and that helped a bit I think, but only during periods of inactivity, so maybe the VM simply is too constrained for Traefik.

<!-- gh-comment-id:3679166491 --> @asardaes commented on GitHub (Dec 21, 2025): Well, after starting containers one by one, it seems the main culprit is Traefik, but I don't really know why, although there's this warning in [its documentation](https://doc.traefik.io/traefik/reference/install-configuration/providers/others/file/): <img width="701" height="354" alt="Image" src="https://github.com/user-attachments/assets/db1d76ed-a2e0-450e-9db7-8edaf440c2c1" /> So I disabled watching for `dynamic_config.yml`, but that didn't make such a big difference. I then found [this old Traefik issue](https://github.com/traefik/traefik/issues/3341) that mentions memory constraints as the issue, so I gave the container more memory, and that helped a bit I think, but only during periods of inactivity, so maybe the VM simply is too constrained for Traefik.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/pangolin#6873