[GH-ISSUE #2967] ee-latest (v1.18.1) causes CPU spike after ~60s — CE latest (v1.18.1) works fine #11047

New Issue

GiteaMirror · 2026-05-06T15:43:57-05:00

GiteaMirror commented

2026-05-06 15:43:57 -05:00

Originally created by @aszurnasirpal on GitHub (May 2, 2026).
Original GitHub issue: https://github.com/fosrl/pangolin/issues/2967

Describe the Bug

CPU spike regression in `ee-latest` v1.18.1 vs `latest` (CE) v1.18.1

Summary

fosrl/pangolin:ee-latest v1.18.1 causes extreme CPU usage (300–400%) after ~60 seconds of runtime. The Community Edition image (fosrl/pangolin:latest) at the same version works normally (~3–15% CPU).

Affected image

fosrl/pangolin:ee-latest
Version: 1.18.1
Image ID: sha256:6b07dae9e13f...
Built: 2026-04-29T23:50:28Z
Git rev: 79541ec7b8

Working image

fosrl/pangolin:latest (Community Edition)
Version: 1.18.1
Built: same day
CPU: 3–15% (normal)

Also working (older EE)

Image ID: sha256:0b7592da0ee5...
Built: ~2026-04-14
CPU: ~3% (normal)

Observed behavior

CPU usage climbs continuously after startup and reaches 300–400%:

Time Container CPU% MEM
11:26:05 pangolin 109% 158.7 MiB / 512 MiB
11:26:19 pangolin 78% 109.9 MiB / 512 MiB
11:26:31 pangolin 52% 121.3 MiB / 512 MiB
11:26:45 pangolin 36% 224.1 MiB / 512 MiB
11:26:57 pangolin 39% 210.0 MiB / 512 MiB
11:27:10 pangolin 15% 245.9 MiB / 512 MiB
11:27:24 pangolin 87% 270.7 MiB / 512 MiB
11:27:39 pangolin 123% 256.5 MiB / 512 MiB
11:27:53 pangolin 380% 244.1 MiB / 512 MiB

Container was restarted multiple times — same pattern reproduced consistently.

Workaround

Revert to fosrl/pangolin:latest (CE).

Notes

No config changes were made between the two images — only the image: tag in docker-compose.yml was changed
The regression appears to have been introduced sometime between the April 14 EE build and the April 29 EE build
CE and EE are listed as v1.18.1 — unsure if they share the same codebase or if EE has additional background processes

Environment

Component	Version
OS	Ubuntu 24.04.4 LTS
Kernel	6.14.0-1017-oracle
Docker	29.4.1
Docker Compose	5.1.3
RAM	1 GB
pangolin config	`mem_limit: 512m`, `mem_reservation: 128m`

To Reproduce

Steps to reproduce

Switch image: in docker-compose.yml from fosrl/pangolin:latest to fosrl/pangolin:ee-latest
Run docker compose up -d
Wait ~60 seconds
Monitor with docker stats pangolin

Expected behavior

CPU usage comparable to CE image (~3–15% at idle), as observed with fosrl/pangolin:latest v1.18.1 and the April 14 EE image.

Additional context from release notes:

The v1.18.1 EE release notes mention:

"the server now scrapes in the certificates from Traefik's acme.json file"

My setup has ./config:/app/config mounted into the pangolin container, which means acme.json is accessible at the default path /app/config/letsencrypt/acme.json. This feature is likely active in my setup.

Hypothesis: The CPU spike may be caused by a tight polling loop or crash-retry loop in the new certificate scraping code. The pattern (CPU starting moderate then climbing to 380% over ~90 seconds) is consistent
with either an infinite retry loop or a file watcher that fires repeatedly.

Config note:

My config.yml does NOT have enable_acme_cert_sync: true in the flags section.
If this flag defaults to false, the ACME sync feature should not be active.

Questions:

Does fosrl/pangolin:ee-latest enable enable_acme_cert_sync by default (ignoring the flag)?
The CPU growth pattern (progressive climb over ~90s) is consistent with a goroutine leak
in a polling loop. Is there a new background goroutine introduced in EE that runs unconditionally?

Originally created by @aszurnasirpal on GitHub (May 2, 2026). Original GitHub issue: https://github.com/fosrl/pangolin/issues/2967 ### Describe the Bug ## CPU spike regression in `ee-latest` v1.18.1 vs `latest` (CE) v1.18.1 ### Summary `fosrl/pangolin:ee-latest` v1.18.1 causes extreme CPU usage (300–400%) after ~60 seconds of runtime. The Community Edition image (`fosrl/pangolin:latest`) at the same version works normally (~3–15% CPU). ### Affected image fosrl/pangolin:ee-latest Version: 1.18.1 Image ID: sha256:6b07dae9e13f... Built: 2026-04-29T23:50:28Z Git rev: 79541ec7b8fcdbee5d8aaf14635911255241408c ### Working image fosrl/pangolin:latest (Community Edition) Version: 1.18.1 Built: same day CPU: 3–15% (normal) ### Also working (older EE) Image ID: sha256:0b7592da0ee5... Built: ~2026-04-14 CPU: ~3% (normal) ### Observed behavior CPU usage climbs continuously after startup and reaches 300–400%: Time Container CPU% MEM 11:26:05 pangolin 109% 158.7 MiB / 512 MiB 11:26:19 pangolin 78% 109.9 MiB / 512 MiB 11:26:31 pangolin 52% 121.3 MiB / 512 MiB 11:26:45 pangolin 36% 224.1 MiB / 512 MiB 11:26:57 pangolin 39% 210.0 MiB / 512 MiB 11:27:10 pangolin 15% 245.9 MiB / 512 MiB 11:27:24 pangolin 87% 270.7 MiB / 512 MiB 11:27:39 pangolin 123% 256.5 MiB / 512 MiB 11:27:53 pangolin 380% 244.1 MiB / 512 MiB Container was restarted multiple times — same pattern reproduced consistently. ### Workaround Revert to `fosrl/pangolin:latest` (CE). ### Notes - No config changes were made between the two images — only the `image:` tag in `docker-compose.yml` was changed - The regression appears to have been introduced sometime between the April 14 EE build and the April 29 EE build - CE and EE are listed as v1.18.1 — unsure if they share the same codebase or if EE has additional background processes ### Environment | Component | Version | |-----------|---------| | OS | Ubuntu 24.04.4 LTS | | Kernel | 6.14.0-1017-oracle | | Docker | 29.4.1 | | Docker Compose | 5.1.3 | | RAM | 1 GB | | pangolin config | `mem_limit: 512m`, `mem_reservation: 128m` | ### To Reproduce ### Steps to reproduce 1. Switch `image:` in `docker-compose.yml` from `fosrl/pangolin:latest` to `fosrl/pangolin:ee-latest` 2. Run `docker compose up -d` 3. Wait ~60 seconds 4. Monitor with `docker stats pangolin` ### Expected behavior CPU usage comparable to CE image (~3–15% at idle), as observed with `fosrl/pangolin:latest` v1.18.1 and the April 14 EE image. **Additional context from release notes:** The v1.18.1 EE release notes mention: > "the server now scrapes in the certificates from Traefik's acme.json file" My setup has `./config:/app/config` mounted into the pangolin container, which means `acme.json` is accessible at the default path `/app/config/letsencrypt/acme.json`. This feature is likely active in my setup. **Hypothesis:** The CPU spike may be caused by a tight polling loop or crash-retry loop in the new certificate scraping code. The pattern (CPU starting moderate then climbing to 380% over ~90 seconds) is consistent with either an infinite retry loop or a file watcher that fires repeatedly. **Config note:** My `config.yml` does NOT have `enable_acme_cert_sync: true` in the `flags` section. If this flag defaults to `false`, the ACME sync feature should not be active. Questions: 1. Does `fosrl/pangolin:ee-latest` enable `enable_acme_cert_sync` by default (ignoring the flag)? 2. The CPU growth pattern (progressive climb over ~90s) is consistent with a goroutine leak in a polling loop. Is there a new background goroutine introduced in EE that runs unconditionally?

GiteaMirror commented

2026-05-06 15:43:58 -05:00

@svillar commented on GitHub (May 2, 2026):

I was about to report the same. I had to switch back to CE because the MainThread process consistently consumes >50% CPU on a relatively idle state while in the CE version is ~5% in same usage conditions.

Also the load stays above 3 all the time. I know that depends on other processes, but when using the CE it's around 0.5 under same conditions.

I noticed because the UI is painfully slow, every time you select a submenu on the left bar it takes ~5s to load.

EDIT: in the logs most of the entries belong to crowdsec BTW

@svillar commented on GitHub (May 2, 2026): I was about to report the same. I had to switch back to CE because the MainThread process consistently consumes >50% CPU on a relatively idle state while in the CE version is ~5% in same usage conditions. Also the load stays above 3 all the time. I know that depends on other processes, but when using the CE it's around 0.5 under same conditions. I noticed because the UI is painfully slow, every time you select a submenu on the left bar it takes ~5s to load. EDIT: in the logs most of the entries belong to crowdsec BTW

GiteaMirror commented

2026-05-06 15:43:58 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

I disabled Crowdsec completely, as I thought initially that this was the reason why the system became completely sluggish and not responsive (Oracle Free Tier x86 with low CPU and 1GB Ram, so a low-end system)

But, no, it's definitely something with the Pangolin EE.

Anyway, given the context of CrowdSec, here are some of my observations from using it for around a year (but as I said, I compared 1.18.1 CE and EE without CrowdSec, which I removed after the system started behaving oddly, so this reported issue have nothing to do with CrowdSec).

Remediation metrics empty in CrowdSec console
Traefik bouncer plugin v1.3.5 did not send usage-metrics to LAPI at all — the metrics table in SQLite had no RC (remediation component) entries. This meant the CrowdSec console showed 0 remediations even though the bouncer was actively blocking. Fixed in v1.4.4. After upgrading to v1.5.1 metrics started appearing (maybe someone had the same issue)
Intermittent appsecQuery:unreachable errors in Traefik logs
AppSec was running fine (port 7422 reachable), but under load Traefik logged frequent appsecQuery:unreachable timeout errors. Non-fatal but noisy.
SQLite error causing Pangolin instability: context canceled: sql: transaction already committed or rolled back
After enabling CrowdSec, Pangolin's SQLite database started throwing this error repeatedly (related to crowdsec/crowdsec#3338). This caused CPU spikes to 99% and made the application unstable. The only fix was removing CrowdSec from the stack entirely. Without CrowdSec, Pangolin runs at 3–4% CPU / ~110 MB RAM. With CrowdSec, it was hitting 99% CPU.

@aszurnasirpal commented on GitHub (May 2, 2026): I disabled Crowdsec completely, as I thought initially that this was the reason why the system became completely sluggish and not responsive (Oracle Free Tier x86 with low CPU and 1GB Ram, so a low-end system) But, no, it's definitely something with the Pangolin EE. Anyway, given the context of CrowdSec, here are some of my observations from using it for around a year (but as I said, I compared 1.18.1 CE and EE without CrowdSec, which I removed after the system started behaving oddly, so this reported issue have nothing to do with CrowdSec). 1. Remediation metrics empty in CrowdSec console Traefik bouncer plugin v1.3.5 did not send usage-metrics to LAPI at all — the metrics table in SQLite had no RC (remediation component) entries. This meant the CrowdSec console showed 0 remediations even though the bouncer was actively blocking. Fixed in v1.4.4. After upgrading to v1.5.1 metrics started appearing (maybe someone had the same issue) 2. Intermittent appsecQuery:unreachable errors in Traefik logs AppSec was running fine (port 7422 reachable), but under load Traefik logged frequent appsecQuery:unreachable timeout errors. Non-fatal but noisy. 3. SQLite error causing Pangolin instability: context canceled: sql: transaction already committed or rolled back After enabling CrowdSec, Pangolin's SQLite database started throwing this error repeatedly (related to crowdsec/crowdsec#3338). This caused CPU spikes to 99% and made the application unstable. The only fix was removing CrowdSec from the stack entirely. Without CrowdSec, Pangolin runs at 3–4% CPU / ~110 MB RAM. With CrowdSec, it was hitting 99% CPU.

GiteaMirror commented

2026-05-06 15:43:59 -05:00

@AstralDestiny commented on GitHub (May 2, 2026):

Mmm can you,

#!/bin/bash

CONTAINER_NAME="pangolin"
LOG_FILE="container_$(date '+%Y%m%d_%H%M%S').log"
DURATION=120
THRESHOLD=100
echo "=== Bringing down containers ==="
docker compose down
echo "=== Bringing up containers ==="
docker compose up -d
echo "=== Waiting for $CONTAINER_NAME to be ready ==="
until docker inspect -f '{{.State.Running}}' "$CONTAINER_NAME" 2>/dev/null | grep -q "true"; do
  sleep 1
done
echo "=== Logging to $LOG_FILE for ${DURATION}s (threshold: ${THRESHOLD}%) ==="
while true; do
  TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
  CPU=$(docker stats "$CONTAINER_NAME" --no-stream --format "{{.CPUPerc}}")
  CPU_VAL=${CPU%\%}

  if (( $(awk "BEGIN {print ($CPU_VAL > $THRESHOLD)}") )); then
    echo "[$TIMESTAMP] [CPU]  $CPU  ⚠️  ABOVE ${THRESHOLD}%" >> "$LOG_FILE"
  else
    echo "[$TIMESTAMP] [CPU]  $CPU" >> "$LOG_FILE"
  fi
  sleep 2
done &
CPU_PID=$!

docker logs -f "$CONTAINER_NAME" 2>&1 | while IFS= read -r line; do
  TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
  echo "[$TIMESTAMP] [LOG]  $line"
done >> "$LOG_FILE" &
LOGS_PID=$!

tail -f "$LOG_FILE" &
TAIL_PID=$!

(
  sleep "$DURATION"

  echo "" >> "$LOG_FILE"
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] [INFO] Duration reached (${DURATION}s) — stopping capture." >> "$LOG_FILE"

  READINGS=$(grep '\[CPU\]' "$LOG_FILE" | grep -oP '[\d.]+(?=%)')

  COUNT=$(echo "$READINGS" | wc -l)
  MAX=$(echo "$READINGS" | sort -n | tail -1)
  MIN=$(echo "$READINGS" | sort -n | head -1)
  AVG=$(echo "$READINGS" | awk '{sum+=$1} END {printf "%.2f", sum/NR}')
  SPIKES=$(grep "ABOVE ${THRESHOLD}%" "$LOG_FILE" | wc -l)

  {
    echo ""
    echo "==============================="
    echo "         CPU SUMMARY"
    echo "==============================="
    echo "  Samples taken : $COUNT"
    echo "  Min CPU       : ${MIN}%"
    echo "  Avg CPU       : ${AVG}%"
    echo "  Max CPU       : ${MAX}%"
    echo "  Spikes >$THRESHOLD%  : $SPIKES"
    echo "==============================="
    if (( $(awk "BEGIN {print ($MAX > $THRESHOLD)}") )); then
      echo "  ⚠️  Max exceeded threshold"
    else
      echo "  ✅  Stayed within expected range"
    fi
    echo "==============================="
  } >> "$LOG_FILE"

  kill $CPU_PID $LOGS_PID $TAIL_PID 2>/dev/null
) &
TIMER_PID=$!

trap "kill $CPU_PID $LOGS_PID $TAIL_PID $TIMER_PID 2>/dev/null; exit" INT TERM
wait

take the contents put it into the same directory or nested folder as the pangolin stack then name it as watch.sh for example and execute via bash watch.sh you can do sh but certain functions will break. bash watch.sh

Make sure to place pangolin into debug via LOG_LEVEL=debug in the environment flags of the pangolin service please.

Preferably if you can provide the log it outputs in private (It might contain sensitive info) and the final part.

@AstralDestiny commented on GitHub (May 2, 2026): Mmm can you, ```sh #!/bin/bash CONTAINER_NAME="pangolin" LOG_FILE="container_$(date '+%Y%m%d_%H%M%S').log" DURATION=120 THRESHOLD=100 echo "=== Bringing down containers ===" docker compose down echo "=== Bringing up containers ===" docker compose up -d echo "=== Waiting for $CONTAINER_NAME to be ready ===" until docker inspect -f '{{.State.Running}}' "$CONTAINER_NAME" 2>/dev/null | grep -q "true"; do sleep 1 done echo "=== Logging to $LOG_FILE for ${DURATION}s (threshold: ${THRESHOLD}%) ===" while true; do TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S') CPU=$(docker stats "$CONTAINER_NAME" --no-stream --format "{{.CPUPerc}}") CPU_VAL=${CPU%\%} if (( $(awk "BEGIN {print ($CPU_VAL > $THRESHOLD)}") )); then echo "[$TIMESTAMP] [CPU] $CPU ⚠️ ABOVE ${THRESHOLD}%" >> "$LOG_FILE" else echo "[$TIMESTAMP] [CPU] $CPU" >> "$LOG_FILE" fi sleep 2 done & CPU_PID=$! docker logs -f "$CONTAINER_NAME" 2>&1 | while IFS= read -r line; do TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S') echo "[$TIMESTAMP] [LOG] $line" done >> "$LOG_FILE" & LOGS_PID=$! tail -f "$LOG_FILE" & TAIL_PID=$! ( sleep "$DURATION" echo "" >> "$LOG_FILE" echo "[$(date '+%Y-%m-%d %H:%M:%S')] [INFO] Duration reached (${DURATION}s) — stopping capture." >> "$LOG_FILE" READINGS=$(grep '\[CPU\]' "$LOG_FILE" | grep -oP '[\d.]+(?=%)') COUNT=$(echo "$READINGS" | wc -l) MAX=$(echo "$READINGS" | sort -n | tail -1) MIN=$(echo "$READINGS" | sort -n | head -1) AVG=$(echo "$READINGS" | awk '{sum+=$1} END {printf "%.2f", sum/NR}') SPIKES=$(grep "ABOVE ${THRESHOLD}%" "$LOG_FILE" | wc -l) { echo "" echo "===============================" echo " CPU SUMMARY" echo "===============================" echo " Samples taken : $COUNT" echo " Min CPU : ${MIN}%" echo " Avg CPU : ${AVG}%" echo " Max CPU : ${MAX}%" echo " Spikes >$THRESHOLD% : $SPIKES" echo "===============================" if (( $(awk "BEGIN {print ($MAX > $THRESHOLD)}") )); then echo " ⚠️ Max exceeded threshold" else echo " ✅ Stayed within expected range" fi echo "===============================" } >> "$LOG_FILE" kill $CPU_PID $LOGS_PID $TAIL_PID 2>/dev/null ) & TIMER_PID=$! trap "kill $CPU_PID $LOGS_PID $TAIL_PID $TIMER_PID 2>/dev/null; exit" INT TERM wait ``` take the contents put it into the same directory or nested folder as the pangolin stack then name it as watch.sh for example and execute via bash watch.sh you can do sh but certain functions will break. ``bash watch.sh`` Make sure to place pangolin into debug via LOG_LEVEL=debug in the environment flags of the pangolin service please. Preferably if you can provide the log it outputs in private (It might contain sensitive info) and the final part.

GiteaMirror commented

2026-05-06 15:43:59 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

Sure, will send you the entire log in a moment.

Just summary below:

===============================
         CPU SUMMARY
===============================
  Samples taken : 46
  Min CPU       : 11.75%
  Avg CPU       : 210.12%
  Max CPU       : 808.43%
  Spikes >100%  : 18
===============================
  ⚠️  Max exceeded threshold

@aszurnasirpal commented on GitHub (May 2, 2026): Sure, will send you the entire log in a moment. Just summary below: ``` =============================== CPU SUMMARY =============================== Samples taken : 46 Min CPU : 11.75% Avg CPU : 210.12% Max CPU : 808.43% Spikes >100% : 18 =============================== ⚠️ Max exceeded threshold ```

GiteaMirror commented

2026-05-06 15:43:59 -05:00

@AstralDestiny commented on GitHub (May 2, 2026):

Likely assuming this is attributed to arm but the log it dumps would be the most useful part.

@AstralDestiny commented on GitHub (May 2, 2026): Likely assuming this is attributed to arm but the log it dumps would be the most useful part.

GiteaMirror commented

2026-05-06 15:44:00 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

will do, but can you provide any email address where I can send you those logs? Can't find any direct address in your profile.

My Oracle is not on arm its x86

@aszurnasirpal commented on GitHub (May 2, 2026): will do, but can you provide any email address where I can send you those logs? Can't find any direct address in your profile. My Oracle is not on arm its x86

GiteaMirror commented

2026-05-06 15:44:00 -05:00

@AstralDestiny commented on GitHub (May 2, 2026):

Can find me on the slack or on the discord, x86 purely or x86_64?

Though not sure of a good method to provide a direct address in my profile honestly.

@AstralDestiny commented on GitHub (May 2, 2026): Can find me on the slack or on the discord, x86 purely or x86_64? Though not sure of a good method to provide a direct address in my profile honestly.

GiteaMirror commented

2026-05-06 15:44:01 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

x86_64 - its oracle free tier (but not Ampere)

DM send on discord ;)

@aszurnasirpal commented on GitHub (May 2, 2026): x86_64 - its oracle free tier (but not Ampere) DM send on discord ;)

GiteaMirror commented

2026-05-06 15:44:01 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

Just for the others to see. Just redacted some non-relevant parts of the logs

[2026-05-02 18:01:37] [LOG]
[2026-05-02 18:01:37] [LOG]  > @fosrl/pangolin@0.0.0 start
[2026-05-02 18:01:37] [LOG]  > ENVIRONMENT=prod node dist/migrations.mjs && ENVIRONMENT=prod NODE_ENV=development node --enable-source-maps dist/server.mjs
[2026-05-02 18:01:37] [LOG]
[2026-05-02 18:01:37] [LOG]  Starting migrations from version 1.18.0
[2026-05-02 18:01:37] [LOG]  Migrations to run:
[2026-05-02 18:01:37] [LOG]  All migrations completed successfully
[2026-05-02 18:01:37] [LOG]  2026-05-02T16:00:37+00:00 [info]: Analytics usage statistics collection is disabled. If you enable this, you can help us make Pangolin better for everyone. Learn more at: https://docs.pangolin.net/telemetry
[2026-05-02 18:01:37] [LOG]  2026-05-02T16:00:40+00:00 [info]: Dashboard API server is running on http://localhost:3000
[2026-05-02 18:01:37] [LOG]  2026-05-02T16:00:40+00:00 [info]: Internal API server is running on http://localhost:3001
[2026-05-02 18:01:37] [LOG]  2026-05-02T16:00:45+00:00 [info]: OpenAPI documentation saved to config/openapi.yaml
[2026-05-02 18:01:37] [LOG]  2026-05-02T16:01:09+00:00 [info]: Dashboard Web UI server is running on http://localhost:3002
[2026-05-02 18:01:37] [LOG]  2026-05-02T16:01:10+00:00 [info]: Integration API server is running on http://localhost:3003
[2026-05-02 18:01:36] [CPU]  11.75%
[2026-05-02 18:01:42] [CPU]  37.63%
[2026-05-02 18:01:46] [CPU]  21.20%
[2026-05-02 18:01:53] [CPU]  26.06%
[2026-05-02 18:01:56] [CPU]  13.98%
[2026-05-02 18:02:00] [LOG]  2026-05-02T16:01:59+00:00 [info]: Updated exit node reachableAt to http://gerbil:3003
[2026-05-02 18:02:00] [CPU]  44.37%
[2026-05-02 18:02:04] [CPU]  12.49%
[2026-05-02 18:02:09] [LOG]  2026-05-02T16:02:08+00:00 [info]: Marking site 1 offline: newt redacted has no recent ping and no active WebSocket connection
[2026-05-02 18:02:09] [LOG]  2026-05-02T16:02:09+00:00 [info]: Marking health check 31 unhealthy due to site 1 being marked offline
[2026-05-02 18:02:09] [LOG]  2026-05-02T16:02:09+00:00 [info]: Marking site 5 offline: newt redacted has no recent ping and no active WebSocket connection
[2026-05-02 18:02:08] [CPU]  69.05%
[2026-05-02 18:02:12] [CPU]  19.62%
[2026-05-02 18:02:15] [CPU]  276.70%  ⚠️  ABOVE 100%
[2026-05-02 18:02:20] [CPU]  147.44%  ⚠️  ABOVE 100%
[2026-05-02 18:02:26] [CPU]  329.56%  ⚠️  ABOVE 100%
[2026-05-02 18:02:30] [CPU]  418.13%  ⚠️  ABOVE 100%
[2026-05-02 18:02:34] [CPU]  355.90%  ⚠️  ABOVE 100%
[2026-05-02 18:02:39] [CPU]  547.71%  ⚠️  ABOVE 100%
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:42+00:00 [info]: Establishing websocket connection
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:43+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: redacted, Total connections: 1, Config version: 0
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:43+00:00 [info]: WebSocket connection fully established and ready - NEWT ID: redacted
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:44+00:00 [info]: Adding peer with public key redacted to exit node 1
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:44+00:00 [info]: Establishing websocket connection
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:44+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: redacted, Total connections: 1, Config version: 0
[2026-05-02 18:02:44] [LOG]  2026-05-02T16:02:44+00:00 [info]: WebSocket connection fully established and ready - NEWT ID: redacted
[2026-05-02 18:02:43] [CPU]  207.67%  ⚠️  ABOVE 100%
[2026-05-02 18:02:48] [LOG]  2026-05-02T16:02:48+00:00 [info]: Handling healthcheck status message
[2026-05-02 18:02:47] [CPU]  301.03%  ⚠️  ABOVE 100%
[2026-05-02 18:02:53] [CPU]  526.33%  ⚠️  ABOVE 100%
[2026-05-02 18:02:55] [LOG]  2026-05-02T16:02:55+00:00 [error]: Error making POST request (can Pangolin see Gerbil HTTP API?) for exit node at http://gerbil:3003 (status: undefined): timeout of 8000ms exceeded
[2026-05-02 18:02:57] [CPU]  211.64%  ⚠️  ABOVE 100%
[2026-05-02 18:03:01] [CPU]  334.33%  ⚠️  ABOVE 100%
[2026-05-02 18:03:05] [CPU]  726.75%  ⚠️  ABOVE 100%
[2026-05-02 18:03:09] [CPU]  436.47%  ⚠️  ABOVE 100%
[2026-05-02 18:03:12] [CPU]  350.15%  ⚠️  ABOVE 100%
[2026-05-02 18:03:16] [CPU]  563.91%  ⚠️  ABOVE 100%
[2026-05-02 18:03:20] [CPU]  808.43%  ⚠️  ABOVE 100%
[2026-05-02 18:03:24] [CPU]  759.86%  ⚠️  ABOVE 100%
[2026-05-02 18:03:28] [CPU]  39.90%
[2026-05-02 18:03:31] [CPU]  267.44%  ⚠️  ABOVE 100%

[2026-05-02 18:03:37] [INFO] Duration reached (120s) — stopping capture.

It's very in line with the stats that I put in my initial post.

@aszurnasirpal commented on GitHub (May 2, 2026): Just for the others to see. Just redacted some non-relevant parts of the logs ``` [2026-05-02 18:01:37] [LOG] [2026-05-02 18:01:37] [LOG] > @fosrl/pangolin@0.0.0 start [2026-05-02 18:01:37] [LOG] > ENVIRONMENT=prod node dist/migrations.mjs && ENVIRONMENT=prod NODE_ENV=development node --enable-source-maps dist/server.mjs [2026-05-02 18:01:37] [LOG] [2026-05-02 18:01:37] [LOG] Starting migrations from version 1.18.0 [2026-05-02 18:01:37] [LOG] Migrations to run: [2026-05-02 18:01:37] [LOG] All migrations completed successfully [2026-05-02 18:01:37] [LOG] 2026-05-02T16:00:37+00:00 [info]: Analytics usage statistics collection is disabled. If you enable this, you can help us make Pangolin better for everyone. Learn more at: https://docs.pangolin.net/telemetry [2026-05-02 18:01:37] [LOG] 2026-05-02T16:00:40+00:00 [info]: Dashboard API server is running on http://localhost:3000 [2026-05-02 18:01:37] [LOG] 2026-05-02T16:00:40+00:00 [info]: Internal API server is running on http://localhost:3001 [2026-05-02 18:01:37] [LOG] 2026-05-02T16:00:45+00:00 [info]: OpenAPI documentation saved to config/openapi.yaml [2026-05-02 18:01:37] [LOG] 2026-05-02T16:01:09+00:00 [info]: Dashboard Web UI server is running on http://localhost:3002 [2026-05-02 18:01:37] [LOG] 2026-05-02T16:01:10+00:00 [info]: Integration API server is running on http://localhost:3003 [2026-05-02 18:01:36] [CPU] 11.75% [2026-05-02 18:01:42] [CPU] 37.63% [2026-05-02 18:01:46] [CPU] 21.20% [2026-05-02 18:01:53] [CPU] 26.06% [2026-05-02 18:01:56] [CPU] 13.98% [2026-05-02 18:02:00] [LOG] 2026-05-02T16:01:59+00:00 [info]: Updated exit node reachableAt to http://gerbil:3003 [2026-05-02 18:02:00] [CPU] 44.37% [2026-05-02 18:02:04] [CPU] 12.49% [2026-05-02 18:02:09] [LOG] 2026-05-02T16:02:08+00:00 [info]: Marking site 1 offline: newt redacted has no recent ping and no active WebSocket connection [2026-05-02 18:02:09] [LOG] 2026-05-02T16:02:09+00:00 [info]: Marking health check 31 unhealthy due to site 1 being marked offline [2026-05-02 18:02:09] [LOG] 2026-05-02T16:02:09+00:00 [info]: Marking site 5 offline: newt redacted has no recent ping and no active WebSocket connection [2026-05-02 18:02:08] [CPU] 69.05% [2026-05-02 18:02:12] [CPU] 19.62% [2026-05-02 18:02:15] [CPU] 276.70% ⚠️ ABOVE 100% [2026-05-02 18:02:20] [CPU] 147.44% ⚠️ ABOVE 100% [2026-05-02 18:02:26] [CPU] 329.56% ⚠️ ABOVE 100% [2026-05-02 18:02:30] [CPU] 418.13% ⚠️ ABOVE 100% [2026-05-02 18:02:34] [CPU] 355.90% ⚠️ ABOVE 100% [2026-05-02 18:02:39] [CPU] 547.71% ⚠️ ABOVE 100% [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:42+00:00 [info]: Establishing websocket connection [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:43+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: redacted, Total connections: 1, Config version: 0 [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:43+00:00 [info]: WebSocket connection fully established and ready - NEWT ID: redacted [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:44+00:00 [info]: Adding peer with public key redacted to exit node 1 [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:44+00:00 [info]: Establishing websocket connection [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:44+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: redacted, Total connections: 1, Config version: 0 [2026-05-02 18:02:44] [LOG] 2026-05-02T16:02:44+00:00 [info]: WebSocket connection fully established and ready - NEWT ID: redacted [2026-05-02 18:02:43] [CPU] 207.67% ⚠️ ABOVE 100% [2026-05-02 18:02:48] [LOG] 2026-05-02T16:02:48+00:00 [info]: Handling healthcheck status message [2026-05-02 18:02:47] [CPU] 301.03% ⚠️ ABOVE 100% [2026-05-02 18:02:53] [CPU] 526.33% ⚠️ ABOVE 100% [2026-05-02 18:02:55] [LOG] 2026-05-02T16:02:55+00:00 [error]: Error making POST request (can Pangolin see Gerbil HTTP API?) for exit node at http://gerbil:3003 (status: undefined): timeout of 8000ms exceeded [2026-05-02 18:02:57] [CPU] 211.64% ⚠️ ABOVE 100% [2026-05-02 18:03:01] [CPU] 334.33% ⚠️ ABOVE 100% [2026-05-02 18:03:05] [CPU] 726.75% ⚠️ ABOVE 100% [2026-05-02 18:03:09] [CPU] 436.47% ⚠️ ABOVE 100% [2026-05-02 18:03:12] [CPU] 350.15% ⚠️ ABOVE 100% [2026-05-02 18:03:16] [CPU] 563.91% ⚠️ ABOVE 100% [2026-05-02 18:03:20] [CPU] 808.43% ⚠️ ABOVE 100% [2026-05-02 18:03:24] [CPU] 759.86% ⚠️ ABOVE 100% [2026-05-02 18:03:28] [CPU] 39.90% [2026-05-02 18:03:31] [CPU] 267.44% ⚠️ ABOVE 100% [2026-05-02 18:03:37] [INFO] Duration reached (120s) — stopping capture. ``` It's very in line with the stats that I put in my initial post.

GiteaMirror commented

2026-05-06 15:44:01 -05:00

@AstralDestiny commented on GitHub (May 2, 2026):

[2026-05-02 18<:02:1062723874942361660>55] [LOG] 2026-05-02T16<:02:1062723874942361660>55+00:00 [error]: Error making POST request (can Pangolin see Gerbil HTTP API?) for exit node at http://gerbil:3003 (status: undefined): timeout of 8000ms exceeded

Quite concerned for this entry honestly..

@AstralDestiny commented on GitHub (May 2, 2026): [2026-05-02 18<:02:1062723874942361660>55] [LOG] 2026-05-02T16<:02:1062723874942361660>55+00:00 [error]: Error making POST request (can Pangolin see Gerbil HTTP API?) for exit node at http://gerbil:3003 (status: undefined): timeout of 8000ms exceeded Quite concerned for this entry honestly..

GiteaMirror commented

2026-05-06 15:44:02 -05:00

@oschwartz10612 commented on GitHub (May 2, 2026):

Could you run docker stats and see what container is causing this? As @AstralDestiny says the time out from gerbil makes me thing something is up there maybe?

@oschwartz10612 commented on GitHub (May 2, 2026): Could you run docker stats and see what container is causing this? As @AstralDestiny says the time out from gerbil makes me thing something is up there maybe?

GiteaMirror commented

2026-05-06 15:44:02 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

Yes, those numbers in the issue in the fist description were taken directly from docker stats output — I collected them during the diagnostic session. The CPU spike (reaching 380%) was in the pangolin container, not gerbil.
The gerbil timeout issue you mention is likely a downstream effect: when pangolin's CPU is pegged at 300-400%, it can't respond to gerbil's health checks or requests in time, causing gerbil to report timeouts (at least my assumption).

@aszurnasirpal commented on GitHub (May 2, 2026): Yes, those numbers in the issue in the fist description were taken directly from docker stats output — I collected them during the diagnostic session. The CPU spike (reaching 380%) was in the pangolin container, not gerbil. The gerbil timeout issue you mention is likely a downstream effect: when pangolin's CPU is pegged at 300-400%, it can't respond to gerbil's health checks or requests in time, causing gerbil to report timeouts (at least my assumption).

GiteaMirror commented

2026-05-06 15:44:02 -05:00

@AstralDestiny commented on GitHub (May 2, 2026):

[2026-05-02 12:35:21] [INFO] Duration reached (120s) — stopping capture.

===============================
         CPU SUMMARY
===============================
  Samples taken : 30
  Min CPU       : 5.09%
  Avg CPU       : 9.34%
  Max CPU       : 19.97%
  Spikes >100%  : 0
===============================
  ✅  Stayed within expected range
===============================

For me if I do it. with and without memory constraints on it. That's EE 1.18.1 for me. Though I don't get the 8000ms timeout at all.. and not fully sure what's causing that timeout. I mean short of you removing the ipam stuff is the only thing not tested.

@AstralDestiny commented on GitHub (May 2, 2026): [2026-05-02 12:35:21] [INFO] Duration reached (120s) — stopping capture. ``` =============================== CPU SUMMARY =============================== Samples taken : 30 Min CPU : 5.09% Avg CPU : 9.34% Max CPU : 19.97% Spikes >100% : 0 =============================== ✅ Stayed within expected range =============================== ``` For me if I do it. with and without memory constraints on it. That's EE 1.18.1 for me. Though I don't get the 8000ms timeout at all.. and not fully sure what's causing that timeout. I mean short of you removing the ipam stuff is the only thing not tested.

GiteaMirror commented

2026-05-06 15:44:03 -05:00

@aszurnasirpal commented on GitHub (May 2, 2026):

For comparison, this is from "stable" community edition (the same version) on the same machine (the same config only image is different)

[2026-05-02 21:44:16] [LOG]  
[2026-05-02 21:44:16] [LOG]  > @fosrl/pangolin@0.0.0 start
[2026-05-02 21:44:16] [LOG]  > ENVIRONMENT=prod node dist/migrations.mjs && ENVIRONMENT=prod NODE_ENV=development node --enable-source-maps dist/server.mjs
[2026-05-02 21:44:16] [LOG]  
[2026-05-02 21:44:16] [LOG]  Starting migrations from version 1.18.0
[2026-05-02 21:44:16] [LOG]  Migrations to run: 
[2026-05-02 21:44:16] [LOG]  All migrations completed successfully
[2026-05-02 21:44:16] [LOG]  2026-05-02T19:43:54+00:00 [info]: Analytics usage statistics collection is disabled. If you enable this, you can help us make Pangolin better for everyone. Learn more at: https://docs.pangolin.net/telemetry
[2026-05-02 21:44:16] [LOG]  2026-05-02T19:43:55+00:00 [info]: Dashboard API server is running on http://localhost:3000
[2026-05-02 21:44:16] [LOG]  2026-05-02T19:43:55+00:00 [info]: Internal API server is running on http://localhost:3001
[2026-05-02 21:44:16] [LOG]  2026-05-02T19:44:04+00:00 [info]: OpenAPI documentation saved to config/openapi.yaml
[2026-05-02 21:44:15] [CPU]  35.59%
[2026-05-02 21:44:20] [CPU]  50.27%
[2026-05-02 21:44:24] [CPU]  139.87%  ⚠️  ABOVE 100%
[2026-05-02 21:44:31] [CPU]  78.74%
[2026-05-02 21:44:33] [LOG]  2026-05-02T19:44:32+00:00 [info]: Dashboard Web UI server is running on http://localhost:3002
[2026-05-02 21:44:33] [LOG]  2026-05-02T19:44:32+00:00 [info]: Integration API server is running on http://localhost:3003
[2026-05-02 21:44:33] [LOG]  2026-05-02T19:44:33+00:00 [info]: Marking site 1 offline: newt redacted has no recent ping and no active WebSocket connection
[2026-05-02 21:44:33] [LOG]  2026-05-02T19:44:33+00:00 [info]: Marking site 5 offline: newt redacted has no recent ping and no active WebSocket connection
[2026-05-02 21:44:33] [LOG]  2026-05-02T19:44:33+00:00 [info]: Updated exit node with reachableAt to http://gerbil:3003
[2026-05-02 21:44:34] [CPU]  600.64%  ⚠️  ABOVE 100%
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: Establishing websocket connection
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: a3406ed8-18be-4fee-84fe-f112837da998, Total connections: 1
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: WebSocket connection established - NEWT ID: redacted
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: Establishing websocket connection
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: 726afbaf-9bd5-44c4-b451-1ce58b38e03c, Total connections: 1
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: WebSocket connection established - NEWT ID: redacted
[2026-05-02 21:44:36] [LOG]  2026-05-02T19:44:36+00:00 [info]: Adding peer with public key redacted to exit node 1
[2026-05-02 21:44:37] [LOG]  2026-05-02T19:44:37+00:00 [info]: Handling healthcheck status message
[2026-05-02 21:44:37] [LOG]  2026-05-02T19:44:37+00:00 [info]: Exit node request successful: {"method":"POST","url":"http://gerbil:3003/peer","status":"Peer added successfully"}
[2026-05-02 21:44:38] [CPU]  2.72%
[2026-05-02 21:44:41] [LOG]  (node:33) [DEP0169] DeprecationWarning: `url.parse()` behavior is not standardized and prone to errors that have security implications. Use the WHATWG URL API instead. CVEs are not issued for `url.parse()` vulnerabilities.
[2026-05-02 21:44:41] [LOG]  (Use `node --trace-deprecation ...` to show where the warning was created)
[2026-05-02 21:44:41] [CPU]  61.51%
[2026-05-02 21:44:45] [CPU]  82.91%
[2026-05-02 21:44:48] [CPU]  15.42%
[2026-05-02 21:44:52] [CPU]  154.88%  ⚠️  ABOVE 100%
[2026-05-02 21:44:55] [CPU]  332.56%  ⚠️  ABOVE 100%
[2026-05-02 21:44:59] [CPU]  2.91%
[2026-05-02 21:45:02] [CPU]  11.30%
[2026-05-02 21:45:05] [CPU]  3.48%
[2026-05-02 21:45:09] [CPU]  55.31%
[2026-05-02 21:45:13] [CPU]  4.78%
[2026-05-02 21:45:16] [CPU]  34.65%
[2026-05-02 21:45:20] [CPU]  34.01%
[2026-05-02 21:45:24] [CPU]  2.97%
[2026-05-02 21:45:27] [CPU]  2.73%
[2026-05-02 21:45:30] [CPU]  10.83%
[2026-05-02 21:45:33] [CPU]  2.86%
[2026-05-02 21:45:36] [CPU]  3.17%
[2026-05-02 21:45:39] [CPU]  13.32%
[2026-05-02 21:45:43] [CPU]  3.03%
[2026-05-02 21:45:46] [CPU]  3.13%
[2026-05-02 21:45:49] [CPU]  24.21%
[2026-05-02 21:45:53] [CPU]  75.62%
[2026-05-02 21:45:58] [CPU]  2.59%
[2026-05-02 21:46:01] [CPU]  15.48%
[2026-05-02 21:46:05] [CPU]  2.42%
[2026-05-02 21:46:08] [CPU]  2.77%
[2026-05-02 21:46:11] [CPU]  14.92%
[2026-05-02 21:46:14] [CPU]  2.56%

[2026-05-02 21:46:15] [INFO] Duration reached (120s) — stopping capture.

===============================
         CPU SUMMARY
===============================
  Samples taken : 38
  Min CPU       : 2.42%
  Avg CPU       : 60.11%
  Max CPU       : 600.64%
  Spikes >100%  : 4
===============================
  ⚠️  Max exceeded threshold

One important caveat to the results: the average of 60% is inflated by start-up spikes - if only the stabilized phase (after ~21:44:48) was measured, the average would be closer to 10-15%.

@aszurnasirpal commented on GitHub (May 2, 2026): For comparison, this is from "stable" community edition (the same version) on the same machine (the same config only image is different) ``` [2026-05-02 21:44:16] [LOG] [2026-05-02 21:44:16] [LOG] > @fosrl/pangolin@0.0.0 start [2026-05-02 21:44:16] [LOG] > ENVIRONMENT=prod node dist/migrations.mjs && ENVIRONMENT=prod NODE_ENV=development node --enable-source-maps dist/server.mjs [2026-05-02 21:44:16] [LOG] [2026-05-02 21:44:16] [LOG] Starting migrations from version 1.18.0 [2026-05-02 21:44:16] [LOG] Migrations to run: [2026-05-02 21:44:16] [LOG] All migrations completed successfully [2026-05-02 21:44:16] [LOG] 2026-05-02T19:43:54+00:00 [info]: Analytics usage statistics collection is disabled. If you enable this, you can help us make Pangolin better for everyone. Learn more at: https://docs.pangolin.net/telemetry [2026-05-02 21:44:16] [LOG] 2026-05-02T19:43:55+00:00 [info]: Dashboard API server is running on http://localhost:3000 [2026-05-02 21:44:16] [LOG] 2026-05-02T19:43:55+00:00 [info]: Internal API server is running on http://localhost:3001 [2026-05-02 21:44:16] [LOG] 2026-05-02T19:44:04+00:00 [info]: OpenAPI documentation saved to config/openapi.yaml [2026-05-02 21:44:15] [CPU] 35.59% [2026-05-02 21:44:20] [CPU] 50.27% [2026-05-02 21:44:24] [CPU] 139.87% ⚠️ ABOVE 100% [2026-05-02 21:44:31] [CPU] 78.74% [2026-05-02 21:44:33] [LOG] 2026-05-02T19:44:32+00:00 [info]: Dashboard Web UI server is running on http://localhost:3002 [2026-05-02 21:44:33] [LOG] 2026-05-02T19:44:32+00:00 [info]: Integration API server is running on http://localhost:3003 [2026-05-02 21:44:33] [LOG] 2026-05-02T19:44:33+00:00 [info]: Marking site 1 offline: newt redacted has no recent ping and no active WebSocket connection [2026-05-02 21:44:33] [LOG] 2026-05-02T19:44:33+00:00 [info]: Marking site 5 offline: newt redacted has no recent ping and no active WebSocket connection [2026-05-02 21:44:33] [LOG] 2026-05-02T19:44:33+00:00 [info]: Updated exit node with reachableAt to http://gerbil:3003 [2026-05-02 21:44:34] [CPU] 600.64% ⚠️ ABOVE 100% [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: Establishing websocket connection [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: a3406ed8-18be-4fee-84fe-f112837da998, Total connections: 1 [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: WebSocket connection established - NEWT ID: redacted [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: Establishing websocket connection [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: Client added to tracking - NEWT ID: redacted, Connection ID: 726afbaf-9bd5-44c4-b451-1ce58b38e03c, Total connections: 1 [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: WebSocket connection established - NEWT ID: redacted [2026-05-02 21:44:36] [LOG] 2026-05-02T19:44:36+00:00 [info]: Adding peer with public key redacted to exit node 1 [2026-05-02 21:44:37] [LOG] 2026-05-02T19:44:37+00:00 [info]: Handling healthcheck status message [2026-05-02 21:44:37] [LOG] 2026-05-02T19:44:37+00:00 [info]: Exit node request successful: {"method":"POST","url":"http://gerbil:3003/peer","status":"Peer added successfully"} [2026-05-02 21:44:38] [CPU] 2.72% [2026-05-02 21:44:41] [LOG] (node:33) [DEP0169] DeprecationWarning: `url.parse()` behavior is not standardized and prone to errors that have security implications. Use the WHATWG URL API instead. CVEs are not issued for `url.parse()` vulnerabilities. [2026-05-02 21:44:41] [LOG] (Use `node --trace-deprecation ...` to show where the warning was created) [2026-05-02 21:44:41] [CPU] 61.51% [2026-05-02 21:44:45] [CPU] 82.91% [2026-05-02 21:44:48] [CPU] 15.42% [2026-05-02 21:44:52] [CPU] 154.88% ⚠️ ABOVE 100% [2026-05-02 21:44:55] [CPU] 332.56% ⚠️ ABOVE 100% [2026-05-02 21:44:59] [CPU] 2.91% [2026-05-02 21:45:02] [CPU] 11.30% [2026-05-02 21:45:05] [CPU] 3.48% [2026-05-02 21:45:09] [CPU] 55.31% [2026-05-02 21:45:13] [CPU] 4.78% [2026-05-02 21:45:16] [CPU] 34.65% [2026-05-02 21:45:20] [CPU] 34.01% [2026-05-02 21:45:24] [CPU] 2.97% [2026-05-02 21:45:27] [CPU] 2.73% [2026-05-02 21:45:30] [CPU] 10.83% [2026-05-02 21:45:33] [CPU] 2.86% [2026-05-02 21:45:36] [CPU] 3.17% [2026-05-02 21:45:39] [CPU] 13.32% [2026-05-02 21:45:43] [CPU] 3.03% [2026-05-02 21:45:46] [CPU] 3.13% [2026-05-02 21:45:49] [CPU] 24.21% [2026-05-02 21:45:53] [CPU] 75.62% [2026-05-02 21:45:58] [CPU] 2.59% [2026-05-02 21:46:01] [CPU] 15.48% [2026-05-02 21:46:05] [CPU] 2.42% [2026-05-02 21:46:08] [CPU] 2.77% [2026-05-02 21:46:11] [CPU] 14.92% [2026-05-02 21:46:14] [CPU] 2.56% [2026-05-02 21:46:15] [INFO] Duration reached (120s) — stopping capture. =============================== CPU SUMMARY =============================== Samples taken : 38 Min CPU : 2.42% Avg CPU : 60.11% Max CPU : 600.64% Spikes >100% : 4 =============================== ⚠️ Max exceeded threshold ``` One important caveat to the results: the average of 60% is inflated by start-up spikes - if only the stabilized phase (after ~21:44:48) was measured, the average would be closer to 10-15%.

GiteaMirror commented

2026-05-06 15:44:04 -05:00

@AstralDestiny commented on GitHub (May 2, 2026):

[2026-05-02 16:00:17] [INFO] Duration reached (120s) — stopping capture.

===============================
         CPU SUMMARY
===============================
  Samples taken : 40
  Min CPU       : 2.73%
  Avg CPU       : 5.69%
  Max CPU       : 40.06%
  Spikes >100%  : 0
===============================
  ✅  Stayed within expected range
===============================

On CE 1.18.1 For me at-least.

and EE for comparison.

===============================
         CPU SUMMARY
===============================
  Samples taken : 30
  Min CPU       : 5.09%
  Avg CPU       : 9.34%
  Max CPU       : 19.97%
  Spikes >100%  : 0
===============================
  ✅  Stayed within expected range
===============================

So pretty sure it's something to do with the gerbil timeout.. or something more annoying.. But could also be a over provisioned cpu maybe? Not sure honestly.

@AstralDestiny commented on GitHub (May 2, 2026): [2026-05-02 16:00:17] [INFO] Duration reached (120s) — stopping capture. ``` =============================== CPU SUMMARY =============================== Samples taken : 40 Min CPU : 2.73% Avg CPU : 5.69% Max CPU : 40.06% Spikes >100% : 0 =============================== ✅ Stayed within expected range =============================== ``` On CE 1.18.1 For me at-least. and EE for comparison. ``` =============================== CPU SUMMARY =============================== Samples taken : 30 Min CPU : 5.09% Avg CPU : 9.34% Max CPU : 19.97% Spikes >100% : 0 =============================== ✅ Stayed within expected range =============================== ``` So pretty sure it's something to do with the gerbil timeout.. or something more annoying.. But could also be a over provisioned cpu maybe? Not sure honestly.

GiteaMirror commented

2026-05-06 15:44:04 -05:00

@aszurnasirpal commented on GitHub (May 3, 2026):

Just for reference. I did try the latest 1.18.2, released tonight. The effect is the same. Community edition is blazing fast and low on resources; the EE edition is causing massive CPU usage on my system, making it unusable.

Seems that EE is not for me.

@aszurnasirpal commented on GitHub (May 3, 2026): Just for reference. I did try the latest 1.18.2, released tonight. The effect is the same. Community edition is blazing fast and low on resources; the EE edition is causing massive CPU usage on my system, making it unusable. Seems that EE is not for me.

GiteaMirror commented

2026-05-06 15:44:04 -05:00

@AstralDestiny commented on GitHub (May 3, 2026):

Not sure yet what's causing that Think Owen has been looking into it, How much ram is it using compared between?

@AstralDestiny commented on GitHub (May 3, 2026): Not sure yet what's causing that Think Owen has been looking into it, How much ram is it using compared between?

GiteaMirror commented

2026-05-06 15:44:04 -05:00

@aszurnasirpal commented on GitHub (May 3, 2026):

  ┌─────┬───────────┬──────────────────┐                                                                                                                                                                                 
  │     │ EE 1.18.2 │ Community 1.18.2 │                                                                                                                                                                                 
  ├─────┼───────────┼──────────────────┤                                                                                                                                                                                 
  │ CPU │ 315–973%  │ 5%               │                    
  ├─────┼───────────┼──────────────────┤
  │ RAM │ 260 MiB   │ 155 MiB          │                                                                                                                                                                                 
  └─────┴───────────┴──────────────────┘

ee edition:

  ⎿  CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O        PIDS
     a7581cadbc9f   pangolin       315.37%   260.3MiB / 512MiB   50.84%    520kB / 1.66MB    2GB / 412MB      23

@aszurnasirpal commented on GitHub (May 3, 2026): ``` ┌─────┬───────────┬──────────────────┐ │ │ EE 1.18.2 │ Community 1.18.2 │ ├─────┼───────────┼──────────────────┤ │ CPU │ 315–973% │ 5% │ ├─────┼───────────┼──────────────────┤ │ RAM │ 260 MiB │ 155 MiB │ └─────┴───────────┴──────────────────┘ ``` ee edition: ``` ⎿ CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS a7581cadbc9f pangolin 315.37% 260.3MiB / 512MiB 50.84% 520kB / 1.66MB 2GB / 412MB 23 ```

GiteaMirror commented

2026-05-06 15:44:05 -05:00

@AstralDestiny commented on GitHub (May 3, 2026):

CE

CONTAINER ID   NAME              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O        PIDS
a1b38225d842   pangolin          2.50%     410.6MiB / 1GiB       40.09%    978kB / 4.1MB   156MB / 61.4kB   23

EE

CONTAINER ID   NAME       CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O      PIDS
73237a77736f   pangolin   6.34%     622.5MiB / 31.04GiB   1.96%     154MB / 284MB   57.3kB / 2GB   23

Though my CE only has a single site vs the EE has 3 orgs and a bunch of newts and resources configured.

@AstralDestiny commented on GitHub (May 3, 2026): CE ``` CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS a1b38225d842 pangolin 2.50% 410.6MiB / 1GiB 40.09% 978kB / 4.1MB 156MB / 61.4kB 23 ``` EE ``` CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 73237a77736f pangolin 6.34% 622.5MiB / 31.04GiB 1.96% 154MB / 284MB 57.3kB / 2GB 23 ``` Though my CE only has a single site vs the EE has 3 orgs and a bunch of newts and resources configured.

GiteaMirror commented

2026-05-06 15:44:05 -05:00

@aszurnasirpal commented on GitHub (May 3, 2026):

As I said, I was on EE from the very begiining, and only after the upgrade to v. 18 EE did I face this problem. I have 53 resources configured from one org/site

@aszurnasirpal commented on GitHub (May 3, 2026): As I said, I was on EE from the very begiining, and only after the upgrade to v. 18 EE did I face this problem. I have 53 resources configured from one org/site

GiteaMirror commented

2026-05-06 15:44:05 -05:00

@AstralDestiny commented on GitHub (May 3, 2026):

mount | grep cgroup ? Curious if you have groupsv1 on that.

@AstralDestiny commented on GitHub (May 3, 2026): ``mount | grep cgroup`` ? Curious if you have groupsv1 on that.

GiteaMirror commented

2026-05-06 15:44:05 -05:00

@aszurnasirpal commented on GitHub (May 3, 2026):

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

@aszurnasirpal commented on GitHub (May 3, 2026): cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

GiteaMirror commented

2026-05-06 15:44:06 -05:00

@svillar commented on GitHub (May 3, 2026):

I got this in the logs, maybe relevant

pangolin  |  ⨯ Error [AxiosError]: timeout of 10000ms exceeded
pangolin  |     at v.<anonymous> (.next/server/chunks/8752.js:16:13973)
pangolin  |     at Timeout._onTimeout (.next/server/chunks/8752.js:3:171862)
pangolin  |     at bP.request (.next/server/chunks/8752.js:16:26036)
pangolin  |     at async B (.next/server/chunks/7268.js:1:14852) {
pangolin  |   isAxiosError: true,
pangolin  |   code: 'ECONNABORTED',
pangolin  |   config: [Object],
pangolin  |   request: [Writable],
pangolin  |   digest: '3997569878'
pangolin  | }
traefik   | {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2026-05-03T22:54:53Z","message":"Provider error, retrying in 711.601574ms"}
crowdsec  | time="2026-05-03T22:54:59Z" level=info msg="127.0.0.1 - [Sun, 03 May 2026 22:54:59 UTC] \"POST /v1/watchers/login HTTP/1.1 200 527.893111ms \"crowdsec/v1.7.7-981e6166-docker\" \"" module=lapi
traefik   | {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2026-05-03T22:55:04Z","message":"Provider error, retrying in 2.303520531s"}
crowdsec  | time="2026-05-03T22:55:07Z" level=info msg="172.18.0.4 - [Sun, 03 May 2026 22:55:07 UTC] \"GET /v1/decisions?ip=XX.XX.XX.XX HTTP/1.1 200 997.886327ms \"Crowdsec-Bouncer-Traefik-Plugin/1.X.X\" \"" module=lapi

@svillar commented on GitHub (May 3, 2026): I got this in the logs, maybe relevant ``` pangolin | ⨯ Error [AxiosError]: timeout of 10000ms exceeded pangolin | at v.<anonymous> (.next/server/chunks/8752.js:16:13973) pangolin | at Timeout._onTimeout (.next/server/chunks/8752.js:3:171862) pangolin | at bP.request (.next/server/chunks/8752.js:16:26036) pangolin | at async B (.next/server/chunks/7268.js:1:14852) { pangolin | isAxiosError: true, pangolin | code: 'ECONNABORTED', pangolin | config: [Object], pangolin | request: [Writable], pangolin | digest: '3997569878' pangolin | } traefik | {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2026-05-03T22:54:53Z","message":"Provider error, retrying in 711.601574ms"} crowdsec | time="2026-05-03T22:54:59Z" level=info msg="127.0.0.1 - [Sun, 03 May 2026 22:54:59 UTC] \"POST /v1/watchers/login HTTP/1.1 200 527.893111ms \"crowdsec/v1.7.7-981e6166-docker\" \"" module=lapi traefik | {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2026-05-03T22:55:04Z","message":"Provider error, retrying in 2.303520531s"} crowdsec | time="2026-05-03T22:55:07Z" level=info msg="172.18.0.4 - [Sun, 03 May 2026 22:55:07 UTC] \"GET /v1/decisions?ip=XX.XX.XX.XX HTTP/1.1 200 997.886327ms \"Crowdsec-Bouncer-Traefik-Plugin/1.X.X\" \"" module=lapi ```

GiteaMirror commented

2026-05-06 15:44:07 -05:00

@AstralDestiny commented on GitHub (May 4, 2026):

Mmm which docker ..? If it returns /snap uninstall docker and install docker properly if want quick do, wget get.docker.com; sh index.html which will install docker not via snap.

@AstralDestiny commented on GitHub (May 4, 2026): Mmm ``which docker`` ..? If it returns /snap uninstall docker and install docker properly if want quick do, ``wget get.docker.com; sh index.html`` which will install docker not via snap.

GiteaMirror commented

2026-05-06 15:44:08 -05:00

@AstralDestiny commented on GitHub (May 4, 2026):

Asking as the latency is a bit too much so either something in ubuntu or the wrong package, hoping wrong package.

@AstralDestiny commented on GitHub (May 4, 2026): Asking as the latency is a bit too much so either something in ubuntu or the wrong package, hoping wrong package.

GiteaMirror commented

2026-05-06 15:44:09 -05:00

@aszurnasirpal commented on GitHub (May 4, 2026):

mine is /usr/bin/docker

@aszurnasirpal commented on GitHub (May 4, 2026): mine is /usr/bin/docker

GiteaMirror commented

2026-05-06 15:44:09 -05:00

@AstralDestiny commented on GitHub (May 4, 2026):

Hmm so not snap.. Mmm curious honestly as it's behaving odd, Are you differing at all from base install?

@AstralDestiny commented on GitHub (May 4, 2026): Hmm so not snap.. Mmm curious honestly as it's behaving odd, Are you differing at all from base install?

GiteaMirror commented

2026-05-06 15:44:09 -05:00

@aszurnasirpal commented on GitHub (May 4, 2026):

No, nothing significant. I only added the geoipupdate Docker image to Docker Compose to update the MaxMind database.
On this machine, the only other container running is Beszel, and pretty much nothing else.

@aszurnasirpal commented on GitHub (May 4, 2026): No, nothing significant. I only added the geoipupdate Docker image to Docker Compose to update the MaxMind database. On this machine, the only other container running is Beszel, and pretty much nothing else.

Sign in to join this conversation.

Branches Tags

main

local-connection

crowdin_dev

dev

dependabot/npm_and_yarn/npm-dependencies-338850a190

dependabot/go_modules/install/go-install-dependencies-3804ca7238

dependabot/docker/docker-dependencies-4faa477378

dependabot/github_actions/github-actions-dependencies-6d79802a48

api-improvements

feat/remember-last-idp-on-smart-login-form

refactor/batch-status-requests

dependabot/npm_and_yarn/multi-5f1280885e

fix/non-semver-version-error

private-resource-page

resource-launcher

backhaul

exit-node-reconnect

feat/command-palette

ssh

delete-account

msg-delivery

org-only-idp

cicd

patch

site-targets-auto-login

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/pangolin#11047

[GH-ISSUE #2967] ee-latest (v1.18.1) causes CPU spike after ~60s — CE latest (v1.18.1) works fine #11047

Describe the Bug

CPU spike regression in ee-latest v1.18.1 vs latest (CE) v1.18.1

Summary

Affected image

Working image

Also working (older EE)

Observed behavior

Workaround

Notes

Environment

To Reproduce

Steps to reproduce

Expected behavior

CPU spike regression in `ee-latest` v1.18.1 vs `latest` (CE) v1.18.1