mirror of
https://github.com/fosrl/pangolin.git
synced 2026-05-22 09:32:36 -05:00
[GH-ISSUE #2967] ee-latest (v1.18.1) causes CPU spike after ~60s — CE latest (v1.18.1) works fine #17264
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @aszurnasirpal on GitHub (May 2, 2026).
Original GitHub issue: https://github.com/fosrl/pangolin/issues/2967
Describe the Bug
CPU spike regression in
ee-latestv1.18.1 vslatest(CE) v1.18.1Summary
fosrl/pangolin:ee-latestv1.18.1 causes extreme CPU usage (300–400%) after ~60 seconds of runtime. The Community Edition image (fosrl/pangolin:latest) at the same version works normally (~3–15% CPU).Affected image
fosrl/pangolin:ee-latest
Version: 1.18.1
Image ID: sha256:6b07dae9e13f...
Built: 2026-04-29T23:50:28Z
Git rev:
79541ec7b8Working image
fosrl/pangolin:latest (Community Edition)
Version: 1.18.1
Built: same day
CPU: 3–15% (normal)
Also working (older EE)
Image ID: sha256:0b7592da0ee5...
Built: ~2026-04-14
CPU: ~3% (normal)
Observed behavior
CPU usage climbs continuously after startup and reaches 300–400%:
Time Container CPU% MEM
11:26:05 pangolin 109% 158.7 MiB / 512 MiB
11:26:19 pangolin 78% 109.9 MiB / 512 MiB
11:26:31 pangolin 52% 121.3 MiB / 512 MiB
11:26:45 pangolin 36% 224.1 MiB / 512 MiB
11:26:57 pangolin 39% 210.0 MiB / 512 MiB
11:27:10 pangolin 15% 245.9 MiB / 512 MiB
11:27:24 pangolin 87% 270.7 MiB / 512 MiB
11:27:39 pangolin 123% 256.5 MiB / 512 MiB
11:27:53 pangolin 380% 244.1 MiB / 512 MiB
Container was restarted multiple times — same pattern reproduced consistently.
Workaround
Revert to
fosrl/pangolin:latest(CE).Notes
image:tag indocker-compose.ymlwas changedEnvironment
mem_limit: 512m,mem_reservation: 128mTo Reproduce
Steps to reproduce
image:indocker-compose.ymlfromfosrl/pangolin:latesttofosrl/pangolin:ee-latestdocker compose up -ddocker stats pangolinExpected behavior
CPU usage comparable to CE image (~3–15% at idle), as observed with
fosrl/pangolin:latestv1.18.1 and the April 14 EE image.Additional context from release notes:
The v1.18.1 EE release notes mention:
My setup has
./config:/app/configmounted into the pangolin container, which meansacme.jsonis accessible at the default path/app/config/letsencrypt/acme.json. This feature is likely active in my setup.Hypothesis: The CPU spike may be caused by a tight polling loop or crash-retry loop in the new certificate scraping code. The pattern (CPU starting moderate then climbing to 380% over ~90 seconds) is consistent
with either an infinite retry loop or a file watcher that fires repeatedly.
Config note:
My
config.ymldoes NOT haveenable_acme_cert_sync: truein theflagssection.If this flag defaults to
false, the ACME sync feature should not be active.Questions:
fosrl/pangolin:ee-latestenableenable_acme_cert_syncby default (ignoring the flag)?in a polling loop. Is there a new background goroutine introduced in EE that runs unconditionally?
@svillar commented on GitHub (May 2, 2026):
I was about to report the same. I had to switch back to CE because the MainThread process consistently consumes >50% CPU on a relatively idle state while in the CE version is ~5% in same usage conditions.
Also the load stays above 3 all the time. I know that depends on other processes, but when using the CE it's around 0.5 under same conditions.
I noticed because the UI is painfully slow, every time you select a submenu on the left bar it takes ~5s to load.
EDIT: in the logs most of the entries belong to crowdsec BTW
@aszurnasirpal commented on GitHub (May 2, 2026):
I disabled Crowdsec completely, as I thought initially that this was the reason why the system became completely sluggish and not responsive (Oracle Free Tier x86 with low CPU and 1GB Ram, so a low-end system)
But, no, it's definitely something with the Pangolin EE.
Anyway, given the context of CrowdSec, here are some of my observations from using it for around a year (but as I said, I compared 1.18.1 CE and EE without CrowdSec, which I removed after the system started behaving oddly, so this reported issue have nothing to do with CrowdSec).
Remediation metrics empty in CrowdSec console
Traefik bouncer plugin v1.3.5 did not send usage-metrics to LAPI at all — the metrics table in SQLite had no RC (remediation component) entries. This meant the CrowdSec console showed 0 remediations even though the bouncer was actively blocking. Fixed in v1.4.4. After upgrading to v1.5.1 metrics started appearing (maybe someone had the same issue)
Intermittent appsecQuery:unreachable errors in Traefik logs
AppSec was running fine (port 7422 reachable), but under load Traefik logged frequent appsecQuery:unreachable timeout errors. Non-fatal but noisy.
SQLite error causing Pangolin instability: context canceled: sql: transaction already committed or rolled back
After enabling CrowdSec, Pangolin's SQLite database started throwing this error repeatedly (related to crowdsec/crowdsec#3338). This caused CPU spikes to 99% and made the application unstable. The only fix was removing CrowdSec from the stack entirely. Without CrowdSec, Pangolin runs at 3–4% CPU / ~110 MB RAM. With CrowdSec, it was hitting 99% CPU.
@AstralDestiny commented on GitHub (May 2, 2026):
Mmm can you,
take the contents put it into the same directory or nested folder as the pangolin stack then name it as watch.sh for example and execute via bash watch.sh you can do sh but certain functions will break.
bash watch.shMake sure to place pangolin into debug via LOG_LEVEL=debug in the environment flags of the pangolin service please.
Preferably if you can provide the log it outputs in private (It might contain sensitive info) and the final part.
@aszurnasirpal commented on GitHub (May 2, 2026):
Sure, will send you the entire log in a moment.
Just summary below:
@AstralDestiny commented on GitHub (May 2, 2026):
Likely assuming this is attributed to arm but the log it dumps would be the most useful part.
@aszurnasirpal commented on GitHub (May 2, 2026):
will do, but can you provide any email address where I can send you those logs? Can't find any direct address in your profile.
My Oracle is not on arm its x86
@AstralDestiny commented on GitHub (May 2, 2026):
Can find me on the slack or on the discord, x86 purely or x86_64?
Though not sure of a good method to provide a direct address in my profile honestly.
@aszurnasirpal commented on GitHub (May 2, 2026):
x86_64 - its oracle free tier (but not Ampere)
DM send on discord ;)
@aszurnasirpal commented on GitHub (May 2, 2026):
Just for the others to see. Just redacted some non-relevant parts of the logs
It's very in line with the stats that I put in my initial post.
@AstralDestiny commented on GitHub (May 2, 2026):
[2026-05-02 18<:02:1062723874942361660>55] [LOG] 2026-05-02T16<:02:1062723874942361660>55+00:00 [error]: Error making POST request (can Pangolin see Gerbil HTTP API?) for exit node at http://gerbil:3003 (status: undefined): timeout of 8000ms exceeded
Quite concerned for this entry honestly..
@oschwartz10612 commented on GitHub (May 2, 2026):
Could you run docker stats and see what container is causing this? As @AstralDestiny says the time out from gerbil makes me thing something is up there maybe?
@aszurnasirpal commented on GitHub (May 2, 2026):
Yes, those numbers in the issue in the fist description were taken directly from docker stats output — I collected them during the diagnostic session. The CPU spike (reaching 380%) was in the pangolin container, not gerbil.
The gerbil timeout issue you mention is likely a downstream effect: when pangolin's CPU is pegged at 300-400%, it can't respond to gerbil's health checks or requests in time, causing gerbil to report timeouts (at least my assumption).
@AstralDestiny commented on GitHub (May 2, 2026):
[2026-05-02 12:35:21] [INFO] Duration reached (120s) — stopping capture.
For me if I do it. with and without memory constraints on it. That's EE 1.18.1 for me. Though I don't get the 8000ms timeout at all.. and not fully sure what's causing that timeout. I mean short of you removing the ipam stuff is the only thing not tested.
@aszurnasirpal commented on GitHub (May 2, 2026):
For comparison, this is from "stable" community edition (the same version) on the same machine (the same config only image is different)
One important caveat to the results: the average of 60% is inflated by start-up spikes - if only the stabilized phase (after ~21:44:48) was measured, the average would be closer to 10-15%.
@AstralDestiny commented on GitHub (May 2, 2026):
[2026-05-02 16:00:17] [INFO] Duration reached (120s) — stopping capture.
On CE 1.18.1 For me at-least.
and EE for comparison.
So pretty sure it's something to do with the gerbil timeout.. or something more annoying.. But could also be a over provisioned cpu maybe? Not sure honestly.
@aszurnasirpal commented on GitHub (May 3, 2026):
Just for reference. I did try the latest 1.18.2, released tonight. The effect is the same. Community edition is blazing fast and low on resources; the EE edition is causing massive CPU usage on my system, making it unusable.
Seems that EE is not for me.
@AstralDestiny commented on GitHub (May 3, 2026):
Not sure yet what's causing that Think Owen has been looking into it, How much ram is it using compared between?
@aszurnasirpal commented on GitHub (May 3, 2026):
ee edition:
@AstralDestiny commented on GitHub (May 3, 2026):
CE
EE
Though my CE only has a single site vs the EE has 3 orgs and a bunch of newts and resources configured.
@aszurnasirpal commented on GitHub (May 3, 2026):
As I said, I was on EE from the very begiining, and only after the upgrade to v. 18 EE did I face this problem. I have 53 resources configured from one org/site
@AstralDestiny commented on GitHub (May 3, 2026):
mount | grep cgroup? Curious if you have groupsv1 on that.@aszurnasirpal commented on GitHub (May 3, 2026):
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
@svillar commented on GitHub (May 3, 2026):
I got this in the logs, maybe relevant
@AstralDestiny commented on GitHub (May 4, 2026):
Mmm
which docker..? If it returns /snap uninstall docker and install docker properly if want quick do,wget get.docker.com; sh index.htmlwhich will install docker not via snap.@AstralDestiny commented on GitHub (May 4, 2026):
Asking as the latency is a bit too much so either something in ubuntu or the wrong package, hoping wrong package.
@aszurnasirpal commented on GitHub (May 4, 2026):
mine is /usr/bin/docker
@AstralDestiny commented on GitHub (May 4, 2026):
Hmm so not snap.. Mmm curious honestly as it's behaving odd, Are you differing at all from base install?
@aszurnasirpal commented on GitHub (May 4, 2026):
No, nothing significant. I only added the geoipupdate Docker image to Docker Compose to update the MaxMind database.
On this machine, the only other container running is Beszel, and pretty much nothing else.
@oschwartz10612 commented on GitHub (May 7, 2026):
Maybe you can try on 1.18.3 with the new memory leak fixes?
@aszurnasirpal commented on GitHub (May 7, 2026):
Hi,
Yes, there is a significant overall improvement in 1.18.3. The system was finally usable, but I went back to the community edition again.
Even if the CPU performance was way better, it was still high enough to see that things load relatively slowly. Going via the pangolin GUI usually took a few seconds to load "resources", details, etc.
In my opinion, something was definitely fixed, and I see the striking difference (previously the system was almost non-responsive even from the CLI) . Not sure if the overall slowness of the current EE edition is due to the relatively low spec of the Oracle Free Tier, but it's still an absolutely huge difference between how the community edition works and how the EE edition works.
For the community edition, everything loads instantly, with almost zero impact on the system. I think I will stick to it.
Load average between most recent ee edition
and previous EE edition (the one that left my system completely non-responsive). So as you can see, it's lower but still way higher than the community
@svillar commented on GitHub (May 7, 2026):
I haven't seen any significant improvement though. It's true that CPU usage lowers till 7-8% but that just lasts a few seconds to go back to a high CPU usage and high load. The UI keeps being extremely slow though, no change on that. Loading pages takes >5s. Last but not least, with EE version, my browser tab with the pangolin dashboard frequently shows "the server had an internal error" page, not the one from the browser, but an specific one from pangolin. That never happens with CE.
@aszurnasirpal commented on GitHub (May 7, 2026):
Your @svillar description matches what I experienced. I got those "server had internal error" messages as well. I guess the only difference between your system and mine is that yours is much more powerful, so the slowness and CPU load on my machine are much more noticeable. But we face the same slow responsiveness (or lack of responsiveness) in pangolin. The same applies to accessing resources served by proxy. Just as an example: in the community edition, I can browse my collection of photos from immich almost as fast as I would load them locally. In the EE edition, it takes seconds to open and load a page, and browsing the collection is a challenge.
@AstralDestiny commented on GitHub (May 7, 2026):
Might just be overprovisoned oracle or some limits they have in place which then causes cpu spikes..
@svillar commented on GitHub (May 7, 2026):
Mine is hosted at home so nothing to do with Oracle services.
@aszurnasirpal commented on GitHub (May 7, 2026):
In my case, the question remains: why, on the same hardware with the same resources, do both versions behave completely differently?
@AstralDestiny commented on GitHub (May 7, 2026):
What's your os by chance? I only really use debian myself.
@aszurnasirpal commented on GitHub (May 7, 2026):
Ubuntu
@AstralDestiny commented on GitHub (May 7, 2026):
Not sure oracle even ships any debian's