[GH-ISSUE #1806] Issues with key file persistence & exit node lookup #6807

Open
opened 2026-04-25 15:45:27 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @Soitora on GitHub (Nov 3, 2025).
Original GitHub issue: https://github.com/fosrl/pangolin/issues/1806

Originally assigned to: @oschwartz10612 on GitHub.

Describe the Bug

After restarting my Unraid server (no config changes for months), Pangolin/Gerbil failed to start. On boot Gerbil logs an error about failing to assign an IP due to an invalid CIDR and fetching the remote config returns a 404 for the gerbil endpoint.

Key log lines:

INFO: 2025/10/31 06:13:10 Fetching remote config from http://pangolin:3001/api/v1/gerbil/get-config
INFO: 2025/10/31 06:13:10 Created WireGuard interface wg0
FATAL: 2025/10/31 06:13:10 Failed to assign IP address: failed to parse IP address: invalid CIDR address:

Manual check of the endpoint:

curl http://pangolin:3001/api/v1/gerbil/get-config

Response:

{"data":null,"success":false,"error":true,"message":"The requests url is not found - /api/v1/gerbil/get-config","status":404,"stack":null}

This happened after a power outage / UPS-triggered shutdown and subsequent server reboot. No configuration, keys, or exit node edits were made by me prior to this failure.

Environment

  • OS Type & Version: Unraid 7.2.0
  • Pangolin Version: latest (1.12.1)
  • Gerbil Version: latest (1.2.2)
  • Traefik Version: latest (3.5.4?)
  • Newt Version: not applicable
  • Olm Version: not applicable

To Reproduce

I cannot reliably reproduce this from scratch, but the failure happened under the following conditions:

  1. Running Pangolin + Gerbil on Unraid for some time with a single exit node configured.
  2. A power outage caused the UPS to signal shutdown and the server to be rebooted.
  3. After reboot, Gerbil fails to start with the invalid CIDR address: error and the /api/v1/gerbil/get-config endpoint returns a 404.

Observed behavior suggests a database state mismatch where the exit node entry and/or sequence cause Gerbil to generate or reference an invalid/empty CIDR.

Expected Behavior

  • Gerbil should either fail gracefully with a clear, actionable error if DB contains inconsistent exit node data, or attempt to recover automatically (e.g., regenerate exit node or correct sequence inconsistencies).
  • The /api/v1/gerbil/get-config endpoint should return appropriate config (not 404) when the service is up.
Originally created by @Soitora on GitHub (Nov 3, 2025). Original GitHub issue: https://github.com/fosrl/pangolin/issues/1806 Originally assigned to: @oschwartz10612 on GitHub. ### Describe the Bug After restarting my Unraid server (no config changes for months), Pangolin/Gerbil failed to start. On boot Gerbil logs an error about failing to assign an IP due to an invalid CIDR and fetching the remote config returns a 404 for the gerbil endpoint. Key log lines: ```swift INFO: 2025/10/31 06:13:10 Fetching remote config from http://pangolin:3001/api/v1/gerbil/get-config INFO: 2025/10/31 06:13:10 Created WireGuard interface wg0 FATAL: 2025/10/31 06:13:10 Failed to assign IP address: failed to parse IP address: invalid CIDR address: ``` Manual check of the endpoint: ```swift curl http://pangolin:3001/api/v1/gerbil/get-config ``` Response: ```swift {"data":null,"success":false,"error":true,"message":"The requests url is not found - /api/v1/gerbil/get-config","status":404,"stack":null} ``` This happened after a power outage / UPS-triggered shutdown and subsequent server reboot. No configuration, keys, or exit node edits were made by me prior to this failure. ### Environment - OS Type & Version: Unraid 7.2.0 - Pangolin Version: latest (1.12.1) - Gerbil Version: latest (1.2.2) - Traefik Version: latest (3.5.4?) - Newt Version: not applicable - Olm Version: not applicable ### To Reproduce I cannot reliably reproduce this from scratch, but the failure happened under the following conditions: 1. Running Pangolin + Gerbil on Unraid for some time with a single exit node configured. 2. A power outage caused the UPS to signal shutdown and the server to be rebooted. 3. After reboot, Gerbil fails to start with the `invalid CIDR address:` error and the `/api/v1/gerbil/get-config` endpoint returns a 404. Observed behavior suggests a database state mismatch where the exit node entry and/or sequence cause Gerbil to generate or reference an invalid/empty CIDR. ### Expected Behavior * Gerbil should either fail gracefully with a clear, actionable error if DB contains inconsistent exit node data, or attempt to recover automatically (e.g., regenerate exit node or correct sequence inconsistencies). * The `/api/v1/gerbil/get-config` endpoint should return appropriate config (not `404`) when the service is up.
GiteaMirror added the bug label 2026-04-25 15:45:27 -05:00
Author
Owner

@Soitora commented on GitHub (Nov 3, 2025):

I managed to resolve it, finally... (a couple of days later)

Here’s what I did:

  1. Stopped Pangolin and related services (including Gerbil).
  2. Opened the db.sqlite file and:
    • Went to the exitNodes table.
    • Deleted my only exit node entry.
  3. Restarted Pangolin and Gerbil to let a new exit node auto-generate.
  4. Stopped the services again and edited the database once more:
    • Set the new exit node’s exitNodeId to 1 (instead of 2).
    • Updated the sqlite_sequence table so the exitNodes sequence was reset to 1.
  5. Renamed the letsencrypt folder so Pangolin could generate new certificates.
  6. Restarted everything; after about a minute, everything worked again.

This indicates the issue wasn't with my Unraid setup or environment, but likely a mismatch or corruption in the Pangolin database state that caused Gerbil to fail to assign its IP on boot?

Resetting the exit node and certificates forced a clean reinitialization, resolving the invalid CIDR and 404 config issues

<!-- gh-comment-id:3482580481 --> @Soitora commented on GitHub (Nov 3, 2025): I managed to resolve it, finally... (a couple of days later) Here’s what I did: 1. **Stopped Pangolin and related services** (including Gerbil). 2. Opened the `db.sqlite` file and: * Went to the `exitNodes` table. * Deleted my only exit node entry. 3. Restarted **Pangolin and Gerbil** to let a new exit node auto-generate. 4. Stopped the services again and edited the database once more: * Set the new exit node’s `exitNodeId` to `1` (instead of `2`). * Updated the `sqlite_sequence` table so the `exitNodes` sequence was reset to `1`. 5. **Renamed the `letsencrypt` folder** so Pangolin could generate new certificates. 6. Restarted everything; after about a minute, everything worked again. This indicates the issue wasn't with my Unraid setup or environment, but likely a mismatch or corruption in the Pangolin database state that caused Gerbil to fail to assign its IP on boot? Resetting the exit node and certificates forced a clean reinitialization, resolving the invalid CIDR and 404 config issues
Author
Owner

@oschwartz10612 commented on GitHub (Nov 8, 2025):

@Soitora sorry for this! Glad its fixed. Thanks for the write up of the fix.

<!-- gh-comment-id:3507128450 --> @oschwartz10612 commented on GitHub (Nov 8, 2025): @Soitora sorry for this! Glad its fixed. Thanks for the write up of the fix.
Author
Owner

@Alucard133 commented on GitHub (Nov 19, 2025):

I actually had the same issue today, thanks @Soitora for the explanation on how to fix it!

<!-- gh-comment-id:3554822282 --> @Alucard133 commented on GitHub (Nov 19, 2025): I actually had the same issue today, thanks @Soitora for the explanation on how to fix it!
Author
Owner

@PhilRW commented on GitHub (Dec 5, 2025):

This helped me today as well; definitely seems like a bug. I encountered it TWICE today and my only action was doing a docker-compose down to the stack.

<!-- gh-comment-id:3618796821 --> @PhilRW commented on GitHub (Dec 5, 2025): This helped me today as well; definitely seems like a bug. I encountered it TWICE today and my only action was doing a `docker-compose down` to the stack.
Author
Owner

@Soitora commented on GitHub (Dec 6, 2025):

@oschwartz10612 Maybe it should be investigated further if it still happens to people? Even I had it happen to me twice, first time I solved it by just starting from scratch

<!-- gh-comment-id:3620253393 --> @Soitora commented on GitHub (Dec 6, 2025): @oschwartz10612 Maybe it should be investigated further if it still happens to people? Even I had it happen to me twice, first time I solved it by just starting from scratch
Author
Owner

@Jurrer commented on GitHub (Dec 10, 2025):

Happened to me when I was updating compose, thanks to you @Soitora kind sir, I managed to fix this issue.

I only removed the entry from exitNodes and it worked

<!-- gh-comment-id:3635108669 --> @Jurrer commented on GitHub (Dec 10, 2025): Happened to me when I was updating compose, thanks to you @Soitora kind sir, I managed to fix this issue. I only removed the entry from exitNodes and it worked
Author
Owner

@cantchooseaname8 commented on GitHub (Dec 12, 2025):

Just happened to me as well. I originally had a local only site without gerbil in my compose. As soon as I updated the compose to include gerbil, it started throwing this error.

<!-- gh-comment-id:3647359863 --> @cantchooseaname8 commented on GitHub (Dec 12, 2025): Just happened to me as well. I originally had a local only site without gerbil in my compose. As soon as I updated the compose to include gerbil, it started throwing this error.
Author
Owner

@GovSat1 commented on GitHub (Dec 12, 2025):

Same issue here

<!-- gh-comment-id:3648060276 --> @GovSat1 commented on GitHub (Dec 12, 2025): Same issue here
Author
Owner

@oschwartz10612 commented on GitHub (Dec 12, 2025):

What could be happening here is you are losing or incorporating the key file in the config that gerbil sends the pangolin. That could be why we are getting a 404.

Could anyone confirm if this key file was removed or changed or corrupted and that is causing this issue?

<!-- gh-comment-id:3648111105 --> @oschwartz10612 commented on GitHub (Dec 12, 2025): What could be happening here is you are losing or incorporating the key file in the config that gerbil sends the pangolin. That could be why we are getting a 404. Could anyone confirm if this key file was removed or changed or corrupted and that is causing this issue?
Author
Owner

@okfro commented on GitHub (Dec 14, 2025):

What could be happening here is you are losing or incorporating the key file in the config that gerbil sends the pangolin. That could be why we are getting a 404.

Could anyone confirm if this key file was removed or changed or corrupted and that is causing this issue?

Bingo. I have been having this issue every time I docker compose down && docker compose up -d my stack. At v1.12.x, the SQLfu above would mitigate the problem:

sqlite3 pangolin/db/db.sqlite
delete from exitNodes;
update sqlite_sequence set seq=1 where name='exitNodes';
.quit

Then, starting in v1.13.x, I started getting a different error: "Failed to parse private key: wgtypes: failed to parse base64-encoded key: illegal base64 data at input byte 44". I assumed this was a variation of the same fault--wgtypes doesn't exist in the schema (afaict), but it's obviously a wireguard error and I was guessing it was emerging from trying to parse the exit node. So I deleted the exit node. It seems that in v13.x, Pangolin is no longer recreating the exit node if missing--because I then started getting a missing exitNode error. So I manually inserted a record:

INSERT INTO exitNodes VALUES(1,'Exit Node AbCdEF','100.89.128.1/24','pangolin.exam.ple','<your key val>',41820,'http://gerbil:3003',NULL,1,NULL,'gerbil',NULL);

And I was back to the original error. So at that point, I just removed Gerbil from the stack so I could get back up and running.

This morning, seeing @oschwartz10612's comment, I took a look at my ./stack_folder/pangolin/key file. Over the past week, I have been moving all my config content into git, and the key file was included.

  • cat pangolin/key > +DpQehbCWGJwlzYa0Cjdt36U77YEHTaQ/PyxjhDtTmg=⏎
  • nano pangolin/key > +DpQehbCWGJwlzYa0Cjdt36U77YEHTaQ/PyxjhDtTmg= ^o^n

So the sneaky made it into git, and that gives the tailing ^o^n on the key. I removed that, restored gerbil to the stack, restarted and viola!

I hope this is the end of the problem(s) for me. Thanks to all in this thread for sharing your insights.

<!-- gh-comment-id:3651078601 --> @okfro commented on GitHub (Dec 14, 2025): > What could be happening here is you are losing or incorporating the key file in the config that gerbil sends the pangolin. That could be why we are getting a 404. > > Could anyone confirm if this key file was removed or changed or corrupted and that is causing this issue? Bingo. I have been having this issue every time I `docker compose down && docker compose up -d` my stack. At v1.12.x, the SQLfu above would mitigate the problem: ``` sqlite3 pangolin/db/db.sqlite delete from exitNodes; update sqlite_sequence set seq=1 where name='exitNodes'; .quit ``` Then, starting in v1.13.x, I started getting a different error: "**_Failed to parse private key: wgtypes: failed to parse base64-encoded key: illegal base64 data at input byte 44_**". I assumed this was a variation of the same fault--`wgtypes` doesn't exist in the schema (afaict), but it's obviously a wireguard error and I was guessing it was emerging from trying to parse the exit node. So I deleted the exit node. It seems that in v13.x, Pangolin is no longer recreating the exit node if missing--because I then started getting a missing exitNode error. So I manually inserted a record: ``` INSERT INTO exitNodes VALUES(1,'Exit Node AbCdEF','100.89.128.1/24','pangolin.exam.ple','<your key val>',41820,'http://gerbil:3003',NULL,1,NULL,'gerbil',NULL); ``` And I was back to the original error. So at that point, I just removed Gerbil from the stack so I could get back up and running. This morning, seeing @oschwartz10612's comment, I took a look at my `./stack_folder/pangolin/key` file. Over the past week, I have been moving all my config content into git, and the key file was included. * `cat pangolin/key` > `+DpQehbCWGJwlzYa0Cjdt36U77YEHTaQ/PyxjhDtTmg=⏎` * `nano pangolin/key` > `+DpQehbCWGJwlzYa0Cjdt36U77YEHTaQ/PyxjhDtTmg= ^o^n` So the sneaky `⏎` made it into git, and that gives the tailing ` ^o^n` on the key. I removed that, restored gerbil to the stack, restarted and viola! I hope this is the end of the problem(s) for me. Thanks to all in this thread for sharing your insights.
Author
Owner

@Soitora commented on GitHub (Dec 17, 2025):

I had a loss of power (although system safely shut down with UPS), and this issue came back once I rebooted.
Is there supposed to be a key in the main folders?

Image Image
<!-- gh-comment-id:3666152400 --> @Soitora commented on GitHub (Dec 17, 2025): I had a loss of power (although system safely shut down with UPS), and this issue came back once I rebooted. Is there supposed to be a key in the main folders? <img width="483" height="354" alt="Image" src="https://github.com/user-attachments/assets/1975e779-6afb-415b-8837-56e726ebe3bc" /> <img width="444" height="354" alt="Image" src="https://github.com/user-attachments/assets/733c7a64-d92a-489d-9afe-150c6c3b3c92" />
Author
Owner

@oschwartz10612 commented on GitHub (Dec 19, 2025):

If you are using gerbil there should be a file called key in the config dir but it looks like its not there!? Thats strange! Does it come back if you recover your instance?

<!-- gh-comment-id:3675313955 --> @oschwartz10612 commented on GitHub (Dec 19, 2025): If you are using gerbil there should be a file called key in the config dir but it looks like its not there!? Thats strange! Does it come back if you recover your instance?
Author
Owner

@Soitora commented on GitHub (Dec 19, 2025):

If you are using gerbil there should be a file called key in the config dir but it looks like its not there!? Thats strange! Does it come back if you recover your instance?

I checked my weekly backups all the way to late July, and not a single one of them has a file called Key

<!-- gh-comment-id:3675674530 --> @Soitora commented on GitHub (Dec 19, 2025): > If you are using gerbil there should be a file called key in the config dir but it looks like its not there!? Thats strange! Does it come back if you recover your instance? I checked my weekly backups all the way to late July, and not a single one of them has a file called Key
Author
Owner

@adrianipopescu commented on GitHub (Dec 21, 2025):

I fixed this for me by killing both pangolin and gerbil containers, then removed the new autogenerated key from gerbil, restart gerbil to generate its key, and then pangolin up to let pangolin reinsert its row in the db for this gerbil with that key

now -- it does seem like I need to regenerate newt identities as I'm seeing a lot of

INFO: 2025/12/21 03:31:34 Cleared 0 sessions for WG IP: <ip>
ERROR: 2025/12/21 03:31:53 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed
ERROR: 2025/12/21 03:31:54 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed
ERROR: 2025/12/21 03:31:55 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed
ERROR: 2025/12/21 03:32:27 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed
ERROR: 2025/12/21 03:32:29 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed

here's an easy approach to read the db, feel free to run deletes and whatnot

docker run --rm -it -v "./pangolin/config/db:/db" alpine:3.20 sh -lc '
  apk add --no-cache sqlite >/dev/null
  sqlite3 /db/db.sqlite "select exitNodeId, reachableAt, publicKey from exitNodes order by exitNodeId;"
'
<!-- gh-comment-id:3678413182 --> @adrianipopescu commented on GitHub (Dec 21, 2025): I fixed this for me by killing both pangolin and gerbil containers, then removed the new autogenerated key from gerbil, restart gerbil to generate its key, and then pangolin up to let pangolin reinsert its row in the db for this gerbil with that key now -- it does seem like I need to regenerate newt identities as I'm seeing a lot of ``` INFO: 2025/12/21 03:31:34 Cleared 0 sessions for WG IP: <ip> ERROR: 2025/12/21 03:31:53 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed ERROR: 2025/12/21 03:31:54 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed ERROR: 2025/12/21 03:31:55 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed ERROR: 2025/12/21 03:32:27 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed ERROR: 2025/12/21 03:32:29 Failed to decrypt message: failed to decrypt message: chacha20poly1305: message authentication failed ``` here's an easy approach to read the db, feel free to run deletes and whatnot ``` docker run --rm -it -v "./pangolin/config/db:/db" alpine:3.20 sh -lc ' apk add --no-cache sqlite >/dev/null sqlite3 /db/db.sqlite "select exitNodeId, reachableAt, publicKey from exitNodes order by exitNodeId;" ' ```
Author
Owner

@farru1998 commented on GitHub (Jan 15, 2026):

In my case as well the key is somehow getting deleted after a pod restart in kubernetes environment, although I have persisted it via persistent volume claim.

Image
<!-- gh-comment-id:3753780394 --> @farru1998 commented on GitHub (Jan 15, 2026): In my case as well the key is somehow getting deleted after a pod restart in kubernetes environment, although I have persisted it via persistent volume claim. <img width="661" height="93" alt="Image" src="https://github.com/user-attachments/assets/e70b8ea7-07e2-4588-807d-3f1e2d14b949" />
Author
Owner

@oschwartz10612 commented on GitHub (Jan 17, 2026):

You can use the GENERATE_AND_SAVE_KEY_TO env var or --generateAndSaveKeyTo flag on gerbil to define where it writes the key file to. Maybe this will help?

We will work on adjusting I think gerbil to use an ephemeral key in the future.

<!-- gh-comment-id:3764268932 --> @oschwartz10612 commented on GitHub (Jan 17, 2026): You can use the `GENERATE_AND_SAVE_KEY_TO` env var or `--generateAndSaveKeyTo` flag on gerbil to define where it writes the `key` file to. Maybe this will help? We will work on adjusting I think gerbil to use an ephemeral key in the future.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/pangolin#6807