mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #23134] issue: [Bug] Database migration chat → chat_message OOM-kills on large datasets (PostgreSQL/AlloyDB, ~75 GB) #35420
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @adesso-pia-vonkolken on GitHub (Mar 27, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23134
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.8.5
Ollama Version (if applicable)
No response
Operating System
Google Cloud Run Service
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
Environment:
b4f3408from PR #21542)Upgrading from v0.7.2 to v0.8.5 (or later) should run the Alembic database migration — including the chat → chat_message table migration — successfully without crashing, regardless of dataset size.
Actual Behavior
The built-in Alembic migration hangs and is ultimately terminated with Signal 9 (OOM-kill) during the Add chat_message table migration step. This occurs even when running the migration in a dedicated Cloud Run Job with 34 GB of memory allocated, which we set up specifically to avoid disrupting the production service.
The migration makes no progress after starting the Add chat_message table step and eventually causes an out-of-memory crash.
The migration runs for approximately 15 minutes between the Add chat_message table log line and the OOM-kill, with no further output.
Steps to Reproduce
b4f3408from PR #21542)Logs & Screenshots
INFO:open_webui.internal.db:Starting migrations
INFO:open_webui.internal.db:There is nothing to migrate
INFO:open_webui.env:Running migrations
INFO:alembic.runtime.plugins:setup plugin alembic.autogenerate.schemas
[... alembic plugin setup ...]
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade a5c220713937 -> b7d2df9e1ab3, Add api_key column to user table if missing
INFO [alembic.runtime.migration] Running upgrade c440947495f3 -> 374d2f66af06, Add prompt history table
INFO [alembic.runtime.migration] Running upgrade 374d2f66af06 -> 8452d01d26d7, Add chat_message table
WARNING Container terminated on signal 9.
Additional Information
We are unable to upgrade to any version beyond v0.7.2 in production due to this issue. The migration fails consistently, blocking all version updates.
@Classic298 commented on GitHub (Mar 27, 2026):
0.8.5 is outdated, newer versions introduced batched processing for this migration. Please read the changelogs and use the newer version with batched migration
@Classic298 commented on GitHub (Mar 27, 2026):
0.8.9 and newer (best to just try 0.8.12) has brought batched perf improvements to the migration as per the changelogs
@adesso-pia-vonkolken commented on GitHub (Mar 27, 2026):
We already cherry-picked these changes onto v0.8.5 from the dev branch, but they did not help us. As far as I can tell, aside from this commit —
b4f340806a— no further significant optimization changes were made to the migration. Cherry-picking this commit did not resolve the issue for us either.Commit
06657b8109fixes an AttributeError that occurs when history/messages fields contain lists instead of dicts — this does not apply to our case, as we did not encounter this error.@Classic298 commented on GitHub (Mar 27, 2026):
With how much memory exactly are you running oom? With a 75GB large chat message table, even with batched processing, you are bound to have some memory growth. How much memory did you allocate to Open WebUI specifically?
@Classic298 commented on GitHub (Mar 27, 2026):
And how did you cherry pick the performance optimizations? How did you deploy them? To be sure you actually have the right changes, it'd be needed to actually run the newer version I mentioned.
If you modify the file inside the docker, the modification might get lost on up -d or however Google Cloud Run handles it
And did you ensure the cherry picked edits were still there during runtime? and did you cherry pick the FULL CHANGES to the file?
@Classic298 commented on GitHub (Mar 27, 2026):
I just had two more agents verify the current code.
It is IMPOSSIBLE to have continuosly growing memory, especially 30+ Gigabyte (i now found the metric in your issue) during the migration.
You just might not have the changes applied. Your cherry picks may not have worked or weren't there during runtime.
Please try, as i said, with the latest version - it must work.
@Classic298 commented on GitHub (Mar 27, 2026):
How exactly did you cherry-pick the changes into your Cloud Run deployment?
To get a cherry-pick running on Cloud Run, you would have needed to:
Specifically — did you update the container image on the Cloud Run Job you're using for migration, or only on the Cloud Run Service?
Did you verify the patched code is actually running at runtime?
Cloud Run is immutable. It runs whatever is baked into the container image. If the image wasn't rebuilt correctly, or if a cached Docker layer was used, or if the job is still pointing at the old image, the fix simply won't be there. The memory profile you're showing (linear ramp to OOM) is exactly what the unpatched code looks like. The patched version streams rows and flushes batches, so memory must stay flat.
The easiest path forward: Rather than cherry-picking onto v0.8.5, just deploy v0.8.12 directly. It includes this fix and several other improvements. That eliminates any risk of an incomplete or conflicting cherry-pick.
@adesso-pia-vonkolken commented on GitHub (Mar 27, 2026):
We did exactly the steps that you mentioned:
As far as I could see the newest image was used including the cherry pick, but as you mentioned Cloud Run Jobs are black boxes regarding the used code.
We will try using v0.8.12 directly, hopefully this will help.