mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #23273] bug: Migration 8452d01d26d7 fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance #58605
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @madnight on GitHub (Mar 31, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23273
[Bug] Migration
8452d01d26d7fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenanceCheck Existing Issues
Installation Method
Docker (Cloud Run)
Open WebUI Version
v0.8.12
Operating System
Google Cloud Run + AlloyDB for PostgreSQL (managed, PostgreSQL 15.15)
Expected Behavior
Upgrading from v0.7.2 to v0.8.12 should complete the Alembic migration
8452d01d26d7("Add chat_message table") successfully, regardless of dataset size. The migration usesyield_per=1000,stream_results=True, andBATCH_SIZE=5000— which correctly prevents Python-side OOM — but the underlying database transaction is still unbounded.Actual Behavior
The migration runs for ~55 minutes and is then terminated by AlloyDB with
OperationalError: server closed the connection unexpectedly. The database remains on alembic revisionc440947495f3(pre-migration), withchat_messagetable partially created but the backfill incomplete and rolled back.During the entire migration window, the application is unusable — all backend requests hang, likely because the service also attempts to run migrations on startup and blocks on the open transaction.
Environment
europe-west3jsonRoot Cause Analysis
PR #21542 (commit
b4f3408) fixed the originalfetchall()pathology and reduced Python heap pressure, but it did not make the migration operationally safe for large PostgreSQL datasets. There are three interacting problems:1. Single unbounded transaction
The upstream
env.pywraps all migrations in a single transaction:The
_flush_batch()function usesconn.begin_nested()(savepoints), but these are nested within this outer transaction — no data is actually committed to disk until the migration function returns. The migration inserts millions of rows with 7 indexes and an FK constraint in a single transaction, generating enormous WAL volume. While PostgreSQL checkpoints can still advance and recycle old WAL segments during the transaction (the redo pointer andKeepLogSegare not constrained by open user transactions), new WAL is produced far faster than recycling can reclaim space. For our dataset (68 GB, 161k chats), the migration ran for 55 minutes at sustained high write throughput until AlloyDB terminated the connection — likely due to managed-service limits on connection duration, write throughput, or storage pressure.2. Foreign key and indexes created before the backfill
The migration creates the
chat_messagetable with aFOREIGN KEYconstraint onchat.id, 4 single-column indexes, and 3 composite indexes — all before inserting any data. This means every insert during the backfill must:chat.idindex for every inserted row, adding I/O overhead per rowIn contrast, a post-load
CREATE INDEXuses sorted sequential bulk construction with batched WAL records (btbuild), which is orders of magnitude more efficient.PostgreSQL's bulk loading documentation explicitly warns against this pattern: create indexes and add FK constraints after the data load, not before.
3. Server-side cursors invalidated by COMMIT
The migration uses
yield_per=1000withstream_results=True, which on PostgreSQL creates a server-side cursor. Psycopg's server-side cursors areWITHOUT HOLDby default, meaning they stop being fetchable after COMMIT. This makes a naive "just add commits inside the loop" patch fundamentally incorrect — the streaming cursor becomes invalid after the first commit.A correct fix needs either keyset pagination (a fresh
SELECTper chunk, ordered by primary key with aWHERE id > last_seen_id) or separate read/write connections with holdable cursor semantics.Why this went undetected
This issue likely went undetected upstream because SQLite is the default backend and does not exhibit the same WAL, FK-trigger, or server-side cursor characteristics as PostgreSQL. Testing with small datasets (a few thousand chats) would also not surface the problem.
Steps to Reproduce
8452d01d26d7starts, streams chat rows, and inserts intochat_messagechat_messageis empty, alembic version unchangedGenerating test data
To reproduce at scale, you need to seed the
chattable with large JSON payloads. An important caveat: PostgreSQL TOAST compression is extremely effective on repetitive text (e.g. Lorem ipsum). Using Lorem ipsum as padding results in ~1 GB on disk instead of the intended 68 GB. Useos.urandom().hex()or similar random data for the message content to avoid compression and achieve realistic on-disk sizes.A simple seeding approach using direct SQL inserts with a size distribution matching a real production dataset:
Logs
Proposed Fix
The migration needs to be restructured to follow PostgreSQL's bulk loading guidance. The core changes:
WITHOUT HOLDand become invalid afterCOMMIT, so paginate explicitly withWHERE id > :last_id ORDER BY id LIMIT :chunk_sizeON CONFLICT DO NOTHING) — so the migration can be safely restarted after a mid-run failureCREATE INDEX CONCURRENTLYif the service must remain online, though note this cannot run inside a transaction block)NOT VALID, thenVALIDATE CONSTRAINT—ALTER TABLE ADD CONSTRAINT ... NOT VALIDskips validation of existing rows during creation; a subsequentVALIDATE CONSTRAINTchecks them without holding anACCESS EXCLUSIVElockApproach 1: Rewrite the migration (recommended upstream fix)
If the backfill remains inside Alembic, the migration should use
op.get_context().autocommit_block()or a separate connection strategy to commit chunks independently. The backfill loop must not rely on one long-lived streaming cursor if chunk commits are introduced (see root cause #3 above).High-level structure:
chat_messagewith PK only (no indexes, no FK)WHERE id > :last_id ORDER BY id LIMIT :chunk_size), parse messages, insert withON CONFLICT (id) DO NOTHING, commit after each chunkALTER TABLE chat_message ADD CONSTRAINT ... FOREIGN KEY ... NOT VALID, thenVALIDATE CONSTRAINTseparatelyApproach 2: Separate backfill from schema migration
Use Alembic only for schema creation (table + PK). Run the backfill as a dedicated resumable job outside of Alembic, then stamp the revision manually. This is the safest option for very large datasets and follows PostgreSQL's documented bulk-load guidance most closely.
Supporting change:
transaction_per_migration=Trueinenv.pyAdding
transaction_per_migration=Truetocontext.configure()inenv.pyis recommended by Alembic when migrations use autocommit blocks. This is supporting infrastructure — it does not fix the migration itself, but prevents other migrations in the upgrade chain from being affected.Additional Context
max_wal_sizethresholds faster than recycling can keep up.pg_stat_wal.wal_bytes,wal_records, andwal_fpibefore and after a test run, or usepg_stat_get_backend_wal()for the migration backend where available.pg_stat_walis cumulative since the last stats reset, so a before/after delta is required.chatcolumn usesjson(notjsonb), as created by migration242a2047eae0. This is relevant for any server-side SQL approach:jsonb_each()cannot operate onjsondirectly — eitherjson_each()or an explicit cast tojsonbis required, and the cast adds CPU overhead at scale.BATCH_SIZEhigher or lower does not help — the issue is the combination of the single transaction boundary, FK constraint validation, and index maintenance during the load.maintenance_work_memandmax_wal_size(where the provider allows it) may reduce pain for smaller datasets but does not fix the structural problem in the migration.COMMIT/BEGINinside the migration, if the process crashes mid-way,alembic_versionstill points to the pre-migration revision (it is only updated afterupgrade()returns). A restart would re-run the migration from scratch. TheON CONFLICT DO NOTHINGclause makes this safe — already-inserted messages are skipped, so the backfill is fully resumable.242a2047eae0_update_chat_table.py(which converts thechatcolumn fromTexttoJSON) also reads every row and writes them back in a single transaction. It presumably succeeded when running against smaller datasets but would exhibit the same WAL problem at scale.Methodology
This analysis was produced with the help of GPT 5.4 xhigh, Claude Opus 4.6, and GPT 5.4 Extended Pro (30+ minutes thinking time). Each model was provided with extensive context including the full migration source code (
8452d01d26d7_add_chat_message_table.py), related issues (#23134, #22568), the batching fix commit (b4f3408), PR #21542, live database diagnostics (table sizes, WAL metrics, TOAST analysis), AlloyDB documentation, and PostgreSQL source code (src/backend/access/transam/xlog.c,src/backend/commands/trigger.c,src/backend/utils/mmgr/portalmem.c,src/backend/executor/execIndexing.c,src/backend/access/nbtree/nbtinsert.c). All claims were verified against the PostgreSQL source tree. All three models independently identified the same root causes: per-row index write amplification and unbounded transaction duration, not the Python-side batching.@pr-validator-bot commented on GitHub (Mar 31, 2026):
⚠️ Missing Issue Title Prefix
@madnight, your issue title is missing a prefix (e.g.,
bug:,feat:,docs:).Please update your issue title to include one of the following prefixes:
Example:
bug: Login fails when using special characters in password@Classic298 commented on GitHub (Mar 31, 2026):
I want to say this is wrong, because we had many people here who previously ran Out of memory on 100GB+ migrations and memory usage of 40GB+
And with the CURRENT changes it works for them just fine on PostgreSQL
@madnight commented on GitHub (Mar 31, 2026):
@Classic298 I suggest that you either attempt to reproduce the issue using the steps provided or enter the Issue Details I have provided into your own Claude Code to verify their plausibility.
@Classic298 commented on GitHub (Apr 14, 2026):
@madnight action wanted on PR thanks
@madnight commented on GitHub (Apr 14, 2026):
Hi @Classic298 @tjbck ,
I have some updates. We got the migration working as-is and migrated DEV and PROD to 0.8.12. For it to work, we had to scale up the database to 16 cores and 128 GB of DB memory, which is easily doable in the cloud but might not be doable if someone is operating OWUI on a dedicated server, let's say a single machine on Hetzner. Depending on the exact number of chat entries and the size of the chat table, the migration for a 50gb+ chat table still runs somewhere between 1 and 3 hours, in which you basically have downtime for the whole thing. If you have an even bigger table lets say in the scale of 200gb+ I would guesstimate you need at least 265gb of DB memory and up to 6h of migration time with the current migration in 0.8.12.
For our purpose, this workaround is sufficient at this point. Hence, I would invite some feedback from this community if others with large Postgres instances have some similar issues.
@Classic298 commented on GitHub (Apr 14, 2026):
Hm great news for you, but since nobody else is in this issue and WE cannot test this, there is now nobody to test the PR.
@madnight commented on GitHub (Apr 14, 2026):
If nobody else has this issue, then I think it would be fine to close the issue for now until someone else stumbles upon it in a discussion or a new issue. In this case we already have an deep issue analysis + PR ready and could request testing.