[GH-ISSUE #23273] bug: Migration 8452d01d26d7 fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance #58605

New Issue

GiteaMirror · 2026-05-05T23:32:07-05:00

GiteaMirror commented

2026-05-05 23:32:07 -05:00

Originally created by @madnight on GitHub (Mar 31, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23273

[Bug] Migration `8452d01d26d7` fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance

Check Existing Issues

I have searched for any existing and/or related issues.
I have searched for any existing and/or related discussions.
I have also searched in the CLOSED issues AND CLOSED discussions and found no related items.
I am using the latest version of Open WebUI.

Installation Method

Docker (Cloud Run)

Open WebUI Version

v0.8.12

Operating System

Google Cloud Run + AlloyDB for PostgreSQL (managed, PostgreSQL 15.15)

Expected Behavior

Upgrading from v0.7.2 to v0.8.12 should complete the Alembic migration 8452d01d26d7 ("Add chat_message table") successfully, regardless of dataset size. The migration uses yield_per=1000, stream_results=True, and BATCH_SIZE=5000 — which correctly prevents Python-side OOM — but the underlying database transaction is still unbounded.

Actual Behavior

The migration runs for ~55 minutes and is then terminated by AlloyDB with OperationalError: server closed the connection unexpectedly. The database remains on alembic revision c440947495f3 (pre-migration), with chat_message table partially created but the backfill incomplete and rolled back.

During the entire migration window, the application is unusable — all backend requests hang, likely because the service also attempts to run migrations on startup and blocks on the open transaction.

Environment

Component	Detail
Open WebUI	v0.8.12
Database	AlloyDB for PostgreSQL 15.15, `europe-west3`
Deployment	Cloud Run Job (dedicated migration job, 34 GB memory)
Chat table	161,011 rows, 68 GB (TOAST storage)
Chat column type	`json`

Root Cause Analysis

PR #21542 (commit b4f3408) fixed the original fetchall() pathology and reduced Python heap pressure, but it did not make the migration operationally safe for large PostgreSQL datasets. There are three interacting problems:

1. Single unbounded transaction

The upstream env.py wraps all migrations in a single transaction:

# env.py (upstream)
with connectable.connect() as connection:
    context.configure(connection=connection, target_metadata=target_metadata)
    with context.begin_transaction():
        context.run_migrations()

The _flush_batch() function uses conn.begin_nested() (savepoints), but these are nested within this outer transaction — no data is actually committed to disk until the migration function returns. The migration inserts millions of rows with 7 indexes and an FK constraint in a single transaction, generating enormous WAL volume. While PostgreSQL checkpoints can still advance and recycle old WAL segments during the transaction (the redo pointer and KeepLogSeg are not constrained by open user transactions), new WAL is produced far faster than recycling can reclaim space. For our dataset (68 GB, 161k chats), the migration ran for 55 minutes at sustained high write throughput until AlloyDB terminated the connection — likely due to managed-service limits on connection duration, write throughput, or storage pressure.

2. Foreign key and indexes created before the backfill

The migration creates the chat_message table with a FOREIGN KEY constraint on chat.id, 4 single-column indexes, and 3 composite indexes — all before inserting any data. This means every insert during the backfill must:

Validate the FK constraint via an SPI query that does a B-tree lookup against the referenced chat.id index for every inserted row, adding I/O overhead per row
Update all 7 secondary indexes — each insert produces roughly 1 heap WAL record + 7 btree WAL records (more on page splits), causing significant write amplification

In contrast, a post-load CREATE INDEX uses sorted sequential bulk construction with batched WAL records (btbuild), which is orders of magnitude more efficient.

PostgreSQL's bulk loading documentation explicitly warns against this pattern: create indexes and add FK constraints after the data load, not before.

3. Server-side cursors invalidated by COMMIT

The migration uses yield_per=1000 with stream_results=True, which on PostgreSQL creates a server-side cursor. Psycopg's server-side cursors are WITHOUT HOLD by default, meaning they stop being fetchable after COMMIT. This makes a naive "just add commits inside the loop" patch fundamentally incorrect — the streaming cursor becomes invalid after the first commit.

A correct fix needs either keyset pagination (a fresh SELECT per chunk, ordered by primary key with a WHERE id > last_seen_id) or separate read/write connections with holdable cursor semantics.

Why this went undetected

This issue likely went undetected upstream because SQLite is the default backend and does not exhibit the same WAL, FK-trigger, or server-side cursor characteristics as PostgreSQL. Testing with small datasets (a few thousand chats) would also not surface the problem.

Steps to Reproduce

Run Open WebUI v0.7.2 with a PostgreSQL database
Seed the database with ≥50 GB of chat data (see below)
Upgrade to v0.8.12 and let the Alembic migration run
Migration 8452d01d26d7 starts, streams chat rows, and inserts into chat_message
WAL volume grows continuously for the entire duration due to per-row index and heap writes
After ~55 minutes, the managed database terminates the connection (connection duration, write throughput, or storage limits)
Transaction rolls back, chat_message is empty, alembic version unchanged

Generating test data

To reproduce at scale, you need to seed the chat table with large JSON payloads. An important caveat: PostgreSQL TOAST compression is extremely effective on repetitive text (e.g. Lorem ipsum). Using Lorem ipsum as padding results in ~1 GB on disk instead of the intended 68 GB. Use os.urandom().hex() or similar random data for the message content to avoid compression and achieve realistic on-disk sizes.

A simple seeding approach using direct SQL inserts with a size distribution matching a real production dataset:

Size class	Count	Avg size	Total
< 10 KB	~84,000	4 KB	~337 MB
10–100 KB	~50,000	27 KB	~1.3 GB
100 KB – 1 MB	~14,000	318 KB	~4.5 GB
1–10 MB	~7,600	4.9 MB	~36 GB
10–30 MB	~1,500	16 MB	~23 GB
> 30 MB	~130	31 MB	~3.9 GB

Logs

INFO [alembic.runtime.migration] Running upgrade 374d2f66af06 -> 8452d01d26d7, Add chat_message table
# ... ~55 minutes of silence ...
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
	This probably means the server terminated abnormally before or while processing the request.

Proposed Fix

The migration needs to be restructured to follow PostgreSQL's bulk loading guidance. The core changes:

Create only the primary key during table creation — defer secondary indexes and FK constraint until after the backfill
Use keyset pagination instead of a streaming cursor — server-side cursors are WITHOUT HOLD and become invalid after COMMIT, so paginate explicitly with WHERE id > :last_id ORDER BY id LIMIT :chunk_size
Commit each chunk — keep WAL bounded by committing every N chats
Use idempotent inserts (ON CONFLICT DO NOTHING) — so the migration can be safely restarted after a mid-run failure
Create indexes after the backfill — avoids per-row index maintenance during the load (use CREATE INDEX CONCURRENTLY if the service must remain online, though note this cannot run inside a transaction block)
Add FK as NOT VALID, then VALIDATE CONSTRAINT — ALTER TABLE ADD CONSTRAINT ... NOT VALID skips validation of existing rows during creation; a subsequent VALIDATE CONSTRAINT checks them without holding an ACCESS EXCLUSIVE lock

Approach 1: Rewrite the migration (recommended upstream fix)

If the backfill remains inside Alembic, the migration should use op.get_context().autocommit_block() or a separate connection strategy to commit chunks independently. The backfill loop must not rely on one long-lived streaming cursor if chunk commits are introduced (see root cause #3 above).

High-level structure:

DDL phase: Create chat_message with PK only (no indexes, no FK)
Backfill phase: Paginate source chats using keyset pagination (WHERE id > :last_id ORDER BY id LIMIT :chunk_size), parse messages, insert with ON CONFLICT (id) DO NOTHING, commit after each chunk
Index phase: Create all 7 secondary indexes (single bulk build per index is orders of magnitude more efficient than per-row maintenance)
FK phase: ALTER TABLE chat_message ADD CONSTRAINT ... FOREIGN KEY ... NOT VALID, then VALIDATE CONSTRAINT separately

Approach 2: Separate backfill from schema migration

Use Alembic only for schema creation (table + PK). Run the backfill as a dedicated resumable job outside of Alembic, then stamp the revision manually. This is the safest option for very large datasets and follows PostgreSQL's documented bulk-load guidance most closely.

Supporting change: `transaction_per_migration=True` in `env.py`

Adding transaction_per_migration=True to context.configure() in env.py is recommended by Alembic when migrations use autocommit blocks. This is supporting infrastructure — it does not fix the migration itself, but prevents other migrations in the upgrade chain from being affected.

Additional Context

This is not specific to AlloyDB. The migration shape is objectively poor for large PostgreSQL backfills. It affects any PostgreSQL deployment with large chat tables. Managed services (AlloyDB, Cloud SQL, RDS, Supabase, Neon) enforce limits on connection duration, write throughput, or storage pressure. Self-hosted PostgreSQL would also struggle — while checkpoints can still recycle old WAL segments during the transaction, the sheer volume of WAL generated (per-row heap + 7 index WAL records) can exhaust disk space or exceed max_wal_size thresholds faster than recycling can keep up.
The problem scales with data size. Our 68 GB dataset ran for 55 minutes before AlloyDB terminated the connection. The exact WAL volume generated by the migration was not isolated. To quantify WAL for this migration precisely, capture pg_stat_wal.wal_bytes, wal_records, and wal_fpi before and after a test run, or use pg_stat_get_backend_wal() for the migration backend where available. pg_stat_wal is cumulative since the last stats reset, so a before/after delta is required.
The chat column uses json (not jsonb), as created by migration 242a2047eae0. This is relevant for any server-side SQL approach: jsonb_each() cannot operate on json directly — either json_each() or an explicit cast to jsonb is required, and the cast adds CPU overhead at scale.
Setting BATCH_SIZE higher or lower does not help — the issue is the combination of the single transaction boundary, FK constraint validation, and index maintenance during the load.
Temporarily increasing maintenance_work_mem and max_wal_size (where the provider allows it) may reduce pain for smaller datasets but does not fix the structural problem in the migration.
Crash safety of the proposed fix: When using explicit COMMIT/BEGIN inside the migration, if the process crashes mid-way, alembic_version still points to the pre-migration revision (it is only updated after upgrade() returns). A restart would re-run the migration from scratch. The ON CONFLICT DO NOTHING clause makes this safe — already-inserted messages are skipped, so the backfill is fully resumable.
This pattern is not unique to this migration. Migration 242a2047eae0_update_chat_table.py (which converts the chat column from Text to JSON) also reads every row and writes them back in a single transaction. It presumably succeeded when running against smaller datasets but would exhibit the same WAL problem at scale.
Concurrent migration risk: Upstream issue #20067 reports that multiple Uvicorn workers can run migrations concurrently on startup, causing schema corruption. While we isolated the migration into a dedicated Cloud Run Job, the typical Docker deployment with multiple workers would be even more vulnerable to this migration's long runtime.

Methodology

This analysis was produced with the help of GPT 5.4 xhigh, Claude Opus 4.6, and GPT 5.4 Extended Pro (30+ minutes thinking time). Each model was provided with extensive context including the full migration source code (8452d01d26d7_add_chat_message_table.py), related issues (#23134, #22568), the batching fix commit (b4f3408), PR #21542, live database diagnostics (table sizes, WAL metrics, TOAST analysis), AlloyDB documentation, and PostgreSQL source code (src/backend/access/transam/xlog.c, src/backend/commands/trigger.c, src/backend/utils/mmgr/portalmem.c, src/backend/executor/execIndexing.c, src/backend/access/nbtree/nbtinsert.c). All claims were verified against the PostgreSQL source tree. All three models independently identified the same root causes: per-row index write amplification and unbounded transaction duration, not the Python-side batching.

Originally created by @madnight on GitHub (Mar 31, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23273 # [Bug] Migration `8452d01d26d7` fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance ## Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items. - [x] I am using the latest version of Open WebUI. ## Installation Method Docker (Cloud Run) ## Open WebUI Version v0.8.12 ## Operating System Google Cloud Run + AlloyDB for PostgreSQL (managed, PostgreSQL 15.15) ## Expected Behavior Upgrading from v0.7.2 to v0.8.12 should complete the Alembic migration `8452d01d26d7` ("Add chat_message table") successfully, regardless of dataset size. The migration uses `yield_per=1000`, `stream_results=True`, and `BATCH_SIZE=5000` — which correctly prevents Python-side OOM — but the underlying database transaction is still unbounded. ## Actual Behavior The migration runs for ~55 minutes and is then terminated by AlloyDB with `OperationalError: server closed the connection unexpectedly`. The database remains on alembic revision `c440947495f3` (pre-migration), with `chat_message` table partially created but the backfill incomplete and rolled back. During the entire migration window, the application is unusable — all backend requests hang, likely because the service also attempts to run migrations on startup and blocks on the open transaction. ## Environment | Component | Detail | |-----------|--------| | Open WebUI | v0.8.12 | | Database | AlloyDB for PostgreSQL 15.15, `europe-west3` | | Deployment | Cloud Run Job (dedicated migration job, 34 GB memory) | | Chat table | 161,011 rows, **68 GB** (TOAST storage) | | Chat column type | `json` | ## Root Cause Analysis PR #21542 (commit b4f3408) fixed the original `fetchall()` pathology and reduced Python heap pressure, but it did **not** make the migration operationally safe for large PostgreSQL datasets. There are three interacting problems: ### 1. Single unbounded transaction The upstream `env.py` wraps all migrations in a single transaction: ```python # env.py (upstream) with connectable.connect() as connection: context.configure(connection=connection, target_metadata=target_metadata) with context.begin_transaction(): context.run_migrations() ``` The `_flush_batch()` function uses `conn.begin_nested()` (savepoints), but these are nested within this outer transaction — no data is actually committed to disk until the migration function returns. The migration inserts millions of rows with 7 indexes and an FK constraint in a single transaction, generating enormous WAL volume. While PostgreSQL checkpoints can still advance and recycle old WAL segments during the transaction (the redo pointer and `KeepLogSeg` are not constrained by open user transactions), new WAL is produced far faster than recycling can reclaim space. For our dataset (68 GB, 161k chats), the migration ran for 55 minutes at sustained high write throughput until AlloyDB terminated the connection — likely due to managed-service limits on connection duration, write throughput, or storage pressure. ### 2. Foreign key and indexes created before the backfill The migration creates the `chat_message` table with a `FOREIGN KEY` constraint on `chat.id`, 4 single-column indexes, and 3 composite indexes — all **before** inserting any data. This means every insert during the backfill must: - Validate the FK constraint via an SPI query that does a B-tree lookup against the referenced `chat.id` index for every inserted row, adding I/O overhead per row - Update all 7 secondary indexes — each insert produces roughly 1 heap WAL record + 7 btree WAL records (more on page splits), causing significant write amplification In contrast, a post-load `CREATE INDEX` uses sorted sequential bulk construction with batched WAL records ([`btbuild`](https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtsort.c)), which is orders of magnitude more efficient. PostgreSQL's [bulk loading documentation](https://www.postgresql.org/docs/current/populate.html) explicitly warns against this pattern: create indexes and add FK constraints **after** the data load, not before. ### 3. Server-side cursors invalidated by COMMIT The migration uses `yield_per=1000` with `stream_results=True`, which on PostgreSQL creates a server-side cursor. Psycopg's server-side cursors are [`WITHOUT HOLD` by default](https://www.psycopg.org/docs/usage.html#server-side-cursors), meaning they **stop being fetchable after COMMIT**. This makes a naive "just add commits inside the loop" patch fundamentally incorrect — the streaming cursor becomes invalid after the first commit. A correct fix needs either keyset pagination (a fresh `SELECT` per chunk, ordered by primary key with a `WHERE id > last_seen_id`) or separate read/write connections with holdable cursor semantics. ### Why this went undetected This issue likely went undetected upstream because SQLite is the default backend and does not exhibit the same WAL, FK-trigger, or server-side cursor characteristics as PostgreSQL. Testing with small datasets (a few thousand chats) would also not surface the problem. ## Steps to Reproduce 1. Run Open WebUI v0.7.2 with a PostgreSQL database 2. Seed the database with ≥50 GB of chat data (see below) 3. Upgrade to v0.8.12 and let the Alembic migration run 4. Migration `8452d01d26d7` starts, streams chat rows, and inserts into `chat_message` 5. WAL volume grows continuously for the entire duration due to per-row index and heap writes 6. After ~55 minutes, the managed database terminates the connection (connection duration, write throughput, or storage limits) 7. Transaction rolls back, `chat_message` is empty, alembic version unchanged ### Generating test data To reproduce at scale, you need to seed the `chat` table with large JSON payloads. An important caveat: **PostgreSQL TOAST compression is extremely effective on repetitive text** (e.g. Lorem ipsum). Using Lorem ipsum as padding results in ~1 GB on disk instead of the intended 68 GB. Use `os.urandom().hex()` or similar random data for the message content to avoid compression and achieve realistic on-disk sizes. A simple seeding approach using direct SQL inserts with a size distribution matching a real production dataset: | Size class | Count | Avg size | Total | |------------|-------|----------|-------| | < 10 KB | ~84,000 | 4 KB | ~337 MB | | 10–100 KB | ~50,000 | 27 KB | ~1.3 GB | | 100 KB – 1 MB | ~14,000 | 318 KB | ~4.5 GB | | 1–10 MB | ~7,600 | 4.9 MB | ~36 GB | | 10–30 MB | ~1,500 | 16 MB | ~23 GB | | > 30 MB | ~130 | 31 MB | ~3.9 GB | ## Logs ``` INFO [alembic.runtime.migration] Running upgrade 374d2f66af06 -> 8452d01d26d7, Add chat_message table # ... ~55 minutes of silence ... sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ``` ## Proposed Fix The migration needs to be restructured to follow [PostgreSQL's bulk loading guidance](https://www.postgresql.org/docs/current/populate.html). The core changes: 1. **Create only the primary key during table creation** — defer secondary indexes and FK constraint until after the backfill 2. **Use keyset pagination instead of a streaming cursor** — server-side cursors are `WITHOUT HOLD` and become invalid after `COMMIT`, so paginate explicitly with `WHERE id > :last_id ORDER BY id LIMIT :chunk_size` 3. **Commit each chunk** — keep WAL bounded by committing every N chats 4. **Use idempotent inserts** (`ON CONFLICT DO NOTHING`) — so the migration can be safely restarted after a mid-run failure 5. **Create indexes after the backfill** — avoids per-row index maintenance during the load (use `CREATE INDEX CONCURRENTLY` if the service must remain online, though note this [cannot run inside a transaction block](https://www.postgresql.org/docs/current/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY)) 6. **Add FK as `NOT VALID`, then `VALIDATE CONSTRAINT`** — [`ALTER TABLE ADD CONSTRAINT ... NOT VALID`](https://www.postgresql.org/docs/current/sql-altertable.html) skips validation of existing rows during creation; a subsequent `VALIDATE CONSTRAINT` checks them without holding an `ACCESS EXCLUSIVE` lock ### Approach 1: Rewrite the migration (recommended upstream fix) If the backfill remains inside Alembic, the migration should use `op.get_context().autocommit_block()` or a separate connection strategy to commit chunks independently. The backfill loop must not rely on one long-lived streaming cursor if chunk commits are introduced (see root cause #3 above). High-level structure: 1. **DDL phase:** Create `chat_message` with PK only (no indexes, no FK) 2. **Backfill phase:** Paginate source chats using keyset pagination (`WHERE id > :last_id ORDER BY id LIMIT :chunk_size`), parse messages, insert with `ON CONFLICT (id) DO NOTHING`, commit after each chunk 3. **Index phase:** Create all 7 secondary indexes (single bulk build per index is orders of magnitude more efficient than per-row maintenance) 4. **FK phase:** `ALTER TABLE chat_message ADD CONSTRAINT ... FOREIGN KEY ... NOT VALID`, then `VALIDATE CONSTRAINT` separately ### Approach 2: Separate backfill from schema migration Use Alembic only for schema creation (table + PK). Run the backfill as a dedicated resumable job outside of Alembic, then stamp the revision manually. This is the safest option for very large datasets and follows PostgreSQL's documented bulk-load guidance most closely. ### Supporting change: `transaction_per_migration=True` in `env.py` Adding `transaction_per_migration=True` to `context.configure()` in `env.py` is recommended by Alembic when migrations use autocommit blocks. This is supporting infrastructure — it does not fix the migration itself, but prevents other migrations in the upgrade chain from being affected. ## Additional Context - **This is not specific to AlloyDB.** The migration shape is objectively poor for large PostgreSQL backfills. It affects any PostgreSQL deployment with large chat tables. Managed services (AlloyDB, Cloud SQL, RDS, Supabase, Neon) enforce limits on connection duration, write throughput, or storage pressure. Self-hosted PostgreSQL would also struggle — while checkpoints can still recycle old WAL segments during the transaction, the sheer volume of WAL generated (per-row heap + 7 index WAL records) can exhaust disk space or exceed `max_wal_size` thresholds faster than recycling can keep up. - The problem scales with data size. Our 68 GB dataset ran for 55 minutes before AlloyDB terminated the connection. The exact WAL volume generated by the migration was not isolated. To quantify WAL for this migration precisely, capture `pg_stat_wal.wal_bytes`, `wal_records`, and `wal_fpi` before and after a test run, or use `pg_stat_get_backend_wal()` for the migration backend where available. `pg_stat_wal` is cumulative since the last stats reset, so a before/after delta is required. - The `chat` column uses `json` (not `jsonb`), as created by migration `242a2047eae0`. This is relevant for any server-side SQL approach: `jsonb_each()` cannot operate on `json` directly — either `json_each()` or an explicit cast to `jsonb` is required, and the cast adds CPU overhead at scale. - Setting `BATCH_SIZE` higher or lower does not help — the issue is the combination of the single transaction boundary, FK constraint validation, and index maintenance during the load. - Temporarily increasing `maintenance_work_mem` and `max_wal_size` (where the provider allows it) may reduce pain for smaller datasets but does not fix the structural problem in the migration. - **Crash safety of the proposed fix:** When using explicit `COMMIT`/`BEGIN` inside the migration, if the process crashes mid-way, `alembic_version` still points to the pre-migration revision (it is only updated after `upgrade()` returns). A restart would re-run the migration from scratch. The `ON CONFLICT DO NOTHING` clause makes this safe — already-inserted messages are skipped, so the backfill is fully resumable. - **This pattern is not unique to this migration.** Migration `242a2047eae0_update_chat_table.py` (which converts the `chat` column from `Text` to `JSON`) also reads every row and writes them back in a single transaction. It presumably succeeded when running against smaller datasets but would exhibit the same WAL problem at scale. - **Concurrent migration risk:** Upstream issue #20067 reports that multiple Uvicorn workers can run migrations concurrently on startup, causing schema corruption. While we isolated the migration into a dedicated Cloud Run Job, the typical Docker deployment with multiple workers would be even more vulnerable to this migration's long runtime. ## Methodology This analysis was produced with the help of GPT 5.4 xhigh, Claude Opus 4.6, and GPT 5.4 Extended Pro (30+ minutes thinking time). Each model was provided with extensive context including the full migration source code (`8452d01d26d7_add_chat_message_table.py`), related issues (#23134, #22568), the batching fix commit (b4f3408), PR #21542, live database diagnostics (table sizes, WAL metrics, TOAST analysis), AlloyDB documentation, and PostgreSQL source code (`src/backend/access/transam/xlog.c`, `src/backend/commands/trigger.c`, `src/backend/utils/mmgr/portalmem.c`, `src/backend/executor/execIndexing.c`, `src/backend/access/nbtree/nbtinsert.c`). All claims were verified against the PostgreSQL source tree. All three models independently identified the same root causes: per-row index write amplification and unbounded transaction duration, not the Python-side batching.

GiteaMirror added the bug label 2026-05-05 23:32:07 -05:00

GiteaMirror closed this issue

2026-05-05 23:32:11 -05:00

GiteaMirror commented

2026-05-05 23:32:12 -05:00

@pr-validator-bot commented on GitHub (Mar 31, 2026):

⚠️ Missing Issue Title Prefix

@madnight, your issue title is missing a prefix (e.g., bug:, feat:, docs:).

Please update your issue title to include one of the following prefixes:

bug: Bug report or error you've encountered
feat: Feature request or enhancement suggestion
docs: Documentation issue or improvement request
question: Question about usage or functionality
help: Request for help or support

Example: bug: Login fails when using special characters in password

@pr-validator-bot commented on GitHub (Mar 31, 2026): # ⚠️ Missing Issue Title Prefix @madnight, your issue title is missing a prefix (e.g., `bug:`, `feat:`, `docs:`). Please update your issue title to include one of the following prefixes: - **bug**: Bug report or error you've encountered - **feat**: Feature request or enhancement suggestion - **docs**: Documentation issue or improvement request - **question**: Question about usage or functionality - **help**: Request for help or support Example: `bug: Login fails when using special characters in password`

GiteaMirror commented

2026-05-05 23:32:13 -05:00

@Classic298 commented on GitHub (Mar 31, 2026):

This issue likely went undetected upstream because SQLite is the default backend and does not exhibit the same WAL, FK-trigger, or server-side cursor characteristics as PostgreSQL. Testing with small datasets (a few thousand chats) would also not surface the problem.

I want to say this is wrong, because we had many people here who previously ran Out of memory on 100GB+ migrations and memory usage of 40GB+

And with the CURRENT changes it works for them just fine on PostgreSQL

@Classic298 commented on GitHub (Mar 31, 2026): > This issue likely went undetected upstream because SQLite is the default backend and does not exhibit the same WAL, FK-trigger, or server-side cursor characteristics as PostgreSQL. Testing with small datasets (a few thousand chats) would also not surface the problem. I want to say this is wrong, because we had many people here who previously ran Out of memory on 100GB+ migrations and memory usage of 40GB+ And with the CURRENT changes it works for them just fine on PostgreSQL

GiteaMirror commented

2026-05-05 23:32:15 -05:00

@madnight commented on GitHub (Mar 31, 2026):

@Classic298 I suggest that you either attempt to reproduce the issue using the steps provided or enter the Issue Details I have provided into your own Claude Code to verify their plausibility.

@madnight commented on GitHub (Mar 31, 2026): @Classic298 I suggest that you either attempt to reproduce the issue using the steps provided or enter the Issue Details I have provided into your own Claude Code to verify their plausibility.

GiteaMirror commented

2026-05-05 23:32:16 -05:00

@Classic298 commented on GitHub (Apr 14, 2026):

@madnight action wanted on PR thanks

@Classic298 commented on GitHub (Apr 14, 2026): @madnight action wanted on PR thanks

GiteaMirror commented

2026-05-05 23:32:18 -05:00

@madnight commented on GitHub (Apr 14, 2026):

Hi @Classic298 @tjbck ,

I have some updates. We got the migration working as-is and migrated DEV and PROD to 0.8.12. For it to work, we had to scale up the database to 16 cores and 128 GB of DB memory, which is easily doable in the cloud but might not be doable if someone is operating OWUI on a dedicated server, let's say a single machine on Hetzner. Depending on the exact number of chat entries and the size of the chat table, the migration for a 50gb+ chat table still runs somewhere between 1 and 3 hours, in which you basically have downtime for the whole thing. If you have an even bigger table lets say in the scale of 200gb+ I would guesstimate you need at least 265gb of DB memory and up to 6h of migration time with the current migration in 0.8.12.

For our purpose, this workaround is sufficient at this point. Hence, I would invite some feedback from this community if others with large Postgres instances have some similar issues.

@madnight commented on GitHub (Apr 14, 2026): Hi @Classic298 @tjbck , I have some updates. We got the migration working as-is and migrated DEV and PROD to 0.8.12. For it to work, we had to scale up the database to 16 cores and 128 GB of DB memory, which is easily doable in the cloud but might not be doable if someone is operating OWUI on a dedicated server, let's say a single machine on Hetzner. Depending on the exact number of chat entries and the size of the chat table, the migration for a 50gb+ chat table still runs somewhere between 1 and 3 hours, in which you basically have downtime for the whole thing. If you have an even bigger table lets say in the scale of 200gb+ I would guesstimate you need at least 265gb of DB memory and up to 6h of migration time with the current migration in 0.8.12. For our purpose, this workaround is sufficient at this point. Hence, I would invite some feedback from this community if others with large Postgres instances have some similar issues.

GiteaMirror commented

2026-05-05 23:32:19 -05:00

@Classic298 commented on GitHub (Apr 14, 2026):

Hm great news for you, but since nobody else is in this issue and WE cannot test this, there is now nobody to test the PR.

@Classic298 commented on GitHub (Apr 14, 2026): Hm great news for you, but since nobody else is in this issue and WE cannot test this, there is now nobody to test the PR.

GiteaMirror commented

2026-05-05 23:32:21 -05:00

@madnight commented on GitHub (Apr 14, 2026):

If nobody else has this issue, then I think it would be fine to close the issue for now until someone else stumbles upon it in a discussion or a new issue. In this case we already have an deep issue analysis + PR ready and could request testing.

@madnight commented on GitHub (Apr 14, 2026): If nobody else has this issue, then I think it would be fine to close the issue for now until someone else stumbles upon it in a discussion or a new issue. In this case we already have an deep issue analysis + PR ready and could request testing.

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#58605

[GH-ISSUE #23273] bug: Migration 8452d01d26d7 fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance #58605

[Bug] Migration 8452d01d26d7 fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance

Check Existing Issues

Installation Method

Open WebUI Version

Operating System

Expected Behavior

Actual Behavior

Environment

Root Cause Analysis

1. Single unbounded transaction

2. Foreign key and indexes created before the backfill

3. Server-side cursors invalidated by COMMIT

Why this went undetected

Steps to Reproduce

Generating test data

Logs

Proposed Fix

Approach 1: Rewrite the migration (recommended upstream fix)

Approach 2: Separate backfill from schema migration

Supporting change: transaction_per_migration=True in env.py

Additional Context

Methodology

⚠️ Missing Issue Title Prefix

[Bug] Migration `8452d01d26d7` fails on large PostgreSQL datasets due to single-transaction backfill with FK/index maintenance

Supporting change: `transaction_per_migration=True` in `env.py`