[GH-ISSUE #23134] issue: [Bug] Database migration chat → chat_message OOM-kills on large datasets (PostgreSQL/AlloyDB, ~75 GB) #58557

Closed
opened 2026-05-05 23:25:08 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @adesso-pia-vonkolken on GitHub (Mar 27, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23134

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.8.5

Ollama Version (if applicable)

No response

Operating System

Google Cloud Run Service

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Environment:

  • Open WebUI version: Upgrading from v0.7.2 → v0.8.5 (also tested with cherry-picked commit b4f3408 from PR #21542)
  • Deployment: Google Cloud Run (Cloud Run Job for migration, 34 GB memory)
  • Database: AlloyDB for PostgreSQL (managed PostgreSQL-compatible)
  • Dataset size: ~75 GB of chat data

Upgrading from v0.7.2 to v0.8.5 (or later) should run the Alembic database migration — including the chat → chat_message table migration — successfully without crashing, regardless of dataset size.

Actual Behavior

The built-in Alembic migration hangs and is ultimately terminated with Signal 9 (OOM-kill) during the Add chat_message table migration step. This occurs even when running the migration in a dedicated Cloud Run Job with 34 GB of memory allocated, which we set up specifically to avoid disrupting the production service.
The migration makes no progress after starting the Add chat_message table step and eventually causes an out-of-memory crash.

The migration runs for approximately 15 minutes between the Add chat_message table log line and the OOM-kill, with no further output.

Steps to Reproduce

  1. Run Open WebUI v0.7.2 with a large PostgreSQL/AlloyDB database (~75 GB of chat data)
  2. Attempt to upgrade to v0.8.5 (or apply cherry-picked commit b4f3408 from PR #21542)
  3. Allow the built-in Alembic migration to run
  4. Observe that the migration stalls at the Add chat_message table step and is eventually killed (OOM)

Logs & Screenshots

INFO:open_webui.internal.db:Starting migrations
INFO:open_webui.internal.db:There is nothing to migrate
INFO:open_webui.env:Running migrations
INFO:alembic.runtime.plugins:setup plugin alembic.autogenerate.schemas
[... alembic plugin setup ...]
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade a5c220713937 -> b7d2df9e1ab3, Add api_key column to user table if missing
INFO [alembic.runtime.migration] Running upgrade c440947495f3 -> 374d2f66af06, Add prompt history table
INFO [alembic.runtime.migration] Running upgrade 374d2f66af06 -> 8452d01d26d7, Add chat_message table
WARNING Container terminated on signal 9.

Image

Additional Information

We are unable to upgrade to any version beyond v0.7.2 in production due to this issue. The migration fails consistently, blocking all version updates.

Originally created by @adesso-pia-vonkolken on GitHub (Mar 27, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23134 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.8.5 ### Ollama Version (if applicable) _No response_ ### Operating System Google Cloud Run Service ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior Environment: - Open WebUI version: Upgrading from v0.7.2 → v0.8.5 (also tested with cherry-picked commit b4f3408 from PR #21542) - Deployment: Google Cloud Run (Cloud Run Job for migration, 34 GB memory) - Database: AlloyDB for PostgreSQL (managed PostgreSQL-compatible) - Dataset size: ~75 GB of chat data Upgrading from v0.7.2 to v0.8.5 (or later) should run the Alembic database migration — including the chat → chat_message table migration — successfully without crashing, regardless of dataset size. ### Actual Behavior The built-in Alembic migration hangs and is ultimately terminated with Signal 9 (OOM-kill) during the Add chat_message table migration step. This occurs even when running the migration in a dedicated Cloud Run Job with 34 GB of memory allocated, which we set up specifically to avoid disrupting the production service. The migration makes no progress after starting the Add chat_message table step and eventually causes an out-of-memory crash. The migration runs for approximately 15 minutes between the Add chat_message table log line and the OOM-kill, with no further output. ### Steps to Reproduce 1. Run Open WebUI v0.7.2 with a large PostgreSQL/AlloyDB database (~75 GB of chat data) 2. Attempt to upgrade to v0.8.5 (or apply cherry-picked commit b4f3408 from PR #21542) 3. Allow the built-in Alembic migration to run 4. Observe that the migration stalls at the Add chat_message table step and is eventually killed (OOM) ### Logs & Screenshots INFO:open_webui.internal.db:Starting migrations INFO:open_webui.internal.db:There is nothing to migrate INFO:open_webui.env:Running migrations INFO:alembic.runtime.plugins:setup plugin alembic.autogenerate.schemas [... alembic plugin setup ...] INFO [alembic.runtime.migration] Context impl PostgresqlImpl. INFO [alembic.runtime.migration] Will assume transactional DDL. INFO [alembic.runtime.migration] Running upgrade a5c220713937 -> b7d2df9e1ab3, Add api_key column to user table if missing INFO [alembic.runtime.migration] Running upgrade c440947495f3 -> 374d2f66af06, Add prompt history table INFO [alembic.runtime.migration] Running upgrade 374d2f66af06 -> 8452d01d26d7, Add chat_message table WARNING Container terminated on signal 9. <img width="811" height="250" alt="Image" src="https://github.com/user-attachments/assets/e1aeef3d-dbd1-46cf-aa68-44d411bc109b" /> ### Additional Information We are unable to upgrade to any version beyond v0.7.2 in production due to this issue. The migration fails consistently, blocking all version updates.
GiteaMirror added the bug label 2026-05-05 23:25:08 -05:00
Author
Owner

@Classic298 commented on GitHub (Mar 27, 2026):

0.8.5 is outdated, newer versions introduced batched processing for this migration. Please read the changelogs and use the newer version with batched migration

<!-- gh-comment-id:4140947719 --> @Classic298 commented on GitHub (Mar 27, 2026): 0.8.5 is outdated, newer versions introduced batched processing for this migration. Please read the changelogs and use the newer version with batched migration
Author
Owner

@Classic298 commented on GitHub (Mar 27, 2026):

0.8.9 and newer (best to just try 0.8.12) has brought batched perf improvements to the migration as per the changelogs

<!-- gh-comment-id:4140955198 --> @Classic298 commented on GitHub (Mar 27, 2026): 0.8.9 and newer (best to just try 0.8.12) has brought batched perf improvements to the migration as per the changelogs
Author
Owner

@adesso-pia-vonkolken commented on GitHub (Mar 27, 2026):

We already cherry-picked these changes onto v0.8.5 from the dev branch, but they did not help us. As far as I can tell, aside from this commit — b4f340806a — no further significant optimization changes were made to the migration. Cherry-picking this commit did not resolve the issue for us either.
Commit 06657b8109 fixes an AttributeError that occurs when history/messages fields contain lists instead of dicts — this does not apply to our case, as we did not encounter this error.

<!-- gh-comment-id:4140982154 --> @adesso-pia-vonkolken commented on GitHub (Mar 27, 2026): We already cherry-picked these changes onto v0.8.5 from the dev branch, but they did not help us. As far as I can tell, aside from this commit — b4f340806a4a0157bdd258b5acaa89c8a048be07 — no further significant optimization changes were made to the migration. Cherry-picking this commit did not resolve the issue for us either. Commit 06657b81097b17d101699c75089ec2fde004ae87 fixes an AttributeError that occurs when history/messages fields contain lists instead of dicts — this does not apply to our case, as we did not encounter this error.
Author
Owner

@Classic298 commented on GitHub (Mar 27, 2026):

With how much memory exactly are you running oom? With a 75GB large chat message table, even with batched processing, you are bound to have some memory growth. How much memory did you allocate to Open WebUI specifically?

<!-- gh-comment-id:4141040334 --> @Classic298 commented on GitHub (Mar 27, 2026): With how much memory exactly are you running oom? With a 75GB large chat message table, even with batched processing, you are bound to have some memory growth. How much memory did you allocate to **Open WebUI specifically**?
Author
Owner

@Classic298 commented on GitHub (Mar 27, 2026):

And how did you cherry pick the performance optimizations? How did you deploy them? To be sure you actually have the right changes, it'd be needed to actually run the newer version I mentioned.

If you modify the file inside the docker, the modification might get lost on up -d or however Google Cloud Run handles it

And did you ensure the cherry picked edits were still there during runtime? and did you cherry pick the FULL CHANGES to the file?

<!-- gh-comment-id:4141049115 --> @Classic298 commented on GitHub (Mar 27, 2026): And how did you cherry pick the performance optimizations? How did you deploy them? To be sure you actually have the right changes, it'd be needed to actually run the newer version I mentioned. If you modify the file inside the docker, the modification might get lost on up -d or however Google Cloud Run handles it And did you ensure the cherry picked edits were still there during runtime? and did you cherry pick the FULL CHANGES to the file?
Author
Owner

@Classic298 commented on GitHub (Mar 27, 2026):

I just had two more agents verify the current code.

It is IMPOSSIBLE to have continuosly growing memory, especially 30+ Gigabyte (i now found the metric in your issue) during the migration.

You just might not have the changes applied. Your cherry picks may not have worked or weren't there during runtime.

Please try, as i said, with the latest version - it must work.

<!-- gh-comment-id:4141100564 --> @Classic298 commented on GitHub (Mar 27, 2026): I just had two more agents verify the current code. It is IMPOSSIBLE to have continuosly growing memory, especially 30+ Gigabyte (i now found the metric in your issue) during the migration. You just might not have the changes applied. Your cherry picks may not have worked or weren't there during runtime. Please try, as i said, with the latest version - it must work.
Author
Owner

@Classic298 commented on GitHub (Mar 27, 2026):

How exactly did you cherry-pick the changes into your Cloud Run deployment?

To get a cherry-pick running on Cloud Run, you would have needed to:

  • Clone the repo and check out v0.8.5
  • Cherry-pick the commit (and resolve any conflicts)
  • Build a new Docker image
  • Push it to Artifact Registry
  • Deploy that new image to your Cloud Run Job (not just the service)

Specifically — did you update the container image on the Cloud Run Job you're using for migration, or only on the Cloud Run Service?

Did you verify the patched code is actually running at runtime?

Cloud Run is immutable. It runs whatever is baked into the container image. If the image wasn't rebuilt correctly, or if a cached Docker layer was used, or if the job is still pointing at the old image, the fix simply won't be there. The memory profile you're showing (linear ramp to OOM) is exactly what the unpatched code looks like. The patched version streams rows and flushes batches, so memory must stay flat.

The easiest path forward: Rather than cherry-picking onto v0.8.5, just deploy v0.8.12 directly. It includes this fix and several other improvements. That eliminates any risk of an incomplete or conflicting cherry-pick.

<!-- gh-comment-id:4141109506 --> @Classic298 commented on GitHub (Mar 27, 2026): How exactly did you cherry-pick the changes into your Cloud Run deployment? To get a cherry-pick running on Cloud Run, you would have needed to: - Clone the repo and check out v0.8.5 - Cherry-pick the commit (and resolve any conflicts) - Build a new Docker image - Push it to Artifact Registry - Deploy that new image to your Cloud Run Job (not just the service) Specifically — did you update the container image on the Cloud Run Job you're using for migration, or only on the Cloud Run Service? **Did you verify the patched code is actually running at runtime?** Cloud Run is immutable. It runs whatever is baked into the container image. If the image wasn't rebuilt correctly, or if a cached Docker layer was used, or if the job is still pointing at the old image, the fix simply won't be there. The memory profile you're showing (linear ramp to OOM) **is exactly what the unpatched code looks like**. The patched version streams rows and flushes batches, so memory must stay flat. The easiest path forward: Rather than cherry-picking onto v0.8.5, just deploy v0.8.12 directly. It includes this fix and several other improvements. That eliminates any risk of an incomplete or conflicting cherry-pick.
Author
Owner

@adesso-pia-vonkolken commented on GitHub (Mar 27, 2026):

We did exactly the steps that you mentioned:

  • Clone the repo and check out v0.8.5
  • Cherry-pick the commit (and resolve any conflicts)
  • Build a new Docker image
  • Push it to Artifact Registry
  • Deploy that new image to your Cloud Run Job (not just the service - actually we did not deploy to the service as we did not wanted our production environment to break)

As far as I could see the newest image was used including the cherry pick, but as you mentioned Cloud Run Jobs are black boxes regarding the used code.
We will try using v0.8.12 directly, hopefully this will help.

<!-- gh-comment-id:4141201427 --> @adesso-pia-vonkolken commented on GitHub (Mar 27, 2026): We did exactly the steps that you mentioned: - Clone the repo and check out v0.8.5 - Cherry-pick the commit (and resolve any conflicts) - Build a new Docker image - Push it to Artifact Registry - Deploy that new image to your Cloud Run Job (not just the service - actually we did not deploy to the service as we did not wanted our production environment to break) As far as I could see the newest image was used including the cherry pick, but as you mentioned Cloud Run Jobs are black boxes regarding the used code. We will try using v0.8.12 directly, hopefully this will help.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#58557