[GH-ISSUE #18933] issue: Critical SQLAlchemy Session Bug: "Could not refresh instance" #34250

Closed
opened 2026-04-25 08:09:57 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @aidenpearce001 on GitHub (Nov 4, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/18933

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Git Clone

Open WebUI Version

v0.6.34

Ollama Version (if applicable)

No response

Operating System

Ubuntu 22.04

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When creating users via the API endpoint /api/v1/auths/add or creating chats via /api/v1/chats/new, the operations should succeed consistently with near 100% success rate under normal load conditions.

Expected:

  • User creation API: 100% success rate (or 95%+ with network issues)
  • Chat creation API: 100% success rate
  • Database operations should be reliable and deterministic
  • No random "internal error" messages

Actual Behavior

Database write operations fail randomly with 30-60% failure rate, returning:

{
  "detail": "An internal error occurred while adding the user."
}

Observed failure rates across multiple test runs:

  • User creation: 10-60% success rate (average 30-40%)
  • Chat creation: 0-26% success rate (average 20%)
  • Pattern: Non-deterministic - same operation succeeds sometimes, fails other times

Error in logs:

sqlalchemy.exc.InvalidRequestError: Could not refresh instance '<User at 0x7676147c7850>'
sqlalchemy.exc.InvalidRequestError: Could not refresh instance '<Chat at 0x70e8c9f93f90>'

This makes OpenWebUI unreliable for:

  • Multi-user deployments
  • Bulk user provisioning
  • API-driven user management
  • Production environments requiring reliability

Steps to Reproduce

Environment Setup

OpenWebUI Configuration:

Image: ghcr.io/open-webui/open-webui:main
Deployment: Kubernetes (6 replicas with HPA)
Database: PostgreSQL 17.6 via PgPool-II

Environment Variables:
  DATABASE_URL: postgresql://postgres:password@pgpool:5432/openwebui
  DATABASE_POOL_SIZE: "40"
  DATABASE_POOL_MAX_OVERFLOW: "20"
  DATABASE_POOL_RECYCLE: "240"
  DATABASE_POOL_PRE_PING: "true"
  DATABASE_POOL_TIMEOUT: "30"

Database Infrastructure:

PostgreSQL 17.6 HA (Bitnami Helm Chart):
  Replication: Streaming (1 primary + 1 replica)
  Max connections: 300
  Settings:
    - idle_in_transaction_session_timeout: 5min
    - statement_timeout: 1min
    - tcp_keepalives_idle: 60
    - tcp_keepalives_interval: 10

PgPool-II 4.6.3:
  Instances: 2
  Configuration:
    - num_init_children: 100
    - max_pool: 4
    - connection_life_time: 600s (10 minutes)
    - client_idle_limit: 300s (5 minutes)
    - load_balancing_mode: transaction

Reproduction Steps

Test 1: User Creation (Simplest)

# Run this 10 times in succession:
for i in {1..10}; do
  curl -s 'https://your-openwebui-url/api/v1/auths/add' \
    -X POST \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADMIN_TOKEN" \
    -d "{
      \"name\": \"Test User $i\",
      \"email\": \"test$i@example.com\",
      \"password\": \"TestPass123!\",
      \"role\": \"user\"
    }" | jq '.id // .detail'

  sleep 0.2
done

Expected Result: 10 successful user creations
Actual Result: 3-6 successful, 4-7 failures (30-60% success rate)

Test 2: Concurrent User Creation (Stress)

# Create 10 users concurrently:
for i in {1..10}; do
  curl -s 'https://your-openwebui-url/api/v1/auths/add' \
    -X POST \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADMIN_TOKEN" \
    -d "{
      \"name\": \"Concurrent User $i\",
      \"email\": \"concurrent$i@example.com\",
      \"password\": \"TestPass123!\",
      \"role\": \"user\"
    }" &
done
wait

Expected Result: 10 successful users
Actual Result: 1-3 successful, 7-9 failures (10-30% success rate)

Test 3: Chat Creation

# After creating a user, try creating multiple chats:
for i in {1..5}; do
  curl -s 'https://your-openwebui-url/api/v1/chats/new' \
    -X POST \
    -H "Authorization: Bearer $USER_TOKEN" \
    -H 'Content-Type: application/json' \
    -d "{
      \"chat\": {
        \"title\": \"Test Chat $i\",
        \"models\": [\"gpt-3.5-turbo\"]
      }
    }" | jq '.id // .detail'

  sleep 0.3
done

Expected Result: 5 chats created
Actual Result: 0-2 chats created, rest fail

Key Observations

  1. Timing-dependent: Same request succeeds/fails randomly
  2. Not load-related: Fails even with delays between requests
  3. Not infrastructure: Happens even after fresh pod restarts
  4. Multiple operations affected: Users, chats, likely all DB writes
  5. Connection pool independent: Fails regardless of pool size (tested 5-40)

Logs & Screenshots

User Creation Failure (from OpenWebUI logs):

2025-11-04 18:52:46.047 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<User at 0x7676147c7850>'
2025-11-04 18:52:37.342 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<User at 0x79a3c1890590>'
2025-11-04 18:52:38.553 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<Auth at 0x79a3c187d550>'
2025-11-04 18:52:43.613 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<User at 0x79a3c16cf5d0>'

Chat Creation Failure (with full stack trace):

2025-11-04 20:05:20.184690471+01:00 stdout F sqlalchemy.exc.InvalidRequestError: Could not refresh instance '<Chat at 0x70e8c9f93f90>'
2025-11-04 20:05:20.184661044+01:00 stdout F     raise sa_exc.InvalidRequestError(
2025-11-04 20:05:20.184432502+01:00 stdout F   File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
2025-11-04 20:05:20.184367022+01:00 stdout F     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
2025-11-04 20:05:20.184110815+01:00 stdout F   File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
2025-11-04 20:05:20.184014941+01:00 stdout F   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__

Database Connection Termination (PostgreSQL logs):

2025-11-04 18:57:25.021 GMT [65419] FATAL:  terminating connection due to idle-session timeout
2025-11-04 18:58:15.012 GMT [65740] FATAL:  terminating connection due to idle-session timeout
2025-11-04 18:58:35.000 GMT [65865] FATAL:  terminating connection due to idle-session timeout

(Note: This happened when we tried setting idle_session_timeout to fix the issue - it made it worse)

Related Error (LiteLLM on same infrastructure):

prisma.errors.DataError: Can't reach database server at `postgresql-ha-pgpool.database.svc.cluster.local`:`5432`

Additional Information

Based on extensive testing and log analysis, the issue appears to be in SQLAlchemy session management:

Suspected code pattern (pseudocode from logs):

# In open_webui/routers/auths.py:781 and similar locations
db.add(user)
db.commit()        # ← Transaction commits, may close/detach session
db.refresh(user)   # ← FAILS because session is closed or object is detached

Why it fails:

  1. After commit(), SQLAlchemy may close the session or detach objects
  2. When connections are recycled from the pool, session state becomes invalid
  3. Calling refresh() on a detached object raises InvalidRequestError
  4. The timing is random depending on connection pool state

Infrastructure Changes Tested (All Failed to Fix)

To rule out infrastructure issues, we extensively tested:

1. Connection Pool Optimization:

  • Increased DATABASE_POOL_SIZE from 5 → 40
  • Increased DATABASE_POOL_MAX_OVERFLOW from 10 → 20
  • Set DATABASE_POOL_RECYCLE to 240s (4 minutes)
  • Enabled DATABASE_POOL_PRE_PING to test connections
  • Result: Still 30-60% failure rate

2. PgPool Configuration:

  • Set connection_life_time to 600s (was 0/infinite)
  • Set client_idle_limit to 300s (was 0/infinite)
  • Configured proper health checks
  • Load balancing mode: transaction
  • Result: Still 30-60% failure rate

3. PostgreSQL Timeouts:

  • idle_in_transaction_session_timeout: 5min (killed stuck transactions)
  • statement_timeout: 1min (killed long queries)
  • idle_session_timeout: 10min → Made it WORSE (killed connections mid-use)
  • idle_session_timeout: 30min → Still failed
  • idle_session_timeout: 0 (disabled) → Still failed
  • Result: Configuration doesn't help, proves it's not a timeout issue

4. Scaling Tests:

  • Tested with 2 pods: 20% success
  • Tested with 5 pods: 60% success
  • Tested with 6 pods: 30% success
  • Tested with 10 pods: 50% success
  • Result: Inconsistent, not related to load/capacity

5. Fresh Connection Tests:

  • Restarted OpenWebUI pods
  • Restarted PostgreSQL cluster
  • Cleared all connection pools
  • Tested immediately after restart
  • Result: Still fails on first requests with brand new connections
Originally created by @aidenpearce001 on GitHub (Nov 4, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/18933 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Git Clone ### Open WebUI Version v0.6.34 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 22.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When creating users via the API endpoint `/api/v1/auths/add` or creating chats via `/api/v1/chats/new`, the operations should succeed consistently with near 100% success rate under normal load conditions. **Expected:** - User creation API: 100% success rate (or 95%+ with network issues) - Chat creation API: 100% success rate - Database operations should be reliable and deterministic - No random "internal error" messages ### Actual Behavior Database write operations fail randomly with 30-60% failure rate, returning: ```json { "detail": "An internal error occurred while adding the user." } ``` **Observed failure rates across multiple test runs:** - User creation: 10-60% success rate (average 30-40%) - Chat creation: 0-26% success rate (average 20%) - Pattern: Non-deterministic - same operation succeeds sometimes, fails other times **Error in logs:** ``` sqlalchemy.exc.InvalidRequestError: Could not refresh instance '<User at 0x7676147c7850>' sqlalchemy.exc.InvalidRequestError: Could not refresh instance '<Chat at 0x70e8c9f93f90>' ``` This makes OpenWebUI unreliable for: - Multi-user deployments - Bulk user provisioning - API-driven user management - Production environments requiring reliability ### Steps to Reproduce ### Environment Setup **OpenWebUI Configuration:** ```yaml Image: ghcr.io/open-webui/open-webui:main Deployment: Kubernetes (6 replicas with HPA) Database: PostgreSQL 17.6 via PgPool-II Environment Variables: DATABASE_URL: postgresql://postgres:password@pgpool:5432/openwebui DATABASE_POOL_SIZE: "40" DATABASE_POOL_MAX_OVERFLOW: "20" DATABASE_POOL_RECYCLE: "240" DATABASE_POOL_PRE_PING: "true" DATABASE_POOL_TIMEOUT: "30" ``` **Database Infrastructure:** ```yaml PostgreSQL 17.6 HA (Bitnami Helm Chart): Replication: Streaming (1 primary + 1 replica) Max connections: 300 Settings: - idle_in_transaction_session_timeout: 5min - statement_timeout: 1min - tcp_keepalives_idle: 60 - tcp_keepalives_interval: 10 PgPool-II 4.6.3: Instances: 2 Configuration: - num_init_children: 100 - max_pool: 4 - connection_life_time: 600s (10 minutes) - client_idle_limit: 300s (5 minutes) - load_balancing_mode: transaction ``` ### Reproduction Steps **Test 1: User Creation (Simplest)** ```bash # Run this 10 times in succession: for i in {1..10}; do curl -s 'https://your-openwebui-url/api/v1/auths/add' \ -X POST \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -d "{ \"name\": \"Test User $i\", \"email\": \"test$i@example.com\", \"password\": \"TestPass123!\", \"role\": \"user\" }" | jq '.id // .detail' sleep 0.2 done ``` **Expected Result:** 10 successful user creations **Actual Result:** 3-6 successful, 4-7 failures (30-60% success rate) **Test 2: Concurrent User Creation (Stress)** ```bash # Create 10 users concurrently: for i in {1..10}; do curl -s 'https://your-openwebui-url/api/v1/auths/add' \ -X POST \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -d "{ \"name\": \"Concurrent User $i\", \"email\": \"concurrent$i@example.com\", \"password\": \"TestPass123!\", \"role\": \"user\" }" & done wait ``` **Expected Result:** 10 successful users **Actual Result:** 1-3 successful, 7-9 failures (10-30% success rate) **Test 3: Chat Creation** ```bash # After creating a user, try creating multiple chats: for i in {1..5}; do curl -s 'https://your-openwebui-url/api/v1/chats/new' \ -X POST \ -H "Authorization: Bearer $USER_TOKEN" \ -H 'Content-Type: application/json' \ -d "{ \"chat\": { \"title\": \"Test Chat $i\", \"models\": [\"gpt-3.5-turbo\"] } }" | jq '.id // .detail' sleep 0.3 done ``` **Expected Result:** 5 chats created **Actual Result:** 0-2 chats created, rest fail ### Key Observations 1. **Timing-dependent:** Same request succeeds/fails randomly 2. **Not load-related:** Fails even with delays between requests 3. **Not infrastructure:** Happens even after fresh pod restarts 4. **Multiple operations affected:** Users, chats, likely all DB writes 5. **Connection pool independent:** Fails regardless of pool size (tested 5-40) ### Logs & Screenshots **User Creation Failure (from OpenWebUI logs):** ``` 2025-11-04 18:52:46.047 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<User at 0x7676147c7850>' 2025-11-04 18:52:37.342 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<User at 0x79a3c1890590>' 2025-11-04 18:52:38.553 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<Auth at 0x79a3c187d550>' 2025-11-04 18:52:43.613 | ERROR | open_webui.routers.auths:add_user:781 - Add user error: Could not refresh instance '<User at 0x79a3c16cf5d0>' ``` **Chat Creation Failure (with full stack trace):** ``` 2025-11-04 20:05:20.184690471+01:00 stdout F sqlalchemy.exc.InvalidRequestError: Could not refresh instance '<Chat at 0x70e8c9f93f90>' 2025-11-04 20:05:20.184661044+01:00 stdout F raise sa_exc.InvalidRequestError( 2025-11-04 20:05:20.184432502+01:00 stdout F File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app 2025-11-04 20:05:20.184367022+01:00 stdout F await wrap_app_handling_exceptions(app, request)(scope, receive, send) 2025-11-04 20:05:20.184110815+01:00 stdout F File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app 2025-11-04 20:05:20.184014941+01:00 stdout F File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__ ``` **Database Connection Termination (PostgreSQL logs):** ``` 2025-11-04 18:57:25.021 GMT [65419] FATAL: terminating connection due to idle-session timeout 2025-11-04 18:58:15.012 GMT [65740] FATAL: terminating connection due to idle-session timeout 2025-11-04 18:58:35.000 GMT [65865] FATAL: terminating connection due to idle-session timeout ``` *(Note: This happened when we tried setting idle_session_timeout to fix the issue - it made it worse)* **Related Error (LiteLLM on same infrastructure):** ``` prisma.errors.DataError: Can't reach database server at `postgresql-ha-pgpool.database.svc.cluster.local`:`5432` ``` ### Additional Information Based on extensive testing and log analysis, the issue appears to be in SQLAlchemy session management: **Suspected code pattern (pseudocode from logs):** ```python # In open_webui/routers/auths.py:781 and similar locations db.add(user) db.commit() # ← Transaction commits, may close/detach session db.refresh(user) # ← FAILS because session is closed or object is detached ``` **Why it fails:** 1. After `commit()`, SQLAlchemy may close the session or detach objects 2. When connections are recycled from the pool, session state becomes invalid 3. Calling `refresh()` on a detached object raises `InvalidRequestError` 4. The timing is random depending on connection pool state ### Infrastructure Changes Tested (All Failed to Fix) To rule out infrastructure issues, we extensively tested: **1. Connection Pool Optimization:** - ✅ Increased `DATABASE_POOL_SIZE` from 5 → 40 - ✅ Increased `DATABASE_POOL_MAX_OVERFLOW` from 10 → 20 - ✅ Set `DATABASE_POOL_RECYCLE` to 240s (4 minutes) - ✅ Enabled `DATABASE_POOL_PRE_PING` to test connections - **Result:** Still 30-60% failure rate **2. PgPool Configuration:** - ✅ Set `connection_life_time` to 600s (was 0/infinite) - ✅ Set `client_idle_limit` to 300s (was 0/infinite) - ✅ Configured proper health checks - ✅ Load balancing mode: transaction - **Result:** Still 30-60% failure rate **3. PostgreSQL Timeouts:** - ✅ `idle_in_transaction_session_timeout: 5min` (killed stuck transactions) - ✅ `statement_timeout: 1min` (killed long queries) - ✅ `idle_session_timeout: 10min` → Made it WORSE (killed connections mid-use) - ✅ `idle_session_timeout: 30min` → Still failed - ✅ `idle_session_timeout: 0` (disabled) → Still failed - **Result:** Configuration doesn't help, proves it's not a timeout issue **4. Scaling Tests:** - ✅ Tested with 2 pods: 20% success - ✅ Tested with 5 pods: 60% success - ✅ Tested with 6 pods: 30% success - ✅ Tested with 10 pods: 50% success - **Result:** Inconsistent, not related to load/capacity **5. Fresh Connection Tests:** - ✅ Restarted OpenWebUI pods - ✅ Restarted PostgreSQL cluster - ✅ Cleared all connection pools - ✅ Tested immediately after restart - **Result:** Still fails on first requests with brand new connections
GiteaMirror added the bug label 2026-04-25 08:09:57 -05:00
Author
Owner

@tjbck commented on GitHub (Nov 4, 2025):

Unable to reproduce, community input wanted here.

<!-- gh-comment-id:3488362870 --> @tjbck commented on GitHub (Nov 4, 2025): Unable to reproduce, community input wanted here.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#34250