[GH-ISSUE #24330] issue: Uvicorn worker crash loop on chat completion when tool servers are unreachable causes CPU exhaustion and system unresponsiveness #58935

New Issue

GiteaMirror · 2026-05-06T00:29:56-05:00

GiteaMirror commented

2026-05-06 00:29:56 -05:00

Originally created by @zaakiy on GitHub (May 3, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/24330

Check Existing Issues

I have searched for any existing and/or related issues.
I have searched for any existing and/or related discussions.
I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
I am using the latest version of Open WebUI.

Installation Method

Pip Install

Open WebUI Version

0.9.2

Ollama Version (if applicable)

No response

Operating System

Ubuntu 24.04

Browser (if applicable)

No response

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When a POST /api/chat/completions request is made and one or more configured external tool/MCP servers are unreachable (connection refused or timeout), Open WebUI should log the error gracefully, skip the unavailable server, and continue serving requests normally. Worker processes should not crash, and the supervisor should not enter an unbounded restart loop.

Actual Behavior

When a chat completion request is handled, Open WebUI calls get_tool_servers_data() in open_webui/utils/tools.py to fetch OpenAPI specs from all configured tool servers. If any server is unreachable, an unhandled ConnectionRefusedError or asyncio.TimeoutError propagates out of the worker's lifespan startup context, killing the Uvicorn worker process.

The multiprocess supervisor immediately spawns a replacement worker. That worker also contacts the unreachable tool servers during its own lifespan startup, crashes again, and the cycle repeats indefinitely — a sustained worker crash/restart loop.

Each restart triggers a full cold-start: reloading the sentence-transformers/all-MiniLM-L6-v2 BertModel, running Alembic DB migrations, and re-initialising all app dependencies. With 10 workers cycling through this repeatedly, CPU spiked to ~57% sustained on a 16-core machine and RAM peaked at 16.4 GB. After ~15 minutes the OS became fully unresponsive — systemd-logind could not create new SSH sessions, containerd began logging context deadline exceeded — and a hard reboot was required.

Steps to Reproduce

Install Open WebUI v0.9.2 via pip install open-webui on Ubuntu 24.04, running as a systemd service with default Uvicorn multiprocess mode (10 workers).
Configure one or more external tool/MCP servers via Admin UI → Settings → Tools (e.g. http://<tailscale-ip>:9999).
Make at least one of those tool servers unreachable (e.g. peer goes offline, container stops, stale IP).
As any authenticated user, submit a chat message — triggering POST /api/chat/completions.
Open WebUI attempts to fetch OpenAPI specs from all configured tool servers as part of request handling and/or worker lifespan startup.
The worker throws an unhandled exception from tools.py:get_tool_server_data (line ~1152) and dies.
The Uvicorn supervisor spawns a replacement worker → it also crashes on tool server contact → repeat.
CPU and memory climb continuously with each spawn cycle due to BertModel reload and DB migrations on every restart.
After ~10–15 minutes the host OS becomes unresponsive and requires a hard reboot.

Config

ENABLE_API_KEY=true
WEBUI_ENABLE_API_KEY=true
ENABLE_API_KEY_ENDPOINT=true
AWS_SDK_LOAD_CONFIG=0
AWS_CONFIG_FILE=/dev/null
AWS_SHARED_CREDENTIALS_FILE=/dev/null
HOST=0.0.0.0
PORT=3000


# --- Streaming & DB performance fixes ---
# DATABASE_ENABLE_SQLITE_WAL is SQLite-only — no effect with Postgres, removed
ENABLE_REALTIME_CHAT_SAVE=False

# --- Uvicorn workers ---
# Formula: (2 x CPUs) + 1 = 17 for an 8-core machine.
# HOWEVER: 17 workers x ~830MB each = ~14GB RSS — this is the primary cause of the
# 12.7GB memory usage and the slowness (memory pressure + GIL contention between workers).
# For a CPU-bound/async ASGI app like Open WebUI, fewer workers is far better.
# Recommended: 9 workers (CPUs + 1) — allows full CPU saturation with ~7.5GB RAM.
UVICORN_WORKERS=9

# --- anyio thread pool ---
# Controls sync-to-async thread concurrency per worker.
# 9 workers x 8 threads = 72 total sync threads — sufficient headroom without bloat.
THREAD_POOL_SIZE=8

# --- SQLAlchemy connection pool (PostgreSQL) ---
# Without DATABASE_POOL_SIZE set, Open WebUI uses SQLAlchemy DEFAULT QueuePool:
#   pool_size=5, max_overflow=10 → up to 15 connections per worker.
# With 9 workers: up to 9 x 15 = 135 connections. Within max_connections=300 but wasteful.
#
# Explicit pool: pool_size=5 + max_overflow=5 = 10 max per worker.
# 9 workers x 10 = 90 max connections (comfortable headroom).
DATABASE_POOL_SIZE=8
DATABASE_POOL_MAX_OVERFLOW=7
DATABASE_POOL_TIMEOUT=30
# Recycle connections every 10 minutes to avoid stale/zombie connections
DATABASE_POOL_RECYCLE=600

DATABASE_URL=postgresql://openwebui:none-of-your-business@127.0.0.1:5432/openwebui

MODELS_CACHE_TTL=300
ENABLE_QUERIES_CACHE=True
RAG_SYSTEM_CONTEXT=True
CHUNK_MIN_SIZE_TARGET=1000
ENABLE_AUTOCOMPLETE_GENERATION=False
WEBUI_URL=https://ai.kelsiem.com

ENABLE_WEBSOCKET_SUPPORT=true
WEBSOCKET_MANAGER=redis
WEBSOCKET_REDIS_URL=redis://localhost:6380/0  # non-standard sedis port

Logs & Screenshots

# Unhandled exception in worker — tool server connection failure
2026-05-03 08:00:49 | ERROR | open_webui.utils.tools:get_tool_server_data:1175 -
  Could not fetch tool server spec from http://10.11.1.172:9999/openapi.json
ConnectionRefusedError: [Errno 111] Connect call failed ('10.11.1.172', 9999)

2026-05-03 08:00:49 | ERROR | open_webui.utils.tools:get_tool_servers_data:1247 -
  Failed to connect to http://10.11.1.172:9999 OpenAPI tool server
2026-05-03 08:00:49 | ERROR | open_webui.utils.tools:get_tool_servers_data:1247 -
  Failed to connect to http://100.74.127.39:9999 OpenAPI tool server

# Exception escaping into multiprocessing spawn — worker killed
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
...
File "open_webui/utils/tools.py", line 1152, in get_tool_server_data
    async with session.get(url, headers=_headers, ssl=...) as response:
TimeoutError

# Uvicorn supervisor crash loop
INFO:  Waiting for child process [2421096]
INFO:  Child process [2421096] died
INFO:  Waiting for child process [2421271]
INFO:  Child process [2421271] died
INFO:  Waiting for child process [2421279]
INFO:  Child process [2421279] died
INFO:  Waiting for child process [2421283]
INFO:  Child process [2421283] died
INFO:  Waiting for child process [2421292]
INFO:  Child process [2421292] died

# BertModel reloaded on every restart cycle (heavy CPU)
BertModel LOAD REPORT from: .../models--sentence-transformers--all-MiniLM-L6-v2/...
2026-05-03 08:16:05 | INFO | open_webui.main:lifespan:659 - Installing external dependencies...
2026-05-03 08:16:15 | INFO | open_webui.main:lifespan:709 - Initializing tool servers...

# systemd resource accounting at shutdown
openwebui.service: Consumed 42min 21.963s CPU time, 16.4G memory peak

# OS symptoms — system becoming unresponsive
sshd: pam_systemd(sshd:session): Failed to create session: Connection timed out
systemd-logind: Failed to start session scope: Message recipient disconnected from message bus without replying
containerd: level=error msg="post event" error="context deadline exceeded"
containerd: level=error msg="forward event" error="context deadline exceeded"

Additional Information

Related issue: #22543 — "Frontend fetch of openapi.json from external tool server has no timeout, causing infinite UI hang on page load" (closed March 2026). That issue fixed the frontend timeout gap, and its author noted "the backend already handles this correctly with AIOHTTP_CLIENT_TIMEOUT_TOOL_SERVER_DATA". However, our issue demonstrates the backend does not handle it safely — a connection error still escapes as an unhandled exception that kills the worker process itself.

Environment:

Open WebUI run as a systemd service via pipx, not Docker
LiteLLM used as model backend (not Ollama)
10 Uvicorn workers (default)
Multiple open-terminal MCP tool servers configured on Tailscale IPs, some of which were offline at time of incident

Suggested fixes:

Wrap all tool server connection attempts in get_tool_server_data / get_tool_servers_data in a broad try/except so any individual failure is logged as a warning and skipped — never allowed to propagate as an unhandled exception that can kill a worker.
Do not contact tool servers during worker lifespan startup if startup-time failures can crash the worker. Defer to a background task or lazy-load on first request.
Add a circuit-breaker or exponential backoff for repeatedly failing tool servers.
Cap the Uvicorn supervisor's worker restart rate to prevent unbounded CPU/memory consumption during crash loops.

Originally created by @zaakiy on GitHub (May 3, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/24330 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Pip Install ### Open WebUI Version 0.9.2 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 24.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When a `POST /api/chat/completions` request is made and one or more configured external tool/MCP servers are unreachable (connection refused or timeout), Open WebUI should log the error gracefully, skip the unavailable server, and continue serving requests normally. Worker processes should not crash, and the supervisor should not enter an unbounded restart loop. ### Actual Behavior When a chat completion request is handled, Open WebUI calls `get_tool_servers_data()` in `open_webui/utils/tools.py` to fetch OpenAPI specs from all configured tool servers. If any server is unreachable, an unhandled `ConnectionRefusedError` or `asyncio.TimeoutError` propagates out of the worker's `lifespan` startup context, **killing the Uvicorn worker process**. The multiprocess supervisor immediately spawns a replacement worker. That worker also contacts the unreachable tool servers during its own `lifespan` startup, crashes again, and the cycle repeats indefinitely — a **sustained worker crash/restart loop**. Each restart triggers a full cold-start: reloading the `sentence-transformers/all-MiniLM-L6-v2` BertModel, running Alembic DB migrations, and re-initialising all app dependencies. With 10 workers cycling through this repeatedly, CPU spiked to ~57% sustained on a 16-core machine and RAM peaked at **16.4 GB**. After ~15 minutes the OS became fully unresponsive — `systemd-logind` could not create new SSH sessions, `containerd` began logging `context deadline exceeded` — and a hard reboot was required. ### Steps to Reproduce 1. Install Open WebUI v0.9.2 via `pip install open-webui` on Ubuntu 24.04, running as a systemd service with default Uvicorn multiprocess mode (10 workers). 2. Configure one or more external tool/MCP servers via Admin UI → Settings → Tools (e.g. `http://<tailscale-ip>:9999`). 3. Make at least one of those tool servers **unreachable** (e.g. peer goes offline, container stops, stale IP). 4. As any authenticated user, submit a chat message — triggering `POST /api/chat/completions`. 5. Open WebUI attempts to fetch OpenAPI specs from all configured tool servers as part of request handling and/or worker `lifespan` startup. 6. The worker throws an unhandled exception from `tools.py:get_tool_server_data` (line ~1152) and dies. 7. The Uvicorn supervisor spawns a replacement worker → it also crashes on tool server contact → repeat. 8. CPU and memory climb continuously with each spawn cycle due to BertModel reload and DB migrations on every restart. 9. After ~10–15 minutes the host OS becomes unresponsive and requires a hard reboot. ### Config ```env ENABLE_API_KEY=true WEBUI_ENABLE_API_KEY=true ENABLE_API_KEY_ENDPOINT=true AWS_SDK_LOAD_CONFIG=0 AWS_CONFIG_FILE=/dev/null AWS_SHARED_CREDENTIALS_FILE=/dev/null HOST=0.0.0.0 PORT=3000 # --- Streaming & DB performance fixes --- # DATABASE_ENABLE_SQLITE_WAL is SQLite-only — no effect with Postgres, removed ENABLE_REALTIME_CHAT_SAVE=False # --- Uvicorn workers --- # Formula: (2 x CPUs) + 1 = 17 for an 8-core machine. # HOWEVER: 17 workers x ~830MB each = ~14GB RSS — this is the primary cause of the # 12.7GB memory usage and the slowness (memory pressure + GIL contention between workers). # For a CPU-bound/async ASGI app like Open WebUI, fewer workers is far better. # Recommended: 9 workers (CPUs + 1) — allows full CPU saturation with ~7.5GB RAM. UVICORN_WORKERS=9 # --- anyio thread pool --- # Controls sync-to-async thread concurrency per worker. # 9 workers x 8 threads = 72 total sync threads — sufficient headroom without bloat. THREAD_POOL_SIZE=8 # --- SQLAlchemy connection pool (PostgreSQL) --- # Without DATABASE_POOL_SIZE set, Open WebUI uses SQLAlchemy DEFAULT QueuePool: # pool_size=5, max_overflow=10 → up to 15 connections per worker. # With 9 workers: up to 9 x 15 = 135 connections. Within max_connections=300 but wasteful. # # Explicit pool: pool_size=5 + max_overflow=5 = 10 max per worker. # 9 workers x 10 = 90 max connections (comfortable headroom). DATABASE_POOL_SIZE=8 DATABASE_POOL_MAX_OVERFLOW=7 DATABASE_POOL_TIMEOUT=30 # Recycle connections every 10 minutes to avoid stale/zombie connections DATABASE_POOL_RECYCLE=600 DATABASE_URL=postgresql://openwebui:none-of-your-business@127.0.0.1:5432/openwebui MODELS_CACHE_TTL=300 ENABLE_QUERIES_CACHE=True RAG_SYSTEM_CONTEXT=True CHUNK_MIN_SIZE_TARGET=1000 ENABLE_AUTOCOMPLETE_GENERATION=False WEBUI_URL=https://ai.kelsiem.com ENABLE_WEBSOCKET_SUPPORT=true WEBSOCKET_MANAGER=redis WEBSOCKET_REDIS_URL=redis://localhost:6380/0 # non-standard sedis port ``` ### Logs & Screenshots ``` # Unhandled exception in worker — tool server connection failure 2026-05-03 08:00:49 | ERROR | open_webui.utils.tools:get_tool_server_data:1175 - Could not fetch tool server spec from http://10.11.1.172:9999/openapi.json ConnectionRefusedError: [Errno 111] Connect call failed ('10.11.1.172', 9999) 2026-05-03 08:00:49 | ERROR | open_webui.utils.tools:get_tool_servers_data:1247 - Failed to connect to http://10.11.1.172:9999 OpenAPI tool server 2026-05-03 08:00:49 | ERROR | open_webui.utils.tools:get_tool_servers_data:1247 - Failed to connect to http://100.74.127.39:9999 OpenAPI tool server # Exception escaping into multiprocessing spawn — worker killed File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main ... File "open_webui/utils/tools.py", line 1152, in get_tool_server_data async with session.get(url, headers=_headers, ssl=...) as response: TimeoutError # Uvicorn supervisor crash loop INFO: Waiting for child process [2421096] INFO: Child process [2421096] died INFO: Waiting for child process [2421271] INFO: Child process [2421271] died INFO: Waiting for child process [2421279] INFO: Child process [2421279] died INFO: Waiting for child process [2421283] INFO: Child process [2421283] died INFO: Waiting for child process [2421292] INFO: Child process [2421292] died # BertModel reloaded on every restart cycle (heavy CPU) BertModel LOAD REPORT from: .../models--sentence-transformers--all-MiniLM-L6-v2/... 2026-05-03 08:16:05 | INFO | open_webui.main:lifespan:659 - Installing external dependencies... 2026-05-03 08:16:15 | INFO | open_webui.main:lifespan:709 - Initializing tool servers... # systemd resource accounting at shutdown openwebui.service: Consumed 42min 21.963s CPU time, 16.4G memory peak # OS symptoms — system becoming unresponsive sshd: pam_systemd(sshd:session): Failed to create session: Connection timed out systemd-logind: Failed to start session scope: Message recipient disconnected from message bus without replying containerd: level=error msg="post event" error="context deadline exceeded" containerd: level=error msg="forward event" error="context deadline exceeded" ``` ### Additional Information **Related issue:** [#22543](https://github.com/open-webui/open-webui/issues/22543) — *"Frontend fetch of openapi.json from external tool server has no timeout, causing infinite UI hang on page load"* (closed March 2026). That issue fixed the **frontend** timeout gap, and its author noted *"the backend already handles this correctly with `AIOHTTP_CLIENT_TIMEOUT_TOOL_SERVER_DATA`"*. However, our issue demonstrates the backend does **not** handle it safely — a connection error still escapes as an unhandled exception that kills the worker process itself. **Environment:** - Open WebUI run as a `systemd` service via `pipx`, not Docker - LiteLLM used as model backend (not Ollama) - 10 Uvicorn workers (default) - Multiple `open-terminal` MCP tool servers configured on Tailscale IPs, some of which were offline at time of incident **Suggested fixes:** 1. Wrap all tool server connection attempts in `get_tool_server_data` / `get_tool_servers_data` in a broad `try/except` so any individual failure is logged as a warning and skipped — never allowed to propagate as an unhandled exception that can kill a worker. 2. Do not contact tool servers during worker `lifespan` startup if startup-time failures can crash the worker. Defer to a background task or lazy-load on first request. 3. Add a circuit-breaker or exponential backoff for repeatedly failing tool servers. 4. Cap the Uvicorn supervisor's worker restart rate to prevent unbounded CPU/memory consumption during crash loops.

GiteaMirror added the bug label 2026-05-06 00:29:56 -05:00

GiteaMirror commented

2026-05-06 00:30:01 -05:00

@pr-validator-bot commented on GitHub (May 3, 2026):

⚠️ Invalid Issue Title

Hey @zaakiy, please provide a descriptive title for your issue. Titles that are empty, very short (under 10 characters), or generic (like "issue:" or "feat:") make it difficult for volunteer contributors to understand and triage issues.

Please update the title to reflect the content of your issue.

⚠️ Missing Issue Title Prefix

@zaakiy, your issue title is missing a prefix (e.g., bug:, feat:, docs:).

Please update your issue title to include one of the following prefixes:

bug: Bug report or error you've encountered
feat: Feature request or enhancement suggestion
docs: Documentation issue or improvement request
question: Question about usage or functionality
help: Request for help or support

Example: bug: Login fails when using special characters in password

@pr-validator-bot commented on GitHub (May 3, 2026): # ⚠️ Invalid Issue Title Hey @zaakiy, please provide a descriptive title for your issue. Titles that are empty, very short (under 10 characters), or generic (like "issue:" or "feat:") make it difficult for volunteer contributors to understand and triage issues. Please update the title to reflect the content of your issue. --- # ⚠️ Missing Issue Title Prefix @zaakiy, your issue title is missing a prefix (e.g., `bug:`, `feat:`, `docs:`). Please update your issue title to include one of the following prefixes: - **bug**: Bug report or error you've encountered - **feat**: Feature request or enhancement suggestion - **docs**: Documentation issue or improvement request - **question**: Question about usage or functionality - **help**: Request for help or support Example: `bug: Login fails when using special characters in password`

GiteaMirror commented

2026-05-06 00:30:03 -05:00

@owui-terminator[bot] commented on GitHub (May 3, 2026):

🔍 Similar Issues Found

I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions:

#15162 issue: Unavailable direct connection chat with multiple workers due to WebSocket/API routing mismatch
by ShirasawaSama · bug, help wanted
#24042 issue: [BUG] 100% CPU after cancel / streaming completion (anyio cancellation loop)
by tbc3697 · bug
#22525 issue: Active chat task stuck in loading state after server restart — stale task data in Redis cannot be cleared
by ShirasawaSama · bug
#22206 issue: [Critical] Multiple API endpoints load entire dataset into memory at once, causing OOM crash and service unavailability
by ShirasawaSama · bug
#24089 issue: Code execution pyodide: Leaving conversation or tab during code exection results in endless processing without timeout
by TomTheWise · bug

💡 If this is a duplicate, consider closing it and adding details to the existing issue.

This comment was generated automatically. React with 👍 if helpful, 👎 if not.

@owui-terminator[bot] commented on GitHub (May 3, 2026): 🔍 **Similar Issues Found** I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions: 1. [#15162](https://github.com/open-webui/open-webui/issues/15162) **issue: Unavailable direct connection chat with multiple workers due to WebSocket/API routing mismatch** *by ShirasawaSama · `bug`, `help wanted`* 2. [#24042](https://github.com/open-webui/open-webui/issues/24042) **issue: [BUG] 100% CPU after cancel / streaming completion (anyio cancellation loop)** *by tbc3697 · `bug`* 3. [#22525](https://github.com/open-webui/open-webui/issues/22525) **issue: Active chat task stuck in loading state after server restart — stale task data in Redis cannot be cleared** *by ShirasawaSama · `bug`* 4. [#22206](https://github.com/open-webui/open-webui/issues/22206) **issue: [Critical] Multiple API endpoints load entire dataset into memory at once, causing OOM crash and service unavailability** *by ShirasawaSama · `bug`* 5. [#24089](https://github.com/open-webui/open-webui/issues/24089) **issue: Code execution pyodide: Leaving conversation or tab during code exection results in endless processing without timeout** *by TomTheWise · `bug`* --- 💡 If this is a duplicate, consider closing it and adding details to the existing issue. *This comment was generated automatically.* React with 👍 if helpful, 👎 if not.

GiteaMirror commented

2026-05-06 00:30:05 -05:00

@zaakiy commented on GitHub (May 3, 2026):

5. #24089 issue: Code execution pyodide: Leaving conversation or tab during code exection results in endless processing without timeout
by TomTheWise · bug

I wonder if #24089 triggers the behaviour seen in my issue

@zaakiy commented on GitHub (May 3, 2026): > 5\. [#24089](https://github.com/open-webui/open-webui/issues/24089) **issue: Code execution pyodide: Leaving conversation or tab during code exection results in endless processing without timeout** > _by TomTheWise · `bug`_ I wonder if #24089 triggers the behaviour seen in my issue

GiteaMirror commented

2026-05-06 00:30:06 -05:00

@MukundaKatta commented on GitHub (May 3, 2026):

The lifespan startup path is the worst place for any network-dependent init since a single unreachable server bricks every worker. The cleanest fix is moving get_tool_servers_data() out of lifespan entirely and lazy-loading on first chat request, with per-server try/except so one dead server can't poison the rest. The supervisor restart loop is symptom, not cause; even with retry backoff it'll still loop forever if the failing call stays in lifespan.

@MukundaKatta commented on GitHub (May 3, 2026): The lifespan startup path is the worst place for any network-dependent init since a single unreachable server bricks every worker. The cleanest fix is moving `get_tool_servers_data()` out of lifespan entirely and lazy-loading on first chat request, with per-server try/except so one dead server can't poison the rest. The supervisor restart loop is symptom, not cause; even with retry backoff it'll still loop forever if the failing call stays in lifespan.

GiteaMirror commented

2026-05-06 00:30:08 -05:00

@zaakiy commented on GitHub (May 3, 2026):

The lifespan startup path is the worst place for any network-dependent init since a single unreachable server bricks every worker. The cleanest fix is moving get_tool_servers_data() out of lifespan entirely and lazy-loading on first chat request, with per-server try/except so one dead server can't poison the rest. The supervisor restart loop is symptom, not cause; even with retry backoff it'll still loop forever if the failing call stays in lifespan.

I'm not familiar with the code, but your idea makes a LOT of sense to me.

@zaakiy commented on GitHub (May 3, 2026): > The lifespan startup path is the worst place for any network-dependent init since a single unreachable server bricks every worker. The cleanest fix is moving `get_tool_servers_data()` out of lifespan entirely and lazy-loading on first chat request, with per-server try/except so one dead server can't poison the rest. The supervisor restart loop is symptom, not cause; even with retry backoff it'll still loop forever if the failing call stays in lifespan. I'm not familiar with the code, but your idea makes a LOT of sense to me.

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#58935