mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[GH-ISSUE #24330] issue: Uvicorn worker crash loop on chat completion when tool servers are unreachable causes CPU exhaustion and system unresponsiveness #58935
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @zaakiy on GitHub (May 3, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/24330
Check Existing Issues
Installation Method
Pip Install
Open WebUI Version
0.9.2
Ollama Version (if applicable)
No response
Operating System
Ubuntu 24.04
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
When a
POST /api/chat/completionsrequest is made and one or more configured external tool/MCP servers are unreachable (connection refused or timeout), Open WebUI should log the error gracefully, skip the unavailable server, and continue serving requests normally. Worker processes should not crash, and the supervisor should not enter an unbounded restart loop.Actual Behavior
When a chat completion request is handled, Open WebUI calls
get_tool_servers_data()inopen_webui/utils/tools.pyto fetch OpenAPI specs from all configured tool servers. If any server is unreachable, an unhandledConnectionRefusedErrororasyncio.TimeoutErrorpropagates out of the worker'slifespanstartup context, killing the Uvicorn worker process.The multiprocess supervisor immediately spawns a replacement worker. That worker also contacts the unreachable tool servers during its own
lifespanstartup, crashes again, and the cycle repeats indefinitely — a sustained worker crash/restart loop.Each restart triggers a full cold-start: reloading the
sentence-transformers/all-MiniLM-L6-v2BertModel, running Alembic DB migrations, and re-initialising all app dependencies. With 10 workers cycling through this repeatedly, CPU spiked to ~57% sustained on a 16-core machine and RAM peaked at 16.4 GB. After ~15 minutes the OS became fully unresponsive —systemd-logindcould not create new SSH sessions,containerdbegan loggingcontext deadline exceeded— and a hard reboot was required.Steps to Reproduce
pip install open-webuion Ubuntu 24.04, running as a systemd service with default Uvicorn multiprocess mode (10 workers).http://<tailscale-ip>:9999).POST /api/chat/completions.lifespanstartup.tools.py:get_tool_server_data(line ~1152) and dies.Config
Logs & Screenshots
Additional Information
Related issue: #22543 — "Frontend fetch of openapi.json from external tool server has no timeout, causing infinite UI hang on page load" (closed March 2026). That issue fixed the frontend timeout gap, and its author noted "the backend already handles this correctly with
AIOHTTP_CLIENT_TIMEOUT_TOOL_SERVER_DATA". However, our issue demonstrates the backend does not handle it safely — a connection error still escapes as an unhandled exception that kills the worker process itself.Environment:
systemdservice viapipx, not Dockeropen-terminalMCP tool servers configured on Tailscale IPs, some of which were offline at time of incidentSuggested fixes:
get_tool_server_data/get_tool_servers_datain a broadtry/exceptso any individual failure is logged as a warning and skipped — never allowed to propagate as an unhandled exception that can kill a worker.lifespanstartup if startup-time failures can crash the worker. Defer to a background task or lazy-load on first request.@pr-validator-bot commented on GitHub (May 3, 2026):
⚠️ Invalid Issue Title
Hey @zaakiy, please provide a descriptive title for your issue. Titles that are empty, very short (under 10 characters), or generic (like "issue:" or "feat:") make it difficult for volunteer contributors to understand and triage issues.
Please update the title to reflect the content of your issue.
⚠️ Missing Issue Title Prefix
@zaakiy, your issue title is missing a prefix (e.g.,
bug:,feat:,docs:).Please update your issue title to include one of the following prefixes:
Example:
bug: Login fails when using special characters in password@owui-terminator[bot] commented on GitHub (May 3, 2026):
🔍 Similar Issues Found
I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions:
#15162 issue: Unavailable direct connection chat with multiple workers due to WebSocket/API routing mismatch
by ShirasawaSama ·
bug,help wanted#24042 issue: [BUG] 100% CPU after cancel / streaming completion (anyio cancellation loop)
by tbc3697 ·
bug#22525 issue: Active chat task stuck in loading state after server restart — stale task data in Redis cannot be cleared
by ShirasawaSama ·
bug#22206 issue: [Critical] Multiple API endpoints load entire dataset into memory at once, causing OOM crash and service unavailability
by ShirasawaSama ·
bug#24089 issue: Code execution pyodide: Leaving conversation or tab during code exection results in endless processing without timeout
by TomTheWise ·
bug💡 If this is a duplicate, consider closing it and adding details to the existing issue.
This comment was generated automatically. React with 👍 if helpful, 👎 if not.
@zaakiy commented on GitHub (May 3, 2026):
I wonder if #24089 triggers the behaviour seen in my issue
@MukundaKatta commented on GitHub (May 3, 2026):
The lifespan startup path is the worst place for any network-dependent init since a single unreachable server bricks every worker. The cleanest fix is moving
get_tool_servers_data()out of lifespan entirely and lazy-loading on first chat request, with per-server try/except so one dead server can't poison the rest. The supervisor restart loop is symptom, not cause; even with retry backoff it'll still loop forever if the failing call stays in lifespan.@zaakiy commented on GitHub (May 3, 2026):
I'm not familiar with the code, but your idea makes a LOT of sense to me.