[GH-ISSUE #13027] feat: Asynchronous process_chat_payload in chat completion #32318

Closed
opened 2026-04-25 06:12:55 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @tth37 on GitHub (Apr 18, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/13027

Originally assigned to: @tjbck on GitHub.

Check Existing Issues

Related: #13007

Problem Description

The /api/chat/completions endpoint supports two primary modes of operation:

  1. Synchronous (stream=False): Typically invoked via direct HTTP requests, this mode processes the entire request and returns the complete response in a single HTTP transaction.
  2. Asynchronous (stream=True): Primarily used by the frontend UI via WebSocket, this mode is expected to return immediately with a task_id. This task_id allows the frontend to receive status updates, stream the response incrementally via the WebSocket connection, and crucially, enables early stopping of the generation process initiated by the user.

While the asynchronous (stream=True) mode functions as expected for standard chat interactions (returning the task_id promptly), this expected behavior breaks when features requiring substantial pre-processing, such as Web Search or Tool Use, are enabled. Instead of returning immediately, it waits for the process_chat_payload phase (which includes potentially long-running operations like web searches or tool executions) to complete before returning the task_id.

Image

This synchronous behavior during the payload processing phase leads to two significant issues: (both reported in discussions)

  1. Delayed Early Stopping: The frontend does not receive the task_id until after web search/tool execution finishes. This prevents users from stopping the request during this initial, potentially lengthy (30-60s+), phase.
  2. Network Timeouts: The extended wait time for the endpoint to respond increases the risk of network errors, such as gateway timeouts or client-side request timeouts, degrading the user experience.

Cause Analysis

The chat completion process can be broadly divided into two phases:

  1. process_chat_payload: Handles request preprocessing, including web searches, tool calls, and injecting results into the context for the language model.
  2. process_chat_response: Handles the actual generation of the AI response by LLM and streams results back via WebSocket.

Currently, process_chat_response is correctly handled asynchronously using create_task, as seen:

b8fb4e528d/backend/open_webui/utils/middleware.py (L1209-L1210)

However the process_chat_payload remains a synchronous function, user have to wait until process_chat_payload finishes and then they can receive the background task_id. Things get worse when web search feature is enabled as it might take up to 30s-60s, in this period user cannot early stop the request, and facing the risk of connection timeout.

Desired Solution you'd like

For asynchronous api calls, refactor the chat_completion handler in main.py to make the entire processing pipeline (both payload processing and response generation) asynchronous from the start. This can be achieved by wrapping all time-consuming logic within a single background task created immediately upon receiving the request. test_async_chat_completion

async def all_time_consuming_jobs(request, form_data, user, metadata, model):
    form_data, metadata, events = await process_chat_payload(
        request, form_data, user, metadata, model
    )
    response = await chat_completion_handler(request, form_data, user)
    await process_chat_response( # don't create_task inside `process_chat_response`
        request, response, form_data, user, metadata, model, events, tasks
    )

task_id, _ = create_task(
    all_time_consuming_jobs(request, form_data, user, metadata, model),
    id=metadata["chat_id"],
)

Further Considerations

This simple patch is technically working, however there might still lots of work to be done:

  • Identifying Synchronous/Asynchronous Requests in main.py
  • Error handling: Correct and robust error handling during the two phases
  • Early Stopping Behavior: The frontend logic of early stopping when web search has not finished
  • etc.
Originally created by @tth37 on GitHub (Apr 18, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/13027 Originally assigned to: @tjbck on GitHub. ### Check Existing Issues Related: #13007 ### Problem Description The `/api/chat/completions` endpoint supports two primary modes of operation: 1. **Synchronous (`stream=False`)**: Typically invoked via direct HTTP requests, this mode processes the entire request and returns the complete response in a single HTTP transaction. 2. **Asynchronous (`stream=True`)**: Primarily used by the frontend UI via WebSocket, this mode is **expected to return immediately** with a `task_id`. This `task_id` allows the frontend to receive status updates, stream the response incrementally via the WebSocket connection, and crucially, enables early stopping of the generation process initiated by the user. While the asynchronous (`stream=True`) mode functions as expected for standard chat interactions (returning the `task_id` promptly), this expected behavior breaks when features requiring substantial pre-processing, such as **Web Search** or **Tool Use**, are enabled. Instead of returning immediately, it waits for the `process_chat_payload` phase (which includes potentially long-running operations like web searches or tool executions) to complete before returning the task_id. ![Image](https://github.com/user-attachments/assets/409f8651-c070-42fd-ba4d-62e98fa1eeb0) This synchronous behavior during the payload processing phase leads to two significant issues: (*both reported* in discussions) 1. **Delayed Early Stopping**: The frontend does not receive the `task_id` until after web search/tool execution finishes. This prevents users from stopping the request during this initial, potentially lengthy (30-60s+), phase. 2. **Network Timeouts**: The extended wait time for the endpoint to respond increases the risk of network errors, such as gateway timeouts or client-side request timeouts, degrading the user experience. ### Cause Analysis The chat completion process can be broadly divided into two phases: 1. `process_chat_payload`: Handles request preprocessing, including web searches, tool calls, and injecting results into the context for the language model. 2. `process_chat_response`: Handles the actual generation of the AI response by LLM and streams results back via WebSocket. Currently, `process_chat_response` is correctly handled asynchronously using `create_task`, as seen: https://github.com/open-webui/open-webui/blob/b8fb4e528dc2629acf68b9a555a59fd0173aaa51/backend/open_webui/utils/middleware.py#L1209-L1210 However the `process_chat_payload` remains a **synchronous** function, user have to wait until `process_chat_payload` finishes and then they can receive the background `task_id`. Things get worse when web search feature is enabled as it might take up to 30s-60s, in this period user cannot early stop the request, and facing the risk of connection timeout. ### Desired Solution you'd like For asynchronous api calls, refactor the `chat_completion` handler in `main.py` to make the entire processing pipeline (both payload processing and response generation) **asynchronous** from the start. This can be achieved by wrapping all time-consuming logic within a single background task created immediately upon receiving the request. [test_async_chat_completion](https://github.com/tth37/open-webui/commit/316adbb085219b2157f230791b6c5f5765b3c52a) ```python3 async def all_time_consuming_jobs(request, form_data, user, metadata, model): form_data, metadata, events = await process_chat_payload( request, form_data, user, metadata, model ) response = await chat_completion_handler(request, form_data, user) await process_chat_response( # don't create_task inside `process_chat_response` request, response, form_data, user, metadata, model, events, tasks ) task_id, _ = create_task( all_time_consuming_jobs(request, form_data, user, metadata, model), id=metadata["chat_id"], ) ``` ### Further Considerations This simple patch is technically working, however there might still lots of work to be done: - Identifying Synchronous/Asynchronous Requests in main.py - Error handling: Correct and robust error handling during the two phases - Early Stopping Behavior: The frontend logic of early stopping when web search has not finished - etc.
Author
Owner

@gaby commented on GitHub (Apr 18, 2025):

@tth37 I believe this is fixed by this https://github.com/open-webui/open-webui/pull/12958

The call to do web-search was blocking. It will now be async

<!-- gh-comment-id:2815473078 --> @gaby commented on GitHub (Apr 18, 2025): @tth37 I believe this is fixed by this https://github.com/open-webui/open-webui/pull/12958 The call to do web-search was blocking. It will now be async
Author
Owner

@tth37 commented on GitHub (Apr 18, 2025):

@tth37 I believe this is fixed by this #12958

The call to do web-search was blocking. It will now be async

@gaby Are you sure? 🤔 I've conducted experiments and #12958 seems not addressing this issue, /api/chat/completion was still blocked until the search process finished.

In my opinion await run_in_threadpool(search_web) is still a synchronous operation. To put it in a separate thread was intended to prevent the search_web from blocking FastAPI server application's IO handling. (Besides the query generation / tool executing are still definitely synchronous)

<!-- gh-comment-id:2815497745 --> @tth37 commented on GitHub (Apr 18, 2025): > [@tth37](https://github.com/tth37) I believe this is fixed by this [#12958](https://github.com/open-webui/open-webui/pull/12958) > > The call to do web-search was blocking. It will now be async @gaby Are you sure? 🤔 I've conducted experiments and #12958 seems not addressing this issue, `/api/chat/completion` was still blocked until the search process finished. In my opinion `await run_in_threadpool(search_web)` is still a **synchronous** operation. To put it in a separate thread was intended to prevent the `search_web` from blocking FastAPI server application's IO handling. (Besides the query generation / tool executing are still definitely synchronous)
Author
Owner

@gaby commented on GitHub (Apr 18, 2025):

@tth37 Yes but it wont block the asyncio-loop, which is a big problem.

You mean api/chat/completion should return even though the search is not done? If that's the case, then yes that PR doesn't fix that issue.

I do agree that process_chat_response needs more async. For example it can use async for, and the function inside should be async

<!-- gh-comment-id:2815517702 --> @gaby commented on GitHub (Apr 18, 2025): @tth37 Yes but it wont block the asyncio-loop, which is a big problem. You mean api/chat/completion should return even though the search is not done? If that's the case, then yes that PR doesn't fix that issue. I do agree that process_chat_response needs more async. For example it can use `async for`, and the function inside should be async
Author
Owner

@tth37 commented on GitHub (Apr 18, 2025):

You mean /api/chat/completion should return even though the search is not done?

Yes it is exactly what I mean. I think /api/chat/completion should return as soon as possible (as soon as the background task is created).

<!-- gh-comment-id:2815541583 --> @tth37 commented on GitHub (Apr 18, 2025): > You mean `/api/chat/completion` should return even though the search is not done? Yes it is exactly what I mean. I think `/api/chat/completion` should return as soon as possible (as soon as the background task is created).
Author
Owner

@gaby commented on GitHub (Apr 18, 2025):

You mean /api/chat/completion should return even though the search is not done?

Yes it is exactly what I mean. I think /api/chat/completion should return as soon as possible (as soon as the background task is created).

That's going to break things, how is the caller suppose to know there's more data?

<!-- gh-comment-id:2815543651 --> @gaby commented on GitHub (Apr 18, 2025): > > You mean `/api/chat/completion` should return even though the search is not done? > > Yes it is exactly what I mean. I think `/api/chat/completion` should return as soon as possible (as soon as the background task is created). That's going to break things, how is the caller suppose to know there's more data?
Author
Owner

@tth37 commented on GitHub (Apr 18, 2025):

There is always a websocket connection alive in the background, which is responsible for updating response status.

Image

For now /api/chat/completion is indeed returned before the AI response is fully generated, but strictly after the searching process is finished.

<!-- gh-comment-id:2815550556 --> @tth37 commented on GitHub (Apr 18, 2025): There is always a websocket connection alive in the background, which is responsible for updating response status. ![Image](https://github.com/user-attachments/assets/106b41ef-667b-492b-9a88-e7d95df4b890) For now `/api/chat/completion` is indeed returned *before the AI response is fully generated*, but *strictly after the searching process is finished*.
Author
Owner

@gaby commented on GitHub (Apr 18, 2025):

That case only applies to the UI. If you use that route via API it doesnt use WebSocket, it's just HTTP?

<!-- gh-comment-id:2815552922 --> @gaby commented on GitHub (Apr 18, 2025): That case only applies to the UI. If you use that route via API it doesnt use WebSocket, it's just HTTP?
Author
Owner

@tth37 commented on GitHub (Apr 18, 2025):

Yes that's the case where stream parameter is set to False, in that case the handler will fallback to the original response.

b8fb4e528d/backend/open_webui/utils/middleware.py (L2265-L2266)

When stream=False, the /api/chat/completions is a synchronous API call, and it's working fine. This issue is targeting asynchronous API call, especially from UI. The synchronous handler can safely remain unchange.

<!-- gh-comment-id:2815562036 --> @tth37 commented on GitHub (Apr 18, 2025): Yes that's the case where `stream` parameter is set to `False`, in that case the handler will fallback to the original response. https://github.com/open-webui/open-webui/blob/b8fb4e528dc2629acf68b9a555a59fd0173aaa51/backend/open_webui/utils/middleware.py#L2265-L2266 When `stream=False`, the `/api/chat/completions` is a synchronous API call, and it's working fine. This issue is targeting asynchronous API call, especially from UI. The synchronous handler can safely remain unchange.
Author
Owner

@tth37 commented on GitHub (Apr 18, 2025):

@gaby @rgaricano Thank you for your attention! I've updated the issue for better describing the problem.

<!-- gh-comment-id:2815584296 --> @tth37 commented on GitHub (Apr 18, 2025): @gaby @rgaricano Thank you for your attention! I've updated the issue for better describing the problem.
Author
Owner

@tth37 commented on GitHub (Apr 18, 2025):

You can easily enable the very basic version of this feature by applying change like this: test_async_chat_completion (Yet only support asynchronous calling together with websocket)

<!-- gh-comment-id:2815617147 --> @tth37 commented on GitHub (Apr 18, 2025): You can easily enable the very basic version of this feature by applying change like this: [test_async_chat_completion](https://github.com/tth37/open-webui/commit/316adbb085219b2157f230791b6c5f5765b3c52a) (Yet only support asynchronous calling together with websocket)
Author
Owner

@tjbck commented on GitHub (Aug 18, 2025):

Addressed with d6f709574e in dev!

<!-- gh-comment-id:3198472159 --> @tjbck commented on GitHub (Aug 18, 2025): Addressed with d6f709574e7809beea314424c295b1f38fcf3a9a in dev!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#32318