mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-11 08:15:00 -05:00
feat: Start voice call TTS response asynchronously when enough text is available #5313
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @lee-b on GitHub (May 24, 2025).
Check Existing Issues
Problem Description
During voice calls, there's currently a very a long latency between completing speech input, and getting voice response back. If the response is available and I click the speaker, voice output is very quick -- almost immediate. Similarly, with /no_think, my time to first token is almost immediate. So this is not an AI inference performance issue. However, Open-WebUI seems to wait for the entire text response from the LLM, before sending anything to the TTS engine for voice generation.
Desired Solution you'd like
Depending on the current setting of Admin Panel -> Settings -> Audio -> Response Splitting, as soon as the first "split" is available, Open-WebUI should send this to the TTS engine, and begin playback, whilst obtaining the remaining splits.
Alternatives Considered
No response
Additional Context
No response
@tjbck commented on GitHub (May 24, 2025):
This is already the case.
@Ryderjj89 commented on GitHub (Jul 31, 2025):
This is not the case. If so, please elaborate. I have Response splitting set to "Punctuation" and yet I have to wait until the entire text response is completed before I get any voice response back. It makes trying to do a "voice call" very awkward as if the response from the model is long/in-depth, there's a long period of silence before getting anything back.