feat: Start voice call TTS response asynchronously when enough text is available #5313

New Issue

GiteaMirror · 2025-11-11T16:17:02-06:00

GiteaMirror commented

2025-11-11 16:17:02 -06:00

Originally created by @lee-b on GitHub (May 24, 2025).

Check Existing Issues

I have searched the existing issues and discussions.

Problem Description

During voice calls, there's currently a very a long latency between completing speech input, and getting voice response back. If the response is available and I click the speaker, voice output is very quick -- almost immediate. Similarly, with /no_think, my time to first token is almost immediate. So this is not an AI inference performance issue. However, Open-WebUI seems to wait for the entire text response from the LLM, before sending anything to the TTS engine for voice generation.

Desired Solution you'd like

Depending on the current setting of Admin Panel -> Settings -> Audio -> Response Splitting, as soon as the first "split" is available, Open-WebUI should send this to the TTS engine, and begin playback, whilst obtaining the remaining splits.

Alternatives Considered

No response

Additional Context

No response

Originally created by @lee-b on GitHub (May 24, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description During voice calls, there's currently a very a long latency between completing speech input, and getting voice response back. If the response is available and I click the speaker, voice output is very quick -- almost immediate. Similarly, with /no_think, my time to first token is almost immediate. So this is not an AI inference performance issue. However, Open-WebUI seems to wait for the entire text response from the LLM, before sending anything to the TTS engine for voice generation. ### Desired Solution you'd like Depending on the current setting of Admin Panel -> Settings -> Audio -> Response Splitting, as soon as the first "split" is available, Open-WebUI should send this to the TTS engine, and begin playback, whilst obtaining the remaining splits. ### Alternatives Considered _No response_ ### Additional Context _No response_

GiteaMirror closed this issue

2025-11-11 16:17:03 -06:00

GiteaMirror commented

2025-11-11 16:17:04 -06:00

@tjbck commented on GitHub (May 24, 2025):

This is already the case.

@tjbck commented on GitHub (May 24, 2025): This is already the case.

GiteaMirror commented

2025-11-11 16:17:04 -06:00

@Ryderjj89 commented on GitHub (Jul 31, 2025):

This is not the case. If so, please elaborate. I have Response splitting set to "Punctuation" and yet I have to wait until the entire text response is completed before I get any voice response back. It makes trying to do a "voice call" very awkward as if the response from the model is long/in-depth, there's a long period of silence before getting anything back.

@Ryderjj89 commented on GitHub (Jul 31, 2025): This is not the case. If so, please elaborate. I have Response splitting set to "Punctuation" and yet I have to wait until the entire text response is completed before I get any voice response back. It makes trying to do a "voice call" very awkward as if the response from the model is long/in-depth, there's a long period of silence before getting anything back.