mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[GH-ISSUE #16457] issue: Voice Mode audio playback does not begin until the assistant finishes generating the entire message #56577
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @QuantumFlux21 on GitHub (Aug 10, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/16457
Check Existing Issues
Installation Method
Docker
Open WebUI Version
0.6.21
Ollama Version (if applicable)
No response
Operating System
ubuntu 24.04
Browser (if applicable)
chrome Version 138.0.7204.184
Confirmation
README.md.Expected Behavior
While in voice mode when Response splitting is set to Punctuation or Paragraphs and Autoplay is On, audio should start during generation.
Completed chunks (for example, sentences) should be sent to the TTS provider as they become available so playback can begin earlier.
Actual Behavior
While in voice mode ElevenLabs TTS requests are sent only after the final message text is available.
Audio playback starts only after text generation completes.
The Response splitting setting does not lead to sentence‑by‑sentence playback during generation.
Steps to Reproduce
Start Open WebUI with ElevenLabs configured and Autoplay enabled.
In Settings > Audio/TTS select Response splitting: Punctuation.
While in voice mode ask the assistant for a multi‑sentence response.
Observe that the text streams in the chat UI.
Watch the Network panel in browser devtools to see when ElevenLabs TTS requests are sent.
The user waits for the enter text to be generated before the request is sent to ElevenLabs TTS.
Logs & Screenshots
Voice Mode
Additional Information
Suggested fix direction:
While in Voice Mode when Response splitting is enabled, send partial chunks to ElevenLabs as they become available, or use the ElevenLabs streaming API so audio can begin mid‑generation.
If external providers cannot support this, clarify in the UI that mid‑generation playback is not available for the selected provider while in voice mode.
If you need any logs please let me know and I'll provide them.
@tjbck commented on GitHub (Aug 11, 2025):
That's already the case with our implementation.
@Byrd910 commented on GitHub (Aug 12, 2025):
I am also having this issue. Testing by asking the LLM to "Tell me a scary story" - it generates a story ~350 words long. I am using Orpheus through Orpheus-FastAPI as the TTS engine, set to split on punctuation.
If I start in text mode and have the LLM create the ~350 word story, and then click "Read Aloud" - it starts streaming the audio response almost right away (as soon as Orpheus has generated a sentence). In Voice Mode, I can watch the Orpheus logs generate each sentence as it receives them (same behavior as "Read Aloud", but audio playback doesn't occur until the entire message is generated.
There seems to be some difference in how the "Read Aloud" feature is handling the streaming audio vs. "Voice Mode." Note: I'm only testing with the "Tell me a scary story" prompt because it consistently creates longer output - the discrepancy isn't as noticeable unless the output is long enough for there to be the gap while the entire text is generated.
@punithrudrappa commented on GitHub (Feb 24, 2026):
I'm facing the same issue with voice mode with openai compatible Text to Speech endpoint.