[GH-ISSUE #16457] issue: Voice Mode audio playback does not begin until the assistant finishes generating the entire message #56577

New Issue

GiteaMirror · 2026-05-05T19:43:56-05:00

GiteaMirror commented

2026-05-05 19:43:56 -05:00

Originally created by @QuantumFlux21 on GitHub (Aug 10, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/16457

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.21

Ollama Version (if applicable)

No response

Operating System

ubuntu 24.04

Browser (if applicable)

chrome Version 138.0.7204.184

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

While in voice mode when Response splitting is set to Punctuation or Paragraphs and Autoplay is On, audio should start during generation.
Completed chunks (for example, sentences) should be sent to the TTS provider as they become available so playback can begin earlier.

Actual Behavior

While in voice mode ElevenLabs TTS requests are sent only after the final message text is available.
Audio playback starts only after text generation completes.
The Response splitting setting does not lead to sentence‑by‑sentence playback during generation.

Steps to Reproduce

Start Open WebUI with ElevenLabs configured and Autoplay enabled.
In Settings > Audio/TTS select Response splitting: Punctuation.
While in voice mode ask the assistant for a multi‑sentence response.
Observe that the text streams in the chat UI.
Watch the Network panel in browser devtools to see when ElevenLabs TTS requests are sent.
The user waits for the enter text to be generated before the request is sent to ElevenLabs TTS.

Logs & Screenshots

Voice Mode

Additional Information

Suggested fix direction:

While in Voice Mode when Response splitting is enabled, send partial chunks to ElevenLabs as they become available, or use the ElevenLabs streaming API so audio can begin mid‑generation.
If external providers cannot support this, clarify in the UI that mid‑generation playback is not available for the selected provider while in voice mode.

If you need any logs please let me know and I'll provide them.

Originally created by @QuantumFlux21 on GitHub (Aug 10, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/16457 ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.21 ### Ollama Version (if applicable) _No response_ ### Operating System ubuntu 24.04 ### Browser (if applicable) chrome Version 138.0.7204.184 ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior While in voice mode when Response splitting is set to Punctuation or Paragraphs and Autoplay is On, audio should start during generation. Completed chunks (for example, sentences) should be sent to the TTS provider as they become available so playback can begin earlier. ### Actual Behavior While in voice mode ElevenLabs TTS requests are sent only after the final message text is available. Audio playback starts only after text generation completes. The Response splitting setting does not lead to sentence‑by‑sentence playback during generation. ### Steps to Reproduce Start Open WebUI with ElevenLabs configured and Autoplay enabled. In Settings > Audio/TTS select Response splitting: Punctuation. While in voice mode ask the assistant for a multi‑sentence response. Observe that the text streams in the chat UI. Watch the Network panel in browser devtools to see when ElevenLabs TTS requests are sent. The user waits for the enter text to be generated before the request is sent to ElevenLabs TTS. ### Logs & Screenshots <img width="3004" height="1668" alt="Image" src="https://github.com/user-attachments/assets/0d6c5b82-8645-4d5e-9570-c57f295d4ee9" /> Voice Mode <img width="1043" height="1890" alt="Image" src="https://github.com/user-attachments/assets/5b37ba89-e0da-4a4f-9d90-30307274c00b" /> ### Additional Information Suggested fix direction: While in Voice Mode when Response splitting is enabled, send partial chunks to ElevenLabs as they become available, or use the ElevenLabs streaming API so audio can begin mid‑generation. If external providers cannot support this, clarify in the UI that mid‑generation playback is not available for the selected provider while in voice mode. If you need any logs please let me know and I'll provide them.

GiteaMirror added the bug label 2026-05-05 19:43:56 -05:00

GiteaMirror closed this issue

2026-05-05 19:43:57 -05:00

GiteaMirror commented

2026-05-05 19:43:59 -05:00

@tjbck commented on GitHub (Aug 11, 2025):

That's already the case with our implementation.

@tjbck commented on GitHub (Aug 11, 2025): That's already the case with our implementation.

GiteaMirror commented

2026-05-05 19:44:00 -05:00

@Byrd910 commented on GitHub (Aug 12, 2025):

I am also having this issue. Testing by asking the LLM to "Tell me a scary story" - it generates a story ~350 words long. I am using Orpheus through Orpheus-FastAPI as the TTS engine, set to split on punctuation.

If I start in text mode and have the LLM create the ~350 word story, and then click "Read Aloud" - it starts streaming the audio response almost right away (as soon as Orpheus has generated a sentence). In Voice Mode, I can watch the Orpheus logs generate each sentence as it receives them (same behavior as "Read Aloud", but audio playback doesn't occur until the entire message is generated.

There seems to be some difference in how the "Read Aloud" feature is handling the streaming audio vs. "Voice Mode." Note: I'm only testing with the "Tell me a scary story" prompt because it consistently creates longer output - the discrepancy isn't as noticeable unless the output is long enough for there to be the gap while the entire text is generated.

@Byrd910 commented on GitHub (Aug 12, 2025): I am also having this issue. Testing by asking the LLM to "Tell me a scary story" - it generates a story ~350 words long. I am using Orpheus through Orpheus-FastAPI as the TTS engine, set to split on punctuation. If I start in text mode and have the LLM create the ~350 word story, and then click "Read Aloud" - it starts streaming the audio response almost right away (as soon as Orpheus has generated a sentence). In Voice Mode, I can watch the Orpheus logs generate each sentence as it receives them (same behavior as "Read Aloud", but audio playback doesn't occur until the entire message is generated. There seems to be some difference in how the "Read Aloud" feature is handling the streaming audio vs. "Voice Mode." Note: I'm only testing with the "Tell me a scary story" prompt because it consistently creates longer output - the discrepancy isn't as noticeable unless the output is long enough for there to be the gap while the entire text is generated.

GiteaMirror commented

2026-05-05 19:44:00 -05:00

@punithrudrappa commented on GitHub (Feb 24, 2026):

I'm facing the same issue with voice mode with openai compatible Text to Speech endpoint.

@punithrudrappa commented on GitHub (Feb 24, 2026): I'm facing the same issue with voice mode with openai compatible Text to Speech endpoint.

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#56577