mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 03:18:23 -05:00
[GH-ISSUE #6021] issue: markdown content being duplicated in TTS #52877
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nengoxx on GitHub (Oct 8, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/6021
Bug Report
Installation Method
Installed via pip on a virtual environment.
Environment
Open WebUI Version: tested with 0.3.30 & 0.3.32
TTS backend: happens with both Alltalk_tts & openedai_speech
Operating System: Windows 11
Browser (if applicable): Firefox 131.0 (& Fennec on Android)
Confirmation:
Expected Behavior:
The TTS should not repeat any sentence in the call.
Actual Behavior:
In every long sentence that an asterisk (maybe other special symbols too) appears, the audio is repeated.
Description
Bug Summary:
When using the call functionality, whenever there are asterisks in a sentence (like additional narration in italics), the response audio repeats that sentence, and it can happen to all the sentences, several times.
Reproduction Details
Steps to Reproduce:
Let the LLM respond with italics (as narration of actions or similar) while using the call functionality: narration dialog
I used https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B to reproduce it.
Also asking it to role-play while adding this to the system prompt to get responses like in the screenshot: "Balance direct speech with narrative. Respect this markdown format: direct speech, actions."
Logs and Screenshots
Browser Console Logs:
There are no logs about it in the console, and it doesn't request the TTS twice to the backend either.
Screenshots/Screen Recordings (if applicable):

Additional Information
I'm using streaming for the LLM responses.
The 'Fluidly stream large external response chunks' and punctuation/paragraph splitting doesn't seem to have any effect, the bug still happens.
Maybe there's something to do with the text parsing & audio playback while it streams, since it doesn't request more audio clips than needed.
@tjbck commented on GitHub (Oct 8, 2024):
Could you provide us with a more concrete way to reproduce the issue?
@nengoxx commented on GitHub (Oct 8, 2024):
Sure, I used https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B to reproduce it.
Also asking it to role-play while adding this to the system prompt to get responses like in the screenshot: "Balance direct speech with narrative. Respect this markdown format: direct speech, actions."
It started happening without that prompt and with a different model while I was testing some fine-tunes, but they didn't add italics often, or write several paragraphs with italics in them, and it didn't happen all the time.
For example if the model responded with a single 'action dialog' it didn't seem to happen. So the response might need more than one block of italics and a couple paragraphs for the bug to happen.
@Simi5599 commented on GitHub (Oct 25, 2024):
I think this was the issue that was fixed in the lastest release (0.3.33).
In particular i was having the same issue because the microphone stayed on even when the model was speaking causing duplicates
@nengoxx commented on GitHub (Nov 20, 2024):
It wasn't the open mic issue tho, that is a different issue.
I just did a quick test, and it seems that it still happens in the latest version too(v0.4.7).
To clarify a bit, it only seems to happen when the output text is formatted in the specific way it's shown in the screenshot: narration dialog narration dialog...
It doesn't happen while using the regular [narration+"direct speech"] format, or even [narration+"direct speech"].
Also, I just realized, if there are italics (narration+"dialog"), while it doesn't repeat sentences, the TTS often skips the first italic block, but not the following ones.
Seems like there's something to do with the asterisks when parsing the text for the TTS engine.
@nengoxx commented on GitHub (Dec 21, 2024):
I came back to this issue to re-test it after enabling the verbose logs with the actual text being requested on alltalk_tts (using open webui 0.4.7), it actually does seem to request the audio twice.
Edit: You can see the logs on both ends, tried to mess a bit with the backend code so that's why its repeated in the open-webui side.

@nengoxx commented on GitHub (Dec 22, 2024):
I 'fixed' the issue where it sends duplicate requests on my local install, but apparently that doesn't solve the issue, I'll try more things later.
This is my quick fix for not sending dupes to the alltalk server. Just added a set to add the requests and check if its already there.
And then inside the /speech route:
This was just for testing, it should also remove the requests after the whole response, in case you wanna trigger the TTS manually again(via the UI button), or in case there are actually legit repeated lines:
And this is to illustrate that it also skips some formatted text:


This is the original text:
In this case it didn't repeat that specific line, but it did repeat a couple of other parts of the response, not sure I can delve deeper into it since I have no clue about the front-end tbh, hope this helps.
Edit: forgot to add that it also sometimes happens when emojis are present, the repetition, not the skipping. It only seems to skip parts of the text where asterisks are involved.
@VanceVagell commented on GitHub (Jun 28, 2025):
I also have frequent TTS repeated audio, I'm using a locally-hosted Kokoro TTS server. For me, it only happens when using Open WebUI's "call" mode. I never get TTS repeats when using the little icon after the response to play it as a one-off (presumably because in that case Open WebUI sends the entire text at once, instead of chunking it?). Unfortunately this makes call mode fairly unusable for me, since it will say the same parts multiple times and it's confusing and hard to follow.
@VanceVagell commented on GitHub (Jul 20, 2025):
This bug really breaks Open WebUI's "call" mode on a smartphone, since it starts repeating itself ad nauseum in the middle of most non-trivial replies.
A workaround is to change the audio chunking setting to "none" (rather than paragraphs or punctuation), but then you hear no response at all until the entire text is done being generated, which can be unreasonably long for a long response.
I took a stab for a couple hours at trying to fix this one, but I'm not very familiar with the code base and was not successful.
On a smartphone, rather than a laptop or desktop, I really think this mode is important when on-the-go and trying to get a quick verbal answer.