[GH-ISSUE #6021] issue: markdown content being duplicated in TTS #29739

Closed
opened 2026-04-25 04:09:55 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @nengoxx on GitHub (Oct 8, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/6021

Bug Report

Installation Method

Installed via pip on a virtual environment.

Environment

  • Open WebUI Version: tested with 0.3.30 & 0.3.32

  • TTS backend: happens with both Alltalk_tts & openedai_speech

  • Operating System: Windows 11

  • Browser (if applicable): Firefox 131.0 (& Fennec on Android)

Confirmation:

  • I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

The TTS should not repeat any sentence in the call.

Actual Behavior:

In every long sentence that an asterisk (maybe other special symbols too) appears, the audio is repeated.

Description

Bug Summary:
When using the call functionality, whenever there are asterisks in a sentence (like additional narration in italics), the response audio repeats that sentence, and it can happen to all the sentences, several times.

Reproduction Details

Steps to Reproduce:

Let the LLM respond with italics (as narration of actions or similar) while using the call functionality: narration dialog
I used https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B to reproduce it.
Also asking it to role-play while adding this to the system prompt to get responses like in the screenshot: "Balance direct speech with narrative. Respect this markdown format: direct speech, actions."

Logs and Screenshots

Browser Console Logs:
There are no logs about it in the console, and it doesn't request the TTS twice to the backend either.

Screenshots/Screen Recordings (if applicable):
Screenshot 2024-10-08 175734

Additional Information

I'm using streaming for the LLM responses.
The 'Fluidly stream large external response chunks' and punctuation/paragraph splitting doesn't seem to have any effect, the bug still happens.
Maybe there's something to do with the text parsing & audio playback while it streams, since it doesn't request more audio clips than needed.

Originally created by @nengoxx on GitHub (Oct 8, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/6021 # Bug Report ## Installation Method Installed via pip on a virtual environment. ## Environment - **Open WebUI Version:** tested with 0.3.30 & 0.3.32 - **TTS backend:** happens with both Alltalk_tts & openedai_speech - **Operating System:** Windows 11 - **Browser (if applicable):** Firefox 131.0 (& Fennec on Android) **Confirmation:** - [x] I have read and followed all the instructions provided in the README.md. - [x] I am on the latest version of both Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Expected Behavior: The TTS should not repeat any sentence in the call. ## Actual Behavior: In every long sentence that an asterisk (maybe other special symbols too) appears, the audio is repeated. ## Description **Bug Summary:** When using the call functionality, whenever there are asterisks in a sentence (like additional narration in italics), the response audio repeats that sentence, and it can happen to all the sentences, several times. ## Reproduction Details **Steps to Reproduce:** Let the LLM respond with italics (as narration of actions or similar) while using the call functionality: *narration* dialog I used https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B to reproduce it. Also asking it to role-play while adding this to the system prompt to get responses like in the screenshot: "Balance direct speech with narrative. Respect this markdown format: direct speech, *actions*." ## Logs and Screenshots **Browser Console Logs:** There are no logs about it in the console, and it doesn't request the TTS twice to the backend either. **Screenshots/Screen Recordings (if applicable):** ![Screenshot 2024-10-08 175734](https://github.com/user-attachments/assets/2e47ac02-0504-45bb-bad2-96283554c804) ## Additional Information I'm using streaming for the LLM responses. The 'Fluidly stream large external response chunks' and punctuation/paragraph splitting doesn't seem to have any effect, the bug still happens. Maybe there's something to do with the text parsing & audio playback while it streams, since it doesn't request more audio clips than needed.
Author
Owner

@tjbck commented on GitHub (Oct 8, 2024):

Could you provide us with a more concrete way to reproduce the issue?

<!-- gh-comment-id:2400586143 --> @tjbck commented on GitHub (Oct 8, 2024): Could you provide us with a more concrete way to reproduce the issue?
Author
Owner

@nengoxx commented on GitHub (Oct 8, 2024):

Sure, I used https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B to reproduce it.
Also asking it to role-play while adding this to the system prompt to get responses like in the screenshot: "Balance direct speech with narrative. Respect this markdown format: direct speech, actions."

It started happening without that prompt and with a different model while I was testing some fine-tunes, but they didn't add italics often, or write several paragraphs with italics in them, and it didn't happen all the time.
For example if the model responded with a single 'action dialog' it didn't seem to happen. So the response might need more than one block of italics and a couple paragraphs for the bug to happen.

<!-- gh-comment-id:2400980388 --> @nengoxx commented on GitHub (Oct 8, 2024): Sure, I used https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B to reproduce it. Also asking it to role-play while adding this to the system prompt to get responses like in the screenshot: "Balance direct speech with narrative. Respect this markdown format: direct speech, *actions*." It started happening without that prompt and with a different model while I was testing some fine-tunes, but they didn't add italics often, or write several paragraphs with italics in them, and it didn't happen all the time. For example if the model responded with a single '*action* dialog' it didn't seem to happen. So the response might need more than one block of italics and a couple paragraphs for the bug to happen.
Author
Owner

@Simi5599 commented on GitHub (Oct 25, 2024):

I think this was the issue that was fixed in the lastest release (0.3.33).
In particular i was having the same issue because the microphone stayed on even when the model was speaking causing duplicates

<!-- gh-comment-id:2438740398 --> @Simi5599 commented on GitHub (Oct 25, 2024): I think this was the issue that was fixed in the lastest release (0.3.33). In particular i was having the same issue because the microphone stayed on even when the model was speaking causing duplicates
Author
Owner

@nengoxx commented on GitHub (Nov 20, 2024):

I think this was the issue that was fixed in the lastest release (0.3.33). In particular i was having the same issue because the microphone stayed on even when the model was speaking causing duplicates

It wasn't the open mic issue tho, that is a different issue.

I just did a quick test, and it seems that it still happens in the latest version too(v0.4.7).

To clarify a bit, it only seems to happen when the output text is formatted in the specific way it's shown in the screenshot: narration dialog narration dialog...

It doesn't happen while using the regular [narration+"direct speech"] format, or even [narration+"direct speech"].
Also, I just realized, if there are italics (narration+"dialog"), while it doesn't repeat sentences, the TTS often skips the first italic block, but not the following ones.

Seems like there's something to do with the asterisks when parsing the text for the TTS engine.

<!-- gh-comment-id:2489024055 --> @nengoxx commented on GitHub (Nov 20, 2024): > I think this was the issue that was fixed in the lastest release (0.3.33). In particular i was having the same issue because the microphone stayed on even when the model was speaking causing duplicates It wasn't the open mic issue tho, that is a different issue. I just did a quick test, and it seems that it still happens in the latest version too(v0.4.7). To clarify a bit, it only seems to happen when the output text is formatted in the specific way it's shown in the screenshot: *narration* dialog *narration* dialog... It doesn't happen while using the regular [narration+"direct speech"] format, or even [*narration*+"direct speech"]. Also, I just realized, if there are italics (*narration*+"dialog"), while it doesn't repeat sentences, the TTS often skips the first italic block, but not the following ones. Seems like there's something to do with the asterisks when parsing the text for the TTS engine.
Author
Owner

@nengoxx commented on GitHub (Dec 21, 2024):

I came back to this issue to re-test it after enabling the verbose logs with the actual text being requested on alltalk_tts (using open webui 0.4.7), it actually does seem to request the audio twice.

Edit: You can see the logs on both ends, tried to mess a bit with the backend code so that's why its repeated in the open-webui side.
Capture

<!-- gh-comment-id:2558183348 --> @nengoxx commented on GitHub (Dec 21, 2024): I came back to this issue to re-test it after enabling the verbose logs with the actual text being requested on alltalk_tts (using open webui 0.4.7), it actually does seem to request the audio twice. Edit: You can see the logs on both ends, tried to mess a bit with the backend code so that's why its repeated in the open-webui side. ![Capture](https://github.com/user-attachments/assets/f8769600-c505-4511-a21d-f07ee54637e7)
Author
Owner

@nengoxx commented on GitHub (Dec 22, 2024):

I 'fixed' the issue where it sends duplicate requests on my local install, but apparently that doesn't solve the issue, I'll try more things later.

This is my quick fix for not sending dupes to the alltalk server. Just added a set to add the requests and check if its already there.

processed_requests = set()

And then inside the /speech route:

body_json = body.decode("utf-8")  # Parse the JSON body, the name var doesn't seem to work well when comparing the body for some reason?
    input_text = body_json
    log.info(f"Input text: {input_text}")
    
    if input_text in processed_requests:
        log.error(f"Duplicate input detected: {input_text}")
        file_path = SPEECH_CACHE_DIR.joinpath(f"empty.mp3")
        file_body_path = SPEECH_CACHE_DIR.joinpath(f"empty.json")
        return FileResponse(file_path)
        #raise HTTPException(status_code=204, detail="Duplicate request skipped.")
    else:
        processed_requests.add(input_text)

This was just for testing, it should also remove the requests after the whole response, in case you wanna trigger the TTS manually again(via the UI button), or in case there are actually legit repeated lines:

processed_requests.discard(input_text)

And this is to illustrate that it also skips some formatted text:
skipped_text
This is the original text:
Screenshot 2024-12-22 163636

In this case it didn't repeat that specific line, but it did repeat a couple of other parts of the response, not sure I can delve deeper into it since I have no clue about the front-end tbh, hope this helps.

Edit: forgot to add that it also sometimes happens when emojis are present, the repetition, not the skipping. It only seems to skip parts of the text where asterisks are involved.

<!-- gh-comment-id:2558484279 --> @nengoxx commented on GitHub (Dec 22, 2024): I 'fixed' the issue where it sends duplicate requests on my local install, but apparently that **doesn't solve the issue**, I'll try more things later. This is my quick fix for not sending dupes to the alltalk server. Just added a set to add the requests and check if its already there. ```python processed_requests = set() ``` And then inside the /speech route: ```python body_json = body.decode("utf-8") # Parse the JSON body, the name var doesn't seem to work well when comparing the body for some reason? input_text = body_json log.info(f"Input text: {input_text}") if input_text in processed_requests: log.error(f"Duplicate input detected: {input_text}") file_path = SPEECH_CACHE_DIR.joinpath(f"empty.mp3") file_body_path = SPEECH_CACHE_DIR.joinpath(f"empty.json") return FileResponse(file_path) #raise HTTPException(status_code=204, detail="Duplicate request skipped.") else: processed_requests.add(input_text) ``` This was just for testing, it should also remove the requests after the whole response, in case you wanna trigger the TTS manually again(via the UI button), or in case there are actually legit repeated lines: ```python processed_requests.discard(input_text) ``` And this is to illustrate that it **also skips some formatted text**: ![skipped_text](https://github.com/user-attachments/assets/e54b817d-6918-49cb-a22e-949db80009c0) This is the original text: ![Screenshot 2024-12-22 163636](https://github.com/user-attachments/assets/78220725-1312-4d80-b630-0798ee1f974c) In this case it didn't repeat that specific line, but it did repeat a couple of other parts of the response, not sure I can delve deeper into it since I have no clue about the front-end tbh, hope this helps. Edit: forgot to add that it also sometimes happens when emojis are present, the repetition, not the skipping. It only seems to skip parts of the text where asterisks are involved.
Author
Owner

@VanceVagell commented on GitHub (Jun 28, 2025):

I also have frequent TTS repeated audio, I'm using a locally-hosted Kokoro TTS server. For me, it only happens when using Open WebUI's "call" mode. I never get TTS repeats when using the little icon after the response to play it as a one-off (presumably because in that case Open WebUI sends the entire text at once, instead of chunking it?). Unfortunately this makes call mode fairly unusable for me, since it will say the same parts multiple times and it's confusing and hard to follow.

<!-- gh-comment-id:3014876507 --> @VanceVagell commented on GitHub (Jun 28, 2025): I also have frequent TTS repeated audio, I'm using a locally-hosted Kokoro TTS server. For me, it only happens when using Open WebUI's "call" mode. I never get TTS repeats when using the little icon after the response to play it as a one-off (presumably because in that case Open WebUI sends the entire text at once, instead of chunking it?). Unfortunately this makes call mode fairly unusable for me, since it will say the same parts multiple times and it's confusing and hard to follow.
Author
Owner

@VanceVagell commented on GitHub (Jul 20, 2025):

This bug really breaks Open WebUI's "call" mode on a smartphone, since it starts repeating itself ad nauseum in the middle of most non-trivial replies.

A workaround is to change the audio chunking setting to "none" (rather than paragraphs or punctuation), but then you hear no response at all until the entire text is done being generated, which can be unreasonably long for a long response.

I took a stab for a couple hours at trying to fix this one, but I'm not very familiar with the code base and was not successful.

On a smartphone, rather than a laptop or desktop, I really think this mode is important when on-the-go and trying to get a quick verbal answer.

<!-- gh-comment-id:3092703201 --> @VanceVagell commented on GitHub (Jul 20, 2025): This bug really breaks Open WebUI's "call" mode on a smartphone, since it starts repeating itself ad nauseum in the middle of most non-trivial replies. A workaround is to change the audio chunking setting to "none" (rather than paragraphs or punctuation), but then you hear no response at all until the entire text is done being generated, which can be unreasonably long for a long response. I took a stab for a couple hours at trying to fix this one, but I'm not very familiar with the code base and was not successful. On a smartphone, rather than a laptop or desktop, I really think this mode is important when on-the-go and trying to get a quick verbal answer.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#29739