[GH-ISSUE #1331] feat: Volume, Speech Rate, and Pitch Controls for Text-to-Speech (TTS) Output #27978

Closed
opened 2026-04-25 02:44:56 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @silentoplayz on GitHub (Mar 28, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/1331

Problem Description:
The current version of Open WebUI lacks the necessary customization options for the text-to-speech (TTS) output, including volume control, speech rate adjustment, pitch adjustment, and audio playback functionality for speaking out notifications. These limitations hinder the user experience and accessibility of the text-to-speech (TTS) feature.

Describe the solution you'd like:
I propose the implementation of the following features to enhance the TTS output customization:

  1. A volume control slider to adjust the volume of the TTS output.
  2. A "Speech Rate" slider to adjust the speed of the TTS output.
  3. A "Pitch" slider enabling users to modify the voice pitch of the TTS output.
  4. An option to enable or disable audio playback for speaking out notifications.

Alternative solution:
Offer predefined volume, speed, & pitch options instead of a slider for a simpler interface.

Alternatives Considered:
Manually adjusting the device's overall volume or utilizing third-party applications to manipulate speech output and volume settings represents a workaround. However, this solution proves to be inconvenient for users, necessitating the addition of these much-needed features within Open WebUI.

Additional Context:
This feature request focuses on improving the text-to-speech (TTS) feature's accessibility and overall user experience. Implementing these requested features, including volume, speed, and pitch adjustments, will significantly enhance user satisfaction and convenience. It's crucial to maintain compatibility with existing features, ensuring this customization suite does not adversely impact any existing functionalities or behaviors.

Originally created by @silentoplayz on GitHub (Mar 28, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/1331 **Problem Description:** The current version of Open WebUI lacks the necessary customization options for the text-to-speech (TTS) output, including volume control, speech rate adjustment, pitch adjustment, and audio playback functionality for speaking out notifications. These limitations hinder the user experience and accessibility of the text-to-speech (TTS) feature. **Describe the solution you'd like:** I propose the implementation of the following features to enhance the TTS output customization: 1. A volume control slider to adjust the volume of the TTS output. 2. A "Speech Rate" slider to adjust the speed of the TTS output. 3. A "Pitch" slider enabling users to modify the voice pitch of the TTS output. 4. An option to enable or disable audio playback for speaking out notifications. **Alternative solution:** Offer predefined volume, speed, & pitch options instead of a slider for a simpler interface. **Alternatives Considered:** Manually adjusting the device's overall volume or utilizing third-party applications to manipulate speech output and volume settings represents a workaround. However, this solution proves to be inconvenient for users, necessitating the addition of these much-needed features within Open WebUI. **Additional Context:** This feature request focuses on improving the text-to-speech (TTS) feature's accessibility and overall user experience. Implementing these requested features, including volume, speed, and pitch adjustments, will significantly enhance user satisfaction and convenience. It's crucial to maintain compatibility with existing features, ensuring this customization suite does not adversely impact any existing functionalities or behaviors.
GiteaMirror added the enhancementgood first issuehelp wantedcore labels 2026-04-25 02:44:57 -05:00
Author
Owner

@dannyl1u commented on GitHub (Apr 8, 2024):

I think this is would be a good feature, how does this look for the UI?

image
<!-- gh-comment-id:2041890083 --> @dannyl1u commented on GitHub (Apr 8, 2024): I think this is would be a good feature, how does this look for the UI? <img width="703" alt="image" src="https://github.com/open-webui/open-webui/assets/45186464/91ffdb4a-8581-4e40-80a3-a797a2ec4ae8">
Author
Owner

@silentoplayz commented on GitHub (Apr 8, 2024):

That looks good to me @dannyl1u, although, do you think the sliders could take on a similar form as the model advanced parameter sliders? I only ask because I feel that tjbck would step in to ask the same thing eventually or even make the adjustment himself.

Screenshot 2024-03-16 141517

P.S: Thank you for your contributions to Open WebUI!

<!-- gh-comment-id:2041948889 --> @silentoplayz commented on GitHub (Apr 8, 2024): That looks good to me @dannyl1u, although, do you think the sliders could take on a similar form as the model advanced parameter sliders? I only ask because I feel that tjbck would step in to ask the same thing eventually or even make the adjustment himself. ![Screenshot 2024-03-16 141517](https://github.com/open-webui/open-webui/assets/50341825/35cb12ff-b9c0-4ac6-89a8-6acde682a866) P.S: Thank you for your contributions to Open WebUI!
Author
Owner

@dannyl1u commented on GitHub (Apr 8, 2024):

That looks good to me @dannyl1u, although, do you think the sliders could take on a similar form as the model advanced parameter sliders? I only ask because I feel that tjbck would step in to ask the same thing eventually or even make the adjustment themselves.

Screenshot 2024-03-16 141517

P.S: Thank you for your contributions to Open WebUI!

Yes! Thanks for the suggestion, I forgot those sliders existed 😆 , that's definitely the better UI and I'll reuse that!

<!-- gh-comment-id:2041983967 --> @dannyl1u commented on GitHub (Apr 8, 2024): > That looks good to me @dannyl1u, although, do you think the sliders could take on a similar form as the model advanced parameter sliders? I only ask because I feel that tjbck would step in to ask the same thing eventually or even make the adjustment themselves. > > ![Screenshot 2024-03-16 141517](https://private-user-images.githubusercontent.com/50341825/320372810-35cb12ff-b9c0-4ac6-89a8-6acde682a866.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI1NTkwNTMsIm5iZiI6MTcxMjU1ODc1MywicGF0aCI6Ii81MDM0MTgyNS8zMjAzNzI4MTAtMzVjYjEyZmYtYjljMC00YWM2LTg5YTgtNmFjZGU2ODJhODY2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDA4VDA2NDU1M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTEzYjljMGU5YmNiNzZkY2UxM2I3M2I2MTdlZWQwMmY3NTUzYTFhNmYxNmJhMjY4OWY5MjJmYTg0NzQ0NzM0ZjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Ilh_540Rz13kYE0vhyhpTYloyVsEQXXJhaRrp-Xygs8) > > P.S: Thank you for your contributions to Open WebUI! Yes! Thanks for the suggestion, I forgot those sliders existed 😆 , that's definitely the better UI and I'll reuse that!
Author
Owner

@UXVirtual commented on GitHub (Apr 26, 2024):

@dannyl1u another challenge with TTS output I've noticed is generated markdown code blocks are spoken out audibly.

Making this a toggle option, and stripping the code block prior to the extractSentences() call if it is toggled on would help with coding assistant use-cases.

<!-- gh-comment-id:2080224728 --> @UXVirtual commented on GitHub (Apr 26, 2024): @dannyl1u another challenge with TTS output I've noticed is generated markdown code blocks are spoken out audibly. Making this a toggle option, and stripping the code block prior to the `extractSentences()` call if it is toggled on would help with coding assistant use-cases.
Author
Owner

@littledot2020 commented on GitHub (May 30, 2024):

@dannyl1u我注意到的 TTS 输出的另一个挑战是生成的 markdown 代码块是以声音形式读出的。

将其设为切换选项,并在extractSentences()切换后剥离调用之前的代码块,这将有助于编码助手用例。
I also want to know how to play content formatted after converting markdown.

<!-- gh-comment-id:2139390048 --> @littledot2020 commented on GitHub (May 30, 2024): > @dannyl1u我注意到的 TTS 输出的另一个挑战是生成的 markdown 代码块是以声音形式读出的。 > > 将其设为切换选项,并在`extractSentences()`切换后剥离调用之前的代码块,这将有助于编码助手用例。 I also want to know how to play content formatted after converting markdown.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Aug 25, 2024):

I definitely think that there needs to be a button that if you click it makes those options appear in a call mode. This way you can adjust the settings without having to exit the call mode. Currently the call mode is fairly unusable in my opinion because of the lack of customizability on the fly. Also, there definitely needs to be an option to display the output as it's being read and received from the LLM. In many situations, the text-to-speech can fail just a few words and having to exit the call mode just to read the missing snippet is really an issue.

<!-- gh-comment-id:2308758833 --> @thiswillbeyourgithub commented on GitHub (Aug 25, 2024): I definitely think that there needs to be a button that if you click it makes those options appear in a call mode. This way you can adjust the settings without having to exit the call mode. Currently the call mode is fairly unusable in my opinion because of the lack of customizability on the fly. Also, there definitely needs to be an option to display the output as it's being read and received from the LLM. In many situations, the text-to-speech can fail just a few words and having to exit the call mode just to read the missing snippet is really an issue.
Author
Owner

@silentoplayz commented on GitHub (Sep 20, 2024):

Related - https://github.com/open-webui/open-webui/pull/5509 (speech playback speed control for Call mode)

Edit: I just found out that the TTS speed can finally be adjusted!

<!-- gh-comment-id:2364184166 --> @silentoplayz commented on GitHub (Sep 20, 2024): Related - https://github.com/open-webui/open-webui/pull/5509 (speech playback speed control for Call mode) Edit: I just found out that the TTS speed can finally be adjusted!
Author
Owner

@Cdddo commented on GitHub (Sep 30, 2024):

The TTS speed and volume should be accessible from the chat view, which is where it is needed most.

<!-- gh-comment-id:2381805865 --> @Cdddo commented on GitHub (Sep 30, 2024): The TTS speed and volume should be accessible from the chat view, which is where it is needed most.
Author
Owner

@Winor commented on GitHub (Oct 29, 2024):

The TTS speed and volume should be accessible from the chat view, which is where it is needed most.

I agree, we should have a tts media player.

<!-- gh-comment-id:2443864185 --> @Winor commented on GitHub (Oct 29, 2024): > The TTS speed and volume should be accessible from the chat view, which is where it is needed most. I agree, we should have a tts media player.
Author
Owner

@kyunwang commented on GitHub (Jan 13, 2025):

  • A "Speech Rate" slider to adjust the speed of the TTS output.

Started a PR to allow for these two controls in chat view including volume control specifically. #8512

How is that for the UI?

Screenshot 2025-01-12 at 21 01 45
Screenshot 2025-01-12 at 23 17 00

<!-- gh-comment-id:2586606876 --> @kyunwang commented on GitHub (Jan 13, 2025): > * A "Speech Rate" slider to adjust the speed of the TTS output. Started a PR to allow for these two controls in chat view including volume control specifically. #8512 How is that for the UI? ![Screenshot 2025-01-12 at 21 01 45](https://github.com/user-attachments/assets/202fb396-6e43-44f6-9762-68b684382bcc) ![Screenshot 2025-01-12 at 23 17 00](https://github.com/user-attachments/assets/7e0ed1bf-3000-4c88-972c-d3a308434bab)
Author
Owner

@stefancrain commented on GitHub (Jan 27, 2025):

In the hopes that this helps others finding this issue in the short term, this comment enabled me to configure the TTS speedup I needed.

... each user can configure the playback speed of TTS in their own settings menu (Click profile image > Settings > Audio > Speech Playback Speed). ...

I can work on a PR to add the user based playback speed instructions to the docs for Kokoro-FastAPI-integration (my use-case). As this change is not specific to that TTS provider, is there a better home for that guide?

<!-- gh-comment-id:2615839754 --> @stefancrain commented on GitHub (Jan 27, 2025): In the hopes that this helps others finding this issue in the short term, [this comment](https://github.com/remsky/Kokoro-FastAPI/issues/75#issuecomment-2605260554) enabled me to configure the TTS speedup I needed. > ... each user can configure the playback speed of TTS in their own settings menu (Click profile image > Settings > Audio > Speech Playback Speed). ... I can work on a PR to add the user based playback speed instructions to the docs for [Kokoro-FastAPI-integration](https://docs.openwebui.com/tutorials/text-to-speech/Kokoro-FastAPI-integration) (my use-case). As this change is not specific to that TTS provider, is there a better home for that guide?
Author
Owner

@emory commented on GitHub (Feb 7, 2025):

I don't know if I could get some support for something related to this or if I should create a different enhancement request. It's a different interface for a user to adjust and alter the resulting audio output of a TTS service using a prompt.

I really hate to use this for an example, but it's the best implementation of what I'm curious about implementing or enabling. In the customization options for a Rabbit R1 when the user is logged into Rabbithole [^hole], you can prompt the user interface elements of responses to queries ("make it look like a tricorder"), and you can also prompt the voice.

I have a favorite, so here's what I do: in the Rabbit config prompt inbox box, I put a prompt. Type of affect, speech characteristics or linguistic traits, but you can use colloquialisms too you don't need to be a speech pathologist especially for basic effects processing operations like changes to the speed and/or pitch of the output. The result of this prompting is actually really interesting. I have one I use that makes Shimmer sound, in my opinion, a dead ringer for Dr. Sharon Fieldstone on Apple tv+'s Ted Lasso. I usually use masculine voices so my kids get used to bossing men around, but I struck gold. (hint: the first thing I prompt is "with a posh accent," which it interprets as, I believe Transatlantic and/or Upper RP but like I said, colloquialisms work too.)

So obviously this could also be used to define whatever attributes you want that diverge from the typical: "17% faster tempo, clear enunciation, use aliteration where you can and have fun with it, you're from new orleans and it's mardis gras!"

…rather than having to dial/slide things in?

Am I in the wrong place? I don't want to open unnecessary issues but I might be too late to this party.

//emory

[hole]: a portal/playground environment where you do irresponsible things in an environment of dubious integrity that you have to assume is completely public.

<!-- gh-comment-id:2641743533 --> @emory commented on GitHub (Feb 7, 2025): I don't know if I could get some support for something related to this or if I should create a different enhancement request. It's a different interface for a user to adjust and alter the resulting audio output of a TTS service using a prompt. I really hate to use this for an example, but it's the best implementation of what I'm curious about implementing or enabling. In the customization options for a *Rabbit R1* when the user is logged into *Rabbithole* [^hole], you can prompt the user interface elements of responses to queries ("make it look like a tricorder"), and you can also *prompt the voice*. I have a favorite, so here's what I do: in the Rabbit config prompt inbox box, I put a prompt. Type of affect, speech characteristics or linguistic traits, but you can use colloquialisms too you don't need to be a speech pathologist especially for basic effects processing operations like changes to the speed and/or pitch of the output. The result of this prompting is actually really interesting. I have one I use that makes `Shimmer` sound, in my opinion, a dead ringer for Dr. Sharon Fieldstone on Apple tv+'s Ted Lasso. I usually use masculine voices so my kids get used to bossing men around, but I struck gold. (hint: the first thing I prompt is "with a posh accent," which it interprets as, I believe `Transatlantic` and/or `Upper RP` but like I said, colloquialisms work too.) So obviously this could also be used to define whatever attributes you want that diverge from the typical: "17% faster tempo, clear enunciation, use aliteration where you can and have fun with it, you're from new orleans and it's mardis gras!" …rather than having to dial/slide things in? Am I in the wrong place? I don't want to open unnecessary issues but I might be too late to this party. //emory [hole]: a portal/playground environment where you do irresponsible things in an environment of dubious integrity that you have to assume is completely public.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#27978