mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 19:38:46 -05:00
[GH-ISSUE #1331] feat: Volume, Speech Rate, and Pitch Controls for Text-to-Speech (TTS) Output #51116
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @silentoplayz on GitHub (Mar 28, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/1331
Problem Description:
The current version of Open WebUI lacks the necessary customization options for the text-to-speech (TTS) output, including volume control, speech rate adjustment, pitch adjustment, and audio playback functionality for speaking out notifications. These limitations hinder the user experience and accessibility of the text-to-speech (TTS) feature.
Describe the solution you'd like:
I propose the implementation of the following features to enhance the TTS output customization:
Alternative solution:
Offer predefined volume, speed, & pitch options instead of a slider for a simpler interface.
Alternatives Considered:
Manually adjusting the device's overall volume or utilizing third-party applications to manipulate speech output and volume settings represents a workaround. However, this solution proves to be inconvenient for users, necessitating the addition of these much-needed features within Open WebUI.
Additional Context:
This feature request focuses on improving the text-to-speech (TTS) feature's accessibility and overall user experience. Implementing these requested features, including volume, speed, and pitch adjustments, will significantly enhance user satisfaction and convenience. It's crucial to maintain compatibility with existing features, ensuring this customization suite does not adversely impact any existing functionalities or behaviors.
@dannyl1u commented on GitHub (Apr 8, 2024):
I think this is would be a good feature, how does this look for the UI?
@silentoplayz commented on GitHub (Apr 8, 2024):
That looks good to me @dannyl1u, although, do you think the sliders could take on a similar form as the model advanced parameter sliders? I only ask because I feel that tjbck would step in to ask the same thing eventually or even make the adjustment himself.
P.S: Thank you for your contributions to Open WebUI!
@dannyl1u commented on GitHub (Apr 8, 2024):
Yes! Thanks for the suggestion, I forgot those sliders existed 😆 , that's definitely the better UI and I'll reuse that!
@UXVirtual commented on GitHub (Apr 26, 2024):
@dannyl1u another challenge with TTS output I've noticed is generated markdown code blocks are spoken out audibly.
Making this a toggle option, and stripping the code block prior to the
extractSentences()call if it is toggled on would help with coding assistant use-cases.@littledot2020 commented on GitHub (May 30, 2024):
@thiswillbeyourgithub commented on GitHub (Aug 25, 2024):
I definitely think that there needs to be a button that if you click it makes those options appear in a call mode. This way you can adjust the settings without having to exit the call mode. Currently the call mode is fairly unusable in my opinion because of the lack of customizability on the fly. Also, there definitely needs to be an option to display the output as it's being read and received from the LLM. In many situations, the text-to-speech can fail just a few words and having to exit the call mode just to read the missing snippet is really an issue.
@silentoplayz commented on GitHub (Sep 20, 2024):
Related - https://github.com/open-webui/open-webui/pull/5509 (speech playback speed control for Call mode)
Edit: I just found out that the TTS speed can finally be adjusted!
@Cdddo commented on GitHub (Sep 30, 2024):
The TTS speed and volume should be accessible from the chat view, which is where it is needed most.
@Winor commented on GitHub (Oct 29, 2024):
I agree, we should have a tts media player.
@kyunwang commented on GitHub (Jan 13, 2025):
Started a PR to allow for these two controls in chat view including volume control specifically. #8512
How is that for the UI?
@stefancrain commented on GitHub (Jan 27, 2025):
In the hopes that this helps others finding this issue in the short term, this comment enabled me to configure the TTS speedup I needed.
I can work on a PR to add the user based playback speed instructions to the docs for Kokoro-FastAPI-integration (my use-case). As this change is not specific to that TTS provider, is there a better home for that guide?
@emory commented on GitHub (Feb 7, 2025):
I don't know if I could get some support for something related to this or if I should create a different enhancement request. It's a different interface for a user to adjust and alter the resulting audio output of a TTS service using a prompt.
I really hate to use this for an example, but it's the best implementation of what I'm curious about implementing or enabling. In the customization options for a Rabbit R1 when the user is logged into Rabbithole [^hole], you can prompt the user interface elements of responses to queries ("make it look like a tricorder"), and you can also prompt the voice.
I have a favorite, so here's what I do: in the Rabbit config prompt inbox box, I put a prompt. Type of affect, speech characteristics or linguistic traits, but you can use colloquialisms too you don't need to be a speech pathologist especially for basic effects processing operations like changes to the speed and/or pitch of the output. The result of this prompting is actually really interesting. I have one I use that makes
Shimmersound, in my opinion, a dead ringer for Dr. Sharon Fieldstone on Apple tv+'s Ted Lasso. I usually use masculine voices so my kids get used to bossing men around, but I struck gold. (hint: the first thing I prompt is "with a posh accent," which it interprets as, I believeTransatlanticand/orUpper RPbut like I said, colloquialisms work too.)So obviously this could also be used to define whatever attributes you want that diverge from the typical: "17% faster tempo, clear enunciation, use aliteration where you can and have fun with it, you're from new orleans and it's mardis gras!"
…rather than having to dial/slide things in?
Am I in the wrong place? I don't want to open unnecessary issues but I might be too late to this party.
//emory
[hole]: a portal/playground environment where you do irresponsible things in an environment of dubious integrity that you have to assume is completely public.