feat: backend support for TTS (Bark, etc.) #48

Closed
opened 2025-11-11 14:03:05 -06:00 by GiteaMirror · 32 comments
Owner

Originally created by @oliverbob on GitHub (Nov 22, 2023).

Is it possible to have a native support for Bark TTS or langchain version of it since we already have that microphone prompt?

Originally created by @oliverbob on GitHub (Nov 22, 2023). Is it possible to have a native support for Bark TTS or langchain version of it since we already have that microphone prompt?
GiteaMirror added the enhancement label 2025-11-11 14:03:05 -06:00
Author
Owner

@tjbck commented on GitHub (Nov 22, 2023):

Hi, Thanks for the suggestion. Sounds like an interesting idea, I'll see what I can do about it but only after I have every previous feature request out the way. In the meantime, if you could implement a working prototype using python and provide us with implementation examples, that would be sublime. Thanks.

@tjbck commented on GitHub (Nov 22, 2023): Hi, Thanks for the suggestion. Sounds like an interesting idea, I'll see what I can do about it but only after I have every previous feature request out the way. In the meantime, if you could implement a working prototype using python and provide us with implementation examples, that would be sublime. Thanks.
Author
Owner

@walking-octopus commented on GitHub (Dec 2, 2023):

Bark is rather unstable, slow, and overkill for an assistant. Piper however seems fine. It also has Python support.

I also wonder if the server or client should be responsible for TTS... It is written in C++, so a WASM port is possible, if desired.

@walking-octopus commented on GitHub (Dec 2, 2023): Bark is rather unstable, slow, and overkill for an assistant. [Piper](https://github.com/rhasspy/piper) however seems fine. It also has [Python support](https://github.com/rhasspy/piper#running-in-python). I also wonder if the server or client should be responsible for TTS... It is written in C++, so a WASM port is possible, if desired.
Author
Owner

@tjbck commented on GitHub (Dec 5, 2023):

I'll be looking into this in the near future! In the meantime, TTS support is already been implemented with legacy web api. Thanks!

@tjbck commented on GitHub (Dec 5, 2023): I'll be looking into this in the near future! In the meantime, TTS support is already been implemented with legacy web api. Thanks!
Author
Owner

@oliverbob commented on GitHub (Dec 26, 2023):

Since we already have the speaker button there, I think we can integrate piper, since its lightweight and fast.

The only requirement is that the server have piper installed via:

pip install piper-tts

Directory structure:

/flask-piper-app

├── app.py
├── static
│ └── welcome.wav

└── templates
└── index.html

Python:

from flask import Flask, render_template, request, send_file
import os  # Add this import statement

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/play', methods=['POST'])
def play_text():
    if 'text' in request.form:
        text = request.form['text']
        
        # Generate the audio file
        generate_audio(text)

        # Return the generated audio file to the client
        return send_file('static/welcome.wav', mimetype='audio/wav', as_attachment=False)

    return render_template('index.html')

def generate_audio(text):
    # Use os.system to execute the piper command
    piper_command = f'echo "{text}" | piper --model en_US-lessac-medium.onnx --output_file static/welcome.wav'
    os.system(piper_command)

if __name__ == '__main__':
    app.run(debug=True, port=5000)

Here's the html (which you can convert to svelte):

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Flask Piper App</title>
</head>
<body>
    <h4>Welcome to Piper App</h4>
    <p>Click "Play Audio" to hear the synthesized speech.</p>

    <!-- Display the text inside a div for user reference -->
    <div id="displayText">
        This is the text that will be read aloud. You can customize this paragraph.
    </div>

    <form id="textForm" method="post" action="/play">
        <input type="submit" value="Play Audio">
    </form>

    <hr>

    <!-- Audio player to play the generated audio -->
    <audio id="audioPlayer">
        <source id="audioSource" src="" type="audio/wav">
        Your browser does not support the audio element.
    </audio>

    <script>
        // Update the audio source when the form is submitted
        document.getElementById('textForm').addEventListener('submit', function(event) {
            event.preventDefault();
            var text = document.getElementById('displayText').innerText;

            // Make an asynchronous POST request to the /play route
            fetch('/play', {
                method: 'POST',
                headers: {
                    'Content-Type': 'application/x-www-form-urlencoded',
                },
                body: 'text=' + encodeURIComponent(text),
            })
            .then(response => response.blob())
            .then(blob => {
                // Create a Blob URL for the audio source
                var blobUrl = URL.createObjectURL(blob);
                document.getElementById('audioSource').src = blobUrl;

                // Load and play the audio
                document.getElementById('audioPlayer').load();
                document.getElementById('audioPlayer').play();
            })
            .catch(error => console.error('Error:', error));
        });
    </script>
</body>
</html>

This way, our model responses will not sound like Stephen Hawking.

@oliverbob commented on GitHub (Dec 26, 2023): Since we already have the speaker button there, I think we can integrate piper, since its lightweight and fast. The only requirement is that the server have piper installed via: `pip install piper-tts` Directory structure: /flask-piper-app │ ├── app.py ├── static │ └── welcome.wav │ └── templates └── index.html Python: ``` from flask import Flask, render_template, request, send_file import os # Add this import statement app = Flask(__name__) @app.route('/') def index(): return render_template('index.html') @app.route('/play', methods=['POST']) def play_text(): if 'text' in request.form: text = request.form['text'] # Generate the audio file generate_audio(text) # Return the generated audio file to the client return send_file('static/welcome.wav', mimetype='audio/wav', as_attachment=False) return render_template('index.html') def generate_audio(text): # Use os.system to execute the piper command piper_command = f'echo "{text}" | piper --model en_US-lessac-medium.onnx --output_file static/welcome.wav' os.system(piper_command) if __name__ == '__main__': app.run(debug=True, port=5000) ``` Here's the html (which you can convert to svelte): ``` <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Flask Piper App</title> </head> <body> <h4>Welcome to Piper App</h4> <p>Click "Play Audio" to hear the synthesized speech.</p> <!-- Display the text inside a div for user reference --> <div id="displayText"> This is the text that will be read aloud. You can customize this paragraph. </div> <form id="textForm" method="post" action="/play"> <input type="submit" value="Play Audio"> </form> <hr> <!-- Audio player to play the generated audio --> <audio id="audioPlayer"> <source id="audioSource" src="" type="audio/wav"> Your browser does not support the audio element. </audio> <script> // Update the audio source when the form is submitted document.getElementById('textForm').addEventListener('submit', function(event) { event.preventDefault(); var text = document.getElementById('displayText').innerText; // Make an asynchronous POST request to the /play route fetch('/play', { method: 'POST', headers: { 'Content-Type': 'application/x-www-form-urlencoded', }, body: 'text=' + encodeURIComponent(text), }) .then(response => response.blob()) .then(blob => { // Create a Blob URL for the audio source var blobUrl = URL.createObjectURL(blob); document.getElementById('audioSource').src = blobUrl; // Load and play the audio document.getElementById('audioPlayer').load(); document.getElementById('audioPlayer').play(); }) .catch(error => console.error('Error:', error)); }); </script> </body> </html> ``` This way, our model responses will not sound like Stephen Hawking.
Author
Owner

@tjbck commented on GitHub (Dec 26, 2023):

I'll actively take a look after #216, but piper doesn't seem to support macos. If any of you guys know any workarounds for this, please let us know. Thanks.

Encountering this issue: https://github.com/rhasspy/piper/issues/203

@tjbck commented on GitHub (Dec 26, 2023): I'll actively take a look after #216, but piper doesn't seem to support macos. If any of you guys know any workarounds for this, please let us know. Thanks. Encountering this issue: https://github.com/rhasspy/piper/issues/203
Author
Owner

@tjbck commented on GitHub (Dec 30, 2023):

Let's get the ball rolling on this one! Stay tuned!

@tjbck commented on GitHub (Dec 30, 2023): Let's get the ball rolling on this one! Stay tuned!
Author
Owner

@diblasio commented on GitHub (Dec 31, 2023):

If I may also suggest this feature has an option to use openai tts as well considering there's already a place to input your api key in the UI. Their model sounds more natural for those of us that are attempting to using AI for language learning.
https://platform.openai.com/docs/guides/text-to-speech

@diblasio commented on GitHub (Dec 31, 2023): If I may also suggest this feature has an option to use openai tts as well considering there's already a place to input your api key in the UI. Their model sounds more natural for those of us that are attempting to using AI for language learning. https://platform.openai.com/docs/guides/text-to-speech
Author
Owner

@explorigin commented on GitHub (Jan 28, 2024):

Piper will likely support wasm compilation soon which would allow browser-side generation: https://github.com/rhasspy/piper/issues/352

@explorigin commented on GitHub (Jan 28, 2024): Piper will likely support wasm compilation soon which would allow browser-side generation: https://github.com/rhasspy/piper/issues/352
Author
Owner

@oliverbob commented on GitHub (Jan 28, 2024):

Piper will likely support wasm compilation soon which would allow browser-side generation: rhasspy/piper#352

I have actually made a pull request that integrated piper in it. But I deleted it since I recall that Timothy said, it is not well supported on his macbook or on mac in general.

If you want, I can make a piper integration again, but it would necessitate to "remove the browser Speech recognition default", unless otherwise some would be kind enough to put a new "piper button" as a sign that I should place it back, it (the new speaker icon) should differentiate between Speech Recognition (the default), and the one to be used for piper (since I'm not very good at svelte, but I'm know quite a lot about javascript). The speech though will not be browser controlled (not wasm yet), but it will read the prompt response, send to server and the server audio generated by piper will be served to the browser.

The only downside is that for longer prompts, the rendered audio file would be larger for the most simplified implementation (without using complex compression algorithm).

Let me know so that I can generate a new pull request should this be still helpful. Alternatively, we can create a piper branch for this repo for research purposes for other developers to look and build on the work. Coz, if I'm not mistaken, OpenAIs whisper server is not free of charge. Its fast but not free.

Piper is better than BARK, since you need a huge GPU to run BARK, and it takes hours on smaller GPUs before bark can talk back to the user text prompt. In Piper, for a message this long (as my comment) for a medium size quality voice, will generate between 1-5 mb. It should be installed where the UI is running. Then it will generate voice back to you from the server between 10 seconds to 30 seconds, or sometimes longer. For longer text, it might require a minute. But if you run piper on a GPU, its as quick as lightning, the only downside would just be "how to compress it" after bark generates the audio file. Im sure there are countless developers here who could figure that out on top of the simplest example, coz for longer text, it reaches more mb, and the voice --model WHATEVER-medium.onnx is quite huge (up to 70MB), which shouldn't be included in the pull request, but can be run (downloaded) after running the piper flask server or bash (which can also be included in the Ollama WebUI run script.

@oliverbob commented on GitHub (Jan 28, 2024): > Piper will likely support wasm compilation soon which would allow browser-side generation: [rhasspy/piper#352](https://github.com/rhasspy/piper/issues/352) I have actually made a pull request that integrated piper in it. But I deleted it since I recall that Timothy said, it is not well supported on his macbook or on mac in general. If you want, I can make a piper integration again, but it would necessitate to "remove the browser Speech recognition default", unless otherwise some would be kind enough to put a new "piper button" as a sign that I should place it back, it (the new speaker icon) should differentiate between Speech Recognition (the default), and the one to be used for piper (since I'm not very good at svelte, but I'm know quite a lot about javascript). The speech though will not be browser controlled (not wasm yet), but it will read the prompt response, send to server and the server audio generated by piper will be served to the browser. The only downside is that for longer prompts, the rendered audio file would be larger for the most simplified implementation (without using complex compression algorithm). Let me know so that I can generate a new pull request should this be still helpful. Alternatively, we can create a piper branch for this repo for research purposes for other developers to look and build on the work. Coz, if I'm not mistaken, OpenAIs whisper server is not free of charge. Its fast but not free. Piper is better than BARK, since you need a huge GPU to run BARK, and it takes hours on smaller GPUs before bark can talk back to the user text prompt. In Piper, for a message this long (as my comment) for a medium size quality voice, will generate between 1-5 mb. It should be installed where the UI is running. Then it will generate voice back to you from the server between 10 seconds to 30 seconds, or sometimes longer. For longer text, it might require a minute. But if you run piper on a GPU, its as quick as lightning, the only downside would just be "how to compress it" after bark generates the audio file. Im sure there are countless developers here who could figure that out on top of the simplest example, coz for longer text, it reaches more mb, and the voice --model WHATEVER-medium.onnx is quite huge (up to 70MB), which shouldn't be included in the pull request, but can be run (downloaded) after running the piper flask server or bash (which can also be included in the Ollama WebUI run script.
Author
Owner

@tjbck commented on GitHub (Feb 5, 2024):

backend piper integration blockers:

@tjbck commented on GitHub (Feb 5, 2024): backend piper integration blockers: - [ ] https://github.com/rhasspy/piper-phonemize/pull/27 - [ ] https://github.com/rhasspy/piper/pull/374
Author
Owner

@tjbck commented on GitHub (Feb 6, 2024):

OpenAI TTS support has been added with #656! As for the local TTS support, piper seems promising so let's wait until they merge the two blocking PRs.

@tjbck commented on GitHub (Feb 6, 2024): OpenAI TTS support has been added with #656! As for the local TTS support, piper seems promising so let's wait until they merge the two blocking PRs.
Author
Owner

@oliverbob commented on GitHub (Feb 6, 2024):

Thanks Timothy.

@oliverbob commented on GitHub (Feb 6, 2024): Thanks Timothy.
Author
Owner

@tjbck commented on GitHub (Feb 22, 2024):

Piper library seems to be unmaintained. Looking for alternatives atm, open to suggestions!

@tjbck commented on GitHub (Feb 22, 2024): Piper library seems to be unmaintained. Looking for alternatives atm, open to suggestions!
Author
Owner

@jmtatsch commented on GitHub (Mar 1, 2024):

Piper works well on Mac also if you build from source and make a tiny change to the CMakelist 🙈

I am pretty sure @synesthesiam will get around to merging those pull requests, piper seems to be his baby after all.
He is just incredibly busy with all the voice assistant integration for Home Assistant.

I played around with bark.cpp and coqui.ai TTS and both are far too slow to be useful.

@jmtatsch commented on GitHub (Mar 1, 2024): Piper works well on Mac also if you build from source and make a tiny change to the CMakelist 🙈 I am pretty sure @synesthesiam will get around to merging those pull requests, piper seems to be his baby after all. He is just incredibly busy with all the voice assistant integration for Home Assistant. I played around with bark.cpp and coqui.ai TTS and both are far too slow to be useful.
Author
Owner

@justinh-rahb commented on GitHub (Mar 1, 2024):

I agree, out of the big three projects for local TTS Piper is probably the best hope we've got.. I really don't understand how this particular niche is so devoid of development, it's one of the most asked-for features in any local AI project.

@justinh-rahb commented on GitHub (Mar 1, 2024): I agree, out of the big three projects for local TTS Piper is probably the best hope we've got.. I really don't understand how this particular niche is so devoid of development, it's one of the most asked-for features in any local AI project.
Author
Owner

@synesthesiam commented on GitHub (Mar 2, 2024):

Piper is definitely still being maintained! As @jmtatsch said, I've just been busy with other stuff. One thing that's held up development is needing to replace the espeak-ng library due to its license.

I think this niche is fairly devoid of development because very few projects leave the demo stage before the authors are on to the next model/paper. I want Piper to be more of a "boring" technology in the sense that it does a job well without always chasing state-of-the-art.

@synesthesiam commented on GitHub (Mar 2, 2024): Piper is definitely still being maintained! As @jmtatsch said, I've just been busy with other stuff. One thing that's held up development is needing to replace the espeak-ng library due to its license. I think this niche is fairly devoid of development because very few projects leave the demo stage before the authors are on to the next model/paper. I want Piper to be more of a "boring" technology in the sense that it does a job well without always chasing state-of-the-art.
Author
Owner

@justinh-rahb commented on GitHub (Mar 2, 2024):

I very much agree with that part of the unix philosophy: do one thing and do it well. Thanks for the status update @synesthesiam 🙏

@justinh-rahb commented on GitHub (Mar 2, 2024): I very much agree with that part of the unix philosophy: do one thing and do it well. Thanks for the status update @synesthesiam 🙏
Author
Owner

@jmtatsch commented on GitHub (Mar 16, 2024):

Can the existing base url for openai tts be made configurable?
I found this adapter
https://github.com/matatonic/openedai-speech
serving an openai tts api with either piper or coqui TTS the back

@jmtatsch commented on GitHub (Mar 16, 2024): Can the existing base url for openai tts be made configurable? I found this adapter https://github.com/matatonic/openedai-speech serving an openai tts api with either piper or coqui TTS the back
Author
Owner

@lee-b commented on GitHub (Mar 30, 2024):

Can the existing base url for openai tts be made configurable? I found this adapter https://github.com/matatonic/openedai-speech serving an openai tts api with either piper or coqui TTS the back

This looks very promising. The API seems to work well, and it's a similar docker-based setup to ollama. I agree, just allowing tweaking the OPENAI_BASE_URL for audio would go a long way to fully local whisper+xtts-v2 with this.

@lee-b commented on GitHub (Mar 30, 2024): > Can the existing base url for openai tts be made configurable? I found this adapter https://github.com/matatonic/openedai-speech serving an openai tts api with either piper or coqui TTS the back This looks very promising. The API seems to work well, and it's a similar docker-based setup to ollama. I agree, just allowing tweaking the OPENAI_BASE_URL for audio would go a long way to fully local whisper+xtts-v2 with this.
Author
Owner

@lee-b commented on GitHub (Mar 31, 2024):

FYI, I made this work with a local openedai-speech (linked above) on my branch, here:

https://github.com/lee-b/open-webui

It currently requires an extra environment variable and uses a custom docker file and runner script to run the thing, but it works. I'll integrate this better if the core team want to advise on their preferred way to solve some of the issues that I did these things to hack around.

@lee-b commented on GitHub (Mar 31, 2024): FYI, I made this work with a local openedai-speech (linked above) on my branch, here: https://github.com/lee-b/open-webui It currently requires an extra environment variable and uses a custom docker file and runner script to run the thing, but it works. I'll integrate this better if the core team want to advise on their preferred way to solve some of the issues that I did these things to hack around.
Author
Owner

@fraschm1998 commented on GitHub (Apr 3, 2024):

FYI, I made this work with a local openedai-speech (linked above) on my branch, here:

https://github.com/lee-b/open-webui

It currently requires an extra environment variable and uses a custom docker file and runner script to run the thing, but it works. I'll integrate this better if the core team want to advise on their preferred way to solve some of the issues that I did these things to hack around.

Any way to fix this?

Got OPENAI_AUDIO_BASE_URL: http://192.168.10.14:8002/v1
open-webui-two  | ERROR:apps.openai.main:404 Client Error: Not Found for url: http://192.168.10.14:8002/audio/speech
open-webui-two  | Traceback (most recent call last):
open-webui-two  |   File "/app/backend/apps/openai/main.py", line 154, in speech
open-webui-two  |     r.raise_for_status()
open-webui-two  |   File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
open-webui-two  |     raise HTTPError(http_error_msg, response=self)
open-webui-two  | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://192.168.10.14:8002/audio/speech
open-webui-two  | INFO:     192.168.10.14:58846 - "POST /openai/api/audio/speech HTTP/1.1" 500 Internal Server Error
open-webui-two  | INFO:     192.168.10.14:53534 - "GET /_app/immutable/nodes/11.76457ae4.js HTTP/1.1" 304 Not Modified

Server is running:

docker logs openedai-speech-server-1 --follow                                             
INFO:     Started server process [1]                                                                                                        
INFO:     Waiting for application startup.                                                                                                  
INFO:     Application startup complete.                                                                                                     
INFO:     Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)                                                                     
 > Using model: xtts                                                                                                                        
INFO:     172.24.0.1:38734 - "POST /v1/audio/speech HTTP/1.1" 200 OK                                                                        
INFO:     172.24.0.1:41190 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41196 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41210 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41212 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41220 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41234 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41244 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41252 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41264 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:41272 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39624 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39630 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39638 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39652 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39656 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39660 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39674 - "POST /audio/speech HTTP/1.1" 404 Not Found                                                                    
INFO:     172.24.0.1:39688 - "POST /audio/speech HTTP/1.1" 404 Not Found
@fraschm1998 commented on GitHub (Apr 3, 2024): > FYI, I made this work with a local openedai-speech (linked above) on my branch, here: > > https://github.com/lee-b/open-webui > > It currently requires an extra environment variable and uses a custom docker file and runner script to run the thing, but it works. I'll integrate this better if the core team want to advise on their preferred way to solve some of the issues that I did these things to hack around. Any way to fix this? ``` Got OPENAI_AUDIO_BASE_URL: http://192.168.10.14:8002/v1 open-webui-two | ERROR:apps.openai.main:404 Client Error: Not Found for url: http://192.168.10.14:8002/audio/speech open-webui-two | Traceback (most recent call last): open-webui-two | File "/app/backend/apps/openai/main.py", line 154, in speech open-webui-two | r.raise_for_status() open-webui-two | File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status open-webui-two | raise HTTPError(http_error_msg, response=self) open-webui-two | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://192.168.10.14:8002/audio/speech open-webui-two | INFO: 192.168.10.14:58846 - "POST /openai/api/audio/speech HTTP/1.1" 500 Internal Server Error open-webui-two | INFO: 192.168.10.14:53534 - "GET /_app/immutable/nodes/11.76457ae4.js HTTP/1.1" 304 Not Modified ``` Server is running: ``` docker logs openedai-speech-server-1 --follow INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) > Using model: xtts INFO: 172.24.0.1:38734 - "POST /v1/audio/speech HTTP/1.1" 200 OK INFO: 172.24.0.1:41190 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41196 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41210 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41212 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41220 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41234 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41244 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41252 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41264 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:41272 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39624 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39630 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39638 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39652 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39656 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39660 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39674 - "POST /audio/speech HTTP/1.1" 404 Not Found INFO: 172.24.0.1:39688 - "POST /audio/speech HTTP/1.1" 404 Not Found ```
Author
Owner

@fraschm1998 commented on GitHub (Apr 3, 2024):

FYI, I made this work with a local openedai-speech (linked above) on my branch, here:

https://github.com/lee-b/open-webui

It currently requires an extra environment variable and uses a custom docker file and runner script to run the thing, but it works. I'll integrate this better if the core team want to advise on their preferred way to solve some of the issues that I did these things to hack around.

Fixed with the following, kudos to ChatGPT:

        if not base_url.endswith("/"):
            base_url += "/"

        speech_url = urljoin(base_url, "audio/speech")

The Python urljoin function is used here to combine base_url with "/audio/speech". The urljoin function is designed to intelligently merge two parts of a URL, but its behavior with trailing slashes can sometimes lead to unexpected results. Specifically, if the base URL (base_url) does not end with a slash (/), and the second part begins with one, urljoin might not concatenate the strings in the way you expect, potentially leading to the omission of parts of the path.

@fraschm1998 commented on GitHub (Apr 3, 2024): > FYI, I made this work with a local openedai-speech (linked above) on my branch, here: > > https://github.com/lee-b/open-webui > > It currently requires an extra environment variable and uses a custom docker file and runner script to run the thing, but it works. I'll integrate this better if the core team want to advise on their preferred way to solve some of the issues that I did these things to hack around. Fixed with the following, kudos to ChatGPT: ``` if not base_url.endswith("/"): base_url += "/" speech_url = urljoin(base_url, "audio/speech") ``` > The Python urljoin function is used here to combine base_url with "/audio/speech". The urljoin function is designed to intelligently merge two parts of a URL, but its behavior with trailing slashes can sometimes lead to unexpected results. Specifically, if the base URL (base_url) does not end with a slash (/), and the second part begins with one, urljoin might not concatenate the strings in the way you expect, potentially leading to the omission of parts of the path.
Author
Owner

@jmtatsch commented on GitHub (Apr 4, 2024):

I think it would be best if open webui just enables us to set a different TTS base url via ENV variable like OPENAI_TTS_BASE_URL.
Like that users can plug in whatever openai tts compatible server they like and there are no licensing woes.
And it is very little work to do as Openai TTS is already implemented and works beautifully 😍

@jmtatsch commented on GitHub (Apr 4, 2024): I think it would be best if open webui just enables us to set a different TTS base url via ENV variable like OPENAI_TTS_BASE_URL. Like that users can plug in whatever openai tts compatible server they like and there are no licensing woes. And it is very little work to do as Openai TTS is already implemented and works beautifully 😍
Author
Owner

@jmtatsch commented on GitHub (Apr 10, 2024):

@tjbck would you be open to the approach taken in https://github.com/lee-b/open-webui
should someone create a pull request?

@jmtatsch commented on GitHub (Apr 10, 2024): @tjbck would you be open to the approach taken in https://github.com/lee-b/open-webui should someone create a pull request?
Author
Owner

@hxypqr commented on GitHub (Apr 24, 2024):

Is there a simple way to change the TTS model to my own now? I can't stand the voice of this robot lol.

@hxypqr commented on GitHub (Apr 24, 2024): Is there a simple way to change the TTS model to my own now? I can't stand the voice of this robot lol.
Author
Owner

@jmtatsch commented on GitHub (Apr 24, 2024):

since cbd18ec you should be able to set your own openai compatible base url

@jmtatsch commented on GitHub (Apr 24, 2024): since cbd18ec you should be able to set your own openai compatible base url
Author
Owner

@UXVirtual commented on GitHub (Apr 26, 2024):

In case this helps anyone who is running the open-webui Docker container along with Ollama on the same PC and using openedai-speech you can use the following for configuration:

  • API Base URL: http://host.docker.internal:8000/v1
  • API Key: sk-111111111

host.docker.internal is required since openedai-speech is exposed via localhost on your PC, but open-webui cannot normally access this from within its container.

Note that openedai-speech doesn't need an API key, but setting a dummy one is required due to validation of this field in open-webui

@UXVirtual commented on GitHub (Apr 26, 2024): In case this helps anyone who is running the `open-webui` Docker container along with Ollama on the same PC and using [openedai-speech](https://github.com/matatonic/openedai-speech) you can use the following for configuration: - *API Base URL*: `http://host.docker.internal:8000/v1` - *API Key*: `sk-111111111` `host.docker.internal` is required since `openedai-speech` is exposed via `localhost` on your PC, but `open-webui` cannot normally access this from within its container. Note that `openedai-speech` doesn't need an API key, but setting a dummy one is required due to validation of this field in `open-webui`
Author
Owner

@jmtatsch commented on GitHub (Apr 29, 2024):

Works wonderfully now.
https://github.com/matatonic/openedai-speech wraps piper, xtts_v2 and parler-tts by the way so there is a good choice of qualities and latencies

@jmtatsch commented on GitHub (Apr 29, 2024): Works wonderfully now. https://github.com/matatonic/openedai-speech wraps piper, xtts_v2 and parler-tts by the way so there is a good choice of qualities and latencies
Author
Owner

@justinh-rahb commented on GitHub (Apr 29, 2024):

I'll leave it up to @oliverbob to decide to call this issue fixed or not, or I will close it as such in a few days if we don't hear from them.

@justinh-rahb commented on GitHub (Apr 29, 2024): I'll leave it up to @oliverbob to decide to call this issue fixed or not, or I will close it as such in a few days if we don't hear from them.
Author
Owner

@ned14 commented on GitHub (Sep 27, 2024):

If you're doing silly things like me like running this on a Haswell based server a decade old, it would be a lot more natural if the voice started speaking after the first sentence is returned rather than waiting until the entire response is returned. With llama 3.2 1b the token output is just about fast enough to be quicker than the TTS, so this would make everything much more natural.

https://github.com/open-webui/open-webui/issues/478 requested the same thing I noticed. I'm using https://github.com/matatonic/openedai-speech and I can clearly see in its logs that open-webui isn't asking for speech synthesis until the LLM has completed its response. If there were an option to send sentences for speech as soon as it gets them, that would be great.

@ned14 commented on GitHub (Sep 27, 2024): If you're doing silly things like me like running this on a Haswell based server a decade old, it would be a lot more natural if the voice started speaking after the first sentence is returned rather than waiting until the entire response is returned. With llama 3.2 1b the token output is just about fast enough to be quicker than the TTS, so this would make everything much more natural. https://github.com/open-webui/open-webui/issues/478 requested the same thing I noticed. I'm using https://github.com/matatonic/openedai-speech and I can clearly see in its logs that open-webui isn't asking for speech synthesis until the LLM has completed its response. If there were an option to send sentences for speech as soon as it gets them, that would be great.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Sep 27, 2024):

If you're doing silly things like me like running this on a Haswell based server a decade old, it would be a lot more natural if the voice started speaking after the first sentence is returned rather than waiting until the entire response is returned. With llama 3.2 1b the token output is just about fast enough to be quicker than the TTS, so this would make everything much more natural.

#478 requested the same thing I noticed. I'm using https://github.com/matatonic/openedai-speech and I can clearly see in its logs that open-webui isn't asking for speech synthesis until the LLM has completed its response. If there were an option to send sentences for speech as soon as it gets them, that would be great.

I completely agree with you. Actually the code is already there as call mode is already doing this. I do think it would merit being available also outside of call mode. (Which is barely usable on vertical screens includinh phones imo)

@thiswillbeyourgithub commented on GitHub (Sep 27, 2024): > If you're doing silly things like me like running this on a Haswell based server a decade old, it would be a lot more natural if the voice started speaking after the first sentence is returned rather than waiting until the entire response is returned. With llama 3.2 1b the token output is just about fast enough to be quicker than the TTS, so this would make everything much more natural. > > #478 requested the same thing I noticed. I'm using https://github.com/matatonic/openedai-speech and I can clearly see in its logs that open-webui isn't asking for speech synthesis until the LLM has completed its response. If there were an option to send sentences for speech as soon as it gets them, that would be great. I completely agree with you. Actually the code is already there as call mode is already doing this. I do think it would merit being available also outside of call mode. (Which is barely usable on vertical screens includinh phones imo)
Author
Owner

@ned14 commented on GitHub (Sep 27, 2024):

Actually the code is already there as call mode is already doing this. I do think it would merit being available also outside of call mode.

Yes, you're right. My hardware is so slow that the 3b model chokes up the speech synthesis. The 1b model works rather better though. If I preloaded the the voice, I think it might work just fine.

Thanks for the tip.

@ned14 commented on GitHub (Sep 27, 2024): > Actually the code is already there as call mode is already doing this. I do think it would merit being available also outside of call mode. Yes, you're right. My hardware is so slow that the 3b model chokes up the speech synthesis. The 1b model works rather better though. If I preloaded the the voice, I think it might work just fine. Thanks for the tip.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#48