feat: audio transcription playground #500

Closed
opened 2025-11-11 14:23:06 -06:00 by GiteaMirror · 20 comments
Owner

Originally created by @g4challenge on GitHub (Mar 19, 2024).

Originally assigned to: @tjbck on GitHub.

Is your feature request related to a problem? Please describe.
I find it challenging when I need to manually transcribe audio content. Whether it’s interviews, meetings, or recorded conversations, having an automated audio transcription feature would significantly improve my workflow.

Describe the solution you’d like
I would like OpenWebUI to include an audio transcription feature. Ideally, it should accept audio files (such as MP3, WAV, or other common formats) and convert them into accurate text transcripts. The transcripts should be time-stamped and easily accessible within the interface.

Describe alternatives you’ve considered
As an alternative, I’ve explored third-party transcription services based on Whisper with UI (https://github.com/chidiwilliams/buzz , or https://github.com/jhj0517/Whisper-WebUI) but they often come with limitations in installation, sharing, privacy concerns, and additional costs and effort. Having an integrated solution within OpenWebUI would streamline the process and enhance the overall user experience.

Additional context
Sometimes, I participate in remote interviews or attend virtual meetings where audio recordings are essential. Having an in-built transcription feature would save time and effort, allowing me to focus on the content rather than manual transcription tasks. When finished I would love to have the ability, to input to a LLM with predefined prompts: eg. "use the following transcript to create a short precise summary in bullet point".

Originally created by @g4challenge on GitHub (Mar 19, 2024). Originally assigned to: @tjbck on GitHub. **Is your feature request related to a problem? Please describe.** I find it challenging when I need to manually transcribe audio content. Whether it’s interviews, meetings, or recorded conversations, having an automated audio transcription feature would significantly improve my workflow. **Describe the solution you’d like** I would like OpenWebUI to include an audio transcription feature. Ideally, it should accept audio files (such as MP3, WAV, or other common formats) and convert them into accurate text transcripts. The transcripts should be time-stamped and easily accessible within the interface. **Describe alternatives you’ve considered** As an alternative, I’ve explored third-party transcription services based on Whisper with UI (https://github.com/chidiwilliams/buzz , or https://github.com/jhj0517/Whisper-WebUI) but they often come with limitations in installation, sharing, privacy concerns, and additional costs and effort. Having an integrated solution within OpenWebUI would streamline the process and enhance the overall user experience. **Additional context** Sometimes, I participate in remote interviews or attend virtual meetings where audio recordings are essential. Having an in-built transcription feature would save time and effort, allowing me to focus on the content rather than manual transcription tasks. When finished I would love to have the ability, to input to a LLM with predefined prompts: eg. "use the following transcript to create a short precise summary in bullet point".
Author
Owner

@arjunkrishna commented on GitHub (Apr 26, 2024):

Yes, having audio and video transcription would be a very useful feature.

@arjunkrishna commented on GitHub (Apr 26, 2024): Yes, having audio and video transcription would be a very useful feature.
Author
Owner

@arjunkrishna commented on GitHub (Apr 26, 2024):

https://github.com/the-crypt-keeper/tldw

@arjunkrishna commented on GitHub (Apr 26, 2024): https://github.com/the-crypt-keeper/tldw
Author
Owner

@rexkani commented on GitHub (Oct 21, 2024):

This is one of the main feature which i was looking for when i installed openwebui..

@rexkani commented on GitHub (Oct 21, 2024): This is one of the main feature which i was looking for when i installed openwebui..
Author
Owner

@flefevre commented on GitHub (Oct 23, 2024):

In scientific research,it will be a very good feature to be able to record a meeting and then summarize it, and keep it in the workspace. Perhaps it should be compatible with milvus to store the audio and the notes?

I have used https://github.com/JigsawStack/insanely-fast-whisper-api and
https://github.com/Vaibhavs10/insanely-fast-whisper

@flefevre commented on GitHub (Oct 23, 2024): In scientific research,it will be a very good feature to be able to record a meeting and then summarize it, and keep it in the workspace. Perhaps it should be compatible with milvus to store the audio and the notes? I have used https://github.com/JigsawStack/insanely-fast-whisper-api and https://github.com/Vaibhavs10/insanely-fast-whisper
Author
Owner

@Trapper4888 commented on GitHub (Oct 29, 2024):

To add my 2 cents:
Since openwebui has an integrated whisper running (and api possibility), it really feels like a wasted opportunity to not be able to use it directly. Same goes for TTS. I imagine a lot of the code is already there since they both are used behind the scenes.

But I have to acknowledge that openwebui is supposed to be a t2t UI, and starting to do stt and tts may be out of scope and increase complexity. In a perfect word, I host my own openai api whisper docker, connect it to openwebui docker, and for direct whisper usage I use another docker with a proper openai api compatible tts webui.

Still, would be very cool to have basic stt and tts using microphone and files in openwebui.

@Trapper4888 commented on GitHub (Oct 29, 2024): To add my 2 cents: Since openwebui has an integrated whisper running (and api possibility), it really feels like a wasted opportunity to not be able to use it directly. Same goes for TTS. I imagine a lot of the code is already there since they both are used behind the scenes. But I have to acknowledge that openwebui is supposed to be a t2t UI, and starting to do stt and tts may be out of scope and increase complexity. In a perfect word, I host my own openai api whisper docker, connect it to openwebui docker, and for direct whisper usage I use another docker with a proper openai api compatible tts webui. Still, would be very cool to have basic stt and tts using microphone and files in openwebui.
Author
Owner

@hongbo-miao commented on GitHub (Nov 19, 2024):

It would be great to support some common video formats as well, thanks! ☺️

@hongbo-miao commented on GitHub (Nov 19, 2024): It would be great to support some common video formats as well, thanks! ☺️
Author
Owner

@flefevre commented on GitHub (Dec 6, 2024):

By searching over the web I have found this project https://github.com/misbahsy/meetingmind

I wanted to highlight it because they have been thinking the user interface to ease the interaction.

Automatic extraction of key information:
Tasks
Decisions
Questions
Insights
Deadlines
Attendees
Follow-ups
Risks
Agenda

The different screenshots are very inspiring.

Hope these elements could help 'open webui' to find some key ideas.

@flefevre commented on GitHub (Dec 6, 2024): By searching over the web I have found this project https://github.com/misbahsy/meetingmind I wanted to highlight it because they have been thinking the user interface to ease the interaction. Automatic extraction of key information: Tasks Decisions Questions Insights Deadlines Attendees Follow-ups Risks Agenda The different screenshots are very inspiring. Hope these elements could help 'open webui' to find some key ideas.
Author
Owner

@rjmalagon commented on GitHub (Dec 8, 2024):

This is a highly valuable feature.
Tika document text extraction and YouTube transcription extraction allow for a very diverse origin text work. The latter is more akin to an indirect free Google STT.
A simple audio file upload for a direct STT is a good start, but I admit that a more powerful audio transcription tool set via internal or external tool integration is a worthy milestone to wait for developer time resources.

@rjmalagon commented on GitHub (Dec 8, 2024): This is a highly valuable feature. Tika document text extraction and YouTube transcription extraction allow for a very diverse origin text work. The latter is more akin to an indirect free Google STT. A simple audio file upload for a direct STT is a good start, but I admit that a more powerful audio transcription tool set via internal or external tool integration is a worthy milestone to wait for developer time resources.
Author
Owner

@flefevre commented on GitHub (Dec 10, 2024):

The diarization feature could be important in order to be able to make resume with list of actions assigned to a specif user.
I am just aggregating idea in order to help to define the perimeter of this highly valuable feature.
Hope it make sense for the open webui team

@flefevre commented on GitHub (Dec 10, 2024): The diarization feature could be important in order to be able to make resume with list of actions assigned to a specif user. I am just aggregating idea in order to help to define the perimeter of this highly valuable feature. Hope it make sense for the open webui team
Author
Owner

@lollylan commented on GitHub (Dec 11, 2024):

I would love a live transcription from the microphone, this would speed up my workload (transcribing doctor-patient-interactions for hands free documentation) so much. Right now I relay on windows 11 voice assistant but this solution is not good. There are ways of having whisper listen and transcribing in short intervals apparently (i am a doctor though and not a programmer so i cannot.implement that myself) this is much better that waiting for the entire interaction to be finished before sending it to whisper. Additionally, if something interrupts the transfer of the transcript file to the server everything is lost.

@lollylan commented on GitHub (Dec 11, 2024): I would love a live transcription from the microphone, this would speed up my workload (transcribing doctor-patient-interactions for hands free documentation) so much. Right now I relay on windows 11 voice assistant but this solution is not good. There are ways of having whisper listen and transcribing in short intervals apparently (i am a doctor though and not a programmer so i cannot.implement that myself) this is much better that waiting for the entire interaction to be finished before sending it to whisper. Additionally, if something interrupts the transfer of the transcript file to the server everything is lost.
Author
Owner

@T-Herrmann-WI commented on GitHub (Feb 18, 2025):

I also would like to see the audio file transcript feature.

@T-Herrmann-WI commented on GitHub (Feb 18, 2025): I also would like to see the audio file transcript feature.
Author
Owner

@ALIENvsROBOT commented on GitHub (Feb 20, 2025):

Would be also good if we could make it to run in background. If we upload the file and the transcription runs in background. Also selecting whisper model in playground maybe also selecting language would be cool feature.

@ALIENvsROBOT commented on GitHub (Feb 20, 2025): Would be also good if we could make it to run in background. If we upload the file and the transcription runs in background. Also selecting whisper model in playground maybe also selecting language would be cool feature.
Author
Owner

@flefevre commented on GitHub (Feb 21, 2025):

Recently the project https://github.com/ahmetoner/whisper-asr-webservice has integrated whisperx with diarisation feature as an endpoint webservice.

@flefevre commented on GitHub (Feb 21, 2025): Recently the project https://github.com/ahmetoner/whisper-asr-webservice has integrated whisperx with diarisation feature as an endpoint webservice.
Author
Owner

@T-Herrmann-WI commented on GitHub (Apr 2, 2025):

Dear @tjbck , at what version the audio file transcript feature could be implemented?

@T-Herrmann-WI commented on GitHub (Apr 2, 2025): Dear @tjbck , at what version the audio file transcript feature could be implemented?
Author
Owner

@lollylan commented on GitHub (Apr 2, 2025):

A live transcription (or 30 second intervals) from the microphone would be a lifechanger für my usecase, I have a project where I enable doctors to transcribe and summarize Patient interactions and this would be invaluable. I use a script I (or ChatGPT) wrote (can be found here https://github.com/lollylan/asklaion) but having it in OWUI would be loads better.

@lollylan commented on GitHub (Apr 2, 2025): A live transcription (or 30 second intervals) from the microphone would be a lifechanger für my usecase, I have a project where I enable doctors to transcribe and summarize Patient interactions and this would be invaluable. I use a script I (or ChatGPT) wrote (can be found here https://github.com/lollylan/asklaion) but having it in OWUI would be loads better.
Author
Owner

@gusman80 commented on GitHub (Apr 2, 2025):

Wouldnt it be possible to upload a file via openwebui API/UI -> and Pass file data to a custom modell, that uses a Filter/Functions python script to call the local whisperAI transcription implementation (def inlet)?

@gusman80 commented on GitHub (Apr 2, 2025): Wouldnt it be possible to upload a file via openwebui API/UI -> and Pass file data to a custom modell, that uses a Filter/Functions python script to call the local whisperAI transcription implementation (def inlet)?
Author
Owner

@flefevre commented on GitHub (Apr 2, 2025):

Just to be sure, when you upload a MP3 file it is taken by fasterwhisper and it translate it automatically.
So for me it is already implemented

@flefevre commented on GitHub (Apr 2, 2025): Just to be sure, when you upload a MP3 file it is taken by fasterwhisper and it translate it automatically. So for me it is already implemented
Author
Owner

@morbificagent commented on GitHub (Apr 3, 2025):

not here... if i upload an mp3 with a podcast and ask a question about it it doesnt know anything about that...

@morbificagent commented on GitHub (Apr 3, 2025): not here... if i upload an mp3 with a podcast and ask a question about it it doesnt know anything about that...
Author
Owner

@ALIENvsROBOT commented on GitHub (Apr 3, 2025):

atleast audio transcription in background is also will be beneficial for people like researchers.

@ALIENvsROBOT commented on GitHub (Apr 3, 2025): atleast audio transcription in background is also will be beneficial for people like researchers.
Author
Owner

@tjbck commented on GitHub (May 1, 2025):

Merging this with #5990

@tjbck commented on GitHub (May 1, 2025): Merging this with #5990
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#500