[PR #6894] [CLOSED] Removing Silence from Audio Files for Local OpenAI/Whisper Models to Prevent Hallucinations #8770

New Issue

GiteaMirror · 2025-11-11T18:05:34-06:00

GiteaMirror commented

2025-11-11 18:05:34 -06:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/6894
Author: @AliveDedSec
Created: 11/13/2024
Status: ❌ Closed

Base: main ← Head: main

📝 Commits (4)

fd1df97 Add files via upload
6d077ec Delete backend/open_webui/apps/audio/main.py
7016bf4 Rename main1.py to main.py
100d11f Update main.py

📊 Changes

1 file changed (+47 additions, -12 deletions)

View changed files

📝 backend/open_webui/apps/audio/main.py (+47 -12)

📄 Description

Proposal for Enhancement in Open WebUI: Silence Removal in Audio Files Before Processing with Whisper Model to Improve Speech Recognition Quality

Dear Developers,

In the current version of Open WebUI (v0.3.35), using local Whisper models for continuous real-time communication can lead to issues. When the Call mode (headphones icon to the right of the microphone) is activated and the user steps away from the computer, the model often listens to extended periods of silence. This situation results in Whisper generating random, nonsensical output when interaction resumes, instead of a meaningful response.

To address this issue, I developed a code that removes silence from audio files before they are processed by the model. This solution avoids "hallucinations" and greatly improves speech recognition quality. It allows for smooth and meaningful interaction with the Open WebUI voice assistant, eliminating unwanted noise and random text. This finally allowed me to communicate in voice assistant mode without the Whisper hallucinations!

Solution Overview:

Silence Removal. The code segments the audio file and removes silent sections in each fragment. It applies a volume threshold and a minimum silence duration to avoid cutting out essential pauses and to improve processing efficiency.
Parallel Processing. To speed up the process, it performs parallel processing on the audio chunks, allowing for efficient handling of longer recordings.
Result Generation. The silence-free audio is then fed into the model, generating a JSON file with the transcription, which delivers highly accurate speech recognition and clear transcription output.

My code was developed specifically for Open WebUI v0.3.35, and I cannot directly submit it to the main development branch, as there are significant differences in code structure that would require adaptation. However, implementing a similar solution in the latest version of WebUI would be extremely beneficial.

Optimization Recommendation: Silence removal could be accelerated using GPU processing, which would offer a significant boost in real-time applications. GPU-based parallel processing would enhance the speed of audio filtering, delivering high-quality and efficient user interaction.

Thank you for your hard work!

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/6894 **Author:** [@AliveDedSec](https://github.com/AliveDedSec) **Created:** 11/13/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `main` --- ### 📝 Commits (4) - [`fd1df97`](https://github.com/open-webui/open-webui/commit/fd1df9766214bff64c943f4c274ab551a4b9a8fd) Add files via upload - [`6d077ec`](https://github.com/open-webui/open-webui/commit/6d077ecdd1eafdf0ddd765ebb2f3279ab703dd53) Delete backend/open_webui/apps/audio/main.py - [`7016bf4`](https://github.com/open-webui/open-webui/commit/7016bf4301ffa62bb54616a921d3a6bab40784c2) Rename main1.py to main.py - [`100d11f`](https://github.com/open-webui/open-webui/commit/100d11f1d5a1b52ccdd0b9b2f60c2a4e8dd77511) Update main.py ### 📊 Changes **1 file changed** (+47 additions, -12 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/apps/audio/main.py` (+47 -12) </details> ### 📄 Description **Proposal for Enhancement in Open WebUI: Silence Removal in Audio Files Before Processing with Whisper Model to Improve Speech Recognition Quality** Dear Developers, In the current version of Open WebUI (v0.3.35), using local Whisper models for continuous real-time communication can lead to issues. When the Call mode (headphones icon to the right of the microphone) is activated and the user steps away from the computer, the model often listens to extended periods of silence. This situation results in Whisper generating random, nonsensical output when interaction resumes, instead of a meaningful response. To address this issue, I developed a code that removes silence from audio files before they are processed by the model. This solution avoids "hallucinations" and greatly improves speech recognition quality. It allows for smooth and meaningful interaction with the Open WebUI voice assistant, eliminating unwanted noise and random text. This finally allowed me to communicate in voice assistant mode without the Whisper hallucinations! ### Solution Overview: 1. **Silence Removal**. The code segments the audio file and removes silent sections in each fragment. It applies a volume threshold and a minimum silence duration to avoid cutting out essential pauses and to improve processing efficiency. 2. **Parallel Processing**. To speed up the process, it performs parallel processing on the audio chunks, allowing for efficient handling of longer recordings. 3. **Result Generation**. The silence-free audio is then fed into the model, generating a JSON file with the transcription, which delivers highly accurate speech recognition and clear transcription output. My code was developed specifically for Open WebUI v0.3.35, and I cannot directly submit it to the main development branch, as there are significant differences in code structure that would require adaptation. However, implementing a similar solution in the latest version of WebUI would be extremely beneficial. **Optimization Recommendation**: Silence removal could be accelerated using GPU processing, which would offer a significant boost in real-time applications. GPU-based parallel processing would enhance the speed of audio filtering, delivering high-quality and efficient user interaction. Thank you for your hard work! --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2025-11-11 18:05:34 -06:00

GiteaMirror closed this issue