Reducing hallucinations in local Whisper models for OpenWebUI v0.5.1 by removing silence from audio files #3107

New Issue

GiteaMirror · 2025-11-11T15:22:46-06:00

GiteaMirror commented

2025-11-11 15:22:46 -06:00

Originally created by @AliveDedSec on GitHub (Dec 26, 2024).

for OpenWebUI v0.5.1

Dear users, if you are experiencing issues with "hallucinations" in local Whisper speech recognition models, I suggest you try the following solution, which may help improve the situation.

The current implementation records audio files with silence, even when no sound is coming through the microphone, causing problems when processing such files with Whisper models, especially in whisper mode. This also wears out the hard drive, even when the room is completely silent.

The proposed solution is not a complete solution to the problem, but it attempts to improve the speech recognition results. A more elegant approach would be not to record silence from the microphone when there are no sounds on it. To achieve this, in the web part of OpenWebUI, a silence threshold needs to be configured so that files are not recorded when there is silence on the microphone up to a certain threshold. Unfortunately, at the moment it is not known how to do this correctly, and it would be best if the developers implemented this functionality.

I have created a modified version of the audio.py file that removes silence from audio files for local models before sending them to the speech recognition model.

audio.py.txt
The file is named audio.py.txt, and you need to rename it to audio.py before copying it to the OpenWebUI project.

The original file that needs to be modified is located in the project repository at this link: https://github.com/open-webui/open-webui/blob/main/backend/open_webui/routers/audio.py

Instructions for replacing the audio.py file in Docker:

Open a terminal and make sure the Docker container is running (docker ps).

Use the docker cp command to copy the audio.py file into the container:

docker cp /local/path/to/audio.py CONTAINER_NAME:/app/backend/open_webui/routers/

Replace /local/path/to/audio.py with the path to the file on your computer, and CONTAINER_NAME with the name or ID of the container.

For local installation without Docker:

Copy the audio.py file to the backend/open_webui/routers/ folder of your local project repository.

After each copy of the modified file, you need to restart the Docker container for the changes to take effect.

If you encounter errors or want to revert to the original state, simply copy the audio.py file from the official project repository and repeat the copy operation as described above.

Please note that you may have different paths to this file in your projects. Make sure you copy the modified file to the correct location, replacing the original file.

I hope this solution helps you improve your work with Whisper models.

Originally created by @AliveDedSec on GitHub (Dec 26, 2024). for OpenWebUI v0.5.1 Dear users, if you are experiencing issues with "hallucinations" in local Whisper speech recognition models, I suggest you try the following solution, which may help improve the situation. The current implementation records audio files with silence, even when no sound is coming through the microphone, causing problems when processing such files with Whisper models, especially in whisper mode. This also wears out the hard drive, even when the room is completely silent. The proposed solution is not a complete solution to the problem, but it attempts to improve the speech recognition results. A more elegant approach would be not to record silence from the microphone when there are no sounds on it. To achieve this, in the web part of OpenWebUI, a silence threshold needs to be configured so that files are not recorded when there is silence on the microphone up to a certain threshold. Unfortunately, at the moment it is not known how to do this correctly, and it would be best if the developers implemented this functionality. I have created a modified version of the audio.py file that removes silence from audio files for local models before sending them to the speech recognition model. [audio.py.txt](https://github.com/user-attachments/files/18252628/audio.py.txt) The file is named audio.py.txt, and you need to rename it to audio.py before copying it to the OpenWebUI project. The original file that needs to be modified is located in the project repository at this link: https://github.com/open-webui/open-webui/blob/main/backend/open_webui/routers/audio.py Instructions for replacing the audio.py file in Docker: Open a terminal and make sure the Docker container is running (docker ps). Use the docker cp command to copy the audio.py file into the container: docker cp /local/path/to/audio.py CONTAINER_NAME:/app/backend/open_webui/routers/ Replace /local/path/to/audio.py with the path to the file on your computer, and CONTAINER_NAME with the name or ID of the container. For local installation without Docker: Copy the audio.py file to the backend/open_webui/routers/ folder of your local project repository. After each copy of the modified file, you need to restart the Docker container for the changes to take effect. If you encounter errors or want to revert to the original state, simply copy the audio.py file from the official project repository and repeat the copy operation as described above. Please note that you may have different paths to this file in your projects. Make sure you copy the modified file to the correct location, replacing the original file. I hope this solution helps you improve your work with Whisper models.

GiteaMirror closed this issue

2025-11-11 15:22:46 -06:00

GiteaMirror commented

2025-11-11 15:22:47 -06:00

@AliveDedSec commented on GitHub (Dec 26, 2024):

I have made the following changes to the file:

Imported the necessary modules:

from multiprocessing import Pool, cpu_count

Added a function remove_silence_from_chunk that removes silence from an audio chunk:

def remove_silence_from_chunk(audio_chunk):
    """Removes silence from an audio chunk."""
    return split_on_silence(
        audio_chunk,
        min_silence_len=500,  # minimum length of silence for splitting
        silence_thresh=-40  # silence loudness threshold
    )

Added a function remove_silence that removes silence from an audio file by splitting it into chunks and processing them in parallel:

def remove_silence(audio):
    """Removes silence from an audio file by splitting it into chunks and processing them in parallel."""
    chunk_size = 500  # chunk size in milliseconds
    chunks = [audio[i:i+chunk_size] for i in range(0, len(audio), chunk_size)]
    
    # Parallel processing of audio chunks
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(remove_silence_from_chunk, chunks)
    
    # Merging chunks without silence
    non_silent_segments = [segment for chunk in results for segment in chunk]
    return sum(non_silent_segments) if non_silent_segments else audio

Modified the transcribe function to remove silence from the audio file before transcription:

def transcribe(request: Request, file_path):
    print("transcribe", file_path)
    filename = os.path.basename(file_path)
    file_dir = os.path.dirname(file_path)
    id = filename.split(".")[0]

    if request.app.state.config.STT_ENGINE == "":
        if request.app.state.faster_whisper_model is None:
            request.app.state.faster_whisper_model = set_faster_whisper_model(
                request.app.state.config.WHISPER_MODEL
            )

        model = request.app.state.faster_whisper_model
        
        # Load and process the audio file
        audio = AudioSegment.from_file(file_path)
        processed_audio = remove_silence(audio)  # Remove silence from audio
        
        # Save the processed audio to a temporary file
        temp_path = f"{file_dir}/{id}_processed.wav"
        processed_audio.export(temp_path, format="wav")
        
        segments, info = model.transcribe(temp_path, beam_size=5)
        log.info(
            "Detected language '%s' with probability %f"
            % (info.language, info.language_probability)
        )

        transcript = "".join([segment.text for segment in list(segments)])
        data = {"text": transcript.strip()}

        # Save the transcript to a json file
        transcript_file = f"{file_dir}/{id}.json"
        with open(transcript_file, "w") as f:
            json.dump(data, f)

        log.debug(data)
        return data

These changes introduce the functionality of removing silence from audio files before they are sent to the Whisper speech recognition model, which can help reduce "hallucinations" and improve transcription accuracy.

@AliveDedSec commented on GitHub (Dec 26, 2024): I have made the following changes to the file: Imported the necessary modules: from multiprocessing import Pool, cpu_count Added a function remove_silence_from_chunk that removes silence from an audio chunk: def remove_silence_from_chunk(audio_chunk): """Removes silence from an audio chunk.""" return split_on_silence( audio_chunk, min_silence_len=500, # minimum length of silence for splitting silence_thresh=-40 # silence loudness threshold ) Added a function remove_silence that removes silence from an audio file by splitting it into chunks and processing them in parallel: def remove_silence(audio): """Removes silence from an audio file by splitting it into chunks and processing them in parallel.""" chunk_size = 500 # chunk size in milliseconds chunks = [audio[i:i+chunk_size] for i in range(0, len(audio), chunk_size)] # Parallel processing of audio chunks with Pool(processes=cpu_count()) as pool: results = pool.map(remove_silence_from_chunk, chunks) # Merging chunks without silence non_silent_segments = [segment for chunk in results for segment in chunk] return sum(non_silent_segments) if non_silent_segments else audio Modified the transcribe function to remove silence from the audio file before transcription: def transcribe(request: Request, file_path): print("transcribe", file_path) filename = os.path.basename(file_path) file_dir = os.path.dirname(file_path) id = filename.split(".")[0] if request.app.state.config.STT_ENGINE == "": if request.app.state.faster_whisper_model is None: request.app.state.faster_whisper_model = set_faster_whisper_model( request.app.state.config.WHISPER_MODEL ) model = request.app.state.faster_whisper_model # Load and process the audio file audio = AudioSegment.from_file(file_path) processed_audio = remove_silence(audio) # Remove silence from audio # Save the processed audio to a temporary file temp_path = f"{file_dir}/{id}_processed.wav" processed_audio.export(temp_path, format="wav") segments, info = model.transcribe(temp_path, beam_size=5) log.info( "Detected language '%s' with probability %f" % (info.language, info.language_probability) ) transcript = "".join([segment.text for segment in list(segments)]) data = {"text": transcript.strip()} # Save the transcript to a json file transcript_file = f"{file_dir}/{id}.json" with open(transcript_file, "w") as f: json.dump(data, f) log.debug(data) return data These changes introduce the functionality of removing silence from audio files before they are sent to the Whisper speech recognition model, which can help reduce "hallucinations" and improve transcription accuracy.

GiteaMirror commented

2025-11-11 15:22:47 -06:00

@AliveDedSec commented on GitHub (Dec 26, 2024):

audio.py.txt

Improved speech recognition by handling empty and short audio fragments

This commit enhances the speech recognition process by:

Returning an empty result for empty audio files
Discarding audio shorter than 0.5 seconds after silence removal

Use this updated version for better accuracy and fewer hallucinations.

Enhanced handling of empty and short audio fragments in speech recognition

This commit introduces improvements to the speech recognition process to better handle empty and short audio fragments:

Empty audio handling: If the audio is empty after silence removal, the code now returns an empty result without sending it to the model. This avoids unnecessary processing of empty audio files.
Minimum audio duration check: After removing silence, the code checks if the remaining audio duration is above a specified minimum threshold (currently set to 0.5 seconds). If the audio is too short, it is discarded, and an empty result is returned. This prevents the model from processing very short, potentially meaningless audio fragments.

These improvements should enhance the speech recognition accuracy and reduce the occurrence of hallucinations when processing empty or very short audio recordings.

Instruction for adjusting silence removal parameters in audio files

This instruction will help you understand how the main silence removal parameters in our code work and how changing them affects the result. This will allow you to customize the silence removal process according to your needs.

Main parameters:

min_silence_len - the minimum duration of a pause in milliseconds that is considered "silence". Pauses shorter than the specified value will not be removed.
silence_thresh - the volume threshold in decibels below which the sound is considered "silence". Audio segments with a volume below this value will be removed.
chunk_size - the size of the segments in milliseconds into which the audio will be divided for parallel processing.
MIN_DURATION_SECONDS - the minimum allowed duration of the audio in seconds after silence removal. If the audio is shorter than the specified value after processing, the function will return an empty result.

Examples and the impact of parameter changes:

Original value: min_silence_len=500 (0.5 sec)
Your example: min_silence_len=5000 (5 sec)
Difference: In your example, pauses of 5 seconds or more will be removed, while shorter pauses will remain in the audio. This is useful if you need to keep short pauses.
Original value: silence_thresh=-40 (decibels)
Your example: silence_thresh=-35 (decibels)
Difference: In your example, louder segments will be considered "silence" and removed along with it. This may be appropriate if there is constant noise in the audio that needs to be removed.
Original value: chunk_size=500 (0.5 sec)
Your example: chunk_size=5000 (5 sec)
Difference: In your example, the audio will be divided into larger segments of 5 seconds for parallel processing. This can speed up processing on powerful processors but may also reduce the accuracy of silence removal.
Original value: MIN_DURATION_SECONDS=0.5 (sec)
Your example: MIN_DURATION_SECONDS=1.5 (sec)
Difference: In your example, audio shorter than 1.5 seconds after silence removal will be considered "empty". This is useful for filtering out audio segments that are too short after processing.

Recommendations:

Experiment with different parameter values and choose the ones that give the best result for your audio.
Increase min_silence_len if you need to keep short pauses.
Increase silence_thresh if you need to remove constant noise along with silence.
Choose the optimal chunk_size based on the power of your processor and the required quality of silence removal.
Increase MIN_DURATION_SECONDS if you need to filter out audio segments that are too short after processing.

Feel free to experiment and find the best parameter values for your case. Good luck with adjusting and processing your audio recordings!

@AliveDedSec commented on GitHub (Dec 26, 2024): [audio.py.txt](https://github.com/user-attachments/files/18252965/audio.py.txt) Improved speech recognition by handling empty and short audio fragments This commit enhances the speech recognition process by: - Returning an empty result for empty audio files - Discarding audio shorter than 0.5 seconds after silence removal Use this updated version for better accuracy and fewer hallucinations. Enhanced handling of empty and short audio fragments in speech recognition This commit introduces improvements to the speech recognition process to better handle empty and short audio fragments: 1. Empty audio handling: If the audio is empty after silence removal, the code now returns an empty result without sending it to the model. This avoids unnecessary processing of empty audio files. 2. Minimum audio duration check: After removing silence, the code checks if the remaining audio duration is above a specified minimum threshold (currently set to 0.5 seconds). If the audio is too short, it is discarded, and an empty result is returned. This prevents the model from processing very short, potentially meaningless audio fragments. These improvements should enhance the speech recognition accuracy and reduce the occurrence of hallucinations when processing empty or very short audio recordings. **Instruction for adjusting silence removal parameters in audio files** This instruction will help you understand how the main silence removal parameters in our code work and how changing them affects the result. This will allow you to customize the silence removal process according to your needs. Main parameters: 1. `min_silence_len` - the minimum duration of a pause in milliseconds that is considered "silence". Pauses shorter than the specified value will not be removed. 2. `silence_thresh` - the volume threshold in decibels below which the sound is considered "silence". Audio segments with a volume below this value will be removed. 3. `chunk_size` - the size of the segments in milliseconds into which the audio will be divided for parallel processing. 4. `MIN_DURATION_SECONDS` - the minimum allowed duration of the audio in seconds after silence removal. If the audio is shorter than the specified value after processing, the function will return an empty result. Examples and the impact of parameter changes: 1. Original value: `min_silence_len=500` (0.5 sec) Your example: `min_silence_len=5000` (5 sec) Difference: In your example, pauses of 5 seconds or more will be removed, while shorter pauses will remain in the audio. This is useful if you need to keep short pauses. 2. Original value: `silence_thresh=-40` (decibels) Your example: `silence_thresh=-35` (decibels) Difference: In your example, louder segments will be considered "silence" and removed along with it. This may be appropriate if there is constant noise in the audio that needs to be removed. 3. Original value: `chunk_size=500` (0.5 sec) Your example: `chunk_size=5000` (5 sec) Difference: In your example, the audio will be divided into larger segments of 5 seconds for parallel processing. This can speed up processing on powerful processors but may also reduce the accuracy of silence removal. 4. Original value: `MIN_DURATION_SECONDS=0.5` (sec) Your example: `MIN_DURATION_SECONDS=1.5` (sec) Difference: In your example, audio shorter than 1.5 seconds after silence removal will be considered "empty". This is useful for filtering out audio segments that are too short after processing. Recommendations: - Experiment with different parameter values and choose the ones that give the best result for your audio. - Increase `min_silence_len` if you need to keep short pauses. - Increase `silence_thresh` if you need to remove constant noise along with silence. - Choose the optimal `chunk_size` based on the power of your processor and the required quality of silence removal. - Increase `MIN_DURATION_SECONDS` if you need to filter out audio segments that are too short after processing. Feel free to experiment and find the best parameter values for your case. Good luck with adjusting and processing your audio recordings!

GiteaMirror commented

2025-11-11 15:22:47 -06:00

@Simi5599 commented on GitHub (Dec 26, 2024):

I am personally interested in this because I often want to use the Call feature but can't because of this.

I'm adding it to my fork to see what happens!

EDIT: Whisper hallucinations are reduced but in my case i think i need a more "powerful" solution (like disabling the mic at all + silence removal).

@Simi5599 commented on GitHub (Dec 26, 2024): I am personally interested in this because I often want to use the Call feature but can't because of this. I'm adding it to my fork to see what happens! EDIT: Whisper hallucinations are reduced but **in my case** i think i need a more "powerful" solution (like disabling the mic at all + silence removal).

GiteaMirror commented

2025-11-11 15:22:47 -06:00

@AliveDedSec commented on GitHub (Dec 26, 2024):

I am personally interested in this because I often want to use the Call feature but can't because of this.

I'm adding it to my fork to see what happens!

EDIT: Whisper hallucinations are reduced but in my case i think i need a more "powerful" solution (like disabling the mic at all + silence removal).

Dear friend, thank you so much for your feedback. Did you use the latest version of audio.py.txt? From my message that was the last one before yours? Because I edited it to completely exclude any hallucinations. And it works great for me. I understand that you additionally need the microphone to be turned off in order to exclude the capture of TTS OpenWebUI sound from external speakers? Did I understand you correctly? I wrote a special program for my system using ChatGPT to work around this issue, in order to turn off the microphone while the sound is output to the speakers. Unfortunately, I don't know what operating system you have or where you are using OpenWebUI. There may already be ready-made programs or scripts for your system that implement turning off the microphone while outputting sound to external speakers.

But here is an example for my Manjaro Linux system, the source code of this program, which I then compiled into an executable file. turn_off_mic_when_sound_to_speakers.c.txt

You might want to ask ChatGPT or another similar powerful artificial intelligence to write a similar program for your system so that it works for you.

Here's what I was thinking about. It's possible that the changes from the file "audio.py" you modified weren't applied to your OpenWebUI project. This means that the open-webui system might not be using this modified file, isn't recognizing it correctly, or hasn't replaced it with the new one.

@AliveDedSec commented on GitHub (Dec 26, 2024): > I am personally interested in this because I often want to use the Call feature but can't because of this. > > I'm adding it to my fork to see what happens! > > EDIT: Whisper hallucinations are reduced but **in my case** i think i need a more "powerful" solution (like disabling the mic at all + silence removal). Dear friend, thank you so much for your feedback. Did you use the latest version of audio.py.txt? From my message that was the last one before yours? Because I edited it to completely exclude any hallucinations. And it works great for me. I understand that you additionally need the microphone to be turned off in order to exclude the capture of TTS OpenWebUI sound from external speakers? Did I understand you correctly? I wrote a special program for my system using ChatGPT to work around this issue, in order to turn off the microphone while the sound is output to the speakers. Unfortunately, I don't know what operating system you have or where you are using OpenWebUI. There may already be ready-made programs or scripts for your system that implement turning off the microphone while outputting sound to external speakers. But here is an example for my Manjaro Linux system, the source code of this program, which I then compiled into an executable file. [turn_off_mic_when_sound_to_speakers.c.txt](https://github.com/user-attachments/files/18254078/turn_off_mic_when_sound_to_speakers.c.txt) You might want to ask ChatGPT or another similar powerful artificial intelligence to write a similar program for your system so that it works for you. Here's what I was thinking about. It's possible that the changes from the file "audio.py" you modified weren't applied to your OpenWebUI project. This means that the open-webui system might not be using this modified file, isn't recognizing it correctly, or hasn't replaced it with the new one.

GiteaMirror commented

2025-11-11 15:22:48 -06:00

@AliveDedSec commented on GitHub (Dec 26, 2024):

I want to add, maybe it will be useful to someone, I have been using this wonderful model for recognition for a long time, just by inserting a link to it in the Open Webui interface: Lunaxod/large-v3-turbo-faster-whisper
https://huggingface.co/Lunaxod/large-v3-turbo-faster-whisper/tree/main This is the fastest, lightest and most accurate model of all that I've come across, which I am very satisfied with. I mainly use the Russian language.

@AliveDedSec commented on GitHub (Dec 26, 2024): I want to add, maybe it will be useful to someone, I have been using this wonderful model for recognition for a long time, just by inserting a link to it in the Open Webui interface: Lunaxod/large-v3-turbo-faster-whisper https://huggingface.co/Lunaxod/large-v3-turbo-faster-whisper/tree/main This is the fastest, lightest and most accurate model of all that I've come across, which I am very satisfied with. I mainly use the Russian language.

GiteaMirror commented

2025-11-11 15:22:48 -06:00

@Simi5599 commented on GitHub (Dec 26, 2024):

I just saw your new instructions, i thing what i was doing wrong: i was using Whisper with APIs by OpenAI. My bad!

If i manage to enable Whisper locally i will give your code a try!

@Simi5599 commented on GitHub (Dec 26, 2024): I just saw your new instructions, i thing what i was doing wrong: i was using Whisper with APIs by OpenAI. My bad! If i manage to enable Whisper locally i will give your code a try!

GiteaMirror commented

2025-11-11 15:22:48 -06:00

@AliveDedSec commented on GitHub (Dec 26, 2024):

I just saw your new instructions, i thing what i was doing wrong: i was using Whisper with APIs by OpenAI. My bad!

If i manage to enable Whisper locally i will give your code a try!

Oh, now I understand why it didn't work for you! I got it. The thing is, I only use local models because I don't have access to paid APIs. So I won't be able to check if the silence removal will work with them, especially since I have an NVIDIA graphics card and a version of Open WebUI with CUDA acceleration.

I don't know what graphics card you have, but with CUDA, everything should work fast if you have enough video memory to simultaneously use a local speech recognition model and an AI model.

Installing a local speech-to-text model is quite simple. Here's a brief guide:

Go to http://localhost:3000/admin/settings
Select the "Audio" tab
In the "Speech recognition settings," set the following parameters:
- Speech recognition system: Whisper (Local)
- Speech recognition model: in this line, install, for example, the Lunaxod/large-v3-turbo-faster-whisper model or another compatible model
Click the download icon on the right and wait for the selected model to download from the https://huggingface.co repository
After the model is successfully downloaded, don't forget to save the settings

Keep in mind that the Lunaxod/large-v3-turbo-faster-whisper model is 814 MB. For it to work quickly, you need an NVIDIA graphics card with CUDA.

https://huggingface.co/Lunaxod/large-v3-turbo-faster-whisper/tree/main

@AliveDedSec commented on GitHub (Dec 26, 2024): > I just saw your new instructions, i thing what i was doing wrong: i was using Whisper with APIs by OpenAI. My bad! > > If i manage to enable Whisper locally i will give your code a try! Oh, now I understand why it didn't work for you! I got it. The thing is, I only use local models because I don't have access to paid APIs. So I won't be able to check if the silence removal will work with them, especially since I have an NVIDIA graphics card and a version of Open WebUI with CUDA acceleration. I don't know what graphics card you have, but with CUDA, everything should work fast if you have enough video memory to simultaneously use a local speech recognition model and an AI model. Installing a local speech-to-text model is quite simple. Here's a brief guide: 1. Go to http://localhost:3000/admin/settings 2. Select the "Audio" tab 3. In the "Speech recognition settings," set the following parameters: - Speech recognition system: Whisper (Local) - Speech recognition model: in this line, install, for example, the `Lunaxod/large-v3-turbo-faster-whisper` model or another compatible model 4. Click the download icon on the right and wait for the selected model to download from the https://huggingface.co repository 5. After the model is successfully downloaded, don't forget to save the settings Keep in mind that the `Lunaxod/large-v3-turbo-faster-whisper` model is 814 MB. For it to work quickly, you need an NVIDIA graphics card with CUDA. https://huggingface.co/Lunaxod/large-v3-turbo-faster-whisper/tree/main

GiteaMirror referenced this issue

2025-11-11 17:41:54 -06:00

[PR #3107] [MERGED] Fix: ollama long response timeout #7985

GiteaMirror referenced this issue

2026-04-20 03:23:36 -05:00

[PR #3107] [MERGED] Fix: ollama long response timeout #21189

GiteaMirror referenced this issue

2026-04-25 10:33:16 -05:00

[PR #3107] [MERGED] Fix: ollama long response timeout #36819

GiteaMirror referenced this issue

2026-04-29 18:15:15 -05:00

[PR #3107] [MERGED] Fix: ollama long response timeout #44237