Contributing: here's my audio cleaning code #2296

Closed
opened 2025-11-11 15:04:37 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @thiswillbeyourgithub on GitHub (Oct 7, 2024).

Not sure opening a Feature Request is okay with you for that but I don't have the time to do a PR and saw you sort of struggled with the audio cleaning :)

(
Edit: forgot to mention reasons why I think should run in all situation and not just to reduce file size:

  1. Reduced costs
  2. Whisper works by 30s sections. If you hesitate on something during the recording, pause to think, and end up not speaking for more than 30s whisper enters a loop where it justs repeats strings even if you start talking again later. For some use case of openwebui this happens a lot (I use it a lot to help me reason through medical school)
  3. Also, given the feature to send audio files directly, having a cleanup function already enabled would help a lot especially for things like conferences where we can have some silences.

)

In a private repo I have a piece of code that might be useful to openwebui. It uses torchaudio to apply sox commands on audio and cleans the audio so that any silence longer than X amount of time would get squished.

That sounds trivial but it's not: pydub's implementation is not scalable (it has lots of performance issues ESPECIALLY on the silence related code that can take all the cpu, throw OOM and appear to hang even though it's just super slow) so it's not suitable for open-webui in my opinion.

Sox has none of those issues but is soo complex and unintuitive that it took me a while to get the parameters working but now (and for over a year) it's perfect (can clean hours of audio in seconds). But there are some kinks about file format that may require using soundfile too as a dep. Short story is that torchaudio does not support all the same format as pydub, soundfile, etc. On my linux installing sox inside the docker was as simple as apt update && apt install sox.

Anyway here's the code if you're interested. If you do, I would appreciate being credited with my github username in the commit author :)

The gist is this:

        # sox effect when loading a sound                                          
        preprocess_sox_effects: List[str] = [                                      
                # normalize audio                                                  
                ["norm"],                                                          
                                                                                   
                # isolate voice frequency                                          
                # -2 is for a steeper filtering                                    
                # ["highpass", "-1", "100"],                                       
                # ["lowpass", "-1", "3000"],                                       
                # # removes high frequency and very low ones                       
                ["highpass", "-2", "50"],                                          
                ["lowpass", "-2", "5000"],                                         
                                                                                   
                # max silence should be 1s                                         
                ["silence", "-l", "1", "0", "0.5%", "-1", "1.0", "0.5%"],          
                                                                                   
                # # remove leading silence                                         
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # # and ending silence, this might be unecessary for splitted audio
                # ["reverse"],                                                     
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # ["reverse"],                                                     
                                                                                   
                # add blank sound to help whisper                                  
                ["pad", "0.2@0"],                                                  
                ] 

       # But in some situations it was not enough so I sometimes used an extra processing after the first. Maybe use it as a last resort if the audio is still too long.
                                                                                   
        # sox effect when forcing the processing of a sound                        
        force_preprocess_sox_effects: List[str] = [                                
                # normalize audio                                                  
                ["norm"],                                                          
                                                                                   
                # filter for voice                                                 
                ["highpass", "-2", "50"],                                          
                ["lowpass", "-2", "5000"],                                         
                                                                                   
                # max silence should be 1s                                         
                ["silence", "-l", "1", "0", "2%", "-1", "1.0", "2%"],              
                                                                                   
                # # remove leading silence                                         
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # # and ending silence, this might be unecessary for splitted audio
                # ["reverse"],                                                     
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # ["reverse"],                                                     
                                                                                   
                # add blank sound to help whisper                                  
                ["pad", "0.2@0"],                                                  
                ]   

Notes: I'm not sure the padding is all that useful and technically can waste money. Also the vad seemed redundant with the silence commands. Not sure if normalizing is useful. The core of my contribution is those damned silence arguments, modify them at your own risks.

And the gist of the actual processing is this:

        # load from file                                                                                                                                                                                        
        waveform, sample_rate = torchaudio.load(audio_mp3_path)                                                                                                                                                 
                                                                                                                                                                                                                
        waveform, sample_rate = torchaudio.sox_effects.apply_effects_tensor(                                                                                                                                    
                waveform,                                                                                                                                                                                       
                sample_rate,                                                                                                                                                                                    
                shared.preprocess_sox_effects, /shared.pre        [1/1]                                                                                                                                         
                )                                                                                                                                                                                               
                                                                                                                                                                                                                
        # write to file as wav                                                                                                                                                                                  
        sf.write(str(audio_mp3_path), waveform.numpy().T, sample_rate, format='wav')                                                                                                                            
        temp = AudioSegment.from_wav(audio_mp3_path)                                                                                                                                                            
        new_path = Path(audio_mp3_path).parent / (Path(audio_mp3_path).stem + "_proc" + Path(audio_mp3_path).suffix)                                                                                            
        temp.export(new_path, format="mp3") 

I needed those conversions for my code (I'm using gradio) but not sure it's needed here so you might not have to include soundfile.

Edit2: Another implementation can be found in gradio

Originally created by @thiswillbeyourgithub on GitHub (Oct 7, 2024). Not sure opening a Feature Request is okay with you for that but I don't have the time to do a PR and saw you sort of struggled with the audio cleaning :) ( Edit: forgot to mention reasons why I think should run in all situation and not just to reduce file size: 1. Reduced costs 2. Whisper works by 30s sections. If you hesitate on something during the recording, pause to think, and end up not speaking for more than 30s whisper enters a loop where it justs repeats strings even if you start talking again later. For some use case of openwebui this happens a lot (I use it a lot to help me reason through medical school) 3. Also, given the feature to send audio files directly, having a cleanup function already enabled would help a lot especially for things like conferences where we can have some silences. ) In a private repo I have a piece of code that might be useful to openwebui. It uses torchaudio to apply [sox](https://en.wikipedia.org/wiki/SoX) commands on audio and cleans the audio so that any silence longer than X amount of time would get squished. That sounds trivial but it's not: pydub's implementation is not scalable ([it has lots of performance issues ESPECIALLY on the silence related code that can take all the cpu, throw OOM and appear to hang even though it's just super slow](https://github.com/jiaaro/pydub/issues?q=is%3Aissue+is%3Aopen+slow)) so it's not suitable for open-webui in my opinion. Sox has none of those issues but is soo complex and unintuitive that it took me a while to get the parameters working but now (and for over a year) it's perfect (can clean hours of audio in seconds). But there are some kinks about file format that may require using soundfile too as a dep. Short story is that torchaudio does not support all the same format as pydub, soundfile, etc. On my linux installing sox inside the docker was as simple as `apt update && apt install sox`. Anyway here's the code if you're interested. If you do, I would appreciate being credited with my github username in the commit author :) The gist is this: ```python # sox effect when loading a sound preprocess_sox_effects: List[str] = [ # normalize audio ["norm"], # isolate voice frequency # -2 is for a steeper filtering # ["highpass", "-1", "100"], # ["lowpass", "-1", "3000"], # # removes high frequency and very low ones ["highpass", "-2", "50"], ["lowpass", "-2", "5000"], # max silence should be 1s ["silence", "-l", "1", "0", "0.5%", "-1", "1.0", "0.5%"], # # remove leading silence # ["vad", "-p", "0.2", "-t", "5"], # # and ending silence, this might be unecessary for splitted audio # ["reverse"], # ["vad", "-p", "0.2", "-t", "5"], # ["reverse"], # add blank sound to help whisper ["pad", "0.2@0"], ] # But in some situations it was not enough so I sometimes used an extra processing after the first. Maybe use it as a last resort if the audio is still too long. # sox effect when forcing the processing of a sound force_preprocess_sox_effects: List[str] = [ # normalize audio ["norm"], # filter for voice ["highpass", "-2", "50"], ["lowpass", "-2", "5000"], # max silence should be 1s ["silence", "-l", "1", "0", "2%", "-1", "1.0", "2%"], # # remove leading silence # ["vad", "-p", "0.2", "-t", "5"], # # and ending silence, this might be unecessary for splitted audio # ["reverse"], # ["vad", "-p", "0.2", "-t", "5"], # ["reverse"], # add blank sound to help whisper ["pad", "0.2@0"], ] ``` Notes: I'm not sure the padding is all that useful and technically can waste money. Also the vad seemed redundant with the silence commands. Not sure if normalizing is useful. The core of my contribution is those damned silence arguments, modify them at your own risks. And the gist of the actual processing is this: ```python # load from file waveform, sample_rate = torchaudio.load(audio_mp3_path) waveform, sample_rate = torchaudio.sox_effects.apply_effects_tensor( waveform, sample_rate, shared.preprocess_sox_effects, /shared.pre [1/1] ) # write to file as wav sf.write(str(audio_mp3_path), waveform.numpy().T, sample_rate, format='wav') temp = AudioSegment.from_wav(audio_mp3_path) new_path = Path(audio_mp3_path).parent / (Path(audio_mp3_path).stem + "_proc" + Path(audio_mp3_path).suffix) temp.export(new_path, format="mp3") ``` I needed those conversions for my code (I'm using gradio) but not sure it's needed here so you might not have to include soundfile. Edit2: Another implementation can be [found in gradio](https://github.com/gradio-app/gradio/blob/d00e344be39d92185e7d36b8ef784cf46e7a5e98/demo/same-person-or-different/run.py#L30)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#2296