mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-11 08:15:00 -05:00
Contributing: here's my audio cleaning code #2296
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @thiswillbeyourgithub on GitHub (Oct 7, 2024).
Not sure opening a Feature Request is okay with you for that but I don't have the time to do a PR and saw you sort of struggled with the audio cleaning :)
(
Edit: forgot to mention reasons why I think should run in all situation and not just to reduce file size:
)
In a private repo I have a piece of code that might be useful to openwebui. It uses torchaudio to apply sox commands on audio and cleans the audio so that any silence longer than X amount of time would get squished.
That sounds trivial but it's not: pydub's implementation is not scalable (it has lots of performance issues ESPECIALLY on the silence related code that can take all the cpu, throw OOM and appear to hang even though it's just super slow) so it's not suitable for open-webui in my opinion.
Sox has none of those issues but is soo complex and unintuitive that it took me a while to get the parameters working but now (and for over a year) it's perfect (can clean hours of audio in seconds). But there are some kinks about file format that may require using soundfile too as a dep. Short story is that torchaudio does not support all the same format as pydub, soundfile, etc. On my linux installing sox inside the docker was as simple as
apt update && apt install sox.Anyway here's the code if you're interested. If you do, I would appreciate being credited with my github username in the commit author :)
The gist is this:
Notes: I'm not sure the padding is all that useful and technically can waste money. Also the vad seemed redundant with the silence commands. Not sure if normalizing is useful. The core of my contribution is those damned silence arguments, modify them at your own risks.
And the gist of the actual processing is this:
I needed those conversions for my code (I'm using gradio) but not sure it's needed here so you might not have to include soundfile.
Edit2: Another implementation can be found in gradio