mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #18621] issue: Call/Voice mode on iOS hard-stops mic capture at ~10-16secs, auto-sends partial transcript, discards rest of audio #18657
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @IN-Neil on GitHub (Oct 26, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/18621
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.6.34 - Hosted on Railway, Docker Template
Ollama Version (if applicable)
No response
Operating System
Client 1: macOS Sequoia 15.1 desktop / Client 2: iPhone iOS 26.0.1
Browser (if applicable)
Client 1: (Zen Browser) Firefox 143.0.1 (aarch64) / Client 2: Arc 1.45.1 (Chromium)
Confirmation
README.md.Expected Behavior
Mobile “Call mode” should continuously record voice for as long as the call is active (tens of seconds, minutes, etc.), send that full audio to
/api/v1/audio/transcriptions, and then submit the full transcript as the user message.In other words: if I speak for 40 seconds or 5 minutes, the backend should receive 40 seconds or 5 minutes of audio, the transcript should include everything, and the assistant should respond after I hang up — not in the middle.
This is important for long-form thinking/chatting and accessibility. Short “push to talk for 10–16 seconds” is not enough.
(note: this behavior was the default until something in the end of September changed it. The suspects are a late September Open WebUI/call-mode change or a browser auto-update enabling a default utterance cap/endpointing on mobile.)
Actual Behavior
On iPhone (iOS 26.0.1), when I use Call mode through the normal Open WebUI UI:
When I inspect the server (Railway container) after a call, there is only one audio file saved under:
/app/backend/data/cache/audio/transcriptions/<UUID>.wav(+ matching .mp3 and .json)Running
ffprobeon that.wavinside the Railway container shows that file is only ~16–17 seconds long, even if I actually spoke ~25+ seconds on the phone. There is no “second chunk” file for the rest of the utterance. Whisper (local) and Deepgram both produce a transcript that ends mid-sentence at ~16 seconds. So the backend never even receives audio after ~16s.This is reproducible across:
This means the cutoff happens before STT, in the frontend/browser-side recording logic of Call mode (likely the
MediaRecorderpath inCallOverlay.svelte). It’s not Whisper / VAD / no_speech_threshold. It’s not Cloudflare. It’s not Railway. It’s the recorder stopping early on mobile and never restarting.Desktop (Zen/Firefox on macOS) does not have this early cutoff. On desktop I can generate ~40s audio and Whisper only truncates because of decoding heuristics, not because the file itself is short. On mobile, the actual captured file is short.
Critical point: iOS itself is capable of long recordings. Proven with a standalone HTTPS test page (details below). So the 16s cutoff is not an iOS hardware/security limit — it appears to be Call mode’s current implementation.
Steps to Reproduce
Deploy Open WebUI on Railway using the
open-webui:mainDocker image. There's a template called 'OpenwebUI with Pipelines' which is the one I'm using. The container mounts a persistent Railway volume at/app/backend/data. I'm using a cloudfare proxy, but it's not needed to reproduce, you can hit Railway’s HTTPS endpoint directly.Arc on iPhone (iOS 26.0.1, for iOS / Safari WebView UA shown above), open the public URL of that Railway deployment.
Enter Call mode (voice input). Talk continuously for 20+ seconds without pressing stop. Keep talking past 16 seconds. Say something easy to recognize so you know where it cuts (for example: “This is second fifteen, I am still talking after fifteen, now we’re at twenty, now we’re at twenty-five…”).
Let Call mode finish by itself, or wait for the assistant to start answering you mid-thought. Observe that the assistant responds using only the first ~15 seconds of what you said, as if you stopped there.
Immediately SSH into the Railway container (
railway shellinto theopen-webuiservice) and inspect new transcription artifacts:Look at:
/app/backend/data/cache/audio/transcriptions/You will see a new
<UUID>.wav,<UUID>.mp3,<UUID>.jsoncreated for that call.Run inside the container:
ffprobe output shows
duration: ~16-17s. There is only that one file for the entire call. There is no second<UUID2>.wavor additional chunks. The.jsontranscript in that same folder stops mid-sentence right around the 16s mark.This proves the browser/frontend only ever uploaded ~16 seconds of audio, even though I was still talking.
Repeat the exact same “talk 20+ seconds” test using STT = local whisper vs STT = Deepgram. Result is the same. So this is not model-dependent.
Now run a control test outside of Open WebUI to rule out iOS mic limits:
I served a minimal HTTPS HTML page (files attached below) with self-signed cert with
navigator.mediaDevices.getUserMedia+MediaRecorder.start()and no artificial 15s cap. I added a wake lock to keep the screen awake. I recorded ~40+ seconds straight on the same iPhone (iOS 26.0.1), then manually stopped.recording_test.html
server.py
The log showed:
So the current call mode logic on mobile browsers ends the recording after ~10–16 seconds, uploads that single chunk, and never spins up another chunk/recorder. Then it treats that first partial transcript as the entire user message and immediately prompts the assistant to respond. The remainder of what is being said after ~16 seconds is never recorded/uploaded at all, so it’s permanently lost.
Expected: continuous capture until user stops, possibly chunked internally, with final stitched transcript sent once.
Actual: one ~16s blob, auto-stop, auto-send, assistant interrupts.
Logs & Screenshots
After a ~25s spoken test on iPhone, container shows:
/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.wav/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.mp3/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.jsonRunning
ffprobeinside Railway on that.wavreported duration ~16.7 seconds. The.jsontext ends mid-sentence exactly at that point:"text": "Hello. I'm testing speech to text because I like using speech to text, but it often gets cut off, like, a lot."(I kept talking after this)recording_test.html
There is no second UUID.wav for seconds 17–25, even though I was audibly still talking.
Browser console (mobile) shows normal getUserMedia prompt and recording start, then Call mode UI “listens,” then assistant answers. I can provide sanitized console output and ffprobe output if needed, but nothing in the logs suggests network failure; it just looks like the recorder stopped.
Desktop control test logs (standalone HTTPS test page, not Open WebUI), iPhone iOS 26.0.1:
Note the ~44s elapsed and ~1.2MB blob. No ~16s hard stop.
Desktop control test logs (Firefox 143.0 on macOS 15.1 - not 10.15 which is a fake user-agent string produced):
The ~53s elapsed and ~0.8MB blob. Again: no ~16s hard stop.
Additional Information
Start date - This started somewhere in the last week of September. It was working as expected before then. The suspects are a late September Open WebUI/call-mode change or a browser auto-update enabling a default utterance cap/endpointing on mobile.
Installation Method*:
Docker image
ghcr.io/open-webui/open-webui:main, deployed on Railway as a public-facing container (not localhost). Persistent volume mounted at/app/backend/data. Accessed from desktop and iPhone (mobile browser) over HTTPS.No Ollama in this flow.
If you need more container logs, console logs (from the CallOverlay component startup/shutdown), or ffprobe output from /app/backend/data/cache/audio/transcriptions/*.wav, I can provide them.