[GH-ISSUE #18621] issue: Call/Voice mode on iOS hard-stops mic capture at ~10-16secs, auto-sends partial transcript, discards rest of audio #57323

Closed
opened 2026-05-05 20:51:16 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @IN-Neil on GitHub (Oct 26, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/18621

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.34 - Hosted on Railway, Docker Template

Ollama Version (if applicable)

No response

Operating System

Client 1: macOS Sequoia 15.1 desktop / Client 2: iPhone iOS 26.0.1

Browser (if applicable)

Client 1: (Zen Browser) Firefox 143.0.1 (aarch64) / Client 2: Arc 1.45.1 (Chromium)

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Mobile “Call mode” should continuously record voice for as long as the call is active (tens of seconds, minutes, etc.), send that full audio to /api/v1/audio/transcriptions, and then submit the full transcript as the user message.

In other words: if I speak for 40 seconds or 5 minutes, the backend should receive 40 seconds or 5 minutes of audio, the transcript should include everything, and the assistant should respond after I hang up — not in the middle.

This is important for long-form thinking/chatting and accessibility. Short “push to talk for 10–16 seconds” is not enough.

(note: this behavior was the default until something in the end of September changed it. The suspects are a late September Open WebUI/call-mode change or a browser auto-update enabling a default utterance cap/endpointing on mobile.)

Actual Behavior

On iPhone (iOS 26.0.1), when I use Call mode through the normal Open WebUI UI:

  1. I speak for ~20–30 seconds or longer in one turn.
  2. The assistant always only hears the first ~10–16 seconds (it varies, but always cuts off.).
  3. Open WebUI immediately posts that partial transcript into the chat as if I stopped talking, and the assistant starts replying, even though I’m still mid-sentence.

When I inspect the server (Railway container) after a call, there is only one audio file saved under:

/app/backend/data/cache/audio/transcriptions/<UUID>.wav (+ matching .mp3 and .json)

Running ffprobe on that .wav inside the Railway container shows that file is only ~16–17 seconds long, even if I actually spoke ~25+ seconds on the phone. There is no “second chunk” file for the rest of the utterance. Whisper (local) and Deepgram both produce a transcript that ends mid-sentence at ~16 seconds. So the backend never even receives audio after ~16s.

This is reproducible across:

  • using local faster-whisper in the container,
  • using Deepgram API,
  • using Local Whisper (small).

This means the cutoff happens before STT, in the frontend/browser-side recording logic of Call mode (likely the MediaRecorder path in CallOverlay.svelte). It’s not Whisper / VAD / no_speech_threshold. It’s not Cloudflare. It’s not Railway. It’s the recorder stopping early on mobile and never restarting.

Desktop (Zen/Firefox on macOS) does not have this early cutoff. On desktop I can generate ~40s audio and Whisper only truncates because of decoding heuristics, not because the file itself is short. On mobile, the actual captured file is short.

Critical point: iOS itself is capable of long recordings. Proven with a standalone HTTPS test page (details below). So the 16s cutoff is not an iOS hardware/security limit — it appears to be Call mode’s current implementation.

Steps to Reproduce

  1. Deploy Open WebUI on Railway using the open-webui:main Docker image. There's a template called 'OpenwebUI with Pipelines' which is the one I'm using. The container mounts a persistent Railway volume at /app/backend/data. I'm using a cloudfare proxy, but it's not needed to reproduce, you can hit Railway’s HTTPS endpoint directly.

  2. Arc on iPhone (iOS 26.0.1, for iOS / Safari WebView UA shown above), open the public URL of that Railway deployment.

    1. I noticed that changing broswers didn’t make a difference, tried latest: Arc, Chrome, Safari. Same results.
  3. Enter Call mode (voice input). Talk continuously for 20+ seconds without pressing stop. Keep talking past 16 seconds. Say something easy to recognize so you know where it cuts (for example: “This is second fifteen, I am still talking after fifteen, now we’re at twenty, now we’re at twenty-five…”).

  4. Let Call mode finish by itself, or wait for the assistant to start answering you mid-thought. Observe that the assistant responds using only the first ~15 seconds of what you said, as if you stopped there.

  5. Immediately SSH into the Railway container (railway shell into the open-webui service) and inspect new transcription artifacts:

    Look at:

    /app/backend/data/cache/audio/transcriptions/

    You will see a new <UUID>.wav, <UUID>.mp3, <UUID>.json created for that call.

    Run inside the container:

    ffprobe -hide_banner -i /app/backend/data/cache/audio/transcriptions/<UUID>.wav
    
    

    ffprobe output shows duration: ~16-17s. There is only that one file for the entire call. There is no second <UUID2>.wav or additional chunks. The .json transcript in that same folder stops mid-sentence right around the 16s mark.

    This proves the browser/frontend only ever uploaded ~16 seconds of audio, even though I was still talking.

  6. Repeat the exact same “talk 20+ seconds” test using STT = local whisper vs STT = Deepgram. Result is the same. So this is not model-dependent.

  7. Now run a control test outside of Open WebUI to rule out iOS mic limits:

I served a minimal HTTPS HTML page (files attached below) with self-signed cert with navigator.mediaDevices.getUserMedia + MediaRecorder.start() and no artificial 15s cap. I added a wake lock to keep the screen awake. I recorded ~40+ seconds straight on the same iPhone (iOS 26.0.1), then manually stopped.

recording_test.html
server.py

The log showed:

- `elapsed wall ms: 44104` (≈44.1 seconds of actual talking time),
- `blob size(bytes): 1218287`,
- iOS UA: `Mozilla/5.0 (iPhone; CPU iPhone OS 26_0_1 like Mac OS X)... CriOS/141.0.7390.96 ...`,
- `FINAL duration(s): NaN` (Safari doesn’t always report duration for `audio/webm;codecs=opus`, so NaN is normal),
- Crucially: no 16-second cutoff. The browser happily recorded ~40+ seconds in one session.

On desktop, same test produced `elapsed wall ms: ~52881` (≈52.9 seconds), `blob size(bytes): ~815262`, `FINAL duration(s): Infinity`. Again: long recording is fine.

This proves iOS will capture >30s audio in one go under HTTPS, and will not auto-kill at ~16s. The 16s ceiling only appears in Open WebUI Call mode.

So the current call mode logic on mobile browsers ends the recording after ~10–16 seconds, uploads that single chunk, and never spins up another chunk/recorder. Then it treats that first partial transcript as the entire user message and immediately prompts the assistant to respond. The remainder of what is being said after ~16 seconds is never recorded/uploaded at all, so it’s permanently lost.

Expected: continuous capture until user stops, possibly chunked internally, with final stitched transcript sent once.

Actual: one ~16s blob, auto-stop, auto-send, assistant interrupts.

Logs & Screenshots

After a ~25s spoken test on iPhone, container shows:

/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.wav

/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.mp3

/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.json

Running ffprobe inside Railway on that .wav reported duration ~16.7 seconds. The .json text ends mid-sentence exactly at that point:

"text": "Hello. I'm testing speech to text because I like using speech to text, but it often gets cut off, like, a lot." (I kept talking after this)

recording_test.html

There is no second UUID.wav for seconds 17–25, even though I was audibly still talking.

Browser console (mobile) shows normal getUserMedia prompt and recording start, then Call mode UI “listens,” then assistant answers. I can provide sanitized console output and ffprobe output if needed, but nothing in the logs suggests network failure; it just looks like the recorder stopped.

Desktop control test logs (standalone HTTPS test page, not Open WebUI), iPhone iOS 26.0.1:

[rec] wakeLock acquired
[rec] using mimeType: audio/webm;codecs=opus
[rec] recording started
[rec] manual stop
[rec] chunk size(bytes): 1218287
[rec] FINAL duration(s): NaN
[rec] blob size(bytes): 1218287
[rec] elapsed wall ms: 44104
[rec] UA: Mozilla/5.0 (iPhone; CPU iPhone OS 26_0_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/141.0.7390.96 Mobile/15E148 Safari/604.1
[rec] wakeLock released

Note the ~44s elapsed and ~1.2MB blob. No ~16s hard stop.

Desktop control test logs (Firefox 143.0 on macOS 15.1 - not 10.15 which is a fake user-agent string produced):

[rec] wakeLock acquired
[rec] using mimeType: audio/webm;codecs=opus
[rec] recording started
[rec] manual stop
[rec] chunk size(bytes): 815262
[rec] FINAL duration(s): Infinity
[rec] blob size(bytes): 815262
[rec] elapsed wall ms: 52881
[rec] UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0
[rec] wakeLock released

The ~53s elapsed and ~0.8MB blob. Again: no ~16s hard stop.

Additional Information

Start date - This started somewhere in the last week of September. It was working as expected before then. The suspects are a late September Open WebUI/call-mode change or a browser auto-update enabling a default utterance cap/endpointing on mobile.

Installation Method*:

Docker image ghcr.io/open-webui/open-webui:main, deployed on Railway as a public-facing container (not localhost). Persistent volume mounted at /app/backend/data. Accessed from desktop and iPhone (mobile browser) over HTTPS.

No Ollama in this flow.

If you need more container logs, console logs (from the CallOverlay component startup/shutdown), or ffprobe output from /app/backend/data/cache/audio/transcriptions/*.wav, I can provide them.

Originally created by @IN-Neil on GitHub (Oct 26, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/18621 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.6.34 - Hosted on Railway, Docker Template ### Ollama Version (if applicable) _No response_ ### Operating System Client 1: macOS Sequoia 15.1 desktop / Client 2: iPhone iOS 26.0.1 ### Browser (if applicable) Client 1: (Zen Browser) Firefox 143.0.1 (aarch64) / Client 2: Arc 1.45.1 (Chromium) ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior Mobile “Call mode” should continuously record voice for as long as the call is active (tens of seconds, minutes, etc.), send that full audio to `/api/v1/audio/transcriptions`, and then submit the full transcript as the user message. In other words: if I speak for 40 seconds or 5 minutes, the backend should receive 40 seconds or 5 minutes of audio, the transcript should include everything, and the assistant should respond after I hang up — not in the middle. This is important for long-form thinking/chatting and accessibility. Short “push to talk for 10–16 seconds” is not enough. (note: this behavior was the default until something in the end of September changed it. The suspects are a late September Open WebUI/call-mode change or a browser auto-update enabling a default utterance cap/endpointing on mobile.) ### Actual Behavior On iPhone (iOS 26.0.1), when I use Call mode through the normal Open WebUI UI: 1. I speak for ~20–30 seconds or longer in one turn. 2. The assistant *always* only hears the first ~10–16 seconds (it varies, but always cuts off.). 3. Open WebUI immediately posts that partial transcript into the chat as if I stopped talking, and the assistant starts replying, even though I’m still mid-sentence. When I inspect the server (Railway container) after a call, there is only one audio file saved under: `/app/backend/data/cache/audio/transcriptions/<UUID>.wav` (+ matching .mp3 and .json) Running `ffprobe` on that `.wav` inside the Railway container shows that file is only ~16–17 seconds long, even if I actually spoke ~25+ seconds on the phone. There is no “second chunk” file for the rest of the utterance. Whisper (local) and Deepgram both produce a transcript that ends mid-sentence at ~16 seconds. So the backend never even receives audio after ~16s. This is reproducible across: - using local faster-whisper in the container, - using Deepgram API, - using Local Whisper (small). This means the cutoff happens before STT, in the frontend/browser-side recording logic of Call mode (likely the `MediaRecorder` path in `CallOverlay.svelte`). It’s not Whisper / VAD / no_speech_threshold. It’s not Cloudflare. It’s not Railway. It’s the recorder stopping early on mobile and never restarting. Desktop (Zen/Firefox on macOS) does not have this early cutoff. On desktop I can generate ~40s audio and Whisper only truncates because of decoding heuristics, not because the file itself is short. On mobile, the actual captured file is short. Critical point: iOS itself is capable of long recordings. Proven with a standalone HTTPS test page (details below). So the 16s cutoff is not an iOS hardware/security limit — it appears to be Call mode’s current implementation. ### Steps to Reproduce 1. Deploy Open WebUI on Railway using the `open-webui:main` Docker image. There's a template called '[OpenwebUI with Pipelines](https://railway.com/deploy/open-webui-with-pipelines)' which is the one I'm using. The container mounts a persistent Railway volume at `/app/backend/data`. I'm using a cloudfare proxy, but it's not needed to reproduce, you can hit Railway’s HTTPS endpoint directly. 2. Arc on iPhone (iOS 26.0.1, for iOS / Safari WebView UA shown above), open the public URL of that Railway deployment. 1. I noticed that changing broswers didn’t make a difference, tried latest: Arc, Chrome, Safari. Same results. 3. Enter Call mode (voice input). Talk continuously for 20+ seconds without pressing stop. Keep talking past 16 seconds. Say something easy to recognize so you know where it cuts (for example: “This is second fifteen, I am still talking after fifteen, now we’re at twenty, now we’re at twenty-five…”). 4. Let Call mode finish by itself, or wait for the assistant to start answering you mid-thought. Observe that the assistant responds using only the first ~15 seconds of what you said, as if you stopped there. 5. Immediately SSH into the Railway container (`railway shell` into the `open-webui` service) and inspect new transcription artifacts: Look at: `/app/backend/data/cache/audio/transcriptions/` You will see a new `<UUID>.wav`, `<UUID>.mp3`, `<UUID>.json` created for that call. Run inside the container: ```bash ffprobe -hide_banner -i /app/backend/data/cache/audio/transcriptions/<UUID>.wav ``` ffprobe output shows `duration: ~16-17s`. There is only that one file for the entire call. There is no second `<UUID2>.wav` or additional chunks. The `.json` transcript in that same folder stops mid-sentence right around the 16s mark. This proves the browser/frontend only ever uploaded ~16 seconds of audio, even though I was still talking. 6. Repeat the exact same “talk 20+ seconds” test using STT = local whisper vs STT = Deepgram. Result is the same. So this is not model-dependent. 7. Now run a control test outside of Open WebUI to rule out iOS mic limits: I served a minimal HTTPS HTML page (files attached below) with self-signed cert with `navigator.mediaDevices.getUserMedia` + `MediaRecorder.start()` and no artificial 15s cap. I added a wake lock to keep the screen awake. I recorded ~40+ seconds straight on the same iPhone (iOS 26.0.1), then manually stopped. [recording_test.html](https://github.com/user-attachments/files/23145594/recording_test.html) [server.py](https://github.com/user-attachments/files/23145646/server.py) The log showed: - `elapsed wall ms: 44104` (≈44.1 seconds of actual talking time), - `blob size(bytes): 1218287`, - iOS UA: `Mozilla/5.0 (iPhone; CPU iPhone OS 26_0_1 like Mac OS X)... CriOS/141.0.7390.96 ...`, - `FINAL duration(s): NaN` (Safari doesn’t always report duration for `audio/webm;codecs=opus`, so NaN is normal), - Crucially: no 16-second cutoff. The browser happily recorded ~40+ seconds in one session. On desktop, same test produced `elapsed wall ms: ~52881` (≈52.9 seconds), `blob size(bytes): ~815262`, `FINAL duration(s): Infinity`. Again: long recording is fine. This proves iOS will capture >30s audio in one go under HTTPS, and will not auto-kill at ~16s. The 16s ceiling only appears in Open WebUI Call mode. So the current call mode logic on mobile browsers ends the recording after ~10–16 seconds, uploads that single chunk, and never spins up another chunk/recorder. Then it treats that first partial transcript as the entire user message and immediately prompts the assistant to respond. The remainder of what is being said after ~16 seconds is never recorded/uploaded at all, so it’s permanently lost. Expected: continuous capture until user stops, possibly chunked internally, with final stitched transcript sent once. Actual: one ~16s blob, auto-stop, auto-send, assistant interrupts. ### Logs & Screenshots After a ~25s spoken test on iPhone, container shows: `/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.wav` `/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.mp3` `/app/backend/data/cache/audio/transcriptions/c9fc6f...13ba.json` Running `ffprobe` inside Railway on that `.wav` reported duration ~16.7 seconds. The `.json` text ends mid-sentence exactly at that point: `"text": "Hello. I'm testing speech to text because I like using speech to text, but it often gets cut off, like, a lot."` (I kept talking after this) [recording_test.html](https://github.com/user-attachments/files/23145575/recording_test.html) There is no second UUID.wav for seconds 17–25, even though I was audibly still talking. Browser console (mobile) shows normal getUserMedia prompt and recording start, then Call mode UI “listens,” then assistant answers. I can provide sanitized console output and ffprobe output if needed, but nothing in the logs suggests network failure; it just looks like the recorder stopped. Desktop control test logs (standalone HTTPS test page, not Open WebUI), iPhone iOS 26.0.1: ``` [rec] wakeLock acquired [rec] using mimeType: audio/webm;codecs=opus [rec] recording started [rec] manual stop [rec] chunk size(bytes): 1218287 [rec] FINAL duration(s): NaN [rec] blob size(bytes): 1218287 [rec] elapsed wall ms: 44104 [rec] UA: Mozilla/5.0 (iPhone; CPU iPhone OS 26_0_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/141.0.7390.96 Mobile/15E148 Safari/604.1 [rec] wakeLock released ``` Note the ~44s elapsed and ~1.2MB blob. No ~16s hard stop. Desktop control test logs (Firefox 143.0 on macOS 15.1 - not 10.15 which is a fake user-agent string produced): ``` [rec] wakeLock acquired [rec] using mimeType: audio/webm;codecs=opus [rec] recording started [rec] manual stop [rec] chunk size(bytes): 815262 [rec] FINAL duration(s): Infinity [rec] blob size(bytes): 815262 [rec] elapsed wall ms: 52881 [rec] UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:143.0) Gecko/20100101 Firefox/143.0 [rec] wakeLock released ``` The ~53s elapsed and ~0.8MB blob. Again: no ~16s hard stop. ### Additional Information Start date - This started somewhere in the last week of September. It was working as expected before then. The suspects are a late September Open WebUI/call-mode change or a browser auto-update enabling a default utterance cap/endpointing on mobile. **Installation Method***: Docker image `ghcr.io/open-webui/open-webui:main`, deployed on Railway as a public-facing container (not localhost). Persistent volume mounted at `/app/backend/data`. Accessed from desktop and iPhone (mobile browser) over HTTPS. No Ollama in this flow. If you need more container logs, console logs (from the CallOverlay component startup/shutdown), or ffprobe output from /app/backend/data/cache/audio/transcriptions/*.wav, I can provide them.
GiteaMirror added the bug label 2026-05-05 20:51:16 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#57323