[GH-ISSUE #15807] Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation) #56586

Open
opened 2026-04-29 11:03:51 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @volzb on GitHub (Apr 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15807

Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation)

Status

  • File Created: 2026-04-25
  • Upstream Submitted: No — awaiting Ben's approval to submit to ollama/ollama

Feature Description

Requesting native support for realtime bidirectional voice conversation in Ollama — the ability to hold a natural, low-latency spoken dialogue with an LLM, similar to OpenAI's Realtime API. This is distinct from the current audio-input-only multimodal support or external STT→LLM→TTS chaining.

What "Realtime Voice Chat" Means Here

Capability Current Ollama This Feature Request
Audio file input for multimodal models (e.g., Qwen2-Audio) Not the same thing
Speech-to-text (STT) No native support Not sufficient
Text-to-speech (TTS) No native support Not sufficient
Streaming audio-in → streaming audio-out Not supported This is the ask
Conversational turn-taking with voice activity detection (VAD) Not supported This is the ask
Low-latency (<500ms) voice response Not supported This is the ask

The desired behavior: a single WebSocket (or SSE) connection where:

  1. Client streams raw audio (e.g., PCM16 @ 24kHz) from the microphone.
  2. Ollama handles speech recognition, LLM inference, and speech synthesis in a continuous pipeline.
  3. Ollama streams synthesized audio back to the client in near real-time.
  4. Turn-taking, interruption handling, and VAD are managed natively or exposed as events.

Use Cases

  1. Accessibility: Hands-free, eyes-free interaction for users with motor or vision impairments.
  2. Productivity: Dictate and converse with local models during coding, driving, or manual work.
  3. Education/Language Learning: Practice speaking with a local AI tutor without sending voice data to third parties.
  4. Embedded & Edge Devices: Voice-enabled local assistants on Raspberry Pi, home servers, or offline workstations.
  5. Privacy: Full voice-to-voice AI interaction without cloud audio processing.

Why It Matters

  • Privacy & Sovereignty: Open-source voice models (Sesame, Qwen2-Audio, Whisper) are advancing rapidly. Users want to run them locally, but Ollama only supports audio input for text output — not true voice conversation.
  • Gap vs. OpenAI Realtime API: Cloud providers are pulling ahead in conversational UX. A local-first alternative keeps the open-source ecosystem competitive.
  • Foundation exists: Ollama already runs audio-capable models and has a streaming API. Extending it to handle audio-out and manage the audio pipeline is a natural evolution.
  • Community demand: Multiple issues (see below) show sustained interest, but they are fragmented across STT-only, TTS-only, or model-specific requests.

Issue Title State Relevance
#1168 Support WhisperForConditionalGeneration Open (63 👍) Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming.
#5424 Supports voice recognition and text-to-speech capabilities Open Generic request for STT + TTS with extension framework. Not specific to streaming/realtime conversation.
#9804 Sesame family models, Realtime voice mode? Open Model-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope.
#11798 Add Audio Input Support for Multimodal Models Open (10 comments) Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice.
#7514 Realtime API like OpenAI Closed (merged into #1168) Was the closest prior issue to this exact request. Closed by maintainer jmorganca on 2024-12-23.

Conclusion: No existing open issue specifically covers native bidirectional streaming voice conversation as a first-class Ollama feature. #1168 is the closest but is STT-only. This feature request is broader and distinct enough to warrant its own issue.


Suggested Implementation Approach

The following is a high-level proposal for discussion. Ollama maintainers should define the canonical design.

1. API Surface

Extend the Ollama API with a new realtime endpoint:

POST /api/realtime
Upgrade: websocket

Client → Server:

  • session.init — model name, voice settings, system prompt
  • audio.append — base64-encoded audio chunks (PCM16, 24kHz)
  • input_audio_buffer.commit — signal end of user turn
  • conversation.item.create — inject text/tools/events
  • session.update — change voice, instructions, or temperature mid-session

Server → Client:

  • conversation.item.created — transcript (user + assistant)
  • response.audio.delta — base64-encoded synthesized audio chunks
  • response.audio.done — end of assistant response
  • response.done — response complete
  • input_audio_buffer.speech_started / speech_stopped — VAD events

2. Architecture

┌─────────────┐      WebSocket       ┌─────────────────────────────────────┐
│   Client    │ ◄──────────────────► │           Ollama Server             │
│  (Mic/Spk)  │    audio/text/events │                                     │
└─────────────┘                      │  ┌─────────┐  ┌──────┐  ┌────────┐  │
                                     │  │  STT   │──►│ LLM  │──►│  TTS   │  │
                                     │  │(local) │  │      │  │(local) │  │
                                     │  └─────────┘  └──────┘  └────────┘  │
                                     │         ▲              │            │
                                     │    VAD / Buffer   Streaming Audio  │
                                     └─────────────────────────────────────┘

3. Component Breakdown

Component Options / Notes
STT Engine Whisper (ggml/gguf via whisper.cpp), or native model audio encoder (Qwen2-Audio, etc.)
LLM Any Ollama text model; system prompt controls persona and tool use
TTS Engine Local option: Sesame CSM-1B, Piper, Coqui TTS, or MeloTTS. Could be model-specific.
VAD Silero VAD, webrtcvad, or native model attention. Detects speech start/stop to trigger STT.
Audio Format PCM16, 24kHz mono (input); PCM16 or opus (output). Matches OpenAI Realtime API conventions.
Interruption Client sends conversation.item.truncate on new speech detection; server cancels in-flight TTS.

4. CLI & SDK

# Start a realtime voice session
ollama realtime llama3.2 --voice default
import ollama

with ollama.realtime(model="llama3.2", voice="nova") as session:
    session.on_audio_delta = lambda chunk: play_audio(chunk)
    session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}")
    session.start()  # blocks, streams mic audio

5. Incremental Rollout Phases

  1. Phase 1 — STT streaming: Accept streaming audio, emit text transcript events (extends #1168).
  2. Phase 2 — TTS streaming: Add TTS model support; emit audio deltas from text responses.
  3. Phase 3 — Native voice models: Support end-to-end audio-in/audio-out models (e.g., GPT-4o-style native audio).
  4. Phase 4 — Interruptions & VAD: Full conversational turn-taking, barge-in, and voice activity detection.

6. Backwards Compatibility

  • New /api/realtime endpoint; no breaking changes to existing /api/generate or /api/chat.
  • Audio models loaded via standard ollama run / Modelfile mechanism.

Open Questions for Maintainers

  1. Should Ollama bundle a default TTS/STT model, or should users bring their own?
  2. Is the goal to support pipeline STT→LLM→TTS, or native end-to-end audio models (or both)?
  3. What is the preferred transport: WebSocket, SSE, or HTTP/2 bidirectional streams?
  4. Should voice conversations support tools/function calling mid-stream?

References

Originally created by @volzb on GitHub (Apr 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15807 # Feature Request: Native Realtime Voice Chat (Bidirectional Streaming Audio Conversation) ## Status - **File Created:** 2026-04-25 - **Upstream Submitted:** No — awaiting Ben's approval to submit to ollama/ollama --- ## Feature Description Requesting native support for **realtime bidirectional voice conversation** in Ollama — the ability to hold a natural, low-latency spoken dialogue with an LLM, similar to OpenAI's Realtime API. This is distinct from the current audio-input-only multimodal support or external STT→LLM→TTS chaining. ### What "Realtime Voice Chat" Means Here | Capability | Current Ollama | This Feature Request | |------------|---------------|----------------------| | Audio file input for multimodal models | ✅ (e.g., Qwen2-Audio) | Not the same thing | | Speech-to-text (STT) | ❌ No native support | Not sufficient | | Text-to-speech (TTS) | ❌ No native support | Not sufficient | | **Streaming audio-in → streaming audio-out** | ❌ Not supported | **This is the ask** | | Conversational turn-taking with voice activity detection (VAD) | ❌ Not supported | **This is the ask** | | Low-latency (<500ms) voice response | ❌ Not supported | **This is the ask** | The desired behavior: a single WebSocket (or SSE) connection where: 1. Client streams raw audio (e.g., PCM16 @ 24kHz) from the microphone. 2. Ollama handles speech recognition, LLM inference, and speech synthesis in a continuous pipeline. 3. Ollama streams synthesized audio back to the client in near real-time. 4. Turn-taking, interruption handling, and VAD are managed natively or exposed as events. --- ## Use Cases 1. **Accessibility**: Hands-free, eyes-free interaction for users with motor or vision impairments. 2. **Productivity**: Dictate and converse with local models during coding, driving, or manual work. 3. **Education/Language Learning**: Practice speaking with a local AI tutor without sending voice data to third parties. 4. **Embedded & Edge Devices**: Voice-enabled local assistants on Raspberry Pi, home servers, or offline workstations. 5. **Privacy**: Full voice-to-voice AI interaction without cloud audio processing. --- ## Why It Matters - **Privacy & Sovereignty**: Open-source voice models (Sesame, Qwen2-Audio, Whisper) are advancing rapidly. Users want to run them locally, but Ollama only supports audio *input* for text output — not true voice conversation. - **Gap vs. OpenAI Realtime API**: Cloud providers are pulling ahead in conversational UX. A local-first alternative keeps the open-source ecosystem competitive. - **Foundation exists**: Ollama already runs audio-capable models and has a streaming API. Extending it to handle audio-out and manage the audio pipeline is a natural evolution. - **Community demand**: Multiple issues (see below) show sustained interest, but they are fragmented across STT-only, TTS-only, or model-specific requests. --- ## Related Existing Issues | Issue | Title | State | Relevance | |-------|-------|-------|-----------| | [#1168](https://github.com/ollama/ollama/issues/1168) | Support WhisperForConditionalGeneration | **Open** (63 👍) | Canonical issue for STT support. #7514 was merged here. Covers speech-to-text, but not TTS or bidirectional streaming. | | [#5424](https://github.com/ollama/ollama/issues/5424) | Supports voice recognition and text-to-speech capabilities | **Open** | Generic request for STT + TTS with extension framework. Not specific to streaming/realtime conversation. | | [#9804](https://github.com/ollama/ollama/issues/9804) | Sesame family models, Realtime voice mode? | **Open** | Model-specific request for Sesame CSM-1B and realtime voice. Overlaps but narrower in scope. | | [#11798](https://github.com/ollama/ollama/issues/11798) | Add Audio Input Support for Multimodal Models | **Open** (10 comments) | Audio-in for text-out (e.g., Qwen2-Audio). Already partially supported; not voice-to-voice. | | ~~#7514~~ | ~~Realtime API like OpenAI~~ | **Closed** (merged into #1168) | Was the closest prior issue to this exact request. Closed by maintainer `jmorganca` on 2024-12-23. | **Conclusion:** No existing open issue specifically covers **native bidirectional streaming voice conversation** as a first-class Ollama feature. #1168 is the closest but is STT-only. This feature request is broader and distinct enough to warrant its own issue. --- ## Suggested Implementation Approach > The following is a high-level proposal for discussion. Ollama maintainers should define the canonical design. ### 1. API Surface Extend the Ollama API with a new realtime endpoint: ``` POST /api/realtime Upgrade: websocket ``` **Client → Server:** - `session.init` — model name, voice settings, system prompt - `audio.append` — base64-encoded audio chunks (PCM16, 24kHz) - `input_audio_buffer.commit` — signal end of user turn - `conversation.item.create` — inject text/tools/events - `session.update` — change voice, instructions, or temperature mid-session **Server → Client:** - `conversation.item.created` — transcript (user + assistant) - `response.audio.delta` — base64-encoded synthesized audio chunks - `response.audio.done` — end of assistant response - `response.done` — response complete - `input_audio_buffer.speech_started` / `speech_stopped` — VAD events ### 2. Architecture ``` ┌─────────────┐ WebSocket ┌─────────────────────────────────────┐ │ Client │ ◄──────────────────► │ Ollama Server │ │ (Mic/Spk) │ audio/text/events │ │ └─────────────┘ │ ┌─────────┐ ┌──────┐ ┌────────┐ │ │ │ STT │──►│ LLM │──►│ TTS │ │ │ │(local) │ │ │ │(local) │ │ │ └─────────┘ └──────┘ └────────┘ │ │ ▲ │ │ │ VAD / Buffer Streaming Audio │ └─────────────────────────────────────┘ ``` ### 3. Component Breakdown | Component | Options / Notes | |-----------|-----------------| | **STT Engine** | Whisper (ggml/gguf via whisper.cpp), or native model audio encoder (Qwen2-Audio, etc.) | | **LLM** | Any Ollama text model; system prompt controls persona and tool use | | **TTS Engine** | Local option: Sesame CSM-1B, Piper, Coqui TTS, or MeloTTS. Could be model-specific. | | **VAD** | Silero VAD, webrtcvad, or native model attention. Detects speech start/stop to trigger STT. | | **Audio Format** | PCM16, 24kHz mono (input); PCM16 or opus (output). Matches OpenAI Realtime API conventions. | | **Interruption** | Client sends `conversation.item.truncate` on new speech detection; server cancels in-flight TTS. | ### 4. CLI & SDK ```bash # Start a realtime voice session ollama realtime llama3.2 --voice default ``` ```python import ollama with ollama.realtime(model="llama3.2", voice="nova") as session: session.on_audio_delta = lambda chunk: play_audio(chunk) session.on_transcript = lambda text, speaker: print(f"{speaker}: {text}") session.start() # blocks, streams mic audio ``` ### 5. Incremental Rollout Phases 1. **Phase 1 — STT streaming**: Accept streaming audio, emit text transcript events (extends #1168). 2. **Phase 2 — TTS streaming**: Add TTS model support; emit audio deltas from text responses. 3. **Phase 3 — Native voice models**: Support end-to-end audio-in/audio-out models (e.g., GPT-4o-style native audio). 4. **Phase 4 — Interruptions & VAD**: Full conversational turn-taking, barge-in, and voice activity detection. ### 6. Backwards Compatibility - New `/api/realtime` endpoint; no breaking changes to existing `/api/generate` or `/api/chat`. - Audio models loaded via standard `ollama run` / `Modelfile` mechanism. --- ## Open Questions for Maintainers 1. Should Ollama bundle a default TTS/STT model, or should users bring their own? 2. Is the goal to support *pipeline* STT→LLM→TTS, or *native* end-to-end audio models (or both)? 3. What is the preferred transport: WebSocket, SSE, or HTTP/2 bidirectional streams? 4. Should voice conversations support tools/function calling mid-stream? --- ## References - [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime) - [Sesame CSM-1B (Conversational Speech Model)](https://huggingface.co/sesame/csm-1b) - [Qwen2-Audio](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Ollama issues: #1168, #5424, #9804, #11798
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56586