[GH-ISSUE #7514] Realtime API like OpenAI (full fledged voice to voice integrations) #30540

Closed
opened 2026-04-22 10:14:49 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @ryzxxn on GitHub (Nov 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7514

if anyone is working on a realtime api like integration with Ollama, please reach out to me. iam working on a similar integration, and i think feedback from, all the amazing people can greatly impact the quality of this feature, i think its pretty cool what openAI has going for it, and iam also a big fan of running every thing locally... 😄

Originally created by @ryzxxn on GitHub (Nov 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7514 if anyone is working on a realtime api like integration with Ollama, please reach out to me. iam working on a similar integration, and i think feedback from, all the amazing people can greatly impact the quality of this feature, i think its pretty cool what openAI has going for it, and iam also a big fan of running every thing locally... 😄
GiteaMirror added the feature request label 2026-04-22 10:14:49 -05:00
Author
Owner

@nikhil-swamix commented on GitHub (Nov 6, 2024):

the thing is whisper is not found on ollama repo. so sound models are not supported in GGUF format.
let me know the outline of implementation, as i run a separate python server for STT<->TTS models. its slow btw.

<!-- gh-comment-id:2458823881 --> @nikhil-swamix commented on GitHub (Nov 6, 2024): the thing is whisper is not found on ollama repo. so sound models are not supported in GGUF format. let me know the outline of implementation, as i run a separate python server for STT<->TTS models. its slow btw.
Author
Owner

@ryzxxn commented on GitHub (Nov 6, 2024):

@nikhil-swamix i have something similar going on, a python server continuously processing streamed audio from my microphone,, i tried implementing multiple threads to process these audio segments parallely but it improves the speed a bit but, at the end of the day its not fast enough to have it real time, any idea how openAi is doing this,
are they just, using a lot of compute?

<!-- gh-comment-id:2460326223 --> @ryzxxn commented on GitHub (Nov 6, 2024): @nikhil-swamix i have something similar going on, a python server continuously processing streamed audio from my microphone,, i tried implementing multiple threads to process these audio segments parallely but it improves the speed a bit but, at the end of the day its not fast enough to have it real time, any idea how openAi is doing this, are they just, using a lot of compute?
Author
Owner

@nikhil-swamix commented on GitHub (Nov 6, 2024):

can you give some basic overview on:

  1. which model ur using
  2. a snap of code of audio processing part.
    there are ways to improve it, compute being one of it, but if you may provide more context my answer can be more grounded.
    regards
<!-- gh-comment-id:2460333705 --> @nikhil-swamix commented on GitHub (Nov 6, 2024): can you give some basic overview on: 1. which model ur using 2. a snap of code of audio processing part. there are ways to improve it, compute being one of it, but if you may provide more context my answer can be more grounded. regards
Author
Owner

@ryzxxn commented on GitHub (Nov 6, 2024):

iam currently using whisper-small
https://huggingface.co/openai/whisper-small

import torch
import numpy as np
import sounddevice as sd
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import asyncio

# Set device and precision based on availability of CUDA
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load Whisper model and processor
model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Initialize the ASR pipeline with the Whisper model, enforcing English language
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Microphone parameters
sample_rate = 16000  # Whisper models prefer 16 kHz input
buffer_duration = 3  # Duration of each audio buffer in seconds (smaller for real-time)

# Asynchronous audio callback function
async def audio_callback(indata, frames, time, status):
    if status:
        print(f"Error: {status}")
    # Convert the audio chunk to a flat array and send to the pipeline
    audio = np.squeeze(indata)  # Flatten the audio data
    # Transcribe the audio asynchronously
    result = await asyncio.to_thread(pipe, audio)
    # Print the transcribed text
    print("Transcription:", result["text"])

# Function to record and transcribe audio in real-time
async def transcribe_from_mic():
    print("Recording audio from microphone...")

    # Start streaming audio from the microphone
    with sd.InputStream(callback=lambda indata, frames, time, status: asyncio.run(audio_callback(indata, frames, time, status)),
                        channels=1, samplerate=sample_rate, blocksize=int(sample_rate * buffer_duration), dtype="float32"):
        print("Press Ctrl+C to stop the transcription.")
        try:
            while True:
                await asyncio.sleep(0.1)  # Keep the stream open for continuous transcription
        except KeyboardInterrupt:
            print("Transcription stopped.")

# Run the transcription function
asyncio.run(transcribe_from_mic())
<!-- gh-comment-id:2460340872 --> @ryzxxn commented on GitHub (Nov 6, 2024): iam currently using whisper-small https://huggingface.co/openai/whisper-small ``` import torch import numpy as np import sounddevice as sd from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline import asyncio # Set device and precision based on availability of CUDA device = "cuda" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Load Whisper model and processor model_id = "openai/whisper-small" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) # Initialize the ASR pipeline with the Whisper model, enforcing English language pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) # Microphone parameters sample_rate = 16000 # Whisper models prefer 16 kHz input buffer_duration = 3 # Duration of each audio buffer in seconds (smaller for real-time) # Asynchronous audio callback function async def audio_callback(indata, frames, time, status): if status: print(f"Error: {status}") # Convert the audio chunk to a flat array and send to the pipeline audio = np.squeeze(indata) # Flatten the audio data # Transcribe the audio asynchronously result = await asyncio.to_thread(pipe, audio) # Print the transcribed text print("Transcription:", result["text"]) # Function to record and transcribe audio in real-time async def transcribe_from_mic(): print("Recording audio from microphone...") # Start streaming audio from the microphone with sd.InputStream(callback=lambda indata, frames, time, status: asyncio.run(audio_callback(indata, frames, time, status)), channels=1, samplerate=sample_rate, blocksize=int(sample_rate * buffer_duration), dtype="float32"): print("Press Ctrl+C to stop the transcription.") try: while True: await asyncio.sleep(0.1) # Keep the stream open for continuous transcription except KeyboardInterrupt: print("Transcription stopped.") # Run the transcription function asyncio.run(transcribe_from_mic()) ```
Author
Owner

@ryzxxn commented on GitHub (Nov 6, 2024):

iam currently using whisper-small https://huggingface.co/openai/whisper-small

import torch
import numpy as np
import sounddevice as sd
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import asyncio

# Set device and precision based on availability of CUDA
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load Whisper model and processor
model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Initialize the ASR pipeline with the Whisper model, enforcing English language
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Microphone parameters
sample_rate = 16000  # Whisper models prefer 16 kHz input
buffer_duration = 3  # Duration of each audio buffer in seconds (smaller for real-time)

# Asynchronous audio callback function
async def audio_callback(indata, frames, time, status):
    if status:
        print(f"Error: {status}")
    # Convert the audio chunk to a flat array and send to the pipeline
    audio = np.squeeze(indata)  # Flatten the audio data
    # Transcribe the audio asynchronously
    result = await asyncio.to_thread(pipe, audio)
    # Print the transcribed text
    print("Transcription:", result["text"])

# Function to record and transcribe audio in real-time
async def transcribe_from_mic():
    print("Recording audio from microphone...")

    # Start streaming audio from the microphone
    with sd.InputStream(callback=lambda indata, frames, time, status: asyncio.run(audio_callback(indata, frames, time, status)),
                        channels=1, samplerate=sample_rate, blocksize=int(sample_rate * buffer_duration), dtype="float32"):
        print("Press Ctrl+C to stop the transcription.")
        try:
            while True:
                await asyncio.sleep(0.1)  # Keep the stream open for continuous transcription
        except KeyboardInterrupt:
            print("Transcription stopped.")

# Run the transcription function
asyncio.run(transcribe_from_mic())

use appropriate torch library based on your system

<!-- gh-comment-id:2460342479 --> @ryzxxn commented on GitHub (Nov 6, 2024): > iam currently using whisper-small https://huggingface.co/openai/whisper-small > > ``` > import torch > import numpy as np > import sounddevice as sd > from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline > import asyncio > > # Set device and precision based on availability of CUDA > device = "cuda" if torch.cuda.is_available() else "cpu" > torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 > > # Load Whisper model and processor > model_id = "openai/whisper-small" > model = AutoModelForSpeechSeq2Seq.from_pretrained( > model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True > ) > model.to(device) > processor = AutoProcessor.from_pretrained(model_id) > > # Initialize the ASR pipeline with the Whisper model, enforcing English language > pipe = pipeline( > "automatic-speech-recognition", > model=model, > tokenizer=processor.tokenizer, > feature_extractor=processor.feature_extractor, > torch_dtype=torch_dtype, > device=device, > ) > > # Microphone parameters > sample_rate = 16000 # Whisper models prefer 16 kHz input > buffer_duration = 3 # Duration of each audio buffer in seconds (smaller for real-time) > > # Asynchronous audio callback function > async def audio_callback(indata, frames, time, status): > if status: > print(f"Error: {status}") > # Convert the audio chunk to a flat array and send to the pipeline > audio = np.squeeze(indata) # Flatten the audio data > # Transcribe the audio asynchronously > result = await asyncio.to_thread(pipe, audio) > # Print the transcribed text > print("Transcription:", result["text"]) > > # Function to record and transcribe audio in real-time > async def transcribe_from_mic(): > print("Recording audio from microphone...") > > # Start streaming audio from the microphone > with sd.InputStream(callback=lambda indata, frames, time, status: asyncio.run(audio_callback(indata, frames, time, status)), > channels=1, samplerate=sample_rate, blocksize=int(sample_rate * buffer_duration), dtype="float32"): > print("Press Ctrl+C to stop the transcription.") > try: > while True: > await asyncio.sleep(0.1) # Keep the stream open for continuous transcription > except KeyboardInterrupt: > print("Transcription stopped.") > > # Run the transcription function > asyncio.run(transcribe_from_mic()) > ``` use appropriate torch library based on your system
Author
Owner

@nikhil-swamix commented on GitHub (Nov 6, 2024):

what type of latency ur getting , and can you share CPU, GPU Specs?
observation: async approach is not doing full justice to hardware utilization...

<!-- gh-comment-id:2460359836 --> @nikhil-swamix commented on GitHub (Nov 6, 2024): what type of latency ur getting , and can you share CPU, GPU Specs? observation: async approach is not doing full justice to hardware utilization...
Author
Owner

@ryzxxn commented on GitHub (Nov 6, 2024):

I'll update you on the output later, for now my specs are a rtx2050 and amd7000 laptop
Also iam segmenting into 3 second buffer which skipped the next 3 seconds of the audio...

<!-- gh-comment-id:2460376194 --> @ryzxxn commented on GitHub (Nov 6, 2024): I'll update you on the output later, for now my specs are a rtx2050 and amd7000 laptop Also iam segmenting into 3 second buffer which skipped the next 3 seconds of the audio...
Author
Owner

@nikhil-swamix commented on GitHub (Nov 6, 2024):

understood, rtx 2050 should be good enough,
do have a look at,
https://github.com/ggerganov/whisper.cpp
and https://github.com/FL33TW00D/whisper-turbo


my current implementation is using groq, so feels fast enough within 3 seconds... whisper.cpp from the founders of llama.cpp (on which ollama is based...) yielded very fast response on test PC RTX 3070. the trick is to work with smaller buffers and concatenate the outputs on multiple threads (max 3). and use the streaming api so each sentence from chatgpt(ollama) server can be spoken out by text to speech in respectable timeframe. will get back on saturday on the issue.
regards

<!-- gh-comment-id:2460395320 --> @nikhil-swamix commented on GitHub (Nov 6, 2024): understood, rtx 2050 should be good enough, do have a look at, https://github.com/ggerganov/whisper.cpp and https://github.com/FL33TW00D/whisper-turbo ----- my current implementation is using groq, so feels fast enough within 3 seconds... whisper.cpp from the founders of llama.cpp (on which ollama is based...) yielded very fast response on test PC RTX 3070. the trick is to work with smaller buffers and concatenate the outputs on multiple threads (max 3). and use the streaming api so each sentence from chatgpt(ollama) server can be spoken out by text to speech in respectable timeframe. will get back on saturday on the issue. regards
Author
Owner

@ryzxxn commented on GitHub (Nov 6, 2024):

Very insightfull, even I found that using small buffers yielded the best result, iam going to keep working on the multiple thread method, I'll update it here if I have any updates,

<!-- gh-comment-id:2460412832 --> @ryzxxn commented on GitHub (Nov 6, 2024): Very insightfull, even I found that using small buffers yielded the best result, iam going to keep working on the multiple thread method, I'll update it here if I have any updates,
Author
Owner

@theboringhumane commented on GitHub (Dec 12, 2024):

Very insightfull, even I found that using small buffers yielded the best result, iam going to keep working on the multiple thread method, I'll update it here if I have any updates,

I'm building something similar https://github.com/fofsinx/echoollama

<!-- gh-comment-id:2537912862 --> @theboringhumane commented on GitHub (Dec 12, 2024): > Very insightfull, even I found that using small buffers yielded the best result, iam going to keep working on the multiple thread method, I'll update it here if I have any updates, I'm building something similar https://github.com/fofsinx/echoollama
Author
Owner

@jmorganca commented on GitHub (Dec 23, 2024):

Merging with https://github.com/ollama/ollama/issues/1168 – thanks for the issue!

<!-- gh-comment-id:2558712697 --> @jmorganca commented on GitHub (Dec 23, 2024): Merging with https://github.com/ollama/ollama/issues/1168 – thanks for the issue!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30540