feat: openai real-time api #2271

Open
opened 2025-11-11 15:03:54 -06:00 by GiteaMirror · 18 comments
Owner

Originally created by @stevenbaert on GitHub (Oct 3, 2024).

Didn't find anything back (yet) on the forum/web, but it's an obvious question: is there a way to use the new open-ai real-time api in open-webui or are you planning to support it?

Reference:
https://openai.com/index/introducing-the-realtime-api/

🙏

Originally created by @stevenbaert on GitHub (Oct 3, 2024). Didn't find anything back (yet) on the forum/web, but it's an obvious question: is there a way to use the new open-ai real-time api in open-webui or are you planning to support it? Reference: https://openai.com/index/introducing-the-realtime-api/ 🙏
GiteaMirror added the non-core label 2025-11-11 15:03:54 -06:00
Author
Owner

@thiswillbeyourgithub commented on GitHub (Oct 5, 2024):

Hi @tjbck i saw on hackernews that agents by livekits used to make the openai realtime api as well as cerebras voice seems to be open source.

They have tons of demos and code on their github. I think there must be a llama-omni implementation somewhere that would be a killer feature for open-webui!

Edit: here's a particularly interesting demo that connects stt + llm + tts: https://github.com/livekit/agents/blob/main/examples/voice-pipeline-agent/minimal_assistant.py

Edit2: I made an issue to ask for a demo for Llama-Omni

Edit3:

Also here's openai's realtime API reference implementation:

https://github.com/openai/openai-realtime-console

The OpenAI Realtime Console is intended as an inspector and interactive API reference for the OpenAI Realtime API. It comes packaged with two utility libraries, openai/openai-realtime-api-beta that acts as a Reference Client (for browser and Node.js) and /src/lib/wavtools which allows for simple audio management in the browser.

https://github.com/openai/openai-realtime-api-beta

This repository contains a reference client aka sample library for connecting to OpenAI's Realtime API. This library is in beta and should not be treated as a final implementation. You can use it to easily prototype conversational apps.

The easiest way to get playing with the API right away is to use the Realtime Console, it uses the reference client to deliver a fully-functional API inspector with examples of voice visualization and more.

@thiswillbeyourgithub commented on GitHub (Oct 5, 2024): Hi @tjbck i saw [on hackernews](https://news.ycombinator.com/item?id=41743327) that [agents by livekits used to make](https://github.com/livekit/agents) the openai realtime api as well as [cerebras voice](https://cerebras.vercel.app/) seems to be open source. They have tons of demos and code [on their github](https://github.com/livekit/agents?tab=readme-ov-file). I think there must be a [llama-omni](https://github.com/ictnlp/LLaMA-Omni) implementation somewhere that would be a killer feature for open-webui! Edit: here's a particularly interesting demo that connects stt + llm + tts: https://github.com/livekit/agents/blob/main/examples/voice-pipeline-agent/minimal_assistant.py Edit2: I made [an issue to ask for a demo for Llama-Omni](https://github.com/livekit/agents/issues/845) Edit3: Also here's openai's realtime API reference implementation: https://github.com/openai/openai-realtime-console > The OpenAI Realtime Console is intended as an inspector and interactive API reference for the OpenAI Realtime API. It comes packaged with two utility libraries, [openai/openai-realtime-api-beta](https://github.com/openai/openai-realtime-api-beta) that acts as a Reference Client (for browser and Node.js) and [/src/lib/wavtools](https://github.com/openai/openai-realtime-console/blob/main/src/lib/wavtools) which allows for simple audio management in the browser. https://github.com/openai/openai-realtime-api-beta > This repository contains a reference client aka sample library for connecting to OpenAI's Realtime API. This library is in beta and should not be treated as a final implementation. You can use it to easily prototype conversational apps. > > The easiest way to get playing with the API right away is to use the [Realtime Console](https://github.com/openai/openai-realtime-console), it uses the reference client to deliver a fully-functional API inspector with examples of voice visualization and more.
Author
Owner

@InventoCasa commented on GitHub (Oct 10, 2024):

+1 for this!
Would be very nice if OpenWebUI would support the new gpt-4o-realtime-preview model!

@InventoCasa commented on GitHub (Oct 10, 2024): +1 for this! Would be very nice if OpenWebUI would support the new gpt-4o-realtime-preview model!
Author
Owner

@aguilarcarboni commented on GitHub (Oct 11, 2024):

Bump! gpt-4o-realtime would be awesome, don't even get me started on Omni because I'd go crazy. Anybody got any time to implement this?

@aguilarcarboni commented on GitHub (Oct 11, 2024): Bump! gpt-4o-realtime would be awesome, don't even get me started on Omni because I'd go crazy. Anybody got any time to implement this?
Author
Owner

@jbaenaxd commented on GitHub (Oct 14, 2024):

We totally need this, specially for folks in Europe, where ChatGPT didn't make available this service through their app/web for Plus/Team users without a VPN. Fortunately, it's available through the API, so we (europeans) could self-hosted ourselves to access this great tool.

@jbaenaxd commented on GitHub (Oct 14, 2024): We totally need this, specially for folks in Europe, where ChatGPT didn't make available this service through their app/web for Plus/Team users without a VPN. Fortunately, it's available through the API, so we (europeans) could self-hosted ourselves to access this great tool.
Author
Owner

@spammenotinoz commented on GitHub (Oct 21, 2024):

Didn't find anything back (yet) on the forum/web, but it's an obvious question: is there a way to use the new open-ai real-time api in open-webui or are you planning to support it?

Reference: https://openai.com/index/introducing-the-realtime-api/

🙏

Careful what you wish for this model is quite expensive, you pay for audio input and output.
$100.00 / 1M input tokens
$200.00 / 1M output tokens

@spammenotinoz commented on GitHub (Oct 21, 2024): > Didn't find anything back (yet) on the forum/web, but it's an obvious question: is there a way to use the new open-ai real-time api in open-webui or are you planning to support it? > > Reference: https://openai.com/index/introducing-the-realtime-api/ > > 🙏 Careful what you wish for this model is quite expensive, you pay for audio input and output. $100.00 / 1M input tokens $200.00 / 1M output tokens
Author
Owner

@amiranvarov commented on GitHub (Oct 21, 2024):

+1 for this. It sucks that we can't use it in EU

@amiranvarov commented on GitHub (Oct 21, 2024): +1 for this. It sucks that we can't use it in EU
Author
Owner

@Fusseldieb commented on GitHub (Oct 27, 2024):

Would indeed be very cool! +1

@Fusseldieb commented on GitHub (Oct 27, 2024): Would indeed be very cool! +1
Author
Owner

@odellus commented on GitHub (Dec 18, 2024):

Price has been lowered. Any interest in this now?

@odellus commented on GitHub (Dec 18, 2024): Price has been lowered. Any interest in this now?
Author
Owner

@ddwinhzy commented on GitHub (Jan 17, 2025):

+1

@ddwinhzy commented on GitHub (Jan 17, 2025): +1
Author
Owner

@thiswillbeyourgithub commented on GitHub (Jan 24, 2025):

I am absolutely convinced that the future is to use fully multimodal models and they are more and more popular and smaller and smaller. The most recent best example I could find was the mini PCM series that exists in various sizes and is already sort of implemented in llama.cpp and ollama projects. Here's the link. To me, it is a feature that is more and more needed and quite frankly the future of LLMs at least for the short and medium term. To be honest, I even do think that it should be a high priority along with stability and in contrast to, for example, making a desktop app.

@thiswillbeyourgithub commented on GitHub (Jan 24, 2025): I am absolutely convinced that the future is to use fully multimodal models and they are more and more popular and smaller and smaller. The most recent best example I could find was the mini PCM series that exists in various sizes and is already sort of implemented in llama.cpp and ollama projects. [Here's the link](https://github.com/OpenBMB/MiniCPM-o). To me, it is a feature that is more and more needed and quite frankly the future of LLMs at least for the short and medium term. To be honest, I even do think that it should be a high priority along with stability and in contrast to, for example, making a desktop app.
Author
Owner

@Fusseldieb commented on GitHub (Jan 24, 2025):

To be honest, I even do think that it should be a high priority

I would say so, too

along with stability and in contrast to, for example, making a desktop app.

A desktop app is superfluous imo, as all webkit browsers allow to 'install' OI as a shortcut as if it were a native app, which is sufficient in 99% of cases. I even pinned mine on the taskbar and it opens in less than a second. Also, 'installed' it gets rid off the address bar and everything, just like a 'native app' would.
Plus, at this point most 'native apps' are just a electron wrapper, which would only eat RAM for absolutely nothing.
Same goes for mobile, as you can 'install' it there, too.
It's certainly not 'high priority' in any sense imo. I'd even wager to say it isn't even necessary, but that's up to the developer.

Image

@Fusseldieb commented on GitHub (Jan 24, 2025): > To be honest, I even do think that it should be a high priority I would say so, too > along with stability and in contrast to, for example, making a desktop app. A desktop app is superfluous imo, as all webkit browsers allow to 'install' OI as a shortcut as if it were a native app, which is sufficient in 99% of cases. I even pinned mine on the taskbar and it opens in less than a second. Also, 'installed' it gets rid off the address bar and everything, just like a 'native app' would. Plus, at this point most 'native apps' are just a electron wrapper, which would only eat RAM for absolutely nothing. Same goes for mobile, as you can 'install' it there, too. It's certainly not 'high priority' in any sense imo. I'd even wager to say it isn't even necessary, but that's up to the developer. ![Image](https://github.com/user-attachments/assets/ddb008b5-624d-4a7f-829f-1df472d6fc5c)
Author
Owner

@thiswillbeyourgithub commented on GitHub (Jan 24, 2025):

I agree. Personnaly for all my friends I created an account and openned the url via the browser and used the "add to homescreen" feature and nobody noticed it wasn't a regular app.

@thiswillbeyourgithub commented on GitHub (Jan 24, 2025): I agree. Personnaly for all my friends I created an account and openned the url via the browser and used the "add to homescreen" feature and nobody noticed it wasn't a regular app.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Feb 11, 2025):

FYI there is currently a free demo at https://demo.ultravox.ai/ for ultravox models, it's really interesting and anyone who hasn't yet given a try to real time models should do so, it opens up a lot of interactivity usecase. And their 70B model is not that expensive either, $0.05/minute, so $3 per hour. Although I don't know if that counts if you are not speaking. Also by signing up on their website you get 30 minutes free without even having to enter a credit card. Although it's still lacking quantization so somewhat out of reach from most customers for now.

I can't wait to be able to play with those on customer hardware and deploy it with open-webui!

@thiswillbeyourgithub commented on GitHub (Feb 11, 2025): FYI there is currently a free demo at https://demo.ultravox.ai/ for ultravox models, it's really interesting and anyone who hasn't yet given a try to real time models should do so, it opens up a lot of interactivity usecase. And their 70B model is not that expensive either, $0.05/minute, so $3 per hour. Although I don't know if that counts if you are not speaking. Also by signing up on their website you get 30 minutes free without even having to enter a credit card. Although it's still lacking [quantization](https://github.com/fixie-ai/ultravox/pull/33) so somewhat out of reach from most customers for now. I can't wait to be able to play with those on customer hardware and deploy it with open-webui!
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 18, 2025):

If anyone wants to see a good implementation of realtime APIs, the devs at gradio made the great lib fastrtc which includes great examples.

For example in about 150 lines of code you get the openai realtime api, UI included!
Here is the code for gemini voice chats

I'm not sure what would be the best way to include that into open-webui though. I think you'd have to hardcode openai and gemini into it because they are not available anywhere else (no openrouter for example) at least that I'm aware of. But there will be alternatives as time goes on but that's too bad to have to wait go knows how long.

There is a cookbook too.

Edit: after giving it some more thought I believe real time models could be implemented as a pipe. We just need a way to plug the websockets to the UI's call mode.

@thiswillbeyourgithub commented on GitHub (Apr 18, 2025): If anyone wants to see a good implementation of realtime APIs, the devs at gradio made the great lib [fastrtc](https://github.com/gradio-app/fastrtc) which includes great examples. For example [in about 150 lines of code you get the openai realtime api, UI included!](https://huggingface.co/spaces/fastrtc/talk-to-openai/blob/main/app.py) Here is [the code for gemini voice chats](https://huggingface.co/spaces/fastrtc/talk-to-gemini/blob/main/app.py) I'm not sure what would be the best way to include that into open-webui though. I think you'd have to hardcode openai and gemini into it because they are not available anywhere else (no openrouter for example) at least that I'm aware of. But there will be alternatives as time goes on but that's too bad to have to wait go knows how long. There is a [cookbook](https://fastrtc.org/cookbook/) too. Edit: after giving it some more thought I believe real time models could be implemented as a pipe. We just need a way to plug the websockets to the UI's call mode.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 23, 2025):

FYI an enthusiastic OpenAI employee is making itself available to help projects if they have issues implementing support for the realtime API. Look for sean in this hn thread

@thiswillbeyourgithub commented on GitHub (Apr 23, 2025): FYI an enthusiastic OpenAI employee is making itself available to help projects if they have issues implementing support for the realtime API. Look for `sean` in [this hn thread](https://news.ycombinator.com/item?id=43762409)
Author
Owner

@Pl8tinium commented on GitHub (May 14, 2025):

im very much looking forward to this. I love openai's advanced voice mode and havent found a similar alternative (I assume this is built with their realtime api).

The advanced voice mode and the fact that they have a native app are the only reason i keep getting back to chatgpt for mobile phone usage. Its just a better ux experience rn

@Pl8tinium commented on GitHub (May 14, 2025): im very much looking forward to this. I love openai's advanced voice mode and havent found a similar alternative (I assume this is built with their realtime api). The advanced voice mode and the fact that they have a native app are the only reason i keep getting back to chatgpt for mobile phone usage. Its just a better ux experience rn
Author
Owner

@JamesClarke7283 commented on GitHub (Jul 22, 2025):

Didn't find anything back (yet) on the forum/web, but it's an obvious question: is there a way to use the new open-ai real-time api in open-webui or are you planning to support it?
Reference: https://openai.com/index/introducing-the-realtime-api/
🙏

Careful what you wish for this model is quite expensive, you pay for audio input and output. $100.00 / 1M input tokens $200.00 / 1M output tokens

Indeed I agree, OpenAI are overpriced for their offerings nowdays(they dont have first-movers advantage anymore, you are paying more for the brand and ecosystem than the product itself, unless your usecase just to happens to match the product, you aren't getting the best value for money), but it's as they say, "a fool and their money are soon parted" (Supplimental evidence: '"Inteligence" VS Price Comparison' ), it's diminishing returns for most people(including myself), i gave up OpenAI ages ago.

People are sceptical and dont have the tokens to burn for local testing.

Side Rant: Training Compute isn't the deciding asset for entering frontier status anymore folks (I know i am preach to the choir here), my opinion will be, next it will be 'corporate private/internal data' that loses the competivive advantage(OSINT + grounded synthetic/augmented data), there is only so much a single company can do longterm to improve a ML based chatbot before a new paradigm is needed(the elusive 'World Models'?).
For best quality ML insight, HuggingFace is currently the platform to be i find (as of July 2025), even as a novice, there are open source 'spaces'(Mini-Apps) for almost every usecase.

Blind human evaluation driven ELO leaderboard here(they may be near top, but observe how the gap's get smaller & smaller):
https://lmarena.ai/leaderboard

The data transparency is a little spotty as only samples of the data get released infrequently, but it tends to line up with my human preferences(if you factor in the 90% CI in your list of models to 'vibe check' for own usecase).
It's also i think, a catch-22 for gaming benchmarks as traditional benchmarks and live, blind human preference evaluations cannot easily be gamed simultaniously(apart from freak incidents like social engineering attacks to admin's like Meta-LLAMA did with llama4 by doing the old-bait-and-switch, and got called out for it).

Externally sourced Evaluations/Benchmarks are only meant to help people choose a set models to try out of course(so they don't overwhelm themselves or... their wallet).

From my personal usage and mistakes(by spending too much with OpenAI), I find using a Frontier Open Weight model (Currently R1 & Qwen) from OpenRouter(Remote Inference), or better(locally) is much better, you can also finetune LoRa's easier and easier now as a novice (Soon LocalAI will have a OpenAI complient training API, after which it's just a question of using a open source frontend for doing OpenAI API integrated training).

The lower sized Qwen3 series (32B/30-a3b) work well for me & uncensored/ablated models are typically better for nuanced topics anyway if you use them responsibly, especially for people where biased answers are a productivity burden (i.e think academic/corporate research/knowledge-base curation, anything where post-training bias can skew insights too much).

Bottom line:
If/When OpenWebUI implements this feature, it's support for OpenAI API 'RealTime' spec, so you would be able to use it with any OpenAI API inference engine server that supports 'Realtime'. including Open Source engines.

like for example:
https://github.com/theboringhumane/echoOLlama
or more definitively for a self-contained suite: LocalAI has an issue here:
https://github.com/mudler/LocalAI/issues/3714

End result of having this in the inference engines like LocalAI (which is making progress) and then supporting it in the UI like with OpenWebUI voice mode, would be you could save money and regain freedom/control in the process.

The Realtime API spec is one of the reasons OpenAI's Advanced voice is so smooth and natural (relatively speaking).

The inference server can use a chained approach (STT -> LLM -> TTS), which is at a technical level a bit harder, as information can be lost unless you encode the STT incantations,subtitles to Text, and reverse for TTS.

End goal is using open weight speech-to-speech (Omni) models, but for now i think its pretty awesome you can swap out STT/LLM/TTS to make one with OpenWebUI.

A universal REST api standard like "OpenAI API spec" just makes convergence much simpler(the inference server is abstracted away, one can use OpenAI or Local, or someone else's), much like with MCP, but without the glaring security issues with its core design (I.E: core defect transcends software implementation).

@JamesClarke7283 commented on GitHub (Jul 22, 2025): > > Didn't find anything back (yet) on the forum/web, but it's an obvious question: is there a way to use the new open-ai real-time api in open-webui or are you planning to support it? > > Reference: https://openai.com/index/introducing-the-realtime-api/ > > 🙏 > > Careful what you wish for this model is quite expensive, you pay for audio input and output. $100.00 / 1M input tokens $200.00 / 1M output tokens Indeed I agree, OpenAI are overpriced for their offerings nowdays(they dont have first-movers advantage anymore, you are paying more for the brand and ecosystem than the product itself, unless your usecase just to happens to match the product, you aren't getting the best value for money), but it's as they say, "a fool and their money are soon parted" (Supplimental evidence: ['"Inteligence" VS Price Comparison'](https://artificialanalysis.ai/models/gemini-2-5-pro?models=openai_o3&endpoints=openai_o4-mini%2Copenai_gpt-4-1-nano%2Copenai_gpt-4-1-mini%2Copenai_gpt-4-1%2Copenai_o3%2Copenai_o3-mini-high%2Copenai_gpt-4o-chatgpt-03-25%2Cmistral_magistral-small%2Cmistral_mistral-small-3-2%2Cmistral_magistral-medium%2Cdeepseek_deepseek-r1-05-28%2Canthropic_claude-4-sonnet-thinking%2Canthropic_claude-3-7-sonnet%2Canthropic_claude-4-opus%2Canthropic_claude-3-7-sonnet-thinking%2Canthropic_claude-4-sonnet%2Cgoogle_gemini-2-5-flash-05-20-reasoning_ai-studio%2Cgoogle_gemini-2-5-pro-06-05_ai-studio%2Cgoogle_gemini-2-5-flash-lite-reasoning_ai-studio%2Cfireworks_qwen3-235b-a22b-instruct-reasoning%2Ccerebras_qwen3-235b-a22b-instruct-reasoning%2Ccerebras_qwen3-32b-instruct-reasoning%2Cbaseten_kimi-k2%2Cxai_grok-4#intelligence-vs-price) ), it's diminishing returns for most people(including myself), i gave up OpenAI ages ago. People are sceptical and dont have the tokens to burn for local testing. Side Rant: Training Compute isn't the deciding asset for entering frontier status anymore folks (I know i am preach to the choir here), my opinion will be, next it will be 'corporate private/internal data' that loses the competivive advantage(OSINT + grounded synthetic/augmented data), there is only so much a single company can do longterm to improve a ML based chatbot before a new paradigm is needed(the elusive 'World Models'?). For best quality ML insight, HuggingFace is currently the platform to be i find (as of July 2025), even as a novice, there are open source 'spaces'(Mini-Apps) for almost every usecase. Blind human evaluation driven ELO leaderboard here(they may be near top, but observe how the gap's get smaller & smaller): https://lmarena.ai/leaderboard The data transparency is a little spotty as only samples of the data get released infrequently, but it tends to line up with my human preferences(if you factor in the 90% CI in your list of models to 'vibe check' for own usecase). It's also i think, a catch-22 for gaming benchmarks as traditional benchmarks and live, blind human preference evaluations cannot easily be gamed simultaniously(apart from freak incidents like social engineering attacks to admin's like Meta-LLAMA did with llama4 by doing the old-bait-and-switch, and got called out for it). Externally sourced Evaluations/Benchmarks are only meant to help people choose a set models to try out of course(so they don't overwhelm themselves or... their wallet). From my personal usage and mistakes(by spending too much with OpenAI), I find using a Frontier Open Weight model (Currently R1 & Qwen) from OpenRouter(Remote Inference), or better(locally) is much better, you can also finetune LoRa's easier and easier now as a novice (Soon LocalAI will have a OpenAI complient training API, after which it's just a question of using a open source frontend for doing OpenAI API integrated training). The lower sized Qwen3 series (32B/30-a3b) work well for me & uncensored/ablated models are typically better for nuanced topics anyway if you use them responsibly, especially for people where biased answers are a productivity burden (i.e think academic/corporate research/knowledge-base curation, anything where post-training bias can skew insights too much). Bottom line: If/When OpenWebUI implements this feature, it's support for OpenAI API 'RealTime' spec, so you would be able to use it with any OpenAI API inference engine server that supports 'Realtime'. including Open Source engines. like for example: https://github.com/theboringhumane/echoOLlama or more definitively for a self-contained suite: LocalAI has an issue here: https://github.com/mudler/LocalAI/issues/3714 End result of having this in the inference engines like LocalAI (which is making progress) and then supporting it in the UI like with OpenWebUI voice mode, would be you could save money and regain freedom/control in the process. The Realtime API spec is one of the reasons OpenAI's Advanced voice is so smooth and natural (relatively speaking). The inference server can use a chained approach (STT -> LLM -> TTS), which is at a technical level a bit harder, as information can be lost unless you encode the STT incantations,subtitles to Text, and reverse for TTS. End goal is using open weight speech-to-speech (Omni) models, but for now i think its pretty awesome you can swap out STT/LLM/TTS to make one with OpenWebUI. A universal REST api standard like "OpenAI API spec" just makes convergence much simpler(the inference server is abstracted away, one can use OpenAI or Local, or someone else's), much like with MCP, but without the [glaring security issues](https://www.trolleyesecurity.com/articles-news-mcp-servers-exposed-without-any-security/) with its core design (I.E: core defect transcends software implementation).
Author
Owner

@Mcayear commented on GitHub (Sep 25, 2025):

Any progress?

@Mcayear commented on GitHub (Sep 25, 2025): Any progress?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#2271