According to announcements on May 7 local time, new models integrated into OpenAI's API (app programming interface) allow developers to build applications that can chat, translate and record conversations directly while the user is speaking. This is considered an important step in the real-time AI voice development race.
The new model set includes three main products: GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper.
In which, GPT-Realtime-2 is integrated with the ability to reason at the GPT-5 level, helping AI process more complex requests and maintain more natural conversations with users.
OpenAI said that this model can understand the context of the conversation, adapt when requesting changes and respond appropriately to each situation.
The second model is GPT-Realtime-Translate, focusing on direct voice translation. This technology supports more than 70 input languages and about 13 output languages.
The noteworthy point is that the system can translate almost simultaneously with the original words, while maintaining the natural speed and rhythm of the speaker.
Meanwhile, GPT-Realtime-Whisper is a new model for converting voice into online text, capable of recording voice directly while the conversation is taking place.
OpenAI believes that voice AI is currently one of the most common ways of interacting between humans and software.
However, building real-world voice products is still very complex because AI not only needs to listen and understand but also monitor context, use appropriate tools and respond at the right time.
The new models will take real-time sound beyond simple question-and-answer forms to become a voice interface that can listen, reason, translate, record and act right during the conversation," OpenAI said on its official blog.
The company expects the new technology to strongly support businesses that want to expand automated customer care services.
In addition, real-time voice AI can also be applied in many fields such as education, media, event organization and content creation platforms.
In multilingual countries like India, direct translation technology is considered particularly useful. New models allow multiple people to use different languages in the same conversation, while listening to translations in real time and tracking text recordings in real time.
Prateek Sachan, co-founder and Chief Technology Officer of BolnaAI (a technology company specializing in developing voice AI platforms for businesses in India), said that GPT-Realtime-Translate achieves a error rate 12.5% lower than many other models that the company has tested in languages such as Hindi, Tamil and Telugu.
According to Mr. Sachan, OpenAI's new technology is establishing a new standard for multilingual voice AI, especially in markets with complex phonetic and local accent systems.