OpenAI’s GPT-4o Advanced Voice is one of the most powerful and potentially important artificial intelligence tools of the year. It lets you have a human-like natural conversation with an AI voice and even interrupt it when it’s speaking too much.
Currently only available to a small number of ChatGPT Plus subscribers, this new method of interacting with technology is expected to be widely available this fall. The company also plans to launch a vision mode that allows you to see the world through your camera next year.
What makes Advanced Voice different from the current ChatGPT Voice or even the newly launched Gemini Live is the fact it is speech-to-speech. This means it can natively understand what you say, how you say it and the emotional intonations behind your words.
It can also do accents and tell a great story, so I asked Advanced Voice to take me on a time travel adventure. It started with a trip to Ancient Egypt and spoke in the voice of a trader. Not only did it do a great job of the voices, but it is a fun storyteller.
Prompting the adventure with Advanced Voice
Using advanced voice isn’t that different to any other artificial intelligence technology in that it starts with a prompt. Unlike talking to ChatGPT with text or generating an image with Midjourney, Advanced Voice is prompted by your voice.
At the most basic level, this is simply telling it what you want it to do but it can also pick up on tone changes in your voice so if you ask it to explain the meaning of life but do so sounding slightly teary or upset it will respond in a way that reflects how are you sounded.
For this adventure I played it straight, simply starting by asking Advanced Voice: “Now, we’re going to go through a story. Imagine you’re a time traveler. When in history would you go?”
It suggested the World’s Fair in Chicago in the 19th century. I asked it to take on the role of a time traveler but also to talk as people at the fair. After a brief sojourn to Chicago, I asked “Let’s go somewhere else. Push the button and take me to a new location.” We went to ancient Egypt.
Advanced Voice said: “Picture this: the grand pyramids are being built, and the Nile flows as the lifeblood of a thriving civilization. What are you most curious about in this time and place?”
This is where I asked it about the language, including speaking the words as accurately as possible based on what we know.
We then went to a market and finally on to Rome and a conversation between our Egyptian trader and a Roman citizen, one speaking Egyptian, the other Latin. I even had Advanced Voice use a Yoda voice for a small portion of the adventure and it gave it a good try.
Final thoughts
Advanced Voice is a brilliant storyteller, able to change emotion levels, reflect the intensity of different scenarios and even take on different accents and voices.
The problem I have with it is the limitations imposed by OpenAI. It ‘could’ generate sound effects to enhance a scene but its been stopped from doing so. In theory it could even adapt its voice even more than it does, but again it has been stopped.
The issue is an understandable one: safety. Asking the model to perform those more unpredictable tasks could lead to output that breaks OpenAI’s safety guidelines and potentially push Advanced Voice into the realm of unsafe to release. It’s just frustrating knowing those capabilities are slightly beyond reach.
Even without them though Advanced Voice is still the best interaction I’ve had with AI, allowing for real-time conversation, a natural flow where I can interrupt on a whim and someone to talk to that responds as a human might to my tone and speed.