UPDATE: May 20, 2024, 8:28 p.m. EST: Scarlett Johansson’s publicist issued a statement to multiple news outlets stating that her legal team “wrote two letters to Mr. Altman and OpenAI, setting out what they had done and asking them to detail the exact process by which they created the ‘Sky’ voice. Consequently, OpenAI reluctantly agreed to take down the ‘Sky’ voice.”
Johansson revealed that Altman sought permission to use her voice in September 2023, but she declined.
“In a time when we are all grappling with deepfakes and the protection of our own likeness, our own work, our own identities,” Johansson wrote, “I believe these are questions that deserve absolute clarity. I look forward to resolution in the form of transparency and the passage of appropriate legislation to help ensure that individual rights are protected.”
OpenAI’s new GPT-4o system released last week is receiving a wide range of reactions. But most who watched the multimodal generative AI system in action can at least agree on one thing—its default voice, “Sky,” certainly sounds “flirtier,” for lack of a more comfortable term.
Less than a week after Sky’s upgraded debut, OpenAI announced it is “working to pause” its availability while they “address” the internet’s questions about the voice. Also, it’s totally “not an imitation of Scarlett Johansson” or her performance as the self-aware AI assistant in the 2013 sci-fi film, Her, by the way.
Most users didn’t seem to mind, or even notice, any of Sky’s potential pop culture similarities when OpenAI publicly launched ChatGPT’s new voices late last year. That all changed overnight with GPT-4o’s much more expressive upgrades. One CNN commentator described GPT-4o’s Sky voice as both “mind-blowing” and “creepy,” while Ars Technica says it now sounds “disarmingly lifelike.” Even The Daily Show took notice.
“This is clearly programmed to feed dudes’ egos. You can really tell that a man built this tech,” show correspondent Desi Lydic said last week while describing it as a “horny robot baby voice.”
But it’s the similarities to Her that have dominated much of GPT-4o’s coverage. Released in 2013, Spike Jonze’s film follows a lonely man as he falls in love with (and spoiler: gets his heart broken by) his increasingly self-aware AI phone assistant voiced by Johannsen. OpenAI CTO Mira Murati denied drawing any intentional inspiration from Johannsen’s title character role in an interview last week, and the company’s new blog post doubles down on the claim despite pausing Sky’s availability. According to OpenAI, Sky and ChatGPT’s four other voices were the result of months of careful planning and consideration, any resemblances to Johannsen are purely accidental—nevermind that CEO Sam Altman once said Her is his favorite film of all time—two weeks before launching the ChatGPT voices in September 2023. And simply tweeted ‘her,’ shortly after last week’s GPT-4o launch.
How OpenAI chose its ‘timeless’ chatbot voices
In the May 19 post, OpenAI offered a rundown of how it chose GPT’s voice options. Alongside Sky, the Breeze, Cove, Ember, and Juniper characters were reportedly designed following five months of consultation with actors, talent agencies, industry advisors, and casting directors. The company then settled on a criteria list that each voice needed to fulfill in order to create what they believed would be the most appealing, universal options possible. Among those characteristics were the need for diverse, multilingual actors capable of charismatic, approachable, trust-inspiring performances—also “a voice that feels timeless,” although it’s not exactly clear what might determine such a quality.
[Related: OpenAI dissolved its team dedicated to preventing rogue AI.]
Designers issued a casting call in May 2023 and reportedly received over 400 submissions. Each actor’s entry consisted of prerecorded, scripted ChatGPT responses ranging from “answering questions about mindfulness to brainstorming travel plans, and even engaging in conversations about a user’s day.” The team apparently then interviewed 14 finalists about “the vision for human-AI voice interactions” while discussing the technology’s “capabilities, limitations, and the risks” alongside supposed OpenAI’s ongoing “safeguards.” (No word on if these safeguard chats included the now completely dissolved superalignment team, whose co-lead publicly resigned last week citing concerns over the “company’s core priorities.”)
The final five actors then flew to San Francisco for recording sessions and in-person meetings over June and July of last year, followed by OpenAI publicly launching ChatGPT’s new voices in September 2023. Most users didn’t seem to mind, or even notice, any AI voice’s potential pop culture similarities during the next seven months, but that changed overnight with GPT-4o’s much more expressive upgrades.
With banter and the ability to interpret facial expressions through camera input, GPT’s voices gained newly personable quirks and emotive capabilities—as well as the recent Johannsen comparisons. It’s unclear how long the company plans to spend tweaking Sky’s voice, and what the revamped version will sound like.
OpenAI hasn’t said if potential legal threats were a behind-the-scenes factor, and did not respond to PopSci at the time of writing.