Eerily Realistic AI Voice Demo Sparks Amazement and Discomfort Online
Freeman writes:
In late 2013, the Spike Jonze film Her imagined a future where people would form emotional connections with AI voice assistants. Nearly 12 years later, that fictional premise has veered closer to reality with the release of a new conversational voice model from AI startup Sesame that has left many users both fascinated and unnerved.
"I tried the demo, and it was genuinely startling how human it felt," wrote one Hacker News user who tested the system.
[...]
In late February, Sesame released a demo for the company's new Conversational Speech Model (CSM) that appears to cross over what many consider the "uncanny valley" of AI-generated speech
[...]
"At Sesame, our goal is to achieve 'voice presence'-the magical quality that makes spoken interactions feel real, understood, and valued," writes the company in a blog post.
[...]
Sometimes the model tries too hard to sound like a real human. In one demo posted online by a Reddit user called MetaKnowing, the AI model talks about craving "peanut butter and pickle sandwiches."
[...]
"I've been into AI since I was a child, but this is the first time I've experienced something that made me definitively feel like we had arrived," wrote one Reddit user.
[...]
Many other Reddit threads express similar feelings of surprise, with commenters saying it's "jaw-dropping" or "mind-blowing."
[...]
Mark Hachman, a senior editor at PCWorld, wrote about being deeply unsettled by his interaction with the Sesame voice AI. "Fifteen minutes after 'hanging up' with Sesame's new 'lifelike' AI, and I'm still freaked out," Hachman reported.
[...]
Others have compared Sesame's voice model to OpenAI's Advanced Voice Mode for ChatGPT, saying that Sesame's CSM features more realistic voices, and others are pleased that the model in the demo will roleplay angry characters, which ChatGPT refuses to do.
[...]
Under the hood, Sesame's CSM achieves its realism by using two AI models working together (a backbone and a decoder) based on Meta's Llama architecture that processes interleaved text and audio. Sesame trained three AI model sizes, with the largest using 8.3 billion parameters (an 8 billion backbone model plus a 300 million parameter decoder) on approximately 1 million hours of primarily English audio.
Read more of this story at SoylentNews.