OK human, take my reservation —

Talking to Google Duplex: Google’s human-like phone AI feels revolutionary

Believe the hype—Google's phone-call bot is every bit as impressive as promised.

A picture of a Google Duplex notification
Enlarge / The end result of Duplex. You ask for a reservation, it makes a phone call in the background, and gets back to you with a result.

NEW YORK—Evidently, I didn't walk into a run-of-the-mill press event. Roughly two months after its annual I/O conference, Google this week invited Ars and several other journalists to the THEP Thai Restaurant in New York City. The company bought out the restaurant for the day, cleared away the tables, and built a little presentation area complete with a TV, loudspeaker, and chairs. Next to the TV was a podium with the Thai restaurant's actual phone—not some new company smartphone, the ol' analogue restaurant line.

We all knew what we were getting into. At I/O 2018, Google shocked the world with a demo of "Google Duplex," an AI system for accomplishing real-world tasks over the phone. The short demo felt like the culmination of Google's various voice-recognition and speech-synthesis capabilities: Google's voice bot could call up businesses and make an appointment on your behalf, all while sounding shockingly similar—some would say deceivingly similar—to a human. Its demo even came complete with artificial speech disfluencies like "um" and "uh."

The short, pre-recorded I/O showcase soon set off a firestorm of debate on the Web. People questioned the ethics of an AI that pretended to be human, wiretap laws were called into question, and some even questioned if the demo was faked. Other than promising Duplex would announce itself as a robot in the future, Google had been pretty quiet about the project since the event.

Then all of a sudden, Google said it was ready to talk more about Duplex. Even better, the company would let me talk directly with the infamous AI. So for an afternoon at least, I wasn't Ron Amadeo, Ars Technica Reviews Editor—I was Ron Amadeo, THEP restaurant employee waiting to field "live" phone calls from a bot.

Eventually, the Duplex flow will work something like this. For today, the Google Assistant voice command system didn't work. Starting Duplex happened via a laptop.
Enlarge / Eventually, the Duplex flow will work something like this. For today, the Google Assistant voice command system didn't work. Starting Duplex happened via a laptop.

Talking to Google Duplex

Unfortunately, Google would not let us record the live interactions this week, but it did provide a video we've embedded below. The robo call in the video is, honestly, perfectly representative of what we experienced. But to allay some of the skepticism out there, let's first outline the specifics of how this demo was set up along with what worked and what didn't.

Ironically, the only thing that wasn't working in our demo was the one thing anyone can try today: the Google Assistant. In a consumer Google Duplex interaction, a user would say something like "OK Google, reserve a table for four at the THEP Thai Restaurant at 6pm." From there, the Google Assistant would fire up Duplex and make the call. But in our demo, the call was never initiated with a verbal voice command. Instead, an engineer in the corner of the room silently punched reservation requirements into his computer, and Duplex then took over and called the business.

(Fortunately, voice activation seems like the least important part of Google Duplex. We know the Google Assistant works. We know it can handle voice commands. We know it can start a call with a named business using Google Maps info.)

The THEP restaurant phone proved to very much be a real, live phone line. In-between demos at one point, the phone unexpectedly started ringing. The Google rep quickly shot a "Wait, did you start a call?" question at the engineer in the corner. After he said no, THEP's owner hurriedly jogged over to the phone to speak to a genuine customer.

During the demonstration period, things went much more according to plan. Over the course of the event, we heard several calls, start to finish, handled over a live phone system. To start, a Google rep went around the room and took reservation requirements from the group, things like "What time should the reservation be for?" or "How many people?" Our requirements were punched into a computer, and the phone soon rang. Journalists—err, restaurant employees—could dictate the direction of the call however they so choose. Some put in an effort to confuse Duplex and throw it some curveballs, but this AI worked flawlessly within the very limited scope of a restaurant reservation.

I need to keep my day job

In my group, I took the first phone call from Google Duplex. I walked up to the front of the presentation area, picked up the ringing receiver, and the call started on the phone and over the loudspeaker. Listening to recordings of Duplex are one thing, but participating in a call with Google's phone bot (in front of a live audience, no less) is a totally different experience. Immediately, I realized this was much more than I was expecting: Google PR, Google engineers, restaurant staff, and several other journalists were intently watching and listening to me take this call over the speaker. I was nervous. I've never taken a restaurant reservation in my life, let alone one with an audience and an engineering crew monitoring every utterance. And you know what? I sucked at taking this reservation. And Duplex was fine with it.

Duplex patiently waited for me to awkwardly stumble through my first ever table reservation while I sloppily wrote down the time and fumbled through a basic back and forth about Google's reservation for four people at 7pm on Thursday. Today's Google Assistant requires authoritative, direct, perfect speech in order to process a command. But Duplex handled my clumsy, distracted communication with the casual disinterest of a real person. It waited for me to write down its reservation requirements, and when I asked Duplex to repeat things I didn't catch the first time ("A reservation at what time?"), it did so without incident. When I told this robocaller the initial time it wanted wasn't available, it started negotiating times; it offered an acceptable time range and asked for a reservation somewhere in that time slot. I offered seven o'clock and Google accepted.

From the human end, Duplex's voice is absolutely stunning over the phone. It sounds real most of the time, nailing most of the prosodic features of human speech during normal talking. The bot "ums" and "uhs" when it has to recall something a human might have to think about for a minute. It gives affirmative "mmhmms" if you tell it to hold on a minute. Everything flows together smoothly, making it sound like something a generation better than the current Google Assistant voice.

One of the strangest (and most impressive) parts of Duplex is that there isn't a single "Duplex voice." For every call, Duplex would put on a new, distinct personality. Sometimes Duplex come across as male; sometimes female. Some voices were higher and younger sounding; some were nasally, and some even sounded cute.

As impressive as it is to hear a computer realistically replicate human speech, the model that generates these voices, WaveNet (from Google's Deepmind division), is actually holding back in the human mimicry department. Deepmind's blog has already revealed that WaveNet can generate human mouth sounds if it wants to. On the blog, there are demos of it breathing and making lip smack noises between sentences. Duplex doesn't do any of that yet.

During the I/O keynote, Google played a brief, pre-recorded Duplex call. Given that the recording was missing many of the important chunks of a normal business call, many suspected that the demo was heavily edited. The employees never said the business' name, and Google never gave out important identifying information like a phone number. People also took issue with the lack of disclosure that Duplex was a robot, and the lack of a call-recording disclosure would be a violation of the law in many states. I think the simplest explanation for the I/O demo is that Google's call was edited for privacy and brevity, and it was only meant as a teaser. During our time at THEP Thai, all of these concerns were addressed.

Every single call started with something along the lines of, "Hi, I'm calling to make a reservation. I'm Google's automated booking service, so I'll record the call. Can I book a reservation for..."  This covered both the "I'm a robot" disclosure and the "this call is being recorded" concerns brought up earlier. Google says it's still working on the exact messaging, but the company always intended to disclose that it was a robot recording the call.

Duplex is fine giving out information, but it's designed to only to give out information the bot is authorized to share. In today's demo, Duplex would clearly, slowly spell out the demo caller's phone number or name when asked. It even had good phone etiquette, saying things like, "The name is Ron, that's R, O, N."  At one point, the callers' email was asked for and Duplex responded with "I'm afraid I don't have permission to share my client's email."

This spelling out of names and numbers is the one time Duplex really loses the illusion of sounding human. It's almost like WaveNet didn't practice this part of speech at all, and the service drops into a Speak & Spell mode when it needs to rattle off individual characters. The intonation of each letter or number is all over the place, never flowing with normal beginning and ending tones that a human would use.

Looking back, I also take issue with some of the "personalities" Duplex presented. The Google Assistant presents itself as a happy, professional robot assistant with a bit of a fun streak. It can tell the occasional joke, but the Assistant usually speaks with proper language, good enunciation, and a happy, upbeat attitude. In contrast, Duplex is much more casual. Google basically built a secretary AI with Duplex, but it doesn't speak with the practiced confidence of someone accustomed to making reservations—it often sounds like a teenager ordering a pizza. That's not necessarily how I would want to be represented to a business. The casual attitude can sometimes combine with the occasional intonation glitch and come across as annoyed, tired, disinterested, or sarcastic.

Channel Ars Technica