In recent years, we have witnessed dramatic progress in mimicking the primary tool humans use to communicate: speech. Increasingly, our internet connected devices can recognize speech, translate the intent of the speaker, respond using recorded and synthesized voices, and integrate with services that order a pizza or change the thermostat. When we were pushing buttons to control our devices, we were more likely to think of them as behaving mechanistically, as being levers and pulleys at bottom. No longer. As dramatized by Joaquin Phoenix and Scarlett Johansson in Her, these voice boxes are a whole new soundscape of seduction. This aural interface can create a powerful illusion.
Each component of the trick is key: speech recognition to receive input, Natural Language Processing to translate verbal requests into commands, third party API’s to execute them, and speech synthesis to respond.
Speech recognition is the pledge of this illusion: “I hear you.” The first system, Audrey (1952), could only recognize numbers zero through nine, and at that, for best results, only from a particular speaker. It would be decades before speech recognition could pick out the key words in a human sentence spoken clearly by most anyone. Using machine learning on enormous data sets, speech technology’s pattern recognition can now generalize over many nuances and varieties of human speech.
Before we anthropomorphized these programs as Alexa, Siri, and Cortana, the earliest iterations sounded far from human. To WOPR in War Games, the dissonant screeching of a dial-up modem to establish a connection was a perfectly adequate “Hello!” But to us, its invitation — “Shall we play a game?” — sounded obviously synthetic and impersonal. Early versions of text-to-speech software sounded equally robotic. The words, pieced together individually, lacked the lilting, cadence, and emphasis of human speech. So too if you select this text on Mac OS, right click, and select “Start Speaking”. Progress, but a ways to go.
Success in humanizing speech technology has been mostly achieved by recording vast libraries of spoken words and phrases. A few melodious speakers enjoy full-time jobs giving voice to our assistants, recording sentences and word pairings day after day. Voice packs featuring celebrity voices became popular on navigation devices in the aughts and are now making their way onto the voice assistants from Amazon, Apple, and Google. With a prompt, you can interact with a disembodied version of the once inimitable Samuel L. Jackson.
There will be synthesized versions of gravelly voices, deep baritones, fast talkers, low talkers, high talkers, yada, yada, yada. Indeed, Amazon has added functions to its Speech Synthesis Markup Language (SSML) to enable Alexa to whisper, emphasize a word, or mimic local slang: sound skeuomorphisms, if you will. Might Siri get a scratchy throat, or Alexa have an occasional bout of laryngitis?
Talking to your Amazon Alexa is cool, but sometimes her responses can be robotic. (Just because she’s an AI doesn’t mean she has to sound like an AI, right?) Amazon hopes to change that by giving developers the ability to hone Alexa’s responses with “a wider range of natural expression.”Christina Bonnington, “Alexa’s Responses Are About to Get More Human” at The Daily Dot
None of these quirks and idiosyncrasies of human speech are essential to the services voice assistants provide. Our desire to echo our humanity shapes our inventions here too. Another recording to better capture the human resonance, endless halls of servers to do the math, and ever more lines of code are written to make our assistants better conversationalists.
But even with untold hours of research, anticipating questions, and preprogramming responses, the limits of these systems are easily discovered. The most basic commands are often mistranslated. Even after requesting your favorite song or podcast for the umpteenth time, your assistant doesn’t remember. The canned jokes get a chuckle, but don’t evince a sense of humor. The many gaps in programming leave these digital assistants grasping at the Web for what humans have said on the matter. “Here’s what I found on the web …” Our talking boxes take our requests, run their routines, and send out results but understand nothing, whether or not they’re made in China.