Speaking fast and slow: Voice tech needs a better script
The value of a technology is often linked to its incremental increases in speed and processor power, but when it comes to the developing world of Voice, things might not be so clear cut.
The fastest oral route from A to B is not always best, so how will Voice-designers and conversational-copywriters reconcile this apparent contradiction?
The value placed on the speed of our technology is evident in Apple's yearly fanfare to welcome the new iteration of iOS and then a few months later, a new generation of products. If you are a willing slave to Apple's dear-delights, you'll know this means your shiny and expensive tech will soon be asked to give back the gold medals and more to second place.
The preceding months will involve an internal battle to justify replacing things that do not need replacing. Firstly to yourself and then to your wife who says she doesn't recognise the importance of a 0.5x increase in GPU until you remind her she'll get the hand-me-down device and/or a reward to be named later.
The Nuances of Spoken Communication
But let's get back to Voice. Speech is our primary form of communication and starts with a complete focus on function over form. Phoebe (the younger of the two amazing women I share my world with) started with single words customarily accompanied by a pointing finger.
At the earliest stages the same single word, "yum", was used to express a desire for anything remotely edible and, on rare occasions, a cuddle from a besotted father. This simplified form of communication was tremendously effective, particularly as her vocabulary increased to differentiate berries from bananas.
A very similar story seems to be unfolding when we compare this kind of development to that of Voice assistants and Voice tech in general. We need only look at the avant-garde Irish TikTok-ist Tadhg Fleming and his seminal work "Alepsa"(watch here) to see how impressive the technology has become in terms of understanding what we are slurring in its general direction. Yet this highlights only one half of our progressively Voice-enabled world; the input.
The development of language beyond our formative years sees an explosion of nuances that go way beyond vocabulary. They include non-verbal things like facial cues and, of course, the vital role played by our hands as we wildly gesticulate to hold someone attention and explain more clearly just how big that fish was.
Upon reflection, many of our spoken interactions are far from efficient in terms of speed and clarity of meaning. Even the best orators amongst us (I'm thinking Stephen Fry) are plagued by speech disfluency, which is said to make up a startling 20% of spoken language. Disfluency refers to, umm the, ahh meaningless, hmmm noises that litter any normal conversation.
But are they really meaningless? A conversation without these cues feels, well, robotic. In fact, the absence of these elements is the easiest way to detect that you are talking to a robot.
Just like suitable changes to cadence, and the pace of delivery, what's in the script and what is being communicated can be poles apart. Anyone forced to sit through a school play will understand how many different ways there are to butcher Mr Shakespeare's work.
Challenges in Developing Natural-Sounding Conversations
Now this variety in interpretation is a problem for technology. Our entire world runs on code, which has been designed to remove all ambiguity through the ruthless pursuit of efficiency.
Yet the way our use of language has evolved breaks every fundamental rule of coding. Single words have multiple meanings, and minor changes in pace can alter the entire focus of a sentence. This is eloquently demonstrated by Ismo Leikola, who while learning English noted the critical role of understanding the various and contradictory meaning of the word shit (watch here).
This is going to present several challenges as Voice technology develops beyond a relationship of master and slave. Our current interactions with Voice are predominately instructional; "Turn off lights", "Open blinds" or some squealed and distorted version of "plee Meecool Boobies".
Instructional speech is relatively free from the revelries of fluid conversation, making it easier to decipher, and easier to respond to. Our voice-assistants typically respond by parroting back a version of the request - "Turning off the lights, opening the blinds and now playing Micheal Buble". This is still an impressive demonstration of our technical prowess, but it's hardly a conversation.
There have been efforts to create a more natural, human, speech. Google duplex demonstrated a frighteningly accurate approximation of booking a haircut over the phone. This was only a demo but showed how the inclusion of the umm and ahh's were enough to fool the unknowing human on the other end of the phone.
Interestingly, despite our innate drive for perfection and technical development, this demo did not go down well at all. Something was unsettling that came a little too close to the sentience of Skynet. As a result, everyone started screaming before running home to unplugged anything that looked remotely threatening.
It appears there will need to be a deal struck before we are comfortable allowing the machines to take on our idiosyncrasies, which brings us back to efficiency.
Our verbal communications are incredibly efficient, but not in a linear way. We can communicate, through a combination of the words and delivery something well beyond the script alone. Something that goes even further than form and can reach feeling. That move from function towards form, and as far as feeling, is currently well beyond the most powerful of computers.
Advanced verbal communication is a highly prized skill amongst humans, and rightly so. The vast majority of us can talk, but anyone subjected to Chris Lomas* telling them about the time he scored a hat-trick for the 15th time knows that delivery can make or break a story.
For Voice designers and conversational copywriters, the next frontier is to move beyond our current transactional experiences with Voice and towards something more conversational. In the pursuit of this aim, it is going to be critical to recognise the depth of what we are able to communicate through our voices beyond words.
To borrow the work of Ismo Leikola, Voice technology has the potential to be the shit. Yet a change in emphasis, a slight alteration in pace or a misplaced word could quickly turn it into something else.