Going Beyond Listening: The Importance of Understanding in Conversational AI
Speech recognition is never 100% right. That’s simply a fact. If you’ve ever tried the speech-to-text feature when composing a text message, then you know how inaccurate voice transcription can be. It may get most the words right, but rarely all the words. Although we may have entered the golden age of AI, and the use of AI technologies have become part of our everyday lives, when it comes to mastering the human language, machines still fall short. And that creates a problem for contact center leaders who want to implement AI for self-service while maintaining a customer experience (CX) that is as good or better than speaking to a human.
Check out this real conversation between a stranded driver and a AAA virtual agent, and see what the ASR (automatic speech recognition) actually transcribed:
Why is speech recognition technology so difficult?
Human speech varies tremendously and so do the conditions that degrade audio quality. It’s hard enough for speech-to-text engines to get the audio right when composing a text message or speaking to a home device, which is a high def experience. When someone dials a phone number, the resolution is reduced to a low fidelity 8K wave form, which cuts out more than half the highs and lows, making it even harder for Automated Speech Recognition (ASR) engines to get it right. And that’s before poor conditions such as background noise, distance from microphone, phone line noise, or poor cell connection further exacerbate the call quality.
Because of the complexity of human speech and the audio degradation over telephony, even the best speech recognition engines are not enough on their own to accurately hear the customer and ensure that the virtual agent delivers an appropriate response. There is a lot of custom work required in the area of Natural Language Understanding (NLU) to augment the accuracy from speech recognition. There is no such thing as a “one size fits all” NLU engine to achieve the highest accuracy possible. It’s an engine that must be tailored by NLU developers question-by-question across every automated interaction to consider all possible responses or outcomes and weighted against expected intents. This is the only way to elevate the experience with AI to be on par with a human.
What you say versus what the ASR hears
The AAA call (in the video above) takes place under the worst possible conditions: on speakerphone, outdoors, and with poor cell connectivity. The caller also has a regional accent and gets frequently interrupted by crosstalk.
What’s interesting to note is that during vehicle capture, the speech recognition engine did not hear the caller say, “Ford F250 truck.” Instead, it very literally transcribed, “Aboard have to fifty truck” since the “F” in Ford was not audible in the caller’s response. This would have likely resulted as a failed call with most other speech recognition engines that are statistical-based transcriptions without the associated NLU stitching to match outcomes to predicted intents. Some ASR engines might have an associated contextual NLU engine to correct words in the context of a sentence, but it’s a far cry from pattern matching language acoustic models against expected answers.
So how did our virtual agent know what the caller meant to say?
The SmartAction approach to recognition + cognition
SmartAction augments its proprietary speech recognition with a highly customized NLU engine that can take any response heard from a customer and run multiple hypotheses to see how closely the language acoustic models match up with what we expected to hear.
Customer service interactions typically almost always have a narrow scope of expected responses that can be accounted for. This fact enables a very special kind of NLU that can be custom built and tuned against customer-specific and domain-specific criteria across every question and answer, in a given interaction. Even when a speech rec transcription is wrong, it’s usually within 5-15 degrees of separation of the intended response. “Aboard have to fifty truck” might be a little more than 15 degrees of separation from “Ford F250 truck,” but our NLU engine was able to properly match the intent even though the hypothesis score was only at 50% confidence.
So even in instances when the speech-to-text transcription is wrong, we can still have a high level of accuracy in correctly extrapolating what was intended. This is ultimately the power of the type of NLU engine that can be customized and tuned for every question and expected answer. While this approach can be time-consuming for developers and not highly scalable, it is the only way to dial up the customer experience with AI to be as frictionless and accurate as possible.
Moreover, since speech rec is rarely 100% right but often close, this kind of NLU can match low probability confidence scores to the right answer as long as the degrees of separation between what was actually heard and what was predicted aren’t too far apart.
This is also why it’s important during the CX design process to ask questions in such a way to get short, succinct responses instead of long run-on sentences that ultimately reduces the power of this kind of NLU approach and only increases the error rate for ASR to get something wrong.
So, while we can’t be all things to all people without custom development, we aren’t trying to be. What’s more important is that we hone our understanding to deliver the most frictionless experience possible to customers.