How the Art of Listening Matters with Conversational AI
Join our most densely packed informational webinar where we discuss the secret sauce behind great speech technology in the contact center: how NLU augments speech rec for the highest accuracy possible since speech is never 100% right.
This webinar will answer:
- How NLU elevates the self-service experience with AI to human-like
- How and where speech-to-text falls short and requires NLU augmentation
- How speech rec and NLU interplay between recognition and cognition
- Real-world examples of speech-to-text getting it wrong but NLU making it right
Watch This
Webinar
Brian Morin
Chief Marketing Officer,
SmartAction
Helena Chen
Director of Product Marketing,
SmartAction
How the Art of Listening Matters with Conversational AI
On-Demand Webinar
Helena Chen: Hello, and welcome. Thank you so much for joining us for How the Art of Listening Matters with Conversational AI. My name is Helena, I am your host and moderator for today, and here with me is conversational AI expert and SmartAction CMO, Brian Morin. We have a very informative webinar planned for you, so let’s get right to it, and if you have any questions along the way, feel free to type that into the chat box and we’ll answer as many as we can during the Q&A session. And with that, I will go ahead and pass this over to Brian.
Brian Morin: Yeah, thanks for that, Helena. And we do like this to be as interactive as possible, so please just chime into the Q&A box or chat box. Let us know you’re listening, interact. We’ll handle what we can ask this webinar goes along and once we do reach the bottom of the hour, we will get to a full Q&A.
So on screen, this is just a very short introduction on SmartAction. If you don’t know anything about us, or why we have any credibility on the topic today, let me just keep this super short while we still have folks coming into the webinar. We deliver AI-powered virtual agents as a service, and so that means that we deliver the full conversational AI technology stack for self service. It’s turnkey. It’s omni-channel. All of our customers will use our voice self service module. Most of them combine what we do over SMS and chat, I’d offer self service additionally, as well, the very latest and rich web chat with point and click bullets, image carousels, form fields, the works.
But we don’t just deliver the technology and then throw it over the fence and then wish you good luck. We actually bundle it with end-to-end CX services, and that means everything, the design, the build, the ongoing operation. So really, at the end of the day, we step in more as a partner than we do just a technology provider. Conversations with machines are really complex and it requires experts in their field and ongoing care and feeding to do it really well.
So we would like to think our approach is working for us. We’re currently the top-ranked conversational self service solution on Gartner Peer Insights, as ranked by customer reviews, 4.8 out of five. In fact, we just added two more five-star reviews in the last week. So that’s the last plug I’ll give. If you are interested in our street cred, well, then starting with those reviews is probably a good place to start. I know you didn’t join just to get a SmartAction commercial today, so I’ll throw it over to Helena to get this started into really what is, at least in a 30-minute session, as deep maybe as somebody can take you into the latest AI technologies and the tools and techniques that are driving frictionless experiences with AI today. So, Helena.
Helena Chen: Thanks, Brian. So as with any human-to-human conversation, there’s hearing, and then there’s the understanding. So these are two different functions, and it’s the same thing with AI. And that’s why good speech technology will have two AI engines, one for recognition, which converts the utterances to letters and words, and another engine for cognition, to piece together the right words and extract the intent.
So let’s take a closer look at the recognition piece of this and talk about how that happens. So, on screen is a standard machine learning approach to recognizing speech and it’s the same approach that we use and the same approach used by other ASR engines. You always start with the training dataset, and then you label that dataset, which is the most time consuming part of the process because you’re getting into the taxonomy of the vocabulary and you’re classifying that training data. And then lastly, with step three, that data gets fed into the machine learning engine and that can take anywhere from hours to days to train. And Brian, you know that [crosstalk 00:04:14].
Brian Morin: Yeah. In our case, days to train. Yeah.
Helena Chen: Right. Yeah, it comes down to the data.
Brian Morin: So, it does come down to the data, and in fact, the audio that we use in our training set, just to give at least, pull back the curtain a little bit. What does this look like for industry-leading platforms like ourselves. For instance, our training set is a little bit of everything, I think, like most others. The cornerstone of that is the domain-specific telephony audio from our clients for specific use cases that we’ve captured over the years across verticals, but it also includes every TED Talk, it includes every UN speech, which is one of the reasons why we believe that we uniquely handle accents as well as we do. It includes certified pre-labeled audio that we purchased. That’s how we actually got some of our start in this several years ago. And and we use the same publicly available corpus that everyone else uses.
Now, I know that as an audience listening in, I think what our audience wants to hear is, “Well hey, how do these industry leaders, how do they stack up against each other?” When we benchmark ourselves on the use cases that we support against other leading ASR platforms, we consistently outperform, but that’s not to be met as a SmartAction commercial. That’s just to be fair, that what we’re doing actually includes the addition of our NLU technology that’s been tuned to those specific use cases and that makes a huge difference.
So it’s really just to have a frank conversation that at the end of the day, in the area of ASR by itself, the speech-to-text portion, there’s frankly, a lot more parody among leading platforms than there’s ever been. The degrees of separation are not like they were four years ago. A lot of that has to do with the fact that everyone just has so much more data now, the same approach to data labeling, using a lot of the same machine learning engines and tools. And so, you do eventually reach a point of diminishing returns where more data really doesn’t even help that much. That’s why Siri or Google, as good as they are, they really haven’t made any visible improvements in accuracy since 2017.
Now, what does help tremendously that we’re setting the stage for in this conversation is the NLU engine piece of this to augment the accuracy that you get from ASR. That’s the real difference in speech technology today. We’ll get to that data and what I mean on that subject in just a minute, but let’s continue to dig in a little bit more here just on the ASR front of this because it is the front end where everyone starts their experience.
So if we were to take a look at the speech rec landscape, while we do love our own speech rec engine, we’re not married to it by any means. There are certain use cases we’ve noticed that Google outperforms ours, particularly wide aperture use cases. So we can opt to use theirs when appropriate, and I will share some examples for you in just a minute on how you would mix match and choose the right engine for the right use case. But at the end of the day, we’re a CX company, so our priority is delivering the best experience for the customer. So we do like to mix and match tools in our conversational AI technology stack to complement our own.
But the big shift, as you can see here on screen, over the last few years is how speech rec has gone from highly specialized to highly commoditized, just in the last four years. And a lot of that has to do with the democratization of machine learning. It has to do with the public access to big data. And the number of at least speech rec players that we track too, is now upwards of 30, far beyond the handful of players just four years ago. Now, not all of them do real-time speech-to-text for self service, but they’re at least doing some kind of transcription service.
And I think that this one is also maybe the interesting to note, is that every single one of them claim a word error rate below 10%. And you may hear IBM Watson and Google fighting over tenths of a percent. And we would say, never, ever believe somebody’s word error rate number. It’s a marketing number that, frankly, is totally BS, because it’s scoring itself against the very corpus or dataset that was used to train the model, so it’s not representative by any means of real-world accuracy. So that’s why even though they publish those rates, that’s certainly not what your experience is. So if anyone tries to sell you on their word error rate, take with a grain of salt. What matters is how it performs out in the wild.
Helena Chen: Right. Yeah, and the truth of the matter is that it’s not all that difficult to create your own speech rec engine if you have access to mountains of audio and can label that data. So in fact, there’s a ton of pre-labeled audio datasets that’s available to the public, and that’s how more than 30 of those vendors got their start. So there’s a huge open source community and every speech recommender has access to that same datasets as part of their training model, in addition to whatever else they have that’s domain-specific, that they’ve labeled themselves.
Brian Morin: Right. So then, let’s talk about what happens in the real-world. Let’s talk about real-world ASR word error rate. Now, on screen I’m giving this example of our own, which what we believe is what represents a true benchmark. Now, I know what I’m going to say or showing this might sound like I’m trying to give a SmartAction commercial here. I’m actually, just here in a minute, trying to get into the difference between what is considered real-world and then what we would consider out in the wild, real calls from real callers, and ultimately why NLU is so important for the latter.
But before we get there, you have to kind of start here, with ASR only in a controlled environment. The numbers on screen are representative of ASR only. There is no NLU augmentation. That’s why we believe it’s a true test of ASR. So as you can see, the numbers on screen, they are on par with human-level accuracy, as far as what you could reasonably expect from a human that’s trying to transcribe the same dataset. So, while these numbers are, they’re absolutely real-world numbers, this test has controls. This shows real-world ASR accuracy in a controlled environment. These are our internal numbers from our own QA team. These numbers came from a team of five people who were testing our name capture application, which is a highly complex capture, because it involves capturing first name, and then last name. It involves capturing them letter by letter, as fast as someone can say them. And in our view, this is the best way to test the accuracy of a speech rec or just the ASR portion, because NLU cannot get activated in this use case. You’re purely testing speech-to-text transcription.
So while these calls in this test, they are over the telephone, they are coming from five people, which none of which have thick accents. They’re calling from a controlled environment. It has no ambient noise issues. So, let’s talk about what’s different when you consider audio that’s then, not only real-world, but what we would call out in the wild. These are real callers, accents, background noise, wind, speakerphone, crosstalk.
Helena Chen: Okay, so the first piece of that can’t be overlooked is what happens when someone calls a customer service number and that audio traverses an outdated telephony infrastructure. So even though audio starts at high def at the device level, whether that’s your phone or your home device, the moment that you dial a phone number, that audio is reduced to a low fidelity. So, it’s a mere 8K resolution, which cuts out more than half of the highs and the lows. And that’s what makes it so hard for speech rec technology at the contact center level.
Brian Morin: Right. Well, and if that weren’t bad enough, now noise has has been introduced. So you don’t notice this at the device level if you’re talking directly in your phone to text someone or you’re speaking to Alexa. That’s a high def experience. But you’ll notice that even in that high def experience, when you try to text somebody using your voice, I don’t know how about you, but at least for me, Siri or Google invariably gets a couple of those words wrong that I end up having to correct.
And so this is why transcription-based ASR engines at the contact center level, they just don’t deliver a good enough customer experience because those same engines that can’t be fully accurate in high def, well, now after that’s traveled over the phone line and hit the contact center where that technology is activated, those same engines are now 50% less accurate than they were at the device level. And so you have to have AI that is purpose-built to solve this kind of a problem, and you have to design your conversations in such a way to take this challenge in mind, because there’s no way that you can… And there’s no way you can deliver a good experience at the contact center level without having powerful NLU engine to augment what you are getting from your speech rec accuracy. And Helena, you’re going to tell us a little bit about that.
Helena Chen: Yeah. Brian, you were mentioning some of the factors that impact the accuracy of speech rec, and on screen are all the pieces factored into that. So here at SmartAction, we see 20% of incoming calls significantly impacted by noise. And we mean a lot of noise, to the point where it would be difficult for even a human to understand and would likely have to re-ask some of the questions to recognize the right answer.
Brian Morin: Right.
Helena Chen: So, even though low fidelity is bad enough, it’s exacerbated by those issues. And so when we talk about calls in the wild, this is what we’re referring to.
Brian Morin: Right.
Helena Chen: And so, while speech rec has come a long way, the problem with ASR is that it’s just never perfect. They invariably get one or two words wrong, and ASR is just not good enough on its own to overcome low fidelity and real-world audio. So as you can see on screen, it’s not quite good enough to help get their ship back.
Brian Morin: Alright, so I’m the one who put that in there. Maybe a little too much of dad humor okay, for this morning.
Helena Chen: So Brian, because speech-to-text is never 100% accurate, I think the question is, can anything be done about it to improve the experience?
Brian Morin: Well, Helena, the answer is yes. So this is the real secret sauce to delivering a great customer experience. And so when we’re sitting down with folks, this is a large part of the education exercise because everyone is so familiar to ASR speech-to-text. Hey, we have our own AI engine. What they’re least familiar with is the difference making capabilities in NLU and how varied these engines are.
So, most ASR engines today, they are augmented by some form of what we would call contextual NLU. An example would be using the word for in a sentence. It’s up to the NLU engine to look at that word in the context of the sentence and determine, “Hey, is this a number four or is this the word for, F-O-R.” And so, you’ve all seen contextual NLU in real time on your phone when speaking with Siri or Google, as they changed that word that was transcribed to the word that you’re actually meaning or intending after it can see it in the context of the sentence.
When you’re trying to have speech technology, though, that has to be all things to all people, contextual NLU is as far as you can go. But in the arena of customer service, and here’s the big takeaway, where you can actually predict how a caller might respond to a question, you can actually go many leaps beyond contextual NLU and do something highly-tailored that really drives up the accuracy far beyond anything that the best AR can give, and we’re going to show some real-world examples of that in a minute. But the form of really advanced NLU that we’re talking about in this case is the kind that can be fine tuned for specific questions that have specific answers.
In AI-powered customer service, most questions have a limited number of grammars that the caller will respond with. So if you know at least what those grammars are, you can narrow the aperture of what you’re listening for to just those grammars, or, and this is what’s the most important, anything that sounds even remotely similar to one of those grammars. And so what this means is that even when speech rec transcription is wrong, because remember, speech rec is never 100% right, the NLU engine can match the language acoustic models against the pattern that it’s actually listing for, and run hypotheses to find the closest match. And so that means getting it right even when speech-to-text transcription gets it wrong.
We have a couple really good examples I’ll show you in just a minute, but when speech-to-text gets something wrong, I think as you know, it’s usually not horrifically wrong. It’s usually within five to 20 degrees of separation from the word that you actually intended to say. So from a technology standpoint, as long as you know what you’re listening for, you can have developers do highly-customized work across every question and answer in a given interaction to match grammars to intents. And this is something you can only do in customer service because you have the ability to ask questions in such a way where you can expect certain answers.
Now, I will be the first one to raise my hand and admit that this is not a very scalable approach. It’s not an all things to all people approach, because you do have to tailor the engine for every client, every interaction, every question, and that’s why it can also take up to six weeks in the build process, because you do have to account for every possible utterance to a given question and then manually tune. Now, the good news, at least in our case, is that we support over 100 use cases, more than over 100 different clients, so we’ve already done most of this heavy lifting and common self service use cases. But if you’re trying to do this on your own, because you see the need to augment your own ASR engine because it’s just not good enough on its own, this is where you do have to find that team of developers that act as that, I mean, kind of the human in the loop, so to speak.
So, while the initial build to a new application, while it can be a little arduous customizing specific questions and specific answers, it really is the only approach that can deliver a really good experience over telephony in limited grammar use cases.
Helena Chen: Right, and Brian, something worth mentioning is that this approach works really well when you have a narrow scope of expected intents. So on the next slide, we have an example of this in action for a AAA Emergency Roadside Assistance, and in that case, when you try to find out in natural language, what’s wrong with the vehicle, it’s always one of eight different intents or eight reasons, and those can be things like a flat tire, out of fuel, dead battery, or you’re locked out, to name a few. And since those are the only things that we’re listening for, we’re matching everything that’s said against those eight intents and finding the highest confidence score, even when the transcription from text-to-speech doesn’t line up perfectly.
So it’s, I mean, but if it’s one of 100 things that could be wrong, then this approach doesn’t work very well, because there’s not a limited range of intents to pattern match against it. So if we run into a wide aperture use case, that’s where we’ll opt in for a tool like Google because a statistical-based ASR with contextual NLU will outperform SmartAction in that instance.
Brian Morin: And that’s what you, on your end, you have to examine your use cases. You have to look at them interaction by interaction, question by question, and devise a plan for which tool will do the best job for this use case, and not even for this use case, sometimes even in a single interaction, you may have one wide aperture question that would perform better with one tool, and then another, a limited range response, that would highly benefit from a custom NLU approach.
Helena Chen: Right. And so one of the keys has nothing to do with the technology itself, but how you design the conversation flow, because the trick is to ask questions in such a way to get a narrow scope of responses that you can predict.
Brian Morin: Yeah, that’s a really good point. So Helena, you mentioned AAA. I’m going to play one of their calls just to show what this looks like in action. Now, as you will hear, these emergency roadside assistance calls, they’re often in the worst conditions possible. They’re outside their vehicle, they’re in the wind with traffic noise, on speakerphone, nonetheless, and there is no way that we could even deliver the accuracy that we do for them without doing this custom tailored NLU approach that we’re talking about. So I’m going to play this call, and not very long, and then we’ll come back to the play-by-play between what did the speech rec do, then what did the NLU do.
SPEAKER 1: Okay, let’s get started with your service request. I see the phone number you’re calling from is 300-0134. Is this the best number to reach you?
Customer: Yes.
SPEAKER 1: Please say or key in the full 16 digits that appear on the front of your membership card beginning with the four.
Customer 2: Put the number [crosstalk 00:24:08].
Customer: Four, two, nine.
SPEAKER 1: In order to receive service, the service vehicle driver will need to see your photo ID. Will you have that available?
Customer: Yes.
SPEAKER 1: Are you in need a service at your home address?
Customer: Yes.
SPEAKER 1: Please confirm we’re sending a service vehicle to 92 [Donokon 00:24:47] Drive. Is this correct?
Customer 2: Yes.
Brian Morin: She chimes in. So they’re on speaker phone.
SPEAKER 1: In a few words, what service do you require today? You can say things like, I need a jumpstart, locked out, or out of fuel.
Customer: I need to get into my vehicle. It’s locked.
Customer 2: [crosstalk 00:25:09] Locked out. Locked out.
SPEAKER 1: Is there a child or pet locked inside your vehicle?
Customer: No.
Brian Morin: Okay, so this is coming up-
SPEAKER 1: Will you require a replacement key because your key is lost, damaged or broken?
Customer 2: Yes.
Customer: Yes.
SPEAKER 1: Okay, please note that you may be responsible for an extra charge for this service. Your driver will let you know of any additional charges.
Brian Morin: So, listen to this vehicle capture portion next.
SPEAKER 1: Is it the 1996 red Jeep Cherokee listed on your account?
Customer: What’d she say?
Customer 2: No, it’s not [inaudible 00:25:53].
Customer: No.
SPEAKER 1: What’s the color of your vehicle?
Customer: Green.
SPEAKER 1: What’s the year of your vehicle?
Customer: ’97.
SPEAKER 1: Now, please say the make and model of your vehicle. For example, Ford Mustang.
Brian Morin: Yep.
Customer: Ford F-250 truck.
SPEAKER 1: Let me confirm that I heard you correctly. You said Ford F-250.
Customer: Yes.
SPEAKER 1: Would you like to receive status updates regarding your service work?
Brian Morin: Okay, so I’m going to go ahead and stop it there. They’re just finishing out the call. Now, what you don’t see here in this call is what the speech-to-text transcription got wrong. It transcribed, in fact, you can see it on screen. It transcribed very literally what it heard. So, if you recall during the call, when he was trying to say the word Ford, the F was not even audible, and so what we heard is also what the speech-to-text heard, which was ord. Now, he clearly said F-250, but the speech rec engine didn’t identify that as a word, and so here’s what it did. So it took the word ord, transcribed that as aboard. That’s the closest word it could find. And then the F that he said, speech rec does not see that as a word, so they transcribe the F to have and then transcribed, the 250 was transcribed to to as an T-O, and then the word fifty.
So this is one of those where the speech-to-text was definitely more than five to 20 degrees of separation off, but the NLU engine was able to match up the acoustics of what it heard against the multiple hypotheses that it ran against the intents that it’s listening for, and there was one confidence score that stood out as a high. So even though it was only a 50% confidence, you can see how the NLU did all this work and corrected it in milliseconds. So, I’ll play just one other, this is a much shorter example. I’m going to just play one more example and I’m just going to ask you if you can make out what the caller said here.
IVA: Thank you for calling DFW, Designer Shoe Warehouse. You’ve reached shoephoria. Am I speaking with Holly Sims?
Holly Sims: Yes.
IVA: To verify your account, please tell me the zip code listed on your account.
Holly Sims: 98516.
IVA: Great. I was able to find your account. What would you like to do today?
Brian Morin: Here’s what I want you to hear.
Holly Sims: Address change. Address change.
IVA: Are you calling about order number 304?
Holly Sims: I think so.
IVA: Do you want to change your billing address or your shipping address?
Holly Sims: Shipping address.
IVA: What’s the five-digit zip code of the new address?
Holly Sims: 98516.
IVA: Okay. From the zip code, I already know your city and state. Please say just the first line of your shipping address such as 1234 Main Street.
Holly Sims: 7838 Martin Way East, Unit 17.
IVA: Now I’ll confirm each part of your address. Was the house number 7838?
Holly Sims: Yes.
Brian Morin: Okay, so we’re just going through the confirmations. No need to show the rest of the confirmation. The point that I wanted to call out is that when we asked the caller what she was calling about, she said, “Dress change.” The A was not even emphasized in what she said. In fact, the A was almost entirely silent. So guess what the speech-to-text engine transcribed. It transcribed exactly what it heard, which was dress change. And our NLU was able to pick up on that and find the closest matching acoustic model which was the address intent, and the same was actually for, the area when we were trying to find out, is this, what address are we talking about. She said a not very clear, “Shipping,” which speech-to-text heard as sipping. But again, the NLU was able to take that and match it.
Helena Chen: Right, and if you were using standard speech rec with a standard contextual NLU, this would have been a failed call, and I think that’s why it’s important in the arena of customer service to go a step beyond and tailor that NLU to your specific use case and the grammars that your customers are actually using.
So earlier, we had mentioned a wide aperture use case as the domain for transcription engines, like Google, when you can’t predict how someone might answer, but on screen is address capture, and address capture is a very wide aperture use case. It’s a common use case we support here at SmartAction for a number of clients, like DFW, as you just heard, and also Choice Hotels.
So you’d think this would be one where we would opt to use Google, but it’s actually one where our approach significantly outperforms Google, and it’s not because we have a better ASR. It really has to do with the custom NLU to stitch what was heard against the patterns that we’re listening for. So if we capture the zip code first, then we can do a data dip on all the street names in that zip code to pattern match against, and that’s what really drives high accuracy. And if we didn’t do that data dip, then we would default to a transcription ASR engine, like Google’s.
And you heard that example of vehicle capture that we do for AAA Emergency Roadside Assistance, and we also do this for auto dealerships, so customers can schedule service that they want. Vehicle capture is not a narrow scope. The aperture of vehicle makes and models is so wide that you would have to typically rely on ASR transcription only, but since we can pattern match against a database of makes and models, we’re really able to drive up that accuracy, as you can see.
Brian Morin: Yep. That’s what really drives it up.
Helena Chen: Mm-hmm (affirmative).
Brian Morin: So we’re right here near the end of the webinar. We just have just a couple of quick slides to show you that are very quick. We will be getting into Q&A, so if you have questions, I’ve seen a couple come in, but start teeing those up, and we will be going into a Q&A shortly. And so Helena, I know that you’re just going to show us a couple examples here on screen of real-world, in the wild accuracy.
Helena Chen: Yes. So on screen are two examples of accuracy rate against highly complex capture that are not just real world, but as you mentioned, it’s in the wild. And this is not a limited dataset, but it’s millions of real callers with real audio. So you can see on the left, the accuracy for State Farm policy numbers, it’s a 13-digit alphanumeric policy number, and that’s about as complex as you can get for speech rec over telephony.
It mainly just underscores how important the NLU is because if we didn’t use that piece, this would be 30 to 40% less accurate, but with the NLU, we know that the eighth digit we hear is a letter and not a number, and the same goes with the 13th digit. As far as we know, we are the only voice self service company in the industry doing alphanumeric capture, and we do it really well, but that’s only because the NLU engine takes advantage of that prediction and the pattern matching.
Brian Morin: Yeah, and I should say, I know that we’re certainly saying that with a little bit of self promotion, but it’s not to say that somebody else couldn’t, another vendor couldn’t deliver the same accuracy levels, but what it requires is that they’re doing the same level of depth into the tailoring of the NLU engine for the interaction to be domain and customer-specific.
Helena Chen: Exactly. And yeah, so while the State Farm policy number is much more complex than the vehicle capture on the right, you can see that the accuracy rate on alphanumeric is actually better than the vehicle capture.
Brian Morin: Yeah.
Helena Chen: And that’s mainly because the callers on the right, they’re usually calling under the worst conditions possible, so outside of their vehicle, on speakerphone, with the wind and traffic noise, and I think the takeaway here is that these are the use cases that you couldn’t even try to support if you relied on ASR transcription only.
Brian Morin: Right. So here, just in closing, the attached is, just to give you an example of what impact that this custom NLU work has to be done, that it hasn’t and what needs to be done, not just before you get an application live, but it’s after an application goes live, as well. So on screen, this is Choice Hotels. They’re the second largest hotel franchisor in the world, over 7,000 locations. They’re the umbrella for nearly every budget chain out there. They shared this data last month in an event with CCW to showcase how conversational AI requires care and feeding after go-live and the impact that that has on containment and CX.
So, to explain what you’re looking at, our AI handles reservations for them and we gather a bunch of information from the caller, things like check-in and check-out dates, city, occupancy, a few things, and when we rolled out the application and we went live, you can see this is back at the end of last year, you can see nearly 8% of callers were getting transferred to live agents out of confusion. The AI simply just wasn’t understanding the intent from 8% of callers.
So even though we had already done all this work prior to the rollout, we tailored grammars for the NLU engine prior to the rollout, we trained the ASR engine, but you can never predict all the ways that real customers are going to interact with AI until they actually do. And you never know what words, frankly, speech rec is even going to struggle with until you look at the data.
Helena Chen: Right, and you can actually see how that team was working after going live to isolate those issues and tweak that NLU engine to capture those unexpected intents. And so you can see that, over time, the rate at which the virtual agent is passing the call to a live agent because of the confusion, it falls all the way less than 3%. And this is another slide that Brian added here.
Brian Morin: Yeah. More of Brian’s dad humor. I’m sorry.
Helena Chen: So as frustrated as Donald Duck is here, you don’t want that with any of your customers. And the best way to delight them is, it’s not just with world class ASR. As good as machines have become, they still fall short, so that’s why you need humans who specialize in NLU to really drive up the accuracy and the experience.
Brian Morin: Nope. Excellent. So, thanks, Helena. We are going to be moving over to a full Q&A and handle the questions that have come in. One of the questions that came in is, “Hey, am I going to get this presentation?” So, the answer is yes, you will get the presentation. Be send it over to you. The deck will be sent over. Once the on-demand gets rendered, we will be sending over the on-demand so you can share that with stakeholders, as well. If you are interested in next steps by any means, with SmartAction, you can see the contact info on screen, info@smartaction.com. Most start either with a demo, just to see and hear and try what we’re doing in use cases that we support in your vertical already with other adjacent players to you, or what we call free AI-Readiness Assessment. We can do a quick back of the napkin math on ROI, and identify you how likely of a candidate are you for AI automation disruption and what that would ultimately mean.
So with that, we will move over to Q&A, so go ahead and just keep your questions coming in. And let’s see, this one came from looks like anonymous attendee has a question. Do we ever get negative customer feedback about the resolution length with AI? He or she has found that customers’ feedback is often more critical of a lengthy automated process versus a human one, so do we find that callers are frustrated with the seemingly slower pace of the conversation?
So, the pace of the conversation is something that is really, really important when you’re getting, not just to the science of AI, but you’re getting into the art of the your conversations and how that relates to the speed of conversation that you’re trying to complete. At least on our side, we can control the speed of conversation and tune that. There are examples of this on our website. A good example of this would be Electrolux. They, Electrolux is… They have brands like Frigidaire, Westinghouse, and they use our AI for capturing intent when somebody calls in, authenticating them, and then taking them through a product registration process, which means capturing a product number, or serial number, or capturing a model number.
When they did the analysis between AI and humans, they found that the humans were completing that process from authentication through capturing product registration, they were doing it in two minutes. And after six months, they shared these findings with execs in the know, actually at an event in Marina del Rey. It is on our website. And they found out that same experience with AI was two minutes and seven seconds. So it was seven seconds longer with AI.
We can share plenty of examples where with AI, it’s actually shorter, because it doesn’t have that up-front maybe pleasantry exchange that happens with the human, and also because you’re trying to coach individuals into communicating with your AI with short succinct sentences, rather than long description. So maybe I went in a little long on that description or on that answer, but it varies.
So, one question that I’ve seen here is, as we’ve seen brands transition to conversational AI, what are the common or typical challenges? That’s a pretty big question, so how can I answer that one in less than 30 minutes. I guess I would say that, some will think through the terms of ROI, perhaps, and they’re wanting just to automate everything, irrespective of CX, and you don’t want to automate everything. You only want to automate the interactions where you know that AI is going to perform as well or better than you would see from a live agent.
So this doesn’t mean that you bucket some interactions for human and then some interactions for AI. It means that you look at all your interactions and you think in terms of symbiosis. How can you define that symbiotic relationship between AI and agents, and asking how much of this interaction should belong with AI, or how much of it should start with AI, and when and for what reason should it be handed to a human, either to finish the call or handle the exclusions that can occur within a conversation.
So, that would be… And that’s kind of part of what goes into our service, from a consultative standpoint, is sitting down with organizations and digging into that, looking at interaction, saying how much of it really should be with AI and how much of it should be human. How do we get your most ROI, but at the same time delivering the highest level of CX back to your customers.
So let’s see what else we have here. Well, so this, some of this is specific just to SmartAction here, so I’ll tackle a couple of those who might want to get some questions that are not just related to SmartAction, but the question here is to, can we talk about our implementation, our process. How long does a typical rollout take? After we go through the conversation design, we have a team of CX designers that typically kicks off these engagements, and we’ve seen that process take anywhere from a week. Frankly, with some organizations it can be like, trying to get conversation flow through their organization with AI, it can be like trying to pass a healthcare bill through Congress. So, it really depends on how quickly you can get through the design process with our CX design team, but once it does get handed over to development, from the time it goes into development, by the time that we’re going live, that’s typically a six to eight week process for us.
So another question. And this is a good question, not just about us, but this would be true for a lot of other conversational AI platforms, asking, “Does the virtual agent query your data real-time to do its job? How does that work?” And the answer is it does. Whenever somebody calls in, we’ll try to do a reverse lookup. We can take that phone number that has come in and even before we even answer, or just as we’re answering that call, we’ve done that lookup on phone number, we’ve done that data dip, we’re pulling 10 to 15 data points. We’re just holding it short term, just in RAM and memory, and so that way, we can use those data points, essentially, as the business rules and everything that we need to access in order to deliver that experience to the customer. So, and all of that happens in milliseconds. Other platform approaches are doing the same thing.
So a question here, this from Alexander. What kind of education that we believe is good for business people to increase the awareness about AI machine learning. That’s a pretty tough question. We really could speak through the lens of, we’ve done this with more than a different 100 organizations, helping them. This is usually starting with either their contact center team, or this is their IT team, or even ops teams, and then they’re rolling that up to stakeholders, who aren’t necessarily as technical adept.
So in most cases, it really is about trying to identify and solve a business problem. We try to identify what that business problem is up-front, because we found that when it comes to the quote, unquote, business people in the organization, what they want to understand is that they want to understand that a business problem is being solved and what is the time to, ROI in the case of this problem being solved. And so, we usually see organizations trying to tackle two challenges at the same time. One is, how do we reduce costs in the contact center, but yet, at the same time, how do we improve the CX to our customers?
And we’re not here to say that AI is better than humans. We can certainly put it in use cases where performs on par with the human, but what essentially kind of gives the better experience that customers don’t have to wait on hold, they can choose their channel of choice, they can self serve immediately, and that’s where we’ve seen a lot of our customers get an immediate big bump on CSAT and NPS, just because they’re able to serve their customers more quickly.
So, this is from another anonymous attendee. They understand that the use for this solution that we’re talking about seems to be for larger enterprise solutions. Where does this fit into and how would you sell into potential small to medium-sized clients? Good question. We actually have that full range within our own umbrella. Clearly, we’re more visible with our Marquee Fortune 500 clients, but we have a lot of clients for us that have no more than 50 live agents, 50 to 100 live agents. They would meet what we would call kind of a minimum threshold for reaching an ROI, a fast ROI, and that’s having about 25,000 minutes a month that can be serviced by AI automation. You’d want to look at that as a threshold. Do you have at least 25,000 minutes a month that could be serviced by AI? If the answer to that is yes, then you should really consider that as low-hanging fruit to go through this kind of, all the work that’s required to get that started.
Now, if you’re in that category where you’re under that threshold, you have fewer than 50 live agents, we would recommend attempting to tackling on a DIY platform, or in most of these cases, it’s just not worth the cost of a SaaS platform and the full-time headcount that’s required to run it. And so, for a lot of those organizations, it’s still better for them just to use humans over voice, and then deflect as much as they can digitally. Good questions.
We’re right here about 10 minutes before the top of the hour. I don’t see any other questions that have come in, unless I’ve missed anything, Helena.
Helena Chen: No, it looks like you got all of them.
Brian Morin: Okay. Well, I think what we can do is give just a little time back to everybody. As mentioned, we will be sending over the on-demand, so you will have that to share with stakeholders, and of course, we’re happy to engage one-on-one with any questions or demo experiences that you would like yourself. Helena, any closing remarks on your end before we close out?
Helena Chen: No, I think I’m good. Thank you all for attending, and enjoy the 10 minutes we’re giving back to you.
Brian Morin: Yes. [inaudible 00:50:04] You go be fruitful with it. We thank everyone for joining, and as far as anything incentive-based for attendees, we will have, our hosts will be following up with you about that. So thank you for everybody’s time and look forward to chatting with you soon. Thank you.
Helena Chen: Thank you.