Speech Recognition and Natural Dialogue with Virtual Agents

By James Hammerton

Today, many call centers employ speech applications where the customer speaks to a computer rather than to an agent. The benefits of doing so are well known: providing 24/7 coverage, reducing costs, and lessening the burden on call center agents by handling simple tasks automatically. Rather than replacing human agents, speech applications complement them, handling the more mundane tasks while leaving humans to deal with more complicated tasks or with situations where a customer insists on talking to an agent.

Typically, the speech applications employed today are rigid, menu-driven applications, asking users for one piece of information at a time. However, recent advances in speech-recognition technology and dialogue systems support more natural forms of dialogue. This enables callers to conduct their business more quickly, increases the range of calls that can be handled automatically, and makes the speech applications easier to use.

Natural dialogue in speech applications: Supporting natural dialogue poses a number of challenges. The grammars, which define the utterances that can be understood by the system, have to be larger, taking more effort to develop. Larger grammars also increase the risk of error as the system has more options from which to choose, and there is a greater risk that a caller’s utterance will be ambiguous, requiring intelligent handling to resolve the ambiguity. Once an application does more than ask for a single piece of information at a time, there is also a greater risk that it can follow a completely inappropriate path because of errors. These challenges have been addressed by the development of a range of new technologies.

Statistical Language Models (SLMs): Used both in conjunction with, and in place of, traditional grammars, SLMs compute the probability of a word occurring, usually based on the two previous words. This information is used to help decide what the caller has said. SLMs are effective at improving speech-recognition performance but require large numbers of transcribed utterances to provide the necessary statistics.

Auto-generation of grammars: This reduces the burden on grammar writers. Although grammar writing cannot be totally automated, it is possible, for example, to provide grammars that have slots in them so that they merely need customization, rather than every single phrase being written from scratch.

Backing-off: The recognition system first tries a grammar tailored to the current application prompt, and if it doesn’t get a match, it then backs off to a grammar covering a wider context. This allows narrow-coverage grammars to be exploited for their good recognition performance while enabling wide-coverage grammars to be exploited for their flexibility.

N-best recognition results: A speech recognizer returns several best guesses of what was said, and the application chooses the guesses that best match the current context.

Look ahead capabilities: This is where the dialogue engine is aware of the future paths an application can take and uses this information to match user input to both current and future fields as needed.

Value confirmation: The dialogue engine is capable of providing implicit and/or explicit confirmation of what a caller has said. With implicit confirmation, the system tells the caller what values it recognized, before giving the next prompt. With explicit confirmation, the system asks the caller whether the recognized values are correct before using them. Implicit confirmation allows the caller to move on quickly if the values are correct or to provide correction at the risk of having to undo some work. Explicit confirmation slows things down, but the application won’t need to undo actions taken because of an error. Judicious use of both types of confirmation helps ensure that what the caller thinks is happening matches what the application is doing.

Coping with errors: An important aspect of any speech system is its error handling. Recognition errors will occur, because it’s not possible for speech recognition to cope with all possible sources of error, such as strong accents, noisy lines, background noise, coughing, repeated words, and the use of words that don’t occur in the system’s vocabulary.

With more natural dialogue, error handling is doubly important to minimize the risk of the system taking the wrong path. For this reason, various strategies are often employed for minimizing errors. Examples include acoustic disambiguation where an application will ask the user, “Did you say X, Y or Z?” when it can’t find a clear winner in the n-best results; acoustic verification where a dialogue system will ask the user, “Did you say X?” when the recognition confidence is low; and semantic disambiguation where the system asks, “Did you mean X or Y?” if the input can be interpreted in more than one way.

Handing over to a human is still sometimes necessary: A common problem with speech systems is that callers find themselves trapped when things go wrong and are unable to get to a human without starting over and risking being stuck again. By keeping track of variables such as recognition confidences, how often a caller has been re-prompted, and whether (or how often) the caller has tried to correct the system, a speech application can monitor how well the dialogue is going. Should things go badly, it can transfer over to a human agent. Additionally, a good system should also enable a caller to get to a person quickly if it needs to. Ideally, the human agent will also receive information about what the caller was trying to do.

Handing the call over to a person is still necessary at times, because no matter how good the recognition or dialogue engines are, they may not be able to cope with a strong accent, or a bad line, or a caller using words that are not in the system’s vocabulary, and then the dialogue will break down. This will leave callers frustrated unless they can speak to a human who can then deal with the situation.

Conclusion: Menu-driven speech applications are well established, and it is now possible to improve these applications by allowing more flexible forms of dialogue. This can lead to a better caller experience, makes speech applications easier to use, and it further reduces the burden on human contact center agents. However, adequate error-handling mechanisms and the ability to transfer to a human agent must be provided, or callers will become frustrated.

Dr. James Hammerton is a natural language processing and artificial intelligence specialist at Graham Technology and the principal researcher behind the dialogue engine in Graham Technology’s agent247 technology, an extension to the company’s business-process modeling system, GT-X7, which enables business processes to be driven via speech or textual input from the user.

[From Connection Magazine May 2007]