A Call for Speech Recognition

Amtelco Updated Logo

By Dan Cropp

My father never liked computers. When asked why, his answer was one heard often about technology: “It’s not user-friendly.” Dad wanted his computer to be simple to use, like his telephone. When I think about automated phone systems and speech recognition tools, I see the wisdom in my father’s words.

Many people dislike automated phone systems. Some gripe about going through menu after menu, only to be sent back to the beginning. Others complain that handset keypads can’t be used with such systems. The user usually has to put the phone to their ear to listen to the prompts, move the phone to read the digits and press the right one, and then put the phone back to their ear to listen for the next prompt. Many automated systems don’t allow time for these gymnastics, so a call can take several attempts to complete.

I can’t use my cell phone with such systems. The keypad is so small and hard to hit that I’ve stopped trying to use it during a call. To be fair, capturing digits has long been the only way to reliably retrieve information from callers.

However, speech recognition is rapidly altering this reality. Speech recognition is generally lumped into three categories: speaker verification, speaker-dependent, and speaker-independent.

Speaker verification is used to verify that a particular person is calling. It’s typically used for security purposes to match a voice to a previously recorded voice.

Speaker-dependent recognition describes a system that must be trained to recognize an individual voice. Once trained, the system can recognize what the person is saying, word for word. This type of recognition is typically found in transcription environments.

Speaker-independent recognition is used in environments where anyone might call in and the system must be able to recognize any voice. This type of speech recognition is the most common in the telephone world. Speaker-independent recognition is typically command-based. At any time, there is a limited set of commands and phrases the system expects to hear.

For example, a voicemail system might ask the caller to verbalize an action for a message they just listened to: “What would you like to do next?” “Delete this message?” “Play the next message?” “Logout?” and so on.

Speech recognition began hitting the mainstream about a decade ago. It came with lots of promises, but it didn’t really deliver. I guess Star Trek led to unreasonable expectations for such new technology. Early speech recognition systems required the user to spend hours training it to recognize just their voice. Even then, it had at best a 97 percent chance of being correct.

I once asked every speech recognition vendor at COMDEX if their technology would work in a phone system. The vendors all politely pointed out that a phone call presents many challenges they hadn’t yet solved.

About five years ago, speech recognition capabilities started appearing in automated phone systems. These were little more than DTMF systems that had been modified to allow users to say digits instead of pressing them. These early systems were fun to try, but offered no significant benefits considering their high costs.

Last week, my home Internet connection wasn’t working. I called my Internet service provider, and an automated voice said there would be a thirty-minute wait before talking to a support technician. The voice asked if I would like to try the automated support system while I waited.

I had work to do, my wife had my car, and I needed to connect to the network at my office, so I said, “Yes.” The system asked me to describe the problem. “I can’t connect to the Internet,” I replied. The system recognized what I said and began asking questions about my hardware.

After a few questions, the system prompted me to unplug the modem. I did and said, “The modem is unplugged.” The system said, “You must now wait sixty seconds and plug it back in. To help you, we will let you know when this time has expired.” After exactly a minute, the system said, “Please plug the modem back in.” I did this and said so. The system told me to wait for the lights on the modem to stop flashing, then power up my computer, and try connecting to the Internet.

I anxiously opened the Web browser on my laptop. Much to my surprise, I was on-line. I could connect to my office and get my work done. In a matter of a few minutes, I went from a disgruntled customer to one singing the praises of the automated support system.

Besides making calls easier for callers, there’s another compelling reason to consider adopting speech recognition. Numerous studies show that using a cell phone while driving is unsafe. Fifteen states and many municipalities have enacted restrictions on cell phone use while driving. Many more laws affecting cell phone use while at the wheel are in the works. New York now prohibits drivers from using cell phones unless they are hands-free devices. California will begin requiring drivers to use hands-free phones in 2008.

Daily commute times have increased in recent years and will continue to increase. While commuting, many of us use cell phones to keep in touch with family and friends and to get work done during our drive time. These impending cell phone laws will force vendors to add speech recognition to their phone systems or risk losing business.

Still not convinced that speech recognition is worth looking at? Then consider how speech recognition could improve a basic voicemail system. With speech recognition, a voicemail system could do anything we’ve become accustomed to doing with email systems.

Obviously, callers would be able to issue simple voice commands, such as replying, saving, deleting, and navigating through messages. But voicemail systems could be made to support advanced commands, such as “Find all messages from John Smith,” “Are there any other messages from this person?” and “Play all messages recorded yesterday.” A speech recognition-enabled messaging system could be tied directly into multiple email accounts and share the user’s contact list with the voice messaging system, and so much more.

Imagine being able to call a single number to retrieve all of your messages, regardless of their origin. Text messages could be played back using text-to-speech translation. Voice messages, including text and voice attachments, could be played back over the phone.

Such a system could easily allow you to issue a command action, such as “Reply,” for an email message. Your voice would be recorded and then either attached to an email as an audio file or translated into text using a speech-to-text engine.

Another benefit to speech recognition systems is the ability to support multiple languages. A speech recognition system could prompt callers to say their preferred language: “English,” “Español,” or others. This requires the system to recognize just one word up front. From this point on, the system would prompt only in the user’s selected language and, more importantly, it would recognize phrases and pronunciations specific to that language. Multilingual systems require more thought and input from someone familiar with your system and the languages your callers speak.

There are some things that speaker-independent recognition does not do well. At any given time, it uses a list of words and phrases to compare the detected speech against. If a list contains multiple words that sound alike, speech recognition has trouble detecting the differences. For example, a list checking for “Bye” or “Dye” has trouble with the distinction. So do people. Often, it’s better to look for phrases instead: “Good-bye,” “Joseph Dye.”

Speech recognition also struggles with the alphabet. The alphabet consists of short syllables, and it’s difficult to recognize these over the phone. We’ve become accustomed to saying a letter and a word that starts with that letter: “D as in Dan, A as in Adam, N as in Nancy.” However, there are ways around this problem. One is to have the system ask the caller to say and spell the name: “Dan, D A N.” In my experience, this works sometimes, but not every time.

The human brain and voice are capable of so many combinations that speech recognition will never be able to totally replace live phone agents. But just as computers were made to handle redundant tasks and improve human productivity, speech recognition can do the same thing for phone systems.

Speech recognition isn’t a fit for every scenario. It really comes down to what is needed to handle each call. Speech recognition should be viewed in the same light as any new technology. When looking at speech recognition systems, make sure they are flexible and able to meet your needs. Take time to “kick the tires” and “look under the hood” of the system. Some processes may need to be tweaked after the system is up and running.

The first question is, “Will it be an improvement?” If your answer is “Yes,” then ask yourself if the benefits are worth the cost of the resources required, the set up work, and the training time.

If your agents gather lots of personal information from callers, items like phone number, city, state, and postal code can easily be handled by speech recognition. Items such as names and addresses are more difficult, so it may be better to have a live agent handle those.

Speech recognition pricing comes in several categories: number of voice ports, types of recognition, and available languages. The type of recognition is dependent on the maximum number of phrases that need to be recognized at any given time. For example, the system could ask a caller to say the name of the person in a large directory, solicit a “yes/no” response, or require a keypad digit.

The number of speech recognition licenses needed is determined by the number of simultaneous calls to be handled using speech recognition. If ten callers will be asked questions at the same time, ten speech recognition licenses would be needed.

If you haven’t tried speech recognition for a few years, it’s time for another look. Automated phone systems have undergone big changes. Dad is probably saying, “It’s about time they made it easy!”

Dan Cropp, a senior software engineer at Amtelco, is the primary designer of Amtelco’s “Just Say It” speech recognition and interactive voice response products. He was the principle engineer in the migration of Amtelco’s Infinity Telephone Agent software from the DOS platform to the Microsoft Windows operating system in the mid-1990s.

[From Connection Magazine May 2007]