Cars that talk and listen

Eric Soule
Sensory Inc.
Santa Clara, Calif.

Edited by Sherri Koucky

My college roommate had an '81 Datsun 200SX that could talk. At appropriate times, a sultry female voice would offer reminders such as "fuel is low," "lights are on," and "door is ajar." In a way, Datsun was ahead of its time. Not only have speech technologies matured enough to become useful, but automotive applications have emerged in the past 20 years that make speech man-machine interfaces (MMIs) essential. Speech recognition and text-to-speech (TTS) offer ideal complementary solutions to more traditional visual and tactile man-machine interfaces such as displays, buttons, and knobs.

While at least 27 states in 2000 were considering legislation banning cell phones in cars, recent data appears to indicate there may be no correlation between automotive fatalities and wireless cell-phone growth. The debate over the safety impact of cell phones will no doubt rage for years to come. Regardless of where these debates lead, the need for hands-free communication within the car will accelerate. Today, there is a growing trend toward complex and featurerich entertainment, navigation, and Telematics systems, which compete for the driver's attention — including visual, auditory, and cognitive mindshare.

It's possible Datsun's speech playback developers thought they were starting a trend in driver interfaces. The problem was that the verbal warnings could have been more simply performed by an idiot light on the instrument cluster. But what if your car, hot-synched to your handheld device, could remind you to pick up groceries on the way home, complete with a shopping list? Or if your car could talk you through the navigation of an unfamiliar neighborhood? The point is, it's time to think about how speech technologies can enhance the quality of our lives by keeping us safer in cars, while letting us be more productive.

Navigation systems
Navigation systems using GPS technology have already migrated down to the "near-luxury" car segment and, in the future, will be available in any car likely to be used for road trips. While most systems today are fairly good at calculating directions, getting information in and out of them remains challenging and distracting to the driver. Using a keypad to enter addresses is challenging for passengers and impossible for drivers unless they pull over. Voicerecognition technology offers a natural solution for this issue because it lets a driver focus on the road and on the steering wheel. But once the system is told where to go, what is the best way for it to point out the optimal route? Here again, most current-generation systems require looking at a map on a video display and/or reading written directions. Text-tospeech and compressed-audio playback technologies offer a far less distracting interface, communicating with the driver as if the system was a passenger holding a map reading directions.

Entertainment systems
Digital multimedia in the car means access to potentially thousands of song tracks (such as MP3s stored on a hard disk) or hundreds of stations (using satellite-based radio). The problem again is, in the car, how to sort through this large database of music or videos, conveniently and safely, while driving. Clearly, just being able to say the name of the song, artist, or genre of music offers a quicker and simpler form of interface than buttons or displays. Here, speech technologies offer significant advantages.

Telematics is a trendy and vague buzzword that generally includes any type of service that enables mobile communication in a vehicle. It has evolved from crash notification to include Internet access and e-mail. Because nearly 70% of all wireless airtime is traced to people driving around, carriers have a strong motivation to serve this mobile market.

But the problems with retrieving information from the Web or responding to e-mails are magnified in the car. There is vigorous debate as to whether the car is an appropriate place to be performing these tasks at all, but assuming they can be done efficiently and safely, speech recognition and TTS are certain to play big roles.

Telematics gets really exciting when portable devices, namely cell phones and PDAs, become seamlessly integrated with our cars, homes, and offices. People use devices differently depending on their environment.

For example, cell-phone users may prefer to dial using the keypad in the office, but with voice in the car. It would be convenient if a PDA, dropped into a cradle in the car, could dial any person in a contact database on utterance of their name. Similarly, digital-music players and other personal devices may be more conveniently operated by voice when used in the car, and with buttons and displays in other environments.

General command and control
Command and control is the category with the greatest misapplications of voice-recognition technology. While it's certainly possible to create systems that accurately let drivers control windows, mirrors, seats, and climate-control functions, the old adage "just because you can, doesn't mean you should" applies perfectly here.

Buttons and knobs are quicker and easier to use than voice commands for these functions (and simpler than now-trendy touch screens, for that matter). But modern cars boast so many features that certain nontime-critical functions, such as setting the cruise control or accessing trip computer information, are better served via speech.

In fact, many features go unused simply because people are either unable or unwilling to figure out the controls. What if it was possible to just say "set cruise control to 65" or "turn on the interior lights?"

Voice is such a natural fit as a man-machine interface that it is difficult to argue that it may never be prevalent in cars. However, several factors could impede the adoption of speech-enabled functions. Because of the complexity of voiceenabled features, products that are introduced with poor speech interfaces will slow consumer acceptance. Not only must voice systems be well designed, but consumers also have to get used to a new way of doing things in their cars.

The situation is similar to the challenges faced by makers of telephone-answering machines 20 years ago, when many people would hang up rather than speak to a message machine. There are other factors to address before speech technologies enjoy widespread adoption.

There are no widely adopted standards for measuring speechrecognition accuracy, mainly because it's a challenging task with many variables. Regardless of what vendor spec sheets say, developers need to evaluate recognition accuracy using their own hardware in the same conditions the user will encounter.

Variations in environmental acoustics, speaker accents, and background noise all degrade accuracy. Weather conditions, rolling down the windows, a passing truck, and kids in the back seat can all wreak havoc on accuracy unless systems have robust software and careful system design.

Dialog design
Many designers of speech-recognition systems argue that proper dialog design, i.e., the scripting of the conversation that takes place between the user and the device, is more critical than the absolute accuracy of the speech recognizer. People expect Star Trek features and levels of performance, and current technology is not there yet. The types of prompts, grammar design (how the recognizer parses the user's response), and voice-menu structure all greatly affect the likelihood that the user can quickly and successfully complete the task at hand.

A voice interface represents a serial form of communication, and this fact must be considered when designing speech recognition into a car environment. A visual scan of the instrumentation lets the driver quickly determine speed, engine rpm, fuel level, and possible problems. Hearing this information spoken may free the driver's eyes, but it's a slower and less-convenient way to convey the same information that can be taken in with a single glance.

Displays also have the advantage of persistence, in contrast with the transitory nature of speech. Think of the difference between transcribing the contents of a voice-mail message, or someone's phone number as they speak it, compared with seeing the same information already written down. For some kinds of in-car communication, it will always be preferable to offer information that can be digested visually rather than audibly, and vice versa.

Hey, where's my (speech) engine?
While the user doesn't care where the speech-recognition software runs, the system designer needs to. Next-generation systems will include one or all of the following options for system architecture. For example, low latency ensures that there are no delays transmitting the voice to a central server. As such, users perceive real-time response.

Also, it's not necessary that cars be connected. Someday cars may be connected to a network all the time and wherever they travel, but that's not today's reality. When recognition processing occurs inside the car, the driver has fullvoice activation functionality even when wireless communication is not available. Plus, there's no need to subscribe to a service with an embedded system. Consumers like services and hate paying for them, especially when they don't need to.

Developers have complete control over the audio path, from microphone to speech processor. This is an advantage over solutions that must rely on potentially noisy network connections and equipmentto-equipment variability.

Embedded solutions also present disadvantages, notably that computing horsepower and memory comes at a premium when located inside the vehicle. As such, cost constraints have forced developers of current generation systems to limit vocabulary sizes to about 100 words. But, as processor, DSP, and memory prices continue to drop, and as speech recognition algorithms improve, vocabulary sizes will quickly increase to thousands of words, enabling applications such as natural voice entry of addresses into navigation systems.

Another approach is to perform all speech-recognition processing on a central server connected to a telephone switch. This provides unlimited computing horsepower. Systems can recognize large vocabularies even with very thin clients inside the car. With a central server, changes to the information that the driver is trying to access, such as a map, are more easily updated than in individual cars. While difficult or impossible to do in the car, dialog design can easily be altered on a server using VXML. Also, the system works with any phone, and doesn't require any special hardware.

However, as with every rose, there are thorns. There are some potential drawbacks to serverbased speech-recognition technology. It won't work if you're not connected, and it's subject to noisy channel problems, different microphones, and equipment variability.

In the context of speech-recognition and TTS systems, distributed processing involves performing a small amount of pre or postprocessing on the client, with the compute-intensive functions taking place on the server. For speech recognition, this means doing the signal-conditioning and feature-extraction functions in the car and transmitting a small data packet to the server to be recognized. With TTS, low-data-rate information goes to the car where a low-cost processor turns it into speech output. Advantages include that it avoids the noisy voice-channel problem, and there is no need for separate audio and data channels. It also reduces the need for computing horsepower on both client and server.

Disadvantages of distributed processing include the need for "infrastructure." Because special hardware is required on the client side and, as a minimum, new software on the server side, the distributed approach demands a substantial commitment from developers and service providers. Standards are also required for interoperability between different cars and systems, and these standards are just starting to emerge. A leading contender is the Aurora standard (www.etsi.org).

Current technology would have given the Datsun engineers in the '80s the ability to turn their idea into a valuable feature. When the car said "fuel is low," it could then follow up by asking, "Would you like to locate the nearest gas station?" Now if only the car could fill its own gas tank.