Google Voice improves speech-to-text using the power of “thinking” computers

Vaughn Highfield July 24, 2015

Listening to a voicemail requires an average investment of around 30 seconds. The end result is usually a trivial “it’s mum, call me back please” message or, even worse, a pocket dial. Enter Google Voice – a service that offers full text transcripts of missed calls, saving you time. That is, if it wasn’t so prone to errors. Thankfully though, Google has improved its speech recognition software by 49%, thanks to deep neural networks (DNNs).

Google Voice improves speech-to-text using the power of “thinking” computers

Using Voice users’ voicemails for research, Google has improved its systems using snappily named long short-term memory deep recurrent neural networks (LSTM). The improvements mean Google Voice should no longer deliver nonsensical transcripts of voicemails. Shame it’s only available in the US and through Project Fi.

An in-depth research paper from Google clearly shows the reason it started using LSTM – the old system of keyword spotting just wasn’t cutting the mustard.

“[DNN] was shown to significantly outperform a baseline keyword-filler system,” states the paper. “[DNN] is appealing for our task because it can be implemented very efficiently to run in real-time on devices and power consumption can be easily adjusted by changing the number of parameters in the DNN.”

However, a DNN solution was far from perfect as recognition degrades “significantly when speech is corrupted by noise, or when the distance between the speaker and the microphone increases.” The idea behind using DNN technology was to help Google Now understand and select the sections of audio that contained a voice. In the testing phase, Google would add artificial noise to speech tracks, forcing its systems to listen more carefully to what was being said. To combat quiet speech, DNN allowed Google’s systems to select and boost near-inaudible sections of audio.

It’s all certainly very interesting and incredible technology, but I’m sure you’re reading this and wondering “what the hell is a LSTM or DNN, and how is any of it making Google Voice better?” Well, if you want to know how all of Google’s speech processing works, the company’s been kind enough to provide some incredibly dense white papers that detail everything.

In layman’s terms, LSTM is a form of “thinking” for neural networks. It’s a type of recurrent neural network (RNN) architecture that’s perfect for learning and classifying. Like other RNNs, it learns about the world by collecting data and gradually builds a better picture of its environment. This is exactly what Google wants its voicemail-transcription technology to do – record more accurately by recognising the sounds and speech patterns of callers. But speech recognition, especially speech-to-text, isn’t simple.

Nigel Cannings, CTO of Intelligent Voice, revealed the difficulties of building incredibly accurate voice-recognition technology. Traditional speech-recognition tools work by listening on a syllable-by-syllable basis. But humans do things differently: we subconsciously listen and predict which words will come next to form, and near-instantly understand, a sentence.

“Speech recognition is purely temporal. DNN is very bad for that. It’s great for images, but bad for speech,” said Cannings. “Think of speech as a collection of a million images all in a row and – to understand the next image – you need to understand the 30 images before, and the 50 images that are coming next.”

Interestingly, RNNs are considered the “end of debate” when it comes to speech recognition, if they can be achieved successfully. The dream is, according to Cannings, to be able to turn data into text and decrypt the information incredibly quickly, all with low file sizes. Currently, “the only problem with neural networks is the amount of frames they can hold”, claims Cannings. RNNs just aren’t big enough to handle the amount of data needed to decrypt whole sentences at a time.

It’s unclear exactly how Google is putting LSTM technology behind Google Voice. Perhaps, as Cannings suggests, it’s taking each word at a time and turning it into text – after all, Google Voice doesn’t need to instantly transcribe a voicemail in real-time.