Google’s AI can now pick out individual voices in a noisy room

Edward Munn April 13, 2018

People are, generally speaking, much better than computers at picking out a single voice in a crowd. You’ll know this if you’ve ever tried to say something to your smart speaker while someone else is talking at the same time. Chances are it probably asked you to repeat your command.

Now, this could be about to change, following the announcement Google has trained an AI model to separate distinct speech signals from one single audio recording.

In a blog post, the company reveals its new deep learning model works by using both the auditory and visual signals of an input video – in short, it lip reads.

“The visual signal not only improves the speech separation quality significantly, in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper)”, the post reads. “Importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.”

Google demonstrates its new AI model using a series of videos including one of two stand-up comedians talking loudly at the same time (which you can watch below), and its effectiveness is startling. It can pick out either man’s voice without any problems, and the speech is so clear there’s no clue someone else was even speaking on the original recording.

Google says that all a user needs to do is to select the face of the person in the video they want to hear. Otherwise, the software can pick a person’s face algorithmically based on context.

There are a number of ways the technology could be used, and perhaps to allay the public’s likely (and probably founded) concerns about privacy, Google has led with the rather dry example of speech recognition for automatic video captioning.

None of the current generation of smart speakers use cameras to interact with users, but it’s not impossible to imagine such technology could be built into speakers in the future, especially if it’s under the guise of offering video calling from the comfort of your living room. The tech could also conceivably improve the performance of voice-control software on phones, tablets, PCs and even televisions.

Google’s AI isn’t the first to offer speech separation – last May, Mitsubishi unveiled a deep learning model that could separate two simultaneous speeches with 90 accuracy – but it claims its model produces better results than both audio-only models like Mitsubishi’s and other recent audio-visual speech separation methods, which typically need to be retrained for every speaker of interest.