Inside Speech Graphics: creating the virtual faces of the future

I first encountered Edinburgh-based Speech Graphics far from the streets of Scotland, in Austin, Texas, at SXSW Interactive 2014. Fans of rap music, however, may know the company better for its work on 2014’s Kanye West video, “Black Skinhead”.

This proved an excellent showcase for the company’s technology, which creates realistic facial animations based on audio analysis. The project – a three-minute video featuring 3D animation and film footage of Kanye West as he stalks the floor – contained some controversial imagery, but for Speech Graphics it was primarily a fascinating project on which to work.

Michael Berger, co-founder and chief technology officer of Speech Graphics, explains: “They contacted us on a Friday, needing the animation by Tuesday for a three-minute video of continuous rap music – and they wanted all of this animation synchronised with the rap. So it was a fast turnaround, and really high pressure, but we did it.”

Through acoustic analysis of Kanye’s voice track, Speech Graphics’ technology was able to automatically identify the facial-muscle activations used in producing the sound, and adopt this information to animate a 3D model of the rapper’s face; the end result appeared to the viewer as a particularly raw, emotional delivery.

Studious beginnings

Speech Graphics’ journey began at the University of Edinburgh, where co-founders Berger and Dr Gregor Hofer were both PhD students. The pair shared an interest in audio-driven animation, but came to the challenge from different angles: Berger’s academic career had begun with a linguistics degree, while Hofer’s background had been in psychology and computer science.

Together they came up with an idea that would save games and animation studios time and money, by making it possible to take the audio spoken by an actor and create a corresponding animation automatically, rather than having to painstakingly animate speech by hand. Hofer believes the key to their success is that, while developing their technology, they kept focus on their target industry’s needs.

“If you develop something on your own, and don’t get any feedback, you might not hit the right buttons,” Hofer says. “The main thing is to talk to industry as quickly as possible, even if they don’t buy from you straight away.”

Talking was fine in 2009, while Berger and Hofer were still students, but they needed a break to turn the technology into a viable business. That came in early 2010: “One day, we were contacted by a large games-development studio,” explains Hofer. “That was the key moment where we started to say, okay, this is something we can pursue.”

“The main thing is to talk to industry as quickly as possible”

By the end of the year they’d incorporated the company, headquartering it from their existing Edinburgh base. “We have a good network here, so we have valuable contacts to hire people,” says Hofer. The university played a role with courses, and put them in touch with people who could help, including access to lawyers.

“The weather is terrible,” admits Berger. “But it’s a beautiful city and a lot of people want to work and live here, which helps when it comes to recruitment.”

Making money

Speech Graphics now employs ten people in Edinburgh, with a network of contractors around the country; a lot of data is uploaded to the cloud for quality control by specialists around the globe.

As with most young companies, the path to success has had its hurdles. Speech Graphics started picking up contracts in 2011 and 2012, but the company’s first big project didn’t end quite as they’d hoped. “Unfortunately, the first game that we worked on was cancelled,” says Hofer. “We’d done a lot of work on it, but that was our welcome to the games industry – before this, we didn’t realise how much stuff gets cancelled. More games get cancelled than get released.” It wasn’t all bad news. The team was paid in full for its work, and the technology was nominated for a games industry award at the end of 2012.

Then, in 2013, the company landed a “very big” contract to provide the facial animation in Middle-Earth: Shadows of Mordor (published by Warner Bros Games). Speech Graphics provided two hours of animation for the game, released in September 2014, so if you’re impressed by the realistic Uruk faces, then you know who should gain the credit.

Such contracts are key to the business, but improving the core technology remains crucial. “We’re doing the work of two companies,” says Berger. “We’re an animation company, producing animations based on our technology, and that’s our main source of revenue; but at the same time we’re still developing the technology that we’re using in that process.”

In the past few months, while the production side of the company has been focusing on the game, the development side of Speech Graphics has been working on improving its motion synthesiser – the component that translates audio data into facial movements.

“We’re an animation company… but at the same time we’re still developing the technology that we’re using in that process.”

“We’ve been making the movement we produce more organic and more realistic,” explains Berger. “We animate the whole face now. When you analyse the audio, you get a phonetic representation of the speech, but it also predicts non-verbal behaviour in the upper face, including the eyebrows, blinking and eye darts.”

How, I asked, do you deduce one from the other? “That’s one of our secrets,” replies Berger coyly, “but we extract features from the audio signal that tend to be correlated with certain facial expressions. For example, if you say something with a high pitch – when you get to the end of a sentence and your voice goes up, for instance – then your eyebrows will tend to go up with that pitch increase.” There are other cues, too: the greater the intensity of our speech, the greater the rate of blinking. Berger is confident that things will continue to improve. “I think we’re five years from photo-realistic facial animation, where you can’t tell the difference between the real and animated face – I’m talking about automated animation; not hand-touched animation.”

Into China

Perhaps surprisingly, Speech Graphics’ technology isn’t language-specific: since the analysis is based on bone and muscle structure, it works just as well for Mandarin Chinese as for Home Counties’ English. As it happens, another big project the company took on used the technology to help Chinese people improve their pronunciation of English.

“The Saundz project in 2012 and 2013 was for a company that was developing a website and an app to teach the Chinese market how to pronounce English without an accent,” says Berger. “They asked us to produce a large set of animations – a view of a woman saying words both from the front and also from inside the mouth and the vocal tract. We created an interior model of the vocal tract, with a lot of artist adaptation, and we drove that model with the same algorithm that we use to drive the external facial muscles.”

You can see the results at saundz.com or by downloading the company’s app. “It’s the most detailed animation to date of the human speech process,” claims Berger, who points out that the technology that drives the animation is all proprietary to their company.

On the horizon

So what’s next for Speech Graphics? Berger pauses. “There is another game project coming up in the autumn,” he says cagily. Names? “That will be revealed in the future,” said Hofer, with an apologetic laugh. “We partner with big companies, and we’re under NDAs [non-disclosure agreements] for a lot of things.”

The pair are more forthcoming when it comes to their own technology. One project the company is developing – together with a department of the Japanese government – is an interactive avatar, which responds when you speak to it.

“It’s already deployed in Japan with anime,” says Hofer, “but they want to bring it to Europe.” Speech Graphics has the ability to bring it to life through realistic facial animation, and has landed a deal to provide the characters and the speech-synthesis model to drive the face movements.

The first demonstration of the technology is due later this year at the University of Edinburgh, where visitors will be able to talk directly to an avatar. Hofer sees applications such as virtual shop assistants: “You might be in a shop, and you could ask: ‘Where can I find this particular perfume?’, or a toy, and the avatar could direct you, or show you on a map.”

“We’ve also been working on a similar technology for mobile devices,” says Berger. “But we’re still developing the application – I don’t want to say anything yet!”

Disclaimer: Some pages on this site may include an affiliate link. This does not effect our editorial in any way.