Microsoft’s Voice Recognition AI Now Outperforms Humans In Speech Transcription

“Anything you can do I can do better; I can do anything better than you.” That is likely Microsoft’s mantra, as its research wizards have reached a milestone in speech recognition, with a word error rate (WER) of just 5.9 percent. That figure itself is down from last month, when Microsoft’s speech recognition system stood at 6.3 percent WER.

“We’ve reached human parity,” said Xuedong Huang, Microsoft’s chief speech scientist. “This is an historic achievement.” But things are little bit better than that;Microsoft admits that’s speech recognition system actually “makes the same or fewer errors” than professional transcriptionists.

“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” added Harry Shum, executive vice president for Microsoft Artificial Intelligence and Research group.

research microsoft
Microsoft's team from the Speech & Dialog group

Microsoft’s impressive advances in speech recognition are made possible by its Computational Network Toolkit (CNTK), which uses deep learning algorithms that are running on high-performance GPU accelerators. The system also uses neural language models to see words as continuous vectors in space.

So what does all this mean in the real world? What are the practical applications for this powerful speech recognition system? Microsoft envisions it being used in devices your Xbox One or to help make Cortana even more intelligent in Windows 10. And of course, instant speech-to-text transcription with greater accuracy will do wonders for products like the Skype Translator.

Right now, Microsoft’s speech recognition system is optimized to work in lab environments where there is little background noise. However, in the future, Microsoft believes that its system can deal with much tougher environments like highway driving (road/wind noise) or in a crowded restaurant. The system will also be tasked with adapting to a variety of voices and accents that it might encounter along the way.

Increased accuracy when it comes to speech transcription is definitely a nice perk, but the end game for Microsoft is to actually be able to understand the words that are coming out of someone’s mouth. “The next frontier is to move from recognition to understanding,” said Geoffrey Zweig, who oversees the Speech & Dialog research group. But as Shum explains, “It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown.”