Microsoft’s Voice Recognition AI Now Outperforms Humans In Speech Transcription

“Anything you can do I can do better; I can do anything better than you.” That is likely Microsoft’s mantra, as its research wizards have reached a milestone in speech recognition, with a word error rate (WER) of just 5.9 percent. That figure itself is down from last month, when Microsoft’s speech recognition system stood at 6.3 percent WER.

“We’ve reached human parity,” said Xuedong Huang, Microsoft’s chief speech scientist. “This is an historic achievement.” But things are little bit better than that;Microsoft admits that’s speech recognition system actually “makes the same or fewer errors” than professional transcriptionists.

“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” added Harry Shum, executive vice president for Microsoft Artificial Intelligence and Research group.

research microsoft
Microsoft's team from the Speech & Dialog group

Microsoft’s impressive advances in speech recognition are made possible by its Computational Network Toolkit (CNTK), which uses deep learning algorithms that are running on high-performance GPU accelerators. The system also uses neural language models to see words as continuous vectors in space.

So what does all this mean in the real world? What are the practical applications for this powerful speech recognition system? Microsoft envisions it being used in devices your Xbox One or to help make Cortana even more intelligent in Windows 10. And of course, instant speech-to-text transcription with greater accuracy will do wonders for products like the Skype Translator.

Right now, Microsoft’s speech recognition system is optimized to work in lab environments where there is little background noise. However, in the future, Microsoft believes that its system can deal with much tougher environments like highway driving (road/wind noise) or in a crowded restaurant. The system will also be tasked with adapting to a variety of voices and accents that it might encounter along the way.

Increased accuracy when it comes to speech transcription is definitely a nice perk, but the end game for Microsoft is to actually be able to understand the words that are coming out of someone’s mouth. “The next frontier is to move from recognition to understanding,” said Geoffrey Zweig, who oversees the Speech & Dialog research group. But as Shum explains, “It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown.”

Brandon Hill

Brandon Hill

Brandon received his first PC, an IBM Aptiva 310, in 1994 and hasn’t looked back since. He cut his teeth on computer building/repair working at a mom and pop computer shop as a plucky teen in the mid 90s and went on to join AnandTech as the Senior News Editor in 1999. Brandon would later help to form DailyTech where he served as Editor-in-Chief from 2008 until 2014. Brandon is a tech geek at heart, and family members always know where to turn when they need free tech support. When he isn’t writing about the tech hardware or studying up on the latest in mobile gadgets, you’ll find him browsing forums that cater to his long-running passion: automobiles.

Opinions and content posted by HotHardware contributors are their own.