Researchers Develop SLIDE Algorithm For CPU AI Training That Outperforms GPUs

Multiple facets of technology are trending towards artificial intelligence these days, in applications both big and small. As that's been happening, graphics processing units (GPUs) have taken on the heavy lifting, though researchers at Rice University have cooked up a new machine learning scheme they say is more efficient when run on central processing units (CPUs).

It should come as no surprise that the computer scientists from Rice University are being supported by collaborators from Intel, which has a vested interest in anything that can tap into its CPUs. There's a lot of money to be made by leveraging AI—NVIDIA, for example, reported a 41 percent increase in its fourth quarter revenue, driven in part by its Tesla V100 Tensor Core GPUs.

What the computer scientists at Rice University created is called SLIDE, which stands for sub-linear deep learning engine. SLIDE is able to perform its machine learning magic using general purpose CPUs without any specialized graphics hardware. Not only that, but the researchers claim SLIDE performs better than GPU methods.

"Our tests show that SLIDE is the first smart algorithmic implementation of deep learning on CPU that can outperform GPU hardware acceleration on industry-scale recommendation datasets with large fully connected architectures," said Anshumali Shrivastava, an assistant professor in Rice’s Brown School of Engineering who invented SLIDE with graduate students Beidi Chen and Tharun Medini.

The researchers explain that the standard back-propagation training technique employed today for deep neural networks requires matrix multiplication, which is ideally suited for GPUs. SLIDE, however, turns neural network training int a search problem that can be solved with hash tables.

If this method works as advertised, it has the potential to save companies tons of money. According to Shrivastava, cutting edge GPU platforms employed by the likes of Amazon and Google can have eight Tesla V100 accelerators and cost $100,000.

"We have one in the lab, and in our test case we took a workload that’s perfect for V100, one with more than 100 million parameters in large, fully connected networks that fit in GPU memory. We trained it with the best (software) package out there, Google’s TensorFlow, and it took 3 1/2 hours to train. We then showed that our new algorithm can do the training in one hour, not on GPUs but on a 44-core Xeon-class CPU," Shrivastava added.

It's not clear if Shrivastava is talking about a pair of 22-core Xeon CPUs, or meant to reference threads rather than cores. Either way, the hurdle for SLIDE is its heavy reliance on memory, compared to a GPU. This method is prone to a problem called cache thrashing, where there a lot of cache misses. Shrivastava says his research group ran into significant cache thrashing with the first set of SLIDE experiments, but that training times were still comparable to or faster than GPU training times.

This is where Intel stepped in. Intel said it could help with the problem to make SLIDE train even faster, and as a results, subsequent tests improved by around 50 percent.

"We’ve just scratched the surface," Shrivastava said. "There’s a lot we can still do to optimize. We have not used vectorization, for example, or built-in accelerators in the CPU, like Intel Deep Learning Boost. There are a lot of other tricks we could still use to make this even faster."

It will be interesting to see how this plays out, and how NVIDIA responds.