Mark Zuckerberg and the gang over at Meta, the
parent company to Facebook, have a new and powerful toy to play with courtesy of a collaboration with NVIDIA. Meta designed and NVIDIA built what is called the AI Research SuperCluster (RSC), a massive supercomputer that once fully deployed will rank as the largest customer installation of NVIDIA DGX A100 systems to date. By extension of that feat, it is also expected to be the fastest AI supercomputer in the galaxy.
While there is more work to be done, RSC is already up and running, and training new AI models. As currently deployed, it packs 760
DGX 100 systems as its compute nodes, which run a combined 6,080 A100 Tensor core GPUs based on Ampere. These are linked on an NVIDIA Quantum 200Gb/s InfiniBand network to deliver a staggering 1,895 petaflops of TF32 performance, NVIDIA says.
This is combined with a storage tier comprised of 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Atlus systems, and 10 petabytes of Pure Storage FlashBlade. When the RSC is fully built, the Infiniband network fabric will connect a mind-boggling 16,000 GPUs as endpoints. According to Meta, the caching and storage system sits at 16TB/s of training data and can scale up to 1 exabyte.
What will all that power be used for?
"We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they could seamlessly collaborate on a research project or play an AR game together," Meta said in a blog post.
A little more specifically, Meta envisions using RSC to build new and better AI models trained from trillions of examples, as well as make it easier to work across hundreds of different languages, develop new augmented reality tools (Meta is hyper-focused on the
metaverse, after all), and seamless analyze text, images and video, among other tasks.
With the help of RSC, Meta says it can create entirely new AI systems, ones that can power real-time voice translations to larger groups of people where each person is speaking a different language. This would be a boon for collaborating on research projects with different teams around the world. But make no mistake, this is a big play towards the metaverse.
"Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform—the
metaverse, where AI-driven applications and products will play an important role," Meta added.
The jump from
6,080 to 16,000 GPUs will increase AI training performance by more than 2.5x, Meta says. According to Meta, this will benefit existing services by leading to the creation of more accurate AI models, as well as enable completely new experiences.