Microsoft Shares Inside Look At NVIDIA's Supercomputing AI Tech Powering ChatGPT

hero microsoft azure data center
When Microsoft and OpenAI partnered in 2019, the two companies sought to bring supercomputing resources designed and dedicated to allow OpenAI to train an expanding suite of increasingly powerful AI models. OpenAI needed an unprecedented cloud-computing infrastructure, larger than anyone in the industry had attempted to build before. Over time that has evolved, with Microsoft announcing on March 13, 2023, new powerful and scalable virtual machines that integrate the latest NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking to continue tackling this monumental task.

"Co-designing supercomputers with Azure has been crucial for scaling our demanding AI training needs, making our research and alignment work on systems like ChatGPT possible," stated Greg Brockman, President and Co-Founder of OpenAI.

The ND H100 v5 Virtual Machine (VM) enables on-demand scaling in sizes ranging from eight to thousands of NVIDIA H100 GPUs interconnected by NVIDIA Quantum-2 InfiniBand networking. Microsoft says this will equate to customers seeing "significantly faster performance" for AI models over the last generation.

NVIDIA and Microsoft Azure have collaborated through multiple generations of products to bring leading AI innovations to enterprises around the world. The NDv5 H100 virtual machines will help power a new era of generative AI applications and services," remarked Ian Buck, Vice President of hyperscale and high-performance computing at NVIDIA.

azure data center washington state
Microsoft Azure data center in Washington state.

Nidhi Chappell, Microsoft Head of Product for Azure High-Performance Computing, says the key to the breakthroughs made came from learning how to build, operate, and maintain tens of thousands of co-located GPUs connected to one another on a high-throughput, low-latency InfiniBand network. Chappell explained that this was larger than even the suppliers of the GPUs and network equipment had ever tested. In short, she said, "It was uncharted territory. Nobody knew for sure if the hardware could be pushed that far without breaking."

Chappell added that there is a lot of system-level optimization needed in order to get the best performance. That optimization includes software that facilitates effective utilization of the GPUs and networking equipment. The Azure infrastructure is currently optimized for large language model training and is available through Azure AI supercomputing capabilities in the cloud.

“This is the most extraordinary moment we have witnessed in the history of AI,” Jensen Huang remarked in a recent NVIDIA blog post. “New AI technologies and rapidly spreading adoption are transforming science and industry, and opening new frontiers for thousands of new companies. This will be our most important GTC yet.”

Microsoft touts that only Microsoft Azure offers the GPUs, the InfiniBand networking, and the unique AI infrastructure needed to build such transformational AI models at scale.