OpenAI And NVIDIA Collaborate On gpt-oss Open Source Reasoning Model And It Runs On GeForce

by Zak Killian — Tuesday, August 05, 2025, 04:40 PM EDT

A curious detail of OpenAI's popular AI models—like the GPT-4o model used in ChatGPT—is that despite the name, OpenAI's models are overwhelmingly not open-source. OpenAI has now released two new open-weight large language models, however, known as gpt-oss-20b and gpt-oss-120b, offering developers direct access to high-performance reasoning AI that can run on everything from cloud infrastructure to consumer-grade RTX graphics cards.

Built on a mixture-of-experts architecture and trained using NVIDIA's H100 GPUs, these models are designed for complex, multi-step reasoning tasks, like code generation, document analysis, and tool use including web search, if you enable that function.

gpt oss aime benchmark — *"Chain of thought" models can spend more tokens on a single query to improve responses.*

The announcement is part of a broader push by both OpenAI and NVIDIA to make advanced AI more accessible to developers, researchers, and enthusiasts. It also underscores NVIDIA's ongoing strategy of tightly integrating its hardware and software ecosystem into the rapidly evolving open-source AI landscape. The company worked with OpenAI to optimize the new models for everything from multi-rack datacenter deployments to local inference on high-end PCs.

At the cloud scale, NVIDIA reports that its Blackwell GB200 NVL72 system can push inference performance to 1.5 million tokens per second with the gpt-oss-120b model, which is a number aimed squarely at organizations deploying large-scale AI services. Blackwell's NVFP4 4-bit precision isn't used here, but the MXFP4 format that the models do use helps keep power and memory use in check while still supporting trillion-parameter workloads in real time.

Perhaps the most noteworthy part of this release is what it means for local inference. Developers can now run the very same models on GeForce RTX and RTX PRO GPUs, with performance purportedly scaling up to 256 tokens per second on the GeForce RTX 5090. That's fast enough to support snappy interactions in local chat UIs, and the models' support for 2¹⁷-token context windows opens the door to deep, document-level reasoning, something typically reserved for server-grade systems.

Fortunately, setup is also more streamlined than in the past. The Ollama app now includes official support for the gpt-oss models, allowing users to load, chat, and tinker with them right on their own systems. File attachments, context customizations, and even multimodal support are all built in—though multimodal functionality is not available with these new models. For developers, there's also CLI and SDK access, plus support across other frameworks like llama.cpp and Microsoft AI Foundry Local.

It's a notable shift: powerful reasoning models are no longer just something you access through an API. With the right hardware and a bit of setup, they can now run locally, and still be fast enough to be useful. To get started with Ollama and try these models on your own 16GB-or-more-VRAM RTX GPU, you can follow the instructions on NVIDIA's official blog. And you can try gpt-oss here on NVIDIA's platform as well.

Tags: Nvidia, open-source, AI, (nasdaq:nvda), openai

Zak Killian

A 30-year PC building veteran, Zak is a modern-day Renaissance man who may not be an expert on anything, but knows just a little about nearly everything.