Fermi: Compute Capabilities
If you've followed the early announcements concerning Fermi, NVDIA's next-generation GPU architecture, you should already be aware that the new GPU core is both an evolution of the existing GT200 architecture and a significant new design in its own right.
The GF100 Die, A.K.A Fermi
While it carries many of the same features as the GT200 series, Fermi is distinctly its own animal. NVIDIA's Fermi whitepaper describes the new architecture as follows: "G80 was our initial vision of what a unified graphics and computing parallel processor should look like. GT200 extended the performance and functionality of G80. With Fermi, we have taken all we have learned from the two prior processors and all the applications that were written for them, and employed a completely new approach to design to create the world’s first computational GPU."
"Computational GPU" is short-hand for "a whole lot of number crunching". Where NVIDIA's G80 packed 128 cores and the GT200 raised the bar to 240, a full-scale Fermi implementation will pack 512 processor cores, ECC memory protection, and up to eight times the double-precision floating point throughput of its predecessor. Peak number-crunching power has increased all the way around. Fermi can execute 64-bit FP code at 50% the speed of 32-bit FP code, as compared to 12.5 percent the speed of 32-bit FP in earlier product iterations.
Each SM (streaming multiprocessor) in Fermi (there are 16 total) has access to 64K of configurable L1 cache; the entire chip shares a 768K L2 cache. In aggregate, that's about 1.8MB of cache, significantly more than the GT200 architecture, which offered 16K of managed memory per SM.