About a year ago, an AI startup known as Recogni announced a patented number system for AI math, known as Pareto. Pareto is a logarithmic system, meaning that it stores numbers using their logarithmic representation rather than their absolute values. This can be highly advantageous for AI , because it drastically simplifies the math you need to do to multiply two values; instead of A×B, it becomes log(A)+log(B). For computers, addition is much simpler than multiplication, and require a lot less silicon (and thus less power) to perform.

We didn't cover the original Pareto announcement because it was just that, an initial technology announcement. But now there's much bigger news: Recogni is rebranding itself as "Tensordyne", and Tensordyne is coming for the big boys like NVIDIA with an upcoming processor that promises to revolutionize AI computing—at least, according to Tensordyne.





By using Logarithmic Number Systems (LNS), you can slash both the energy cost and silicon area cost of AI inference computations. How much? Tensordyne says that its upcoming chips can beat an NVIDIA GB200 NVL72 rack in power efficiency by a factor of 8. In other words, it would draw one-eighth the power to generate the same number of AI tokens, specifically with the Llama3.3 70B open-source LLM.

Because performant log-to-linear conversion is necessarily approximation-based, does that mean the accuracy of Tensordyne's method suffers? Not at all, according to the company's numbers. In fact, the error rate on most models is under 1%, and the company also says that LNS offers the smallest "maximum error", meaning that its maximum potential quantization error for any given value is lower than INT and FP models.



The upshot of the low error rate is that Tensordyne's method produces functionally identical results for text-to-text and text-to-image models, and in some cases, it can actually produce better results for text-to-video. During a briefing, Tensordyne showed two AI video samples , one of which was generated using FP16, and the other using LNS. The LNS video was both more stable and less prone to artifacts, which the company says is a common result thanks to LNS being better able to represent the extremely wide dynamic range of generative video AI.

The biggest difference is in power efficiency, though. As Tensordyne points out, the demand for AI inference seems insatiable, yet the power grids built in the middle of the previous century are simply not built to scale to the amount of power we will soon need to support AI computation's requirements. The company says that the relative energy cost of its 16-bit add operations is 1/22, or less than 5% of the energy required for an FP16 multiply. That doesn't translate to 20 times higher efficiency, but as we noted earlier, the company does claim that it can beat NVIDIA by a factor of eight.





When you hear about a revolutionary product like this, one of the key concerns surrounding it is software support. Tensordyne claims it's got that on lockdown already. The company has an SDK that supports popular AI frontends such as Torch and Triton, and can automatically convert models to the Tensordyne LNS format. It also exposes the intermediate representation to Python, so if developers choose , they can code directly to it, similar to how developers can do that with CUDA. The company says that its SDK has mixed-precision support, dynamic on-the-fly quantization, and that the process requires no additional training or calibration.

So, does Tensordyne actually have chips to show off? Not yet. All of the numbers the company has presented about performance and efficiency are based on simulations. However, according to Tensordyne CEO Marc Bolitho, tape-out of the company's first chips is "imminent", and it expects to launch its first hardware product around the middle of next year, ideally. What is that product going to be? See for yourself:







An AI inference processor, built on TSMC's N3 process, including 256MB of SRAM, 144GB of HBM3e memory, and a proprietary interconnect with 460GB/second of unidirectional bandwidth as well as the ability to connect up to 144 GPUS for tensor parallelism. Specifications like clock rates and core counts aren't available right now; Tensordyne is playing close to its chest with details like that. Still, given the performance numbers the company has quoted, its chips should be super speedy—if they meet their design targets.

Indeed, Tensordyne goes so far as to claim that its hardware is "the only path to profitable AI at scale." The above chart compares the tokens per second per user against the cost per 1 million tokens for a variety of providers, both cloud-based and hardware, and according to Tensordyne's own numbers, it beats essentially everyone on cost while providing performance at least on par with the absolute best of the best in the industry.

