NEWS No one knows the startup threatens to humiliate Nvidia chips at once on two fronts: speed and energy

pinkman

BOSS
Staff member
ADMIN
LEGEND
ULTIMATE
SUPREME
MEMBER
BFD Legacy
Joined
Feb 3, 2025
Messages
2,253
Reaction score
19,012
Deposit
0$
Tensordyne challenged the giant by replacing multiplication by addition — and so far no one can check whether it is lying or not.
1781795316243.png
California startup Tensordyne sent a project of the first AI-chip Napier and is already comparing the future system with Nvidia accelerators. According to the company, the node of 72 chips Napier will be able to launch large language models four times faster than the system of 72 Nvidia GB 300, while spending five times less energy. It is not yet possible to check these figures: customers do not yet have ready-made equipment, and the first real systems should appear later.

The main rate of Tensordyne is not to another increase in the number of computing blocks, but on another way to consider the mathematics of neural networks. At the heart of modern models are matrix multiplications. Chips constantly multiply large arrays of numbers, and such operations require a lot of energy and occupy a noticeable area on the crystal. Napier is trying to replace some of this load with cheaper operations.

The idea is based on a simple mathematical property: the logarithm of the work is equal to the sum of logarithms. If you translate the numbers into a logarithmic form, the multiplication can be replaced by affixing. For a chip, this is important: the sumpeters are simpler, smaller and more economical than multiplication blocks. Tensordine argues that it is due to this that Napier can place more calculations on a smaller area and reduce energy consumption.

The principle itself has been known for a long time, but earlier the practical problem interfered. Neural networks usually work with numbers in floating-point formats, and the transition back from the logarithmic representation ate too much time, energy and accuracy. Tensordine says that they have found a way to perform such conversions quickly and carefully right on silicon. The company has not yet disclosed a detailed technical description, so the key part of the application remains closed to external verification.

Formats of numbers have long influenced the development of AI-accelerators. The shorter the representation of the number, the smaller the scheme, faster calculation and lower power consumption. That’s why the industry has moved from the heavier formats to FP16, BF16, FP8 and even more compact options. Nvidia also linked a significant part of the productivity gains of its GPU to the transition to short numerical formats. Tensordyne goes on and offers not just to cut accuracy, but to change the arithmetic itself.

The company makes Napier primarily for the infection, that is, the launch of already trained models. This market is growing rapidly: AI services respond to users, write code, work as agents, handle long requests and perform chains of action. In such a load, not only the overall performance, but also the delay of the response, the cost of a million tokens and energy consumption in the data center are important.

The launch of a large language model is usually divided into two parts. First comes prefill: the model reads the input text, turns it into tokens and builds a working context for a further answer. This step requires a lot of calculations. Then the decode step begins: the model generates new tokens by one, relying on the context already created. Here, memory, network bandwidth and latency between the chips are especially important, because each next piece of response depends on the previous one.

AI manufacturers are increasingly sharing these tasks between different systems. Some accelerators are better suited for a heavy initial calculation, others - for fast generation of tokens with a short delay. Tensordine argues that Napier will be able to effectively close both parts in one architecture, without a bunch of several suppliers and without deploying a large number of racks.

For a computationally heavy stage, the company uses logarithmic mathematics. To generate a response, Napier received 144 GB of high-speed HBM memory on the chip and its own network between the processors. In Tensordyne materials, this network is called Napier Link or TDN Link; the company declares a delay of one microsecond or lower and high bandwidth inside the 72-chip system.

One unit of Tensordyne occupies a quarter of the standard rack. It includes 72 Napier chips, eight Intel Xeon processors and 64 TB of solid-state storage. The full rack combines four such nodes, that is, 288 chips. According to the company, this configuration consumes about 120 kW, works with air cooling and is designed for large models, including systems with trillions of parameters.

The loudest statement concerns the economy. Tensordine claims that the rack of four nodes when working with a model for 2 trillion parameters will be able to issue about 1300 tokens per second per user at a cost of $ 11 per million tokens. The company compares this to larger hybrid systems based on future Nvidia accelerators and individual response solutions that it estimate would require more racks and significantly more energy.

But for now, all these figures remain a promise. Tensordyne has already passed an important stage: the Napier project has been sent to production. In the industry, it's called tapeout. After it, the chip still needs to be made, obtained, checked, eliminate possible problems, assemble systems, bring the software stack and give customers the opportunity to conduct real tests. Between the beautiful simulation and stable work in the data center often lies a large engineering distance.

At the same time, it is important to understand that Nvidia, AMD and other large suppliers win not only through iron, but also at the expense of a mature software ecosystem. Developers are accustomed to their tools, libraries, compilers and profilers. The new chip is not enough to be effective on paper. It should quickly run real models, work well with popular frameworks and do not require customers to completely rebuild the infrastructure.

If Tensordyne confirms its calculations on finished equipment, Napier can be a noticeable alternative to infection. Energy consumption has already become one of the main problems of AI services: data centers need quick answers, low cost of tokens and less pressure on the power grid. Any architecture that actually reduces the delay and energy consumption will get the attention of cloud providers and model developers.

Tensordyne plans to open the first cloud beta tests by the end of 2026, and deliveries of systems to customers are expected in 2027. Up to this point, the company’s statement should be read cautiously: Napier looks like an interesting attempt to reassemble the AI-accelerators mathematician, but the market will believe in the advantage over Nvidia only after independent tests on real models, and not after comparisons in the simulation.
 
Top Bottom