RNGD preview: The world’s most efficient AI chip for LLM inference
Technical Updates
Summary
-
FuriosaAI has developed RNGD, the most power-efficient accelerator chip for data center inference with computationally demanding AI models.
-
RNGD offers native PyTorch 2.x support and delivers 256/512/1024 TOPS (BF16/FP8 or INT8/INT4).
-
RNGD achieves 3x better performance per watt compared to H100 when running advanced large language models.
-
Hardware testing of RNGD is proceeding quickly and the chip will launch later this year.
To make AI computing both more sustainable and more accessible, FuriosaAI has developed RNGD, the world's most power-efficient chip for inference with large language models and multimodal models. We are currently testing the first hardware samples of RNGD in preparation for launch later this year.
In this blog post we’ll provide a preview of RNGD’s key features and capabilities, with more information to come as we get closer to launch.
While cutting edge GPUs consume up to 1,200 watts, RNGD (pronounced “Renegade”) is designed to operate with a thermal design power (TDP) of just 150 watts. This makes RNGD an ideal choice for large-scale deployment of advanced generative AI models like Llama 2 and Llama 3.
Optimizing AI models to run efficiently on GPUs has often been a difficult process that requires significant time and expertise. This has proven even more true with many new alternative chip architectures, where real world deployments require extensive hand-tuning of kernels and other complex heuristics.
RNGD’s tensor-based hardware architecture (described in more detail here) makes it possible to deploy and optimize new models automatically even when they use a novel architecture. (GPUs, by comparison, allocate resources dynamically, so it’s impossible to precisely predict how well each optimization tactic will work.)
The RNGD software stack offers native PyTorch 2.x support, as well as a model quantizer API, scheduler, Python and C++ SDK, model server, Kubernetes support, and low-level drivers.
Processing power and flexibility
Programmability and efficiency are of little use if a chip can’t deliver sufficient computational power. RNGD provides 512 TFLOPS (FP8) of compute, 48 GB of HBM3 memory, and 1.5 TB/s bandwidth, making it an ideal choice for deploying advanced generative AI models like Llama 2 and Llama 3.
The RNGD chip contains eight identical processing elements, which can operate independently or be “fused” together when needed. Each processing element has 64 “slices,” each with a compute pipeline and SRAM that stores a partitioned piece of the tensor it is working with.
Data can also be multicast (rather than just copied) to the operation units of multiple slices through a fetch network. This significantly reduces the number of SRAM accesses and improves data reuse. Scheduling is done explicitly through the hardware structure (rather than as threads), which significantly streamlines the datapath compared to GPUs.
Load data once, then reuse it over and over
FuriosaAI detailed our chip architecture in a technical paper, TCP: A Tensor Contraction Processor for AI Workloads, which was submitted to the International Symposium on Computer Architecture (ISCA), the leading forum for new ideas in silicon design. A high-level overview of the TCP architecture is available in this blog post. In the paper, which ISCA accepted for publication, we provide sample performance comparisons between RNGD and leading GPUs when running MetaAI’s Llama 2 model. We’re currently updating those examples with other open source models, including Llama 3, and we will share more information before RNGD launches. For the purposes of this blog post, however, we’ll build on the Llama 2 examples shared earlier in the ISCA paper.
Using Llama 2, we can illustrate how RNGD and a GPU manage memory differently. In Llama 2 7B, the first computation of the feed-forward network has to store the output of the intermediate activations (4,096 input tokens x 11,008 expanded dimensions), while loading about 44 million weights (4,096 hidden dimension x 11,008 expanded dimension) of the next layer.
The high-performance H100 GPU has approximately 30MB of shared memory and 50MB of L2 cache. That shared memory is used as a temporary space where calculations are carried out after loading the activations and weights. So the intermediate activations of the feed-forward network must be stored in GPU’s L2 cache, as well as off-chip DRAM accessed via high-bandwidth memory(since the data can’t all fit within the L2 cache).
In the RNGD chip, the 256MB of on-chip memory can not only store all intermediate tensors of the feed-forward network, but also prefetch the weights of the next layer. GPUs, by contrast, must load the data again from the global L2 cache or DRAM into shared memory. But with RNGD, the output of the computations from the first layer of the feed-forward network can be used directly in the next layer's calculation without any movement from the on-chip memory.
In other words, RNGD only consumes memory bandwidth for loading the weights once, whereas the GPU has to load not only weights but also intermediate activation results. Moreover, even if L2 cache is used, there is an additional cost to traverse the on-chip network globally.
Building the hardware for the next chapter of the AI revolution
Furiosa’s RNGD chip will launch this year for use with LLMs and other advanced transformer-based models. We look forward to sharing benchmark data and additional performance details as we get closer to launch. We are designing additional variations for specific use cases, such as AI video- and image-generation tasks. Another will target a new generation language of models that are larger and more capable than even the best models that exist today.
We believe this family of chips will unlock AI applications in important new ways. Because of its improved power efficiency, Furiosa’s second gen chip can be deployed in a very wide range of data centers, for example, without the complex liquid cooling systems required by high-performance GPUs.
Greater energy efficiency has benefits beyond just simplifying data center deployments, of course. AI hardware consumed about as much energy in total in 2023 as the nation of Cyprus and is on track to use as much as Sweden by 2027. Even with powerful techniques (like quantization, distillation, speculative decoding and more) to make AI models more lightweight, the need for more compute is widely expected to accelerate significantly in coming years. This means that developing more power-efficient chips is an important part of making AI sustainable.
There are many reasons why innovation in AI compute is essential if the AI revolution is to continue. As researchers from OpenAI, the Centre for Governance of AI, the University of Cambridge, and other leading institutions note in a recent paper, compute is now a key limiting factor for both developing and deploying new AI systems. It’s become a constraint on everything from power grids to water consumption.
We believe RNGD, and the TCP chip architecture it is based on, will be an important part of addressing these challenges and enabling more people, businesses, and organizations everywhere to harness the benefits of AI.
Sign up here to be notified first about RNGD availability and product updates.