Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale
News

We are excited to announce Furiosa NXT RNGD Server, an enterprise-ready turnkey appliance designed to bring world-class performance to any on-prem or private cloud deployment.
Engineered to meet the rapidly growing global demand for AI inference, NXT RNGD Server is not just powerful, but also a practical and power-efficient solution for real-world data center environments.
NXT RNGD Server addresses the core enterprise challenge of scaling AI workloads: high energy consumption and costly infrastructure upgrades to support GPUs’ power and cooling requirements. While an NVIDIA DGX H100 server has a maximum power consumption of 10.2 kW, a typical NXT RNGD Server deployment uses only 3 kW. This enables businesses to quickly scale AI within existing facilities without additional significant capital investment. A standard 15 kW data center rack can accommodate up to five NXT RNGD Servers, compared to just one NVIDIA DGX H100 server.

Key system specifications
NXT RNGD Server is powered by eight RNGD cards, delivering 4 petaFLOPS of FP8 (or 4 petaTOPS of INT8) compute in a single standard 4 U unit. The appliance is equipped with a total of 384 GB of HBM3 memory operating at 12 TB/s memory bandwidth. Total power consumption is 3 kW, compared to 10 kW or more for advanced GPU servers.
NXT RNGD Server offers several key benefits for enterprise customers:
Lower cost of ownership: Run state-of-the-art models without vendor lock-in or an unsustainable Total Cost of Ownership (TCO).
Deploy anywhere: Run advanced AI efficiently at scale within current infrastructure and power limitations – using on-prem servers or cloud data centers. The reality is that more than 80% of data centers are air-cooled and operate at 8 kW per rack or less.
Data sovereignty: Quickly deploy and scale high-performance local infrastructure for sensitive workloads with complete control over enterprise data, model weights, enhanced regulatory compliance, and greater security and privacy protections.
Flexibility for new models and use cases: Leverage a powerful SDK that offers a drop-in replacement for vLLM, OpenAI API compatibility, and a full suite of profiling capabilities.
At the core of RNGD’s performance and efficiency is our Tensor Contraction Processor (TCP) architecture, which is designed from the ground up to eliminate the inefficiencies of using GPUs for AI applications. This AI-native, software-driven approach maximizes parallelism and data reuse, resulting in world-class performance and radical energy efficiency. Each RNGD card is fabricated using TSMC’s 5nm process, with 48 GB of HBM3 memory and a TDP of just 180 W.
Proven real-world performance
NXT RNGD Server’s performance has been validated by global enterprises. In July, LG AI Research announced that it has adopted RNGD for inference computing with its EXAONE models. LG AI Research found that RNGD delivers 2.25x better LLM inference performance per watt vs. GPUs, while also meeting demanding latency and throughput requirements.
We are now working with LG AI Research to supply NXT RNGD servers to enterprises using EXAONE across key sectors, like electronics, finance, telecommunications, and biotechnology.
Earlier this month, we partnered with OpenAI to showcase its new open-weight gpt-oss-120b model running live on just two RNGD cards, using MXFP4 precision.

FuriosaAI CTO Hanjoon Kim talks with guests at OpenAI's Seoul launch event.
New front-end features, multi-chip scaling, and more with SDK 2025.3 and 2025.3.1
The performance and efficiency of NXT RNGD Server is amplified by continuous updates to the Furiosa SDK. Our latest releases focus on multichip scaling, new performance optimizations for large models, and new front-end functionality.
Our latest 2025.3 and 2025.3.1 SDK updates bring several significant new features:
Inter-chip tensor parallelism across multiple RNGD cards, enabling efficient scaling for massive models and incorporating optimized PCIe paths for P2P communication, advanced communications scheduling, and compiler tactics that overlap inter-chip DMA with computation
Enhanced global compiler optimization, maximizing SRAM reuse between transformer blocks, reducing latency and boosting throughput
Runtime optimizations, improving synchronization across devices
Expanded model and quantization support, including Qwen 2 and Qwen 2.5
Expanded W8A16 quantization to existing options, which include BF16, FP8, INT8, INT4, and MXFP4 precision
Improved logging and production metrics
Expanded support for tool calling
These enhancements build on previous releases that introduced Hugging Face Hub integration, reasoning model support, and PyTorch eager mode execution with torch.compile support.

Ready for enterprise customers
Today’s unveil of NXT RNGD Server further shows that cutting-edge models can be deployed within the existing power budgets of typical data centers. This removes the prohibitive energy costs and complex infrastructure requirements of GPUs, making advanced AI truly accessible for enterprise customers.
NXT RNGD Server is currently sampling with global customers and is expected to be available for order in early 2026.
Sign up for RNGD updates here.