Furiosa RNGD - Gen 2 data center accelerator

Powerfully efficient AI inference for Enterprise and Cloud

View specs

Get notified first

#1 EFFICIENT LLAMA INFERENCE

token/s/W

token/s/W

Llama 3.1 70B

2,048 input tokens / 128 output tokens / x 8 cards

rngd

FuriosaSDK / FP8 / 957.05 token/s

h100 sxm

TensorRT-LLM 0.15.0 / FP8 / 2,064.53 token/s

l40s

TensorRT-LLM 0.15.0 / FP8 / 163.53 token/s

Llama 3.1 8B

128 input tokens / 4,096 output tokens / x 1 card

rngd

FuriosaSDK / FP8 / 3,935.25 token/s

h100 sxm

TensorRT-LLM 0.15.0 / FP8 / 13,222.06 token/s

l40s

TensorRT-LLM 0.15.0 / FP8 / 2989.17 token/s

	RNGD	H100 SXM	L40S
Technology	TSMC 5nm	TSMC 4nm	TSMC 5nm
BF16/FP8 (TFLOPS)	256/512	989/1979	362/733
INT8/INT4 (TOPS)	512/1024	1979/-	733/733
Memory Capacity (GB)	48	80	48
Memory Bandwidth (TB/s)	1.5	3.35	0.86
Host I/F	Gen5 x16	Gen5 x16	Gen4 x16
TDP (W)	180	700	350

Disclaimer: Measurements by FuriosaAI internally on current specifications and/or internal engineering calculations. Nvidia results were retrieved from Nvidia website, https://github.com/NVIDIA/Tens..... /perf-overview.md, on Aug 25, 2024.

EFFICIENT AI INFERENCE IS HERE

RNGD delivers high-performance LLM and multimodal deployment capabilities while maintaining a radically efficient 180W power profile.

512 TFLOPS: 64 TFLOPS (FP8) x 8 processing elements
48 GB: HBM3 memory capacity
2 x HBM3: CoWoS-S, 6.0 Gbps
256 MB SRAM: 384 TB/s on-chip bandwidth
1.5 TB/s: HBM3 memory bandwidth
180 W TDP: Targeting air-cooled data centers

PCIe P2P support for LLM BF16, FP8, INT8, INT4 support

Multiple-Instance and Virtualization Secure boot & model encryption

Tensor contraction, not matmul

Tensor Contraction Processor (TCP)

At the heart of Furiosa RNGD is Tensor Contraction Processor architecture (ISCA 2024), specifically designed for efficient tensor contraction operations. The fundamental computation of modern day deep learning is tensor contraction, a higher dimensional generalization of matrix multiplication. However, most commercial deep learning accelerators today incorporate fixed-sized matmul instructions as primitives. RNGD breaks away from that, unlocking powerful performance and efficiency.

Download TCP Paper

Learn more about TCP

Advanced packaging technology

For optimal single-chip compute density, memory bandwidth, and energy efficiency.

Learn more about implementing HBM3

Tensor Contraction Processor

TCP is the compute architecture underlying Furiosa accelerators. With tensor operation as the first-class citizen, Tensor Contraction Processor (TCP) unlocks unparalleled energy efficiency.

Tensor mapping for max utilization

We elevate the programming interface between hardware and software to treat tensor contraction as a single, unified operation. This fundamental design choice streamlines programming, maximizing parallelism and data reuse, while providing flexibility and reconfigurability of compute and maximizes memory resources based on tensor shapes. Furiosa Compiler leverages this flexibility and reconfigurability of hardware to select the most optimized tactics, delivering powerful and efficient deep learning acceleration for all scales of deployment.

SOFTWARE FOR LLM DEPLOYMENT

Furiosa SW Stack consists of a model compressor, serving framework, runtime, compiler, profiler, debugger, and a suite of APIs for ease of programming and deployment.

Available now.

Built for advanced inference deployment

Comprehensive software toolkit for optimizing large language models on RNGD. User-friendly APIs facilitate seamless state-of-the-art LLM deployment.

Maximizing Data Center Utilization

Ensure higher utilization and flexibility for small and large deployments with containerization, SR-IOV, Kubernetes, as well as other cloud native components.

Robust Ecosystem Support

Effortlessly deploy models from library to end-user with PyTorch 2.x integration. Leverage the vast advancements of open-source AI and seamlessly transition models into production.