Frontpage
Powerfully efficient AI inference for enterprise and cloud
EFFICIENT LLM INFERENCE
Efficiency (tokens/watt)
Efficiency (tokens/watt)
Llama 3 8B
2048 input tokens / 2048 output tokens
FuriosaSDK / FP8 / 3047 tokens/s
TensorRT-LLM 0.11.0 / FP8 / 8399 tokens/s
TensorRT-LLM 0.11.0 / FP8 / 1912 tokens/s
Efficiency (queries/watt)
Efficiency (queries/watt)
GPT-J
MLPerf data center, closed, offline scenario / 99.9% accuracy
FuriosaSDK / FP8 / 15.13 queries/s
TensorRT-LLM 0.11.0 / FP8 / 30.375 queries/s
TensorRT-LLM 0.11.0 / FP8 / 12.25 queries/s
RNGD | H100 SXM | L40S | |
---|---|---|---|
Technology | TSMC 5nm | TSMC 4nm | TSMC 5nm |
BF16/FP8 (TFLOPS) | 256/512 | 989/1979 | 362/733 |
INT8/INT4 (TOPS) | 512/1024 | 1979/- | 733/733 |
Memory capacity (GB) | 48 | 80 | 48 |
Memory bandwidth (TB/s) | 1.5 | 3.35 | 0.86 |
Host I/F | Gen5 x16 | Gen5 x16 | Gen4 x16 |
TDP (W) | 150 | 700 | 350 |
Disclaimer: Measurements by FuriosaAI internally on current specifications and/or internal engineering calculations. Nvidia results were retrieved from Nvidia website, https://github.com/NVIDIA/Tens... /perf-overview.md, on Aug 25, 2024.
INFERENCE WITHOUT CONSTRAINTS
Performance
Deploy the most capable models with low latency and high throughput
Efficiency
Lower total cost of ownership with less energy, fewer racks, and air-cooled data centers of today
Programmability
Stay future-proof for tomorrow’s models and transition with ease
EFFICIENT AI INFERENCE IS HERE
RNGD (pronounced "Renegade") delivers high-performance LLM and multimodal deployment capabilities while maintaining a radically efficient 150W power profile.
- 512TFLOPS
- 64TFLOPS (FP8) x 8 processing elements
- 48GB
- HBM3 memory capacity
- 2 x HBM3
- CoWoS-S, 6.0Gbps
- 256MB SRAM
- 384TB/s on-chip bandwidth
- 1.5TB/s
- HBM3 memory bandwidth
- 150W TDP
- Targeting air-cooled data centers
- PCIe P2P support for LLM BF16, FP8, INT8, INT4 support
- Multiple-instance and virtualization Secure boot & model encryption
Tensor contraction, not matmul
Tensor Contraction Processor (TCP)
At the heart of Furiosa RNGD is Tensor Contraction Processor architecture (ISCA 2024), specifically designed for efficient tensor contraction operations. The fundamental computation of modern day deep learning is tensor contraction, a higher dimensional generalization of matrix multiplication. However, most commercial deep learning accelerators today incorporate fixed-sized matmul instructions as primitives. RNGD breaks away from that, unlocking powerful performance and efficiency.
Tensor Contraction Processor
TCP is the compute architecture underlying Furiosa accelerators. With tensor operation as the first-class citizen, Tensor Contraction Processor (TCP) unlocks unparalleled energy efficiency.
Tensor mapping for max utilization
We elevate the programming interface between hardware and software to treat tensor contraction as a single, unified operation. This fundamental design choice streamlines programming, maximizing parallelism and data reuse, while providing flexibility and reconfigurability of compute and maximizes memory resources based on tensor shapes. Furiosa Compiler leverages this flexibility and reconfigurability of hardware to select the most optimized tactics, delivering powerful and efficient deep learning acceleration for all scales of deployment.
SOFTWARE FOR LLM DEPLOYMENT
Furiosa SW Stack consists of a model compressor, serving framework, runtime, compiler, profiler, debugger, and a suite of APIs for ease of programming and deployment.
Coming soon publicly in 2025.
Built for advanced inference deployment
Comprehensive software toolkit for optimizing large language models on RNGD. User-friendly APIs facilitate seamless state-of-the-art LLM deployment.
Maximizing data center utilization
Ensure higher utilization and flexibility for small and large deployments with containerization, SR-IOV, Kubernetes, as well as other cloud native components.
Robust ecosystem support
Effortlessly deploy models from library to end-user with PyTorch 2.x integration. Leverage the vast advancements of open-source AI and seamlessly transition models into production.
Series RNGD
RNGD-S
Leadership performance for creatives, media and entertainment, and video AI
RNGD
150W versatile inference for all infrastructure deployments
RNGD-MAX
350W powerful inference with maximum compute density