Frontpage
The most efficient data center accelerator for high-performance LLM and multimodal deployment
Tensor Contraction Architecture (TCA)
Tensor Contraction Architecture (TCA) is the architecture behind all Furiosa accelerators – designed to unlock powerful performance and unparalleled energy efficiency on the most capable AI models.
Llama 2 7B
L40S | H100 | RNGD | ||
---|---|---|---|---|
Perf/Watt (tokens/sec/W) | Batch Size=16, Input Length=2K, Output Length=2K | 1.52 | Undisclosed | 6.24 |
Batch Size=32, Input Length=2K, Output Length=2K | Undisclosed | 3.19 | 8.62 |
L40S | H100 | RNGD | ||
---|---|---|---|---|
1st Token Latency (ms) | Batch Size=1, Sequence Length=128 | 14 | 7 | 8 |
L40S | H100 | RNGD | ||
---|---|---|---|---|
Throughput tokens (tokens/s) | Batch Size=16, Input Length=2K, Output Length=2K | 531 | Undisclosed | 935 |
Batch Size=32, Input Length=2K, Output Length=2K | Undisclosed | 2230 | 1293 |
Disclaimer: Measurements by FuriosaAI internally on current specifications and/or internal engineering calculations. Nvidia results were retrieved from Nvidia website, https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference, on February 14, 2024.
L40S | H100 | RNGD | |
---|---|---|---|
Technology | TSMC 5nm | TSMC 4nm | TSMC 5nm |
BF16/FP8 (TFLOPS) | 362/733 | 989/1979 | 256/512 |
INT8/INT4 (TOPS) | 733/733 | 1979/- | 512/1024 |
Memory Capacity (GB) | 48 | 80 | 48 |
Memory Bandwidth (TB/s) | 0.86 | 3.35 | 1.5 |
Host I/F | Gen4 x16 | Gen5 x16 | Gen5 x16 |
TDP (W) | 350 | 700 | 150 |
Purpose-built for tensor contraction
How Furiosa TCA unlocks powerful performance and energy efficiency
AI models structure data in tensors of various dimensions. The architecture adapts to each tensor contraction via compiler-defined tactics.
Intermediary tensors are maintained in the on-chip memory (SRAM), akin to model-wise operator fusion.
This allows the chip to fully exploit parallelism and maximize data reuse for maximum utilization for inference deployment.
Meet Renegade
The most efficient data center accelerator for high-performance LLM and multimodal deployment
- 512 TFLOPS
- 64 TFLOPS (FP8) x 8 Processing Elements
- 48GB
- HBM3 Memory Capacity
- 1.5TB/s
- Memory Bandwidth
- 150W
- Thermal Design Power
RNGD Series
RNGD-S
Leadership performance for creatives, media and entertainment, and video AI
RNGD
Versatile cloud and on-prem LLM and Multimodal deployment
RNGD-Max
Powerful cloud and on-prem LLM and Multimodal deployment