Furiosa’s gen 2 AI chip. Coming soon.

The most efficient data center accelerator for high-performance LLM and multimodal deployment

Tensor Contraction Architecture (TCA)

Tensor Contraction Architecture (TCA) is the architecture behind all Furiosa accelerators – designed to unlock powerful performance and unparalleled energy efficiency on the most capable AI models.

Energy efficiency

Llama 2 7B

Energy efficiency

		L40S	H100	RNGD
Perf/Watt (tokens/sec/W)	Batch Size=16, Input Length=2K, Output Length=2K	1.52	Undisclosed	6.24
	Batch Size=32, Input Length=2K, Output Length=2K	Undisclosed	3.19	8.62

Latency

		L40S	H100	RNGD
1st Token Latency (ms)	Batch Size=1, Sequence Length=128	14	7	8

Throughput

		L40S	H100	RNGD
Throughput tokens (tokens/s)	Batch Size=16, Input Length=2K, Output Length=2K	531	Undisclosed	935
	Batch Size=32, Input Length=2K, Output Length=2K	Undisclosed	2230	1293

Disclaimer: Measurements by FuriosaAI internally on current specifications and/or internal engineering calculations. Nvidia results were retrieved from Nvidia website, https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference, on February 14, 2024.

	L40S	H100	RNGD
Technology	TSMC 5nm	TSMC 4nm	TSMC 5nm
BF16/FP8 (TFLOPS)	362/733	989/1979	256/512
INT8/INT4 (TOPS)	733/733	1979/-	512/1024
Memory Capacity (GB)	48	80	48
Memory Bandwidth (TB/s)	0.86	3.35	1.5
Host I/F	Gen4 x16	Gen5 x16	Gen5 x16
TDP (W)	350	700	150

Purpose-built for tensor contraction

How Furiosa TCA unlocks powerful performance and energy efficiency

AI models structure data in tensors of various dimensions. The architecture adapts to each tensor contraction via compiler-defined tactics.

Intermediary tensors are maintained in the on-chip memory (SRAM), akin to model-wise operator fusion.

This allows the chip to fully exploit parallelism and maximize data reuse for maximum utilization for inference deployment.

Meet Renegade

Coming soon.

The most efficient data center accelerator for high-performance LLM and multimodal deployment

View specs

512 TFLOPS: 64 TFLOPS (FP8) x 8 Processing Elements
48GB: HBM3 Memory Capacity
1.5TB/s: Memory Bandwidth
150W: Thermal Design Power

View specs

RNGD Series

RNGD-S

Leadership performance for creatives, media and entertainment, and video AI

RNGD

Versatile cloud and on-prem LLM and Multimodal deployment

RNGD-Max

Powerful cloud and on-prem LLM and Multimodal deployment

Blog

See all posts

How ePopSoft, maker of Korea’s most popular English instruction app, uses Furiosa’s WARBOY

Technical Updates

Q&A: ASUS on AI server trends, FuriosaAI partnership and more

Our Viewpoints

A new global survey of businesses’ AI Infra Plans, conducted by FuriosaAI, ClearML and AIIA

News

Frontpage

The most efficient data center accelerator for high-performance LLM and multimodal deployment

Tensor Contraction Architecture (TCA)

Purpose-built for tensor contraction

Meet Renegade

RNGD Series

RNGD-S

RNGD

RNGD-Max

Blog

How ePopSoft, maker of Korea’s most popular English instruction app, uses Furiosa’s WARBOY

Q&A: ASUS on AI server trends, FuriosaAI partnership and more

A new global survey of businesses’ AI Infra Plans, conducted by FuriosaAI, ClearML and AIIA