Why we’re joining the UEC: The future of LLM inference is multi-chip

News December 02, 2024

Share this article

The AI industry needs better ways to connect chips together. Model sizes are growing faster than individual chips can keep up, and proprietary interconnect solutions limit innovation. That's why we're excited to join the Ultra Ethernet Consortium (UEC) and help build open standards for chip-to-chip communication.

As part of the UEC, we will help the organization develop open standards for high-bandwidth, low-latency chip-to-chip communication – a key part of building scalable and sustainable AI infrastructure.

Why multi-chip AI is here to stay

Although there has been exciting progress in building small language models, today’s leading LLMs and multimodal models exceed the memory capacity of most single chips, even those with large on-chip SRAM.

While Llama 70B represents the current benchmark for LLMs, architectures like Mixtral 8x7B demonstrate even greater memory demands. With transistor density improvements plateauing and model sizes continuing to grow, multi-chip solutions are essential to scaling AI inference.

Parallelism is key

Multi-chip deployments use several different parallelism strategies. Data parallelism distributes inference requests across chips; tensor parallelism splits individual model layers. Expert parallelism in MoE models like Mixtral routes computations to different specialized units.

The Tensor Contraction Processor architecture in FuriosaAI’s second-gen chip, RNGD (“Renegade”), excels at leveraging different forms of parallelism because it eliminates the need to first break tensors down into matrices.

For all these approaches, high-bandwidth interconnect is crucial.

The power of open standards in AI

Open approaches have been key to accelerating progress in AI, from the public COCO dataset in 2013 right through to today’s Llama models. Hardware initiatives like the Open Compute Project and open memory standards like HBM3 and HBM3e have driven innovation through standardization and competition.

Ultra Ethernet will be a similarly important step forward for the industry. It offers several compelling advantages over current solutions. Ultra Ethernet’s vendor-neutral approach ensures broad compatibility, while delivering much greater bandwidth than PCIe.

The architecture also enables superior performance per watt in chip-to-chip communication, lowering the cost and complexity of inference at scale.

It can also enable mixing different vendors’ accelerators in single deployments and scaling easily from individual servers to large-scale multirack configurations.

Ultra Ethernet and Furiosa

This aligns perfectly with Furiosa’s approach to AI inference. RNGD’s TCP architecture easily scales across multiple chips, and our software stack is built for distributed deployments from day one.

By joining the UEC, we’re reinforcing our commitment to solutions that combine performance, power efficiency, and programmability in ways that easily scale to tomorrow’s models and use cases.

Share this article

Why we’re joining the UEC: The future of LLM inference is multi-chip

Why multi-chip AI is here to stay

Parallelism is key

The power of open standards in AI

Ultra Ethernet and Furiosa

Other posts

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Furiosa SDK 2025.3 boosts RNGD performance with multichip scaling and more

FuriosaAI and OpenAI showcase the future of sustainable enterprise AI

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI