Why we’re joining the UEC: The future of LLM inference is multi-chip
News
The AI industry needs better ways to connect chips together. Model sizes are growing faster than individual chips can keep up, and proprietary interconnect solutions limit innovation. That's why we're excited to join the Ultra Ethernet Consortium (UEC) and help build open standards for chip-to-chip communication.
As part of the UEC, we will help the organization develop open standards for high-bandwidth, low-latency chip-to-chip communication – a key part of building scalable and sustainable AI infrastructure.
Why multi-chip AI is here to stay
Although there has been exciting progress in building small language models, today’s leading LLMs and multimodal models exceed the memory capacity of most single chips, een those with large on-chip SRAM.
While Llama 70B represents the current benchmark for LLMs, architectures like Mixtral 8x7B demonstrate even greater memory demands. With transistor density improvements plateauing and model sizes continuing to grow, multi-chip solutions are essential to scaling AI inference.
Parallelism is key
Multi-chip deployments use several different parallelism strategies. Data parallelism distributes inference requests across chips; tensor parallelism splits individual model layers. Expert parallelism in MoE models like Mixtral routes computations to different specialized units.
The Tensor Contraction Processor architecture in FuriosaAI’s second-gen chip, RNGD (“Renegade”), excels at leveraging different forms of parallelism because it eliminates the need to first break tensors down into matrices.
For all these approaches, high-bandwidth interconnect is crucial.
The power of open standards in AI
Open approaches have been key to accelerating progress in AI, from the public COCO dataset in 2013 right through to today’s Llama models. Hardware initiatives like the Open Compute Project and open memory standards like HBM3 and HBM3e have driven innovation through standardization and competition.
Ultra Ethernet will be a similarly important step forward for the industry. It offers several compelling advantages over current solutions. Ultra Ethernet’s vendor-neutral approach ensures broad compatibility, while delivering much greater bandwidth than PCIe.
The architecture also enables superior performance per watt in chip-to-chip communication, lower the cost and complexity of inference at scale.
It can also enable mixing different vendors’ accelerators in single deployments and scaling easily from individual servers to large-scale multirack configurations.
Ultra Ethernet and Furiosa
This aligns perfectly with Furiosa’s approach to AI inference. RNGD’s TCP architecture easily scales across multiple chips and our software stack is built for distributed deployments from day one.
By joining the UEC, we’re reinforcing our commitment to solutions that combine performance, power efficiency and programmability in ways that easily scale to tomorrow’s models and use cases.