Furiosa SDK 2025.3 boosts RNGD performance with multichip scaling and more

Technical Updates September 25, 2025

Share this article

We are continuously advancing our software stack to unlock the full potential of RNGD, our AI accelerator for high-performance, energy-efficient data center inference.

Our latest major release, Furiosa SDK 2025.3 (along with minor 2025.3.1, 2025.3.2, and 2025.3.3 updates) unlocks significant performance and efficiency gains, particularly with large-scale models and agentic AI.

These updates focus on robust inter-chip tensor parallelism, additional compiler and runtime optimizations, and improved developer functionality. For Llama 3.3 70B, these SDKs deliver up to 3x better average throughput and up to 35% average reduction in Time to First Token (TTFT) compared to SDK 2025.2.

Unlocking scalability and multichip performance

SDK 2025.3 adds support for inter-chip tensor parallelism across multiple RNGD cards, enabling efficient scaling with large models and significantly improved throughput within the same server power constraints. Key features include:

Using PCIe Gen 5 for inter-chip Peer-to-Peer (P2P) communication allows multiple chips to transfer data efficiently at up to 64 GB/s in each direction.

Optimized PCIe paths for P2P communication and advanced communication scheduling to manage the flow of data between chips.

For an eight-card RNGD server running Llama 3.3 70B, the total power consumption is less than half the power consumed by competing NVIDIA solutions.

Compiler and runtime optimizations

The recent SDKs use advanced compiler and runtime techniques to extract significant performance gains from RNGD’s unique Tensor Contraction Processor (TCP) chip architecture.

The Furiosa Compiler's global optimization capabilities now maximize SRAM reuse between transformer blocks, while the runtime reduces interference between RNGD and the host. We also developed compiler tactics that explicitly overlap inter-chip DMA with computation, further reducing latency.

These enhancements translate directly into reduced memory access latency, higher overall throughput, improved synchronization across devices, and minimized overhead between consecutive decoding steps. When running Llama 3.1 8B, the average throughput improved by 4.5% and the TTFT declined by 55%.

Expanded model support

We’ve also added support for the popular Qwen 2 and Qwen 2.5 models, as well as support for W8A16 quantization. Precompiled artifacts on the Hugging Face Hub now support context lengths up to 32K tokens, enabling more complex and context aware applications.

Improved observability features

SDK 2025.3 also adds improved monitoring and debugging capabilities. Production metrics are exposed through the /metrics endpoint and logs now show data on average throughput, KV Cache usages, and running/waiting requests, giving developers deeper insights into application performance.

Structured output support for agentic AI and MCP

Our SDK now supports the Structured Outputs functionality that OpenAI has added to its API. This feature forces the model output to conform to a specific JSON schema. This provides a simple and reliable way to format output correctly—a crucial need for projects that use the Model Context Protocol (MCP) or use agents to call functions or interact with APIs.

        from openai import OpenAI

base_url = "http://localhost:8000/v1" # Replace this with your base URL
api_key = "EMPTY"
client = OpenAI(api_key=api_key, base_url=base_url)

# Sample review to classify
review = "This movie was absolutely fantastic!"

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": f"Classify sentiment: '{review}'"}],
    extra_body={"guided_choice": ["positive", "negative", "neutral"]},
    temperature=0.0,
)

print(response.choices[0].message.content)

Users can generate structured outputs using both OpenAI’s Completions API and Chat Completions API.

Rapid SDK iteration

Our goal is to ship frequent updates to our software stack. In May, we released SDK 2025.2.0, adding major new functionality, including Hugging Face Hub integration, reasoning model support, and support for chunked prefill. This summer, LG AI Research adopted RNGD for inference with its EXAONE models, using recent SDK enhancements to achieve 2.25x better performance per watt vs. its previous GPU solution.

The SDK 2025.3 release solidifies RNGD's position as a leading AI inference platform.

By delivering power efficiency, robust performance, and critical scalability features, RNGD is uniquely positioned to meet the practical demands of real-world datacenter environments. We look forward to announcing additional features and improvements soon.

RNGD is sampling now with enterprise customers globally.

RNGD is sampling now with enterprise customers globally. Contact us via this form to learn more.

Share this article

Furiosa SDK 2025.3 boosts RNGD performance with multichip scaling and more

Unlocking scalability and multichip performance

Compiler and runtime optimizations

Expanded model support

Improved observability features

Structured output support for agentic AI and MCP

Rapid SDK iteration

Other posts

How PyTorch handles dynamic tensor shapes

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI

Furiosa SDK 2025.3 boosts RNGD performance with multichip scaling and more

Unlocking scalability and multichip performance

Compiler and runtime optimizations

Expanded model support

Improved observability features

Structured output support for agentic AI and MCP

Rapid SDK iteration

Other posts

How PyTorch handles dynamic tensor shapes

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice