Furiosa SDK 2026.2: Boosting RNGD throughput and accelerating deployments

No items found.

FuriosaAI has released SDK 2026.2 for RNGD, our high-performance AI inference accelerator.

With RNGD now in mass production, this release introduces major software advancements and new features for enterprise customers deploying and scaling agentic AI systems, coding agents, and high-throughput LLM applications.

SDK 2026.2 combines automated performance tuning with significant improvements in throughput and routing efficiency. This marks our second RNGD SDK release this year, underscoring our commitment to rapid iteration and continuous optimization for developers running inference at scale in their data centers.

RNGD is available as a standalone PCIe card operating at a 180W TDP and as a turnkey server configuration that delivers up to 3.5x greater compute density compared to H100-based systems in standard data center environments.

Boosting throughput across the serving stack

Maximizing hardware utilization is the most direct path to reducing AI inference operating costs. With SDK 2026.2, we’ve introduced deep compiler optimizations alongside a hybrid KV cache management system, delivering a 74.9% average throughput improvement for the Qwen3 and EXAONE 4.0 model families compared to our previous release. By optimizing how memory is allocated between prefill and decode phases, we have effectively doubled service capacity on existing hardware without increasing the power footprint.

Zero-touch performance with bucket presets

Optimizing LLM inference typically requires manual, expert-level tuning of "buckets" (prefill and decode lengths) to minimize wasted compute cycles.Our ArtifactBuilder now includes Per-Model Bucket Presets, eliminating this complexity.

By automating these expert-tuned configurations during the compilation phase, we have removed the burden of manual optimization. Developers can now achieve near-optimal RNGD performance out-of-the-box using standard configurations, without manual tuning.

Prefix-aware data parallelism

In RAG and agentic workflows, redundant computation across shared system prompts can significantly degrade efficiency. SDK 2026.2 introduces a Prefix-Aware Data Parallel (DP) Router that inspects each request's tokenized prefix and routes it to the RNGD replica already holding a matching prefix-cache entry.

Combined with Prefix Cache Hit Deferral — which briefly delays incoming requests likely to match an in-flight prefix in order to maximize cross-request hit rates — the system lowers Time-to-First-Token (TTFT) and avoids re-processing large shared contexts across replicas.

Data center-ready reliability

These improvements further establish RNGD as the best high-performance inference solution optimized for standard air-cooled data centers (10–15kW racks) where most enterprise workloads run today. We will continue to ship rapid, iterative improvements across both hardware and software to meet the evolving demands of large-scale AI systems in the months and years ahead.

Read the complete SDK documentation here.

Written by

The Furiosa Team

Furiosa SDK 2026.2: Boosting RNGD throughput and accelerating deployments

Boosting throughput across the serving stack

Zero-touch performance with bucket presets

Prefix-aware data parallelism

Data center-ready reliability

Blog

Experience RENEGADE Summit 2026

FuriosaAI establishes European flagship office in Portugal

RNGD outperforms RTX Pro 6000 with the latest SDK

Get the latest updates on FuriosaAI