FuriosaAI has released SDK 2026.2 for RNGD, our high-performance AI inference accelerator.
With RNGD now in mass production, this release introduces major software advancements and new features for enterprise customers deploying and scaling agentic AI systems, coding agents, and high-throughput LLM applications.
SDK 2026.2 combines automated performance tuning with significant improvements in throughput and routing efficiency. This marks our second RNGD SDK release this year, underscoring our commitment to rapid iteration and continuous optimization for developers running inference at scale in their data centers.
RNGD is available as a standalone PCIe card operating at a 180W TDP and as a turnkey server configuration that delivers up to 3.5x greater compute density compared to H100-based systems in standard data center environments.
Boosting throughput across the serving stack
Maximizing hardware utilization is the most direct path to reducing AI inference operating costs. With SDK 2026.2, we’ve introduced deep compiler optimizations alongside a hybrid KV cache management system, delivering a 74.9% average throughput improvement for the Qwen3 and EXAONE 4.0 model families compared to our previous release. By optimizing how memory is allocated between prefill and decode phases, we have effectively doubled service capacity on existing hardware without increasing the power footprint.
Zero-touch performance with bucket presets
Optimizing LLM inference typically requires manual, expert-level tuning of "buckets" (prefill and decode lengths) to minimize wasted compute cycles.Our ArtifactBuilder now includes Per-Model Bucket Presets, eliminating this complexity.
By automating these expert-tuned configurations during the compilation phase, we have removed the burden of manual optimization. Developers can now achieve near-optimal RNGD performance out-of-the-box using standard configurations, without manual tuning.
Prefix-aware data parallelism
In RAG and agentic workflows, redundant computation across shared system prompts can significantly degrade efficiency. SDK 2026.2 introduces a Prefix-Aware Data Parallel (DP) Router that inspects each request's tokenized prefix and routes it to the RNGD replica already holding a matching prefix-cache entry.
Combined with Prefix Cache Hit Deferral — which briefly delays incoming requests likely to match an in-flight prefix in order to maximize cross-request hit rates — the system lowers Time-to-First-Token (TTFT) and avoids re-processing large shared contexts across replicas.
Data center-ready reliability
These improvements further establish RNGD as the best high-performance inference solution optimized for standard air-cooled data centers (10–15kW racks) where most enterprise workloads run today. We will continue to ship rapid, iterative improvements across both hardware and software to meet the evolving demands of large-scale AI systems in the months and years ahead.
Read the complete SDK documentation here.
Written by
The Furiosa Team




