Furiosa SDK 2026.1: Hybrid batching, prefix caching, and native k8s support

No items found.

FuriosaAI has released SDK 2026.1 for RNGD, our high-performance AI inference accelerator. With RNGD now in mass production, this major software update ensures the platform is ready for enterprise customers to move from experimentation to production-grade operation.

This release delivers the full-stack infrastructure required for RAG and agentic AI workflows, combining new cloud-native orchestration with massive improvements in throughput and observability.

RNGD is available as a standalone PCIe card operating at strict 180W TDP and as a turnkey server that delivers 3.5x greater compute density (throughput per rack) than H100-based systems in standard data center environments.

New tools to enable developer velocity

Deploying specialized hardware often comes with an "integration tax" of custom scripts and opaque monitoring. SDK 2026.1 eliminates this by moving core telemetry to a Rust-native implementation with full OpenTelemetry support.

SDK 2026.1 introduces several critical serving-layer optimizations:

Hybrid Batching: Intelligently combines prefill and decode requests within a single batch, boosting requests per second (RPS) by up to 2x while controlling tail latency.
Prefix Caching: Automatically reuses common prompt prefixes via a branch-compressed radix tree, significantly reducing Time-To-First-Token (TTFT) for RAG and multi-turn agents.
Pooling Model Support: Comprehensive support for embeddings, scoring, and reranking (e.g., Qwen3-8B), providing the complete toolkit for high-accuracy retrieval systems.
Advanced Quantization: Support for fine-grained dynamic FP8 quantization, including DeepSeek-style 2D-block weight quantization, ensuring accuracy at scale.
ARM64 support: Linux driver RNGD (>= Linux Kernel 6.3), fw updater, furiosa-smi, and Furiosa-LLM packages for ARM64 arch are now available
Expanded Model Support: Native support for EXAONE 4.0 (including 128k context lengths) and the Qwen3 family,

This release also brings enhanced tools for cloud-native distributed inference:

llm-d Framework: Our Kubernetes-native distributed inference framework handles disaggregated serving (splitting prefill and decode workloads) and intelligent request routing.
NPU Operator: Automates device discovery and firmware lifecycle management within Kubernetes clusters.
Dynamic Resource Allocation (DRA): PCIe topology-aware strategy that automatically places model weights and KV caches for maximum bandwidth.
Rust-Native Telemetry: We have migrated core telemetry and metrics collection to a Rust-native implementation to provide high-performance, OpenTelemetry-compatible observability with minimal overhead.

from furiosa_llm import LLM, PoolingParams

# Load an embedding model
llm = LLM("furiosa-ai/Qwen3-Embedding-8B")

# ============================================================
# Example 1: Single prompt embedding
# ============================================================
prompt = "What is the capital of France?"
output = llm.embed(prompt)
embedding = output[0].outputs.embedding
print(f"Prompt: {prompt!r}")
print(f"Embedding dimension: {len(embedding)}")
print(f"Embedding (first 10 values): {embedding[:10]}")
print("-" * 80)

# ============================================================
# Example 2: Batch embedding (multiple prompts)
# ============================================================
prompts = [
   "What is the capital of France?",
   "What is the capital of Germany?",
   "What is the capital of Italy?",
]
outputs = llm.embed(prompts)
for prompt, output in zip(prompts, outputs):
   embedding = output.outputs.embedding
   print(f"Prompt: {prompt!r}")
   print(f"Embedding dimension: {len(embedding)}")
   print(f"Embedding (first 5 values): {embedding[:5]}")
print("-" * 80)

# ============================================================
# Example 3: Using PoolingParams for truncation
# ============================================================
# Truncate long prompts to fit within token limits
pooling_params = PoolingParams(truncate_prompt_tokens=128)

long_prompts = [
   "This is a very long text that might exceed the model's context window. " * 50,
   "Another lengthy document that needs to be truncated for processing. " * 50,
]

outputs = llm.embed(long_prompts, pooling_params=pooling_params)
for i, output in enumerate(outputs):
   embedding = output.outputs.embedding
   print(f"Long prompt {i}: embedding dimension = {len(embedding)}")

An inference solution that’s ready for real world deployments

This release demonstrates Furiosa’s ability to ship new features quickly and ensure developers using RNGD can achieve excellent real world performance.

Features like the Ahead-of-Time (AOT) Wired Pipeline and AVX-512 accelerated normalization prove that Furiosa is optimizing the entire execution path. Whether you are deploying for on-prem security or scaling via NPU-based GPUaaS, SDK 2026.1 is the production backbone of the RNGD ecosystem.

These features augment core Furiosa SDK functionality, such as native torch.compile support and vLLM-compatible APIs. For a full list of technical specifications, refer to our latest documentation at developer.furiosa.ai.

import os

import requests

# Start server with: furiosa-llm serve path/to/reranker-model

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")

# 1-to-N scoring via HTTP API
response = requests.post(
   f"{base_url}/score",
   json={
       "model": "reranker",
       "text_1": "What is machine learning?",
       "text_2": [
           "Machine learning is a subset of AI.",
           "Python is a programming language.",
           "Deep learning uses neural networks.",
       ],
   },
)

data = response.json()
for item in data["data"]:
   print(f"Index {item['index']}: score = {item['score']:.4f}")

Major features & improvements

LLM serving improvements

Hybrid batching
- Description: Implements a scheduler that intelligently combines multiple prefill and decode requests within a single batch rather than processing them sequentially.
- Impact: Boosts overall throughput while maintaining low tail latency, achieving up to 2x higher requests per second compared to previous releases.
Prefix caching
- Description: Automatically detects and reuses common prompt prefixes across multiple requests using a branch-compressed radix tree to eliminate redundant computation.
- Impact: Ideal for applications with shared context—such as chatbots with system prompts and RAG systems—this significantly reduces Time-To-First-Token (TTFT) through SIMD-optimized prefix matching and cache eviction.
Pooling model support
- Description: Adds comprehensive support for pooling models, including PoolingParams.normalize for normalized embedding outputs. Currently supports the Qwen3-8B embedding and reranking models.
- Impact: Enables critical NLP tasks such as generating vector representations for semantic search (encode, embed), evaluating query-document relevance (score), and improving search results through candidate reordering (rerank).
Structured output with multiple backends
- Description: Production-ready structured output generation supporting JSON schema validation, Regular expression constraints, and Grammar-based generation using both outlines and xgrammar backends.
- Impact: Optimized guided decoding with bitmask prefetching reduces latency during NPU task execution, providing flexibility and performance for data-extraction use cases.
Fine-grained dynamic FP8 quantization
- Description: Support for high-precision quantization including DeepSeek-style 2D-block weight quantization and per-token group activation quantization.
- Impact: Allows enterprises to deploy massive models with a reduced memory footprint while maintaining frontier-level accuracy, lowering the barrier for local inference.

Distributed inference and cloud-native support

llm-d framework
- Description: Enables seamless deployment of large language models across multiple nodes, intelligently handling request routing based on KV-cache usage, prefixes, and model awareness.
- Impact: Supports disaggregated serving (splitting prefill and decode workloads) across clusters, allowing for massive scaling of long-context models across the data center.
NPU operator and DRA
- Description: A native Kubernetes NPU operator that automates device discovery and firmware upgrades, paired with PCIe topology-aware Dynamic Resource Allocation (DRA).
- Impact: Ensures model weights and KV caches are placed for maximum bandwidth without manual intervention, streamlining the management of large-scale NPU clusters.
Rust-native telemetry
- Description: Migrated the core metrics collection and telemetry stack to a Rust-native implementation with full OpenTelemetry and Prometheus integration.
- Impact: Provides high-performance observability—including per-device metrics and KV cache utilization—with significantly reduced CPU overhead compared to previous Python-based implementations.
ARM64 compatibility
- Description: Native support for the ARM64 architecture within furiosa-llm, allowing the SDK to run on energy-efficient CPUs like Ampere.
- Impact: Enables RNGD deployment in thermally constrained or power-capped data centers and edge appliances where ARM is the preferred architecture for its performance-per-watt.

Expanded model support

Description: Comprehensive support for the Qwen3 Family (including 32B variants and 8B embedding models) and EXAONE 4.0.
Impact: Unlocks advanced capabilities such as 128k context lengths, sparse attention, and sliding window attention for national-scale AI projects and enterprise RAG.

For a full list of technical specifications, refer to our latest documentation at developer.furiosa.ai.

‍

Written by

Furiosa SDK 2026.1: Hybrid batching, prefix caching, and native k8s support

New tools to enable developer velocity

An inference solution that’s ready for real world deployments

Major features & improvements

Blog

FuriosaAI Expands European AI Infrastructure with RNGD Deployment at Equinix’s Lisbon Data Center

FuriosaAI to expand RNGD infrastructure availability across Europe with Equinix

Furiosa SDK 2026.3: A new kernel framework, and the models it unlocks

Get the latest updates on FuriosaAI