We're excited to announce Furiosa SDK 2026.3.
If we had to describe this release in two words, they would be model enablement.
AI models are evolving faster than ever. As new architectures emerge at an accelerating pace, software agility has become a key differentiator alongside hardware performance.
SDK 2026.3 represents a major step in how quickly new open architectures can be compiled and deployed to RNGD, FuriosaAI's inference accelerator for large language models and agentic AI.
Underpinning everything is a new kernel framework called TCL (Tensor Contraction Language) and the furiosa-kernels package built on top of it. On that foundation, SDK 2026.3 delivers:
- Native, first-class multimodal (vision-language) serving with Qwen3-VL
- Accelerated bring-up for large Mixture-of-Experts (MoE) models, including gpt-oss, Solar-Open, K-EXAONE, and Qwen3 MoE
- Zero-recompilation model shipping via a portable FXB (Furiosa Executable Bundle) artifact format
- An opt-in overlap scheduler that maximizes NPU utilization and generalizes Data Parallel routing into a scoring-based policy
If you're upgrading from SDK 2026.2, be sure to read the Breaking Changes and Upgrade Guide sections before you start.
TCL: A new kernel framework for bringing up models
To see why adding a model got so much faster in SDK 2026.3, it helps to look at how graph compilation previously operated.
Our previous path started from a model in PyTorch, traced with TorchDynamo into an FX graph of Torch ATen ops. While this was great for getting a model running quickly, it diluted the model's intent. Capturing through ATen decomposes the graph into fine-grained primitives. For example, an aten.mm becomes tiling, then elementwise multiply, then reduce, so the compiler no longer sees a "matmul" and has to re-recognize one before it can optimize it. Axis semantics and op boundaries blurred in the same way.
Just as importantly, the modeling stage left few good places to attach the hints a compiler needs for an RNGD target, such as the intended tensor parallelism (TP) strategy. The software was essentially optimizing after much of the meaning had already been flattened away.
TCL (Tensor Contraction Language) was designed to preserve that model meaning and structure. It's a declarative Python eDSL in which a kernel author writes a high-level, @tcl.kernel-decorated function that says what to compute and leaves how to run it — tiling, scheduling, fusion, hardware mapping — to the compiler.
Its key idea is to treat the tensor contraction operations that DNN models intrinsically rely on as first-class primitives, so a model is described directly in terms of those contractions, with its structure intact. This matches RNGD's underlying TCP (Tensor Contraction Processor) architecture, whose compiler is built around the same primitive — and a TCL kernel is then compiled down to the executable binaries (EDFs) that run on it.
That design pays off in three ways:
- Intuitive and compiler-friendly. Authors express the computation's meaning rather than thread layouts or synchronization, and because TCL is built to be compiled, the toolchain can fuse multiple kernels and compile them as a single unit.
- Modular and reusable. Kernels like RMSNorm, Linear, or MLP become building blocks that many models share, so enabling a new architecture is mostly composing existing blocks and adding only what's genuinely new.
- Native to RNGD. Padding, sharding, broadcast, and multi-chip collectives are expressed at the language and type level, where the compiler can reason about their correctness and optimization together.
The upshot is that enablement now scales with the number of reusable blocks, not the number of models. The furiosa-kernels package is the concrete result, a collection of TCL kernels (attention, MoE, sampler, vision encoder, and architecture-specific blocks), covering the families Furiosa-LLM supports, and the breadth of new models in this release follows directly from it.
A wave of new model families
Leveraging the modular capabilities of TCL, SDK 2026.3 brings up a broad set of new models on RNGD, including several large MoE architectures:
- Qwen3-VL (e.g. Qwen3-VL-32B)., The first vision-language family on RNGD: a dense transformer paired with a vision encoder (more on this below)
- gpt-oss (e.g. gpt-oss-120b). An MoE family with MXFP4-quantized expert weights
- Solar-Open (e.g. Solar-Open-100B). An MoE family with NVFP4-quantized weights and 16-bit activations and KV cache (NVFP4A16)
- Qwen3 MoE (e.g. Qwen3-30B-A3B). An MoE family with dynamic FP8 activation quantization at runtime; Instruct, Thinking, and Coder variants
- K-EXAONE (e.g. K-EXAONE-236B-A23B). A multilingual MoE family using a hybrid sliding-window + global attention scheme; NVFP4A16
Each of these ships an Furiosa Executable Bundle (FXB) so it can be served directly from its Hugging Face repository. Check the per-model cards for the exact repository IDs and serving commands.
FXB: a compiled model you can ship and reuse
New models are only useful if they're easy to deploy. That's the problem FXB solves.
An .fxb file is Furiosa-LLM's shareable compiled-artifact format: a single archive that bundles the compiled kernels needed to run a model on the NPU together with the metadata describing what they were built for. Once a model is compiled into an .fxb, you can serve it without recompiling, copy it to another machine, or publish it to the Hugging Face Hub for others to reuse.
The most interesting property of an FXB is its architecture fingerprint. That metadata records the model architecture and the configuration fields that determine the compiled kernels — hidden size, head counts, vocabulary size, quantization, and so on.
This decoupled design delivers a major advantage for production environments: day-zero support for fine-tuned models. Because compatibility is keyed on that fingerprint rather than on a specific checkpoint, a single bundle is reusable across any Hugging Face model that shares the same fingerprint — not just the one it was built from.
In practice, this means you can serve a model whose own repository ships no .fxb, including fine-tuned or weight-updated variants of a supported model, by reusing a compatible bundle from your local cache, instead of recompiling for every variant.
A dedicated fxb command manages the full lifecycle: building, downloading, caching, compatibility checking, and inspection. For example, to serve Qwen/Qwen3-8B-FP8 (which ships no bundle of its own) by reusing the published, fingerprint-compatible furiosa-ai/Qwen3-8B-FP8 bundle:
# Download a published bundle into the local cache
fxb download furiosa-ai/Qwen3-8B-FP8
# Confirm it is fingerprint-compatible with the target model
fxb check Qwen/Qwen3-8B-FP8
# Serve as usual — the compatible cached bundle is found automatically
furiosa-llm serve Qwen/Qwen3-8B-FP8At serving time, Furiosa-LLM resolves the bundle in order: an explicit --fxb path, an .fxb shipped inside the model repository, then the local cache. FuriosaAI publishes pre-compiled bundles for popular models on the Hugging Face Hub.
And if no compatible bundle exists yet, you don't have to wait for one. As long as the model belongs to a family Furiosa-LLM supports, you can compile your own bundle once and reuse it from then on:
# Compile an .fxb for a supported-family model
fxb build Qwen/Qwen3-8B-FP8 ./qwen3-8b-fp8.fxbSee the Furiosa Executable Bundles (FXB) to learn more.
Multimodal serving arrives with Qwen3-VL
SDK 2026.3 brings vision-language (multimodal) serving to RNGD, with Qwen3-VL-32B as the first supported model.
Image-and-text requests are served through the standard OpenAI-compatible Chat Completions API using image_url content parts, backed by a dedicated multimodal scheduler with chunked prefill for the vision-preprocessing stage and support for concurrent multimodal requests.
A minimal example using the OpenAI Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}},
{"type": "text", "text": "Describe this image."},
],
}],
)
print(response.choices[0].message.content)To avoid re-sending and re-preprocessing the same image across requests, multimodal inputs can be tagged with a stable UUID and reused from a server-side processor cache, sized via --mm-processor-cache-gb.
This is the first step in our multimodal story — scheduling and batching of multimodal requests are still being optimized, with more improvements landing in the next release.
See the Vision-Language Models guide for the supported models, request format, and reuse semantics.
Overlap scheduling: toward zero-overhead batching
Here's a subtle source of wasted NPU time. Between forward passes, an inference engine spends real CPU time on host-side work — batch scheduling, KV-block allocation, and prefix matching against the radix cache. When that work runs between NPU forward passes, the NPU sits idle waiting for the next batch to be prepared. On short decode steps, this CPU overhead can become a significant fraction of each iteration.
SDK 2026.3 adds an overlap scheduler that removes the stall by running one batch ahead: while the NPU executes the current batch, the scheduler concurrently prepares all the metadata for the next one. The host-side scheduling cost is overlapped with NPU compute instead of serialized in front of it, keeping the NPU continuously fed.
For throughput-oriented workloads, this improves overall throughput and TPOT (time-per-output-token), at the cost of a small, bounded increase in TTFT (time-to-first-token).
It's opt-in for now. Enable the scheduler explicitly today. We plan to make it the default as it matures:
furiosa-llm serve <model> --enable-overlap-schedulingScoring-based Data Parallel routing
The prefix-aware Data Parallel (DP) router introduced in SDK 2026.2 evolves in SDK 2026.3 into a scoring-based policy. When picking a replica, the router now balances two signals:
- Prefix locality: preferring a replica that already holds matching prefix cache entries
- Token-footprint load: preferring a less-loaded replica
The relative weight of the two is selected through a scoring profile.
# Scoring-based routing (default), balanced profile (default)
furiosa-llm serve <model> --data-parallel-size <N>
# Bias toward prefix cache affinity (equivalent to the old prefix-aware behavior)
furiosa-llm serve <model> --data-parallel-size <N> \
--data-parallel-routing-policy scoring \
--data-parallel-scoring-profile locality--data-parallel-routing-policy accepts scoring (default) or round-robin; --data-parallel-scoring-profile accepts balanced (default), locality, or load.
See the Data Parallel Routing guide for details.
Wrapping up
The thread through SDK 2026.3 is architectural scalability via modular, reusable building blocks. TCL and furiosa-kernels significantly shorten the path from "a new architecture exists" to "it runs well on RNGD." The breadth of new model families in this release, including vision-language for the first time, demonstrates the effectiveness of this approach.
FXB then makes the compiled result portable and reusable across variants, while the overlap scheduler and scoring-based DP routing keep the hardware busy and balanced once a model is serving under real load.
For the complete list of changes, see the All Changes section of the full release notes. And if you're coming from SDK 2026.2, don't skip the Breaking Changes and Upgrade Guide.
Happy serving. 🚀
Written by




