Furiosa SDK 2025.2 is here: Hugging Face Hub integration, reasoning model support, enhanced APIs and more
Technical Updates

Furiosa SDK 2025.2.0 is now available, bringing significant new functionality to RNGD, our second-gen chip for power-efficient inference with LLMs and agentic AI.
Highlights of this beta release include:
Support for reasoning parser and reasoning models, including DeepSeek-R1-Distill
Direct building of models in bfloat16, float32 and float16 formats directly from Hugging Face Hub without a separate quantization step
Support new metadata endpoints /v1/models and /version
Support for Chat with tool in LLM API, supporting conversation with tool calling
A new metrics endpoint (/metrics) for monitoring
Support for chunked prefill to allow splitting large prefills into small chunks
Simplified Linux setup: NPU device access no longer requires in the 'furiosa' group
Support Ubuntu 24.04 (Noble Numbat)
Python 3.11, 3.12 support
This is our fourth major SDK release for RNGD in just six months, reflecting our commitment to rapid iteration for developers using RNGD for inference for data centers.
Streamlined model access with Hugging Face Hub
A major focus of this release is tighter integration with the Hugging Face ecosystem:
The LLM API, furiosa-mlperf, and furiosa-llm serve now support loading model artifacts directly from the Hugging Face Hub here.
Pre-compiled model artifacts are also available on Hugging Face, so developers can use optimized models immediately.
Models can be accessed via commands like this:
furiosa-llm serve furiosa-ai/Llama-3.1-8B-Instruct
Larger (70B) models are available via HTTP:
furiosa-ai/DeepSeek-R1-Distill-Llama-70B (download)
furiosa-ai/Llama-2-70b-chat-hf-FP8 (download)
furiosa-ai/Llama-3.3-70B-Instruct (download)
furiosa-ai/Llama-3.3-70B-Instruct-FP8 (download)
A much simpler way to build model artifacts
Before this release, building a model artifact requires the calibration and quantization steps. The release 2025.2 allows a direct build of a bfloat16 model artifact without those steps. Additionally, if you specify --auto-bfloat16-cast, you can directly build float16, float32 models too by casting to bfloat16.
furiosa-llm build \
LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct ./Output-EXAONE-3.5-7.8B-Instruct \
-tp 8 \
--max-seq-len-to-capture 32768 \
--prefill-chunk-size 8192 \
-db 4,32768 \
--auto-bfloat16-cast \
--trust-remote-code
Reasoning model support
Furiosa-LLM provides support for models with reasoning capabilities, such as the Deepseek R1 series. These models are designed to generate reasoning steps and then provide a final answer. For this mechanism, these models require a special parser to recognize reasoning steps. To use the reasoning model, you need to specify --enable-reasoning and --reasoning-parser as follows:
furiosa-llm serve furiosa-ai/DeepSeek-R1-Distill-Llama-8B --enable-reasoning --reasoning-parser deepseek_r1
Chunked prefill
Furiosa-LLM now supports an experimental feature, “chunked prefill,” to split large prefills into small chunks. Chunked prefill in Furiosa-LLM is still under development and doesn’t yet batch a single prefill and multiple decode requests. However, it’s still useful when you have to handle a large context length.
To enable chunked prefill, add the --prefill-chunk-size [CHUNK_SIZE] option to the furiosa-llm build command. The following shows an example command for building the LG EXAONE model with a 32k context length.
If you are not familiar with furiosa-llm, please check out Quick Start with Furiosa LLM
Enhanced API functionality
Furiosa LLM now offers enhanced APIs compatible with OpenAI standards. The Embedding API allows developers to generate high-quality embeddings
The new Chat API, accessible via the LLM.chat() method, enables creation of dynamic, multiturn chat applications
We’ve added /v1/models (and /v1/models/{model_id}) and /version endpoints to the OpenAI-compatible server for better introspection and management. A new /metrics endpoint allows server monitoring
Support for abort() has been added to LLMEngine and AsyncLLMEngine APIs
Simplified setup and support by allowing users to access devices on Linux systems without having to join the furiosa group. This release also introduces support for Ubuntu 24.04 and Python 3.11/3.12
Support for standard container runtimes
This release adds official support for industry-standard container runtimes: Docker v25.0.0 or later ContainerD v1.7.0 or later, and CRI-O v1.28.0 or later.
Documentation is available here.
Much more to come
For our next major SDK release we plan to add additional Tensor Parallelism support for inter-chip communication and Speculative decoding in Furiosa LLM. Stay tuned for more updates coming soon.
🔗Sign up to be notified first about RNGD: https://furiosa.ai/signup.