Furiosa SDK 2025.2.0 is here: Hugging Face Hub integration, reasoning model support, enhanced APIs, and more

Technical Updates May 19, 2025

Share this article

Furiosa SDK 2025.2.0 is now available, bringing significant new functionality to RNGD, our flagship chip for power-efficient inference on LLMs and agentic AI.

Furiosa SDK 2025.2.0 highlights include:

Support for reasoning parser and reasoning models, including DeepSeek-R1-Distill
Building of models in bfloat16, float32, and float16 formats directly from Hugging Face Hub without a separate quantization step
Support for new metadata endpoints /v1/models and /version
Support for Chat with tool in LLM API, supporting conversation with tool calling
A new metrics endpoint (/metrics) for monitoring
Support for chunked prefill to allow splitting large prefills into smaller chunks
Simplified Linux setup: NPU device access no longer requires joining the 'furiosa' group
Ubuntu 24.04 (Noble Numbat) support
Python 3.11 and Python 3.12 support

This is our fourth SDK release for RNGD in just six months, reflecting our commitment to rapid iteration for developers using RNGD for inference in data centers.

Streamlined model access with Hugging Face Hub

A major focus of this release is tighter integration with the Hugging Face ecosystem:

The LLM API, furiosa-mlperf, and furiosa-llm serve now support loading model artifacts directly from the Hugging Face Hub here.
Pre-compiled model artifacts are also available on Hugging Face, so developers can use optimized models immediately.

Models can be accessed via commands like this:

furiosa-llm serve furiosa-ai/Llama-3.1-8B-Instruct

Larger (70B) models are available via HTTPS:

furiosa-ai/DeepSeek-R1-Distill-Llama-70B (download)
furiosa-ai/Llama-2-70b-chat-hf-FP8 (download)
furiosa-ai/Llama-3.3-70B-Instruct (download)
furiosa-ai/Llama-3.3-70B-Instruct-FP8 (download)

A much simpler way to build model artifacts

Before this release, building a model artifact required the calibration and quantization steps. The 2025.2 release allows a direct build of a bfloat16 model artifact without those steps. Additionally, if you specify --auto-bfloat16-cast, you can directly build float16 and float32 models too, by casting to bfloat16.

        furiosa-llm build \
LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct ./Output-EXAONE-3.5-7.8B-Instruct \
-tp 8 \
--max-seq-len-to-capture 32768 \
--prefill-chunk-size 8192 \
-db 4,32768 \
--auto-bfloat16-cast \
--trust-remote-code

Reasoning model support

Furiosa-LLM provides support for models with reasoning capabilities, such as the Deepseek R1 series. These models are designed to generate reasoning steps and then provide a final answer. For this mechanism, these models require a special parser to recognize the reasoning steps. To use the reasoning model, you need to specify --enable-reasoning and --reasoning-parser as follows:

        furiosa-llm serve furiosa-ai/DeepSeek-R1-Distill-Llama-8B --enable-reasoning --reasoning-parser deepseek_r1

Chunked prefill

Furiosa-LLM now supports an experimental feature, “chunked prefill,” that splits large prefills into smaller chunks. Chunked prefill is still under development and doesn’t yet batch a single prefill and multiple decode requests. However, it’s still useful when you have to handle a large context length.

To enable chunked prefill, add the --prefill-chunk-size [CHUNK_SIZE] option to the furiosa-llm build command. The following shows an example command for building the LG EXAONE model with a 32K context length.

furiosa-llm build \
LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct ./Output-EXAONE-3.5-7.8B-Instruct \
-tp 8 \
--max-seq-len-to-capture 32768 \
--prefill-chunk-size 8192 \
-db 4,32768 \
--auto-bfloat16-cast \
--trust-remote-code

If you are not familiar with furiosa-llm, please check out our quick start guide.

Enhanced API functionality

Furiosa LLM now offers enhanced APIs compatible with OpenAI standards. The Embedding API allows developers to generate high-quality embeddings.
The new Chat API, accessible via the LLM.chat() method, enables creation of dynamic, multiturn chat applications.
We added /v1/models (and /v1/models/{model_id}) and /version endpoints to the OpenAI-compatible server for better introspection and management. A new /metrics endpoint allows for server monitoring.
We added support for abort() to the LLMEngine and AsyncLLMEngine APIs.

Support for standard container runtimes

This release adds official support for industry-standard container runtimes: Docker v25.0.0 or later, ContainerD v1.7.0 or later, and CRI-O v1.28.0 or later.

Documentation is available here.

Much more to come

For our next SDK release, we plan to add additional Tensor Parallelism support for inter-chip communication and speculative decoding in Furiosa LLM. Stay tuned for more updates coming soon.

🔗Sign up to be notified first about RNGD at furiosa.ai/signup.

Share this article

Furiosa SDK 2025.2.0 is here: Hugging Face Hub integration, reasoning model support, enhanced APIs, and more

Streamlined model access with Hugging Face Hub

A much simpler way to build model artifacts

Reasoning model support

Chunked prefill

Enhanced API functionality

Support for standard container runtimes

Much more to come

Other posts

How PyTorch handles dynamic tensor shapes

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI

Furiosa SDK 2025.2.0 is here: Hugging Face Hub integration, reasoning model support, enhanced APIs, and more

Streamlined model access with Hugging Face Hub

A much simpler way to build model artifacts

Reasoning model support

Chunked prefill

Enhanced API functionality

Support for standard container runtimes

Much more to come

Other posts

How PyTorch handles dynamic tensor shapes

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice