Furiosa SDK 2025.1.0 Release Blog
Technical Updates

We’re excited to introduce Furiosa SDK 2025.1.0 (release notes), packed with major enhancements for seamless LLM deployment with our user-friendly Furiosa LLM APIs. This release includes significant latency optimizations, tool calling support, flexible device remapping in containers, and a streamlined process for converting Hugging Face models.
This marks our third major SDK release for RNGD in just five months, reflecting our rapid innovation to deliver the most efficient LLM inference for data centers.
LLM Latency Optimization
Furiosa SDK 2025.1.0 delivers significant LLM latency optimizations, improving TTFT by up to 11.66% and TPOT by 11.45% for large inputs (30k tokens) and outputs (1k tokens).
This means faster response times and better efficiency for high-throughput AI workloads.
Get started by updating to the latest SDK and benchmarking supported models.
OpenAI API’s Tool Calling Support
Furiosa LLM now supports tool calling, enabling models to interact with external tools and functions.
This allows developers to seamlessly integrate AI-driven automation into their applications with minimal changes, which is critical for building Agentic AI applications.
Get started by using the `llama3_json` parser, with more options coming in future releases. For more information, refer to`tool calling` documentation.
‘furiosa-llm build’ for easily converting Hugging Face models
The new `furiosa-llm build` command simplifies converting Hugging Face models into optimized model artifacts for RNGD.
This streamlines deployment, reducing manual setup while ensuring peak performance.
Get started by referring to building a model artifact documentation.
Automatic optimization of Blocked KV cache allocation
Furiosa SDK 2025.1.0 automatically maximizes blocked KV cache allocation by reducing memory fragmentation, improving memory efficiency for LLM inference.
You don’t need to manually tune to get the best performance. With the latest release of Furiosa LLM engine, you automatically get maximum performance.
Get started with Furiosa LLM with this quick start documentation.
And there’s more coming! In the next few months, we’re rolling out enhanced tensor parallelism, speculating with a draft model, embeddings API support, torch.compile () backend, and more.
With RNGD now in key enterprise customers’ hands, we’re prioritizing rapid SDK updates, so join us on our journey.
🔗Sign up to be notified first about RNGD: https://furiosa.ai/signup.