Demonstrating High-Speed Inference Throughput with the Furiosa SDK

Our Viewpoints February 24, 2025

A photo of the demo that FuriosaAI showed at ISSCC 2025

Share this article

One of the biggest challenges in AI inference is handling multiple concurrent large language model (LLM) requests while ensuring fast response times and high utilization. In this post, we demonstrate how the Furiosa SDK enables high-throughput batch inference with a single 180W RNGD (pronounced “renegade”) card, showing how our unique Tensor Contraction Processor (TCP) architecture can efficiently perform and scale.

RNGD is FuriosaAI's flagship TCP-based chip, designed specifically for inference in data centers with LLMs, agentic AI, and other advanced AI applications. Unlike traditional GPUs, which divide computations into matrix multiplications, RNGD uses tensor contraction as its fundamental computational primitive, enabling our compiler to find much more efficient data-handling strategies and delivering superior performance per watt for demanding inference workloads.

Demo

Running LLMs efficiently at scale requires handling many concurrent requests without compromising speed or performance. The following demo simulates a large-scale, real-world scenario by simultaneously generating numerous quizzes, mimicking the pressure of serving many users at once. Using just one RNGD card and the Furiosa SDK, we show how automated batching, real-time streaming, and lightweight deployment make high-throughput inference not just possible, but easy:

With the Furiosa SDK, we demonstrate that our system can:

Easily deploy an OpenAI-compatible server with a single command
Use FuriosaLLM, our drop-in replacement for vLLM, for efficient and high-throughput serving of models
Handle high-volume parallel requests efficiently through automatic batching and scheduling
Provide real-time system metrics, including power consumption, chip temperature, and token throughput

For more details on how we put this demo together, read on.

Setting up the demo

For the frontend, we built a streamlined quiz generation system with a React-based interface using WebSockets for real-time updates. For the backend, we deployed a RNGD server (compatible with the industry-standard OpenAI API) using the Furiosa SDK to power a FastAPI-based quiz generation server that simulates multi-user quiz generation and Llama 3.1 8B quantized to FP8 precision.

Before scaling up, we validated that the system could generate high-quality quizzes while maintaining performance. This involved:

Defining a structured prompt to ensure consistent quiz generation:

Prompt: "Generate a short quiz with an answer related to {topic}. The answer should be 6-7 lines, around 500 words. Make only one quiz."

Example Output:
"Why is the perfect pizza cooked in a wood-fired oven? It cooks the crust evenly and crisply at a high temperature, leaving a smoky flavor."

Testing quiz generation using the OpenAI API Python library, ensuring the LLM server produces relevant, high-quality responses across multiple topics

Scaling the workload

Our goal was to ensure the demo would let the user increase or decrease task volume dynamically through the interface and that the system would respond instantly by rebalancing workloads.

The demo thus enables simultaneous quiz generation through streamed responses (similar to ChatGPT, where text appears progressively as it is generated) and shows the user multiple quiz generations in real time. Users can define the generation speed, triggering new LLM tasks at regular intervals. This flexibility shows developers that RNGD can handle the performance of changing application demands.

We used an asynchronous backend built with WebSockets to ensure low-latency, real-time data communications capable of handling multiple user requests in parallel.

We optimized the data transmission to reduce network overhead, batching multiple responses and sending a batch every 50ms, instead of transmitting individually. The batches are sent quickly enough that the frontend receives updates at a steady rate, ensuring smooth real-time rendering.

Performance monitoring

We also included a live performance dashboard with the following metrics:

Throughput (higher is better), measured as tokens per second, to show RNGD’s raw processing capacity
Efficiency (higher is better), measured as tokens per second per watt, to show how much work RNGD is doing given the same amount of power
Power (lower is better), measured in watts, to show how much electricity RNGD is using
Temperature (lower is better), measured in Celsius, to show how hot RNGD is getting

These metrics are made possible since the demo’s backend, written in Python, counts the tokens as they are generated, and the frontend, written in JavaScript, shows the number of tokens generated per request, updated every second. The last two metrics are possible because the Furiosa System Management Interface Library tracks power consumption and temperature in real time.

We showed this demo running on a single RNGD card at ISSCC 2025, where attendees interacted with it and saw an industry-leading average throughput of 3,000 tokens per second at 180W.

ISSCC 2025 attendees try the real-time batch inference demo using Furiosa’s RNGD chip and SDK.

While further optimization opportunities exist, this demo illustrates that a single RNGD can handle several hundred requests concurrently.

Beyond the demo

This demo shows that with just a single 180W RNGD card and our deployment software, it’s possible to serve concurrent LLM requests at scale with strong performance, low power consumption, and seamless real-time responsiveness. AI applications don’t have to come with complexity and cost.

To get a 1:1 demo for your specific use cases, contact us.

Share this article

Demonstrating High-Speed Inference Throughput with the Furiosa SDK

Demo

Setting up the demo

Scaling the workload

Performance monitoring

Beyond the demo

Other posts

How PyTorch handles dynamic tensor shapes

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI

Demonstrating High-Speed Inference Throughput with the Furiosa SDK

Demo

Setting up the demo

Scaling the workload

Performance monitoring

Beyond the demo

Other posts

How PyTorch handles dynamic tensor shapes

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice

Introducing Furiosa NXT RNGD Server: Efficient AI inference at data center scale

Get the latest updates on FuriosaAI

Get the latest from Furiosa AI

Serving gpt-oss-120b at 5.8 ms TPOT with two RNGD cards: compiler optimizations in practice