For chatbots, coding agents, and other real-time AI services, providers typically target an output Service Level Objective (SLO) of 30–50 tokens per second (TPS) per user. Since this exceeds the typical human reading speed of 15–25 TPS, it ensures the interaction feels instantaneous.
In production inference infrastructure, maximizing concurrent users while maintaining this SLO is a much higher priority than chasing peak “hero” speeds for a single user.
RNGD, FuriosaAI's purpose-built inference accelerator for data centers, excels at this crucial task, delivering 2.2x to 7.4x more SLO-compliant users than the NVIDIA RTX Pro 6000.
This performance stems from two structural factors: optimization tailored to the low-batch “effective service range,” and RNGD's 180W TDP, which enables 2.5x greater hardware density within the same power-constrained environment.
In this blog, we will explain the architectural and software breakthroughs that give RNGD a decisive advantage in this range, and explore the specific enterprise services where it can be applied.
Performance optimization within the effective service range
While many benchmarks focus on peak theoretical throughput, real-world deployments prioritize consistently high service quality. By targeting the widely used Qwen3-32B model, we tuned our SDK (v2026.1) specifically to lift performance in the low-batch regions (below batch 64) where real-time traffic actually operates.

A common pattern emerges across the most widely adopted LLM deployments: the goal is not peak token throughput, but rather maintaining consistent service quality through a specific SLO. Then, within that SLO, the goal is to generate as many concurrent users as possible.
Among these cases, many enterprises were using TPS/User as their SLO metric, and a significant number of services placed particular importance on the range of 30–50 TPS/user to ensure an experience with virtually no perceived latency.
By building on this foundation, service providers can secure higher concurrent session counts, serve more end customers, and increase the effective utilization value of their infrastructure. This strategy aligns with the growing enterprise effort to support the increasing number of agents that each user now operates.
Against this backdrop, we conducted performance optimization targeting 32B-class LLM models widely used in enterprise today.
The graph below compares Qwen3-32B performance on a server with 4x RNGD against one with 4x RTX Pro 6000.
The comparison to the RTX Pro 6000 is critical because customers running 32B-class inference recognized this tier as the most advantageous for purchase price and operational infrastructure utilization. It is therefore the product that RNGD directly competes against in this segment.

The SDK tuned through March 13, 2026 was optimized to produce more tokens than the RTX Pro 6000 in the TPS/User 30 and above range, as shown in the graph.
The SDK tuned through March 13, 2026, was optimized to outperform the RTX Pro 6000 in the 30+ TPS/user range. This work focused on lifting performance in regions below 64 batch (b64), which differs from traditional tuning for maximum concurrency (b256 or higher).
Compared to performance as of February 14, 2026, 64 batch (b64) improved from approximately 1,200 TPS to 1,500 TPS, and 32 batch (b32) improved from approximately 750 TPS to 1,100 TPS—representing performance gains of 25% and 47%, respectively.
However, this effect is far more dramatic from a service perspective. A system previously capable of serving only 5.8 users increased to 47.5 users—an 8.2x increase in service capacity. This modest throughput gain translates to an 818% increase in overall service capacity compared to our previous SDK.

To extend this effect across the broader 30–50 TPS/user range, optimization was carried out for batch sizes 64, 32, 16, and 8. As a result, RNGD supported 20% and 240% more users than the RTX Pro 6000 at SLOs of 40 and 50 TPS, respectively.

Latency comparison: TTFT and TPOT
For Time-Per-Output-Token (TPOT), which has a greater impact on overall throughput, the graph shows that RNGD and RTX Pro 6000 exhibit similar numbers. However, RNGD maintains a slight edge in the targeted optimization range.
What is even more encouraging is the gap in Time-to-First-Token (TTFT). While TTFT is superior across all concurrency levels, RNGD shows roughly half the TTFT between 8 batch (b8) and 64 batch (b64), ensuring response times will feel notably fast when applied to real-world services.
For example, at a 30 TPS/User SLO, the RTX Pro 6000 produces its first token after 2.7 to 4.4 seconds, whereas RNGD outputs its first token in 1.1 to 2.1 seconds—less than half the latency of the RTX Pro 6000.
RNGD is significantly more power efficient across all ranges, and the ultimate implications of this will be explained in greater detail in the following section.

Greater service density through high power efficiency
Scaling AI services is not achieved through a simple 1:1 card-to-card comparison. Solutions built on different cards exhibit different power consumption, which significantly changes the number of servers or cards that can be installed in the same customer environment.
A server equipped with 8 RNGD cards consumes 3kW of power, while a server equipped with 8 RTX Pro 6000 cards consumes 6.6kW. In a standard 15kW rack, you can install 5 RNGD servers but only 2 RTX Pro 6000 servers. Because hardware must be installed in whole integers, the number of serviceable users increases in a step-wise fashion rather than a linear one.
This is where RNGD's power efficiency translates into service density. Even assuming the card-to-card user count was identical, RNGD’s 180W TDP allows it to deliver 2.2x more services within the same power envelope.
This can be illustrated in the graph below. The X-axis represents the rack's power capacity, and the Y-axis represents the number of serviceable users. It shows that users increase in a step-wise pattern each time the power capacity allows for an additional server to be installed.

For example, assuming a customer with a 15kW power capacity rack serves at a TPS/User 30 SLO using RNGD and RTX Pro 6000, RNGD allows 5 servers to be installed, enabling 474 concurrent users, while RTX Pro 6000 allows 2 servers to be installed, enabling 187 concurrent users. As a result, the customer can serve 2.5x more users when using RNGD.
The graph below shows this difference normalized per kW. When the number of serviceable users from a 1:1 card comparison at each SLO is multiplied by the number of installable servers, the difference in user count becomes substantial.

This increase in serviceable users is directly linked to a reduction in the customer's Total Cost of Ownership (TCO). TCO is directly tied to how much the resources required to achieve a target service level can be reduced. Through RNGD's optimization within the effective service range, many customers are able to configure their services more efficiently.
Below, we have created scenarios estimating how many RNGD and RTX Pro 6000 units need to be purchased, and how much the operating environment needs to be expanded, under the assumption of configuring actual services.
Case study: Enterprise internal AI assistant
Company A has 10,000 employees. During peak usage (2,000 concurrent users at 40 TPS)
Employees at Company A currently use only one agent per user, but Multi-Agent services are being introduced. Based on the adoption trend, it is expected that within one year, each user will simultaneously utilize around 10 Agents.
The currently secured infrastructure is configured at approximately 15kW per rack, with plans for further expansion in the future.
Company A's current concurrent connections during peak time is 2,000. (10,000 × 20%)
Since the throughput per RNGD Server at TPS/User 40 is 43 users, and for RTX Pro 6000 it is 37.8 users, the totals are:
- RNGD requires 47 servers across 10 racks (141kW total).
- RTX Pro 6000 requires 53 servers across 27 racks (350kW total).
The required datacenter capacity is the number of servers multiplied by the power consumed per server. This is a calculation of the capacity purely for accelerators, excluding other equipment. Additional capacity would be needed when factoring in switches, compute servers, and other components.
If the rack power in the secured datacenter is at the 15kW level, each rack can accommodate 5 RNGD Servers or 2 RTX Pro 6000 Servers. Therefore, the total number of racks required for floor space is 10 racks for RNGD and 27 racks for RTX Pro 6000.
- RNGD NXT Server: 10 Racks — 47 units ÷ 5 = 9.4
- RTX Pro 6000 Server: 27 Racks — 53 units ÷ 2 = 26.5

As shown in the table above, there is a difference not only in the number of servers to purchase, but also in the datacenter capacity required—2.5x more for RTX Pro 6000—and in floor space, which requires 2.7x more racks.
This is a calculation proportional to current usage. If we assume that usage increases 10x due to Multi-Agent adoption, the absolute TCO gap widens dramatically—from a current difference of 208.8kW to approximately 2MW at the datacenter level.
More servers in the same space with RNGD
FuriosaAI recently optimized RNGD for the service areas most widely utilized in real-world business. Most enterprise services prioritize high concurrency over extreme speed for a single user, aiming to maintain service quality at 30–50 TPS/User. We directed RNGD's optimization efforts to align with this critical service range.
As a result, RNGD delivers 20% more concurrent services than the RTX Pro 6000 on a card-to-card basis for a 40 TPS/User SLO and 2.4x more concurrent services for a 50 TPS/User SLO. The difference in concurrent service count is not merely a benchmark figure, but a gap that is directly felt in actual service operations.
When RNGD's superior power efficiency is factored in, this gap is further amplified. Since an RNGD server consumes less than half the power of the RTX Pro 6000 equivalent, more servers can be packed into the same datacenter environment. Multiplying the card-level performance advantage by the server density advantage expands the actual service capacity to 2.5x–2.7x relative to a card-to-card comparison. This directly translates to a significant TCO reduction that widens even further as multi-agent adoption grows.
Furthermore, the 30–50 TPS/User SLO is not limited to specific services. It is commonly applied across virtually all real-time AI services, such as enterprise internal AI assistants, real-time customer support, AI tutoring, medical AI, and game NPCs. This means that the area RNGD has optimized for covers the broad service spectrum in the market.
Additional performance gains to come
What is particularly noteworthy about this optimization effort is that all of these results were achieved purely through software changes in just one month. Without hardware modifications, SDK tuning alone delivered a 25–47% throughput improvement and up to an 818% increase in concurrent service count from a service perspective, compared to the earlier version of our SDK.
This demonstrates that RNGD's hardware still holds considerable untapped potential. The current optimization has only lifted a portion of the effective service range, and there remains ample room to drive additional performance gains at the software level in immediate response to evolving customer requirements or market shifts.
This rapid response is possible because of the inherent flexibility and scalability of Furiosa’s TCP architecture. RNGD was designed from the ground up to adapt to diverse workloads and models, and our mature software stack is built to swiftly address new models and service requirements. These optimization results are only the beginning as we continue to expand performance boundaries in alignment with customer and market demands.
Written by
The Furiosa Team



