Is Furiosa’s chip architecture actually innovative? Or just a fancy systolic array?
Technical Updates
When FuriosaAI unveiled RNGD at the Hot Chips conference in August, we claimed there were two main reasons why the AI world should pay attention: 1) it delivers a “trifecta” of performance, programmability and power efficiency that has eluded GPUs and other chips that are designed for inference with LLMs and multimodal models. And 2) RNGD does this through an innovative chip architecture that we call the Tensor Contraction Processor.
At Hot Chips, the AI Hardware Summit, and the PyTorch Conference, several engineers asked about how TCP compares to a systolic array.
So what’s so special about TCP? Is it really that different from a systolic array? Why is it a better solution than GPUs for a wide range of real world use cases?
In the Q&A below, two Furiosa hardware engineers, Younggeun Choi and Junyoung Park, explain what TCP is, what makes it unique, and what benefits it provides. You can learn more about RNGD here and read our ISCA paper on TCP here.
For those who don’t know, can you begin by briefly describing what a standard systolic array is? And what are the benefits and drawbacks of using them to accelerate deep learning models?
Picture a grid of Processing Elements (PE) where data flows in a synchronized wave – like a heartbeat, hence the term "systolic." Each element in a systolic array performs a simple multiplication and addition before passing results to its neighbors. This architecture is particularly well-suited for matrix multiplication, which is fundamental to deep learning models.
When everything aligns perfectly, a systolic array is remarkably efficient. Data moves predictably through the grid, and each Processing Element stays busy, maximizing both energy efficiency and computational throughput. It's an elegant solution that has served as the foundation for many successful AI accelerators.
However, systolic arrays come with inherent limitations. Their rigid structure – typically a fixed-size grid with data flowing in predetermined directions – means they work best only when the computation perfectly matches their dimensions.
This creates a challenging trade-off. Make the array relatively large to accommodate bigger matrices, and you risk significant underutilization when processing smaller ones. Make it small, and you lose the efficiency benefits that made systolic arrays attractive in the first place. These limitations become particularly apparent in inference workloads, where batch sizes and tensor dimensions can vary significantly.
At a high level how does TCP relate to systolic arrays?
TCP and systolic arrays share some similarities – they are both designed to perform efficient parallel computations by systematically moving data through an array of Processing Elements.
But there are key differences.
Unlike a traditional systolic array, which has a fixed grid, TCP has smaller configurable compute units that can be rearranged dynamically. Think of them as building blocks that can be reassembled as needed, rather than a rigid, pre-defined structure.
This flexibility means TCP can adapt to different tensor shapes. It's particularly important for inference workloads where batch sizes are often small and tensor dimensions vary widely. When a systolic array might be partially idle, TCP can reconfigure its resources to maintain high utilization.
This adaptability is one of the key innovations that sets TCP apart from traditional architectures.
What are the key architectural differences between TCP's compute units and those in a typical systolic array?
TCP and systolic arrays differ significantly on how they handle data movement and reuse.
In a systolic array, data flows in a fixed pattern through a rigid grid – moving from one edge to another, with data reuse limited by the array's physical width or height. (Picture a waterfall, with data cascading through the Processing Elements in one direction.) Data flow is aligned with the shape of the Processing Elements rather than the tensor’s shape.
TCP takes a more flexible approach. Instead of a fixed grid, it uses smaller compute units called “slices” that are connected by a fetch network. This network can broadcast data to multiple slices simultaneously, while each slice contains parallel dot product engines that further multiply data reuse.
TCP also adds a temporal dimension to data reuse through strategic placement of buffers and sequencing logic. This means data can be reused not just across space (like in systolic arrays) but also across time.
Looking at the diagrams, you can see how a traditional 128x128 systolic array has a rigid structure with fixed data flow patterns. In contrast, TCP can be configured in different ways (like Wx4H or (2WxH)x2) to match the computation at hand, with its fetch network enabling more flexible data movement and reuse patterns.
This flexibility is particularly valuable when processing the dynamic shapes common in modern AI workloads.
What are the most important differences between TCP and other NPU architectures?
The TCP architecture, as its name suggests, uses tensor contraction as its primitive operation. In a typical NPU architecture, the hardware is designed to efficiently perform matrix multiplication, while the software is responsible for breaking down given tensor operations into matrix multiplications to utilize the hardware for acceleration.
In contrast, having tensor contraction as the primitive means that the software can process tensor operations directly or transform them into other tensor forms, while the hardware is designed to efficiently accelerate these tensor-based operations.
Since tensor contraction operations are static and predictable, the hardware can establish an efficient structure for fetching, data reuse, and committing operations without needing to allocate excessive resources.
Additionally, the software can optimize operations directly at the tensor level, without breaking them down into lower-dimensional matrices. This allows TCP to achieve more efficient tensor operation acceleration with lower complexity in optimization, compared to other architectures.
Let’s talk first about why TCP is an ideal architecture for AI/ML service providers who need to support a wide range of workloads.
At a high level, TCP offers two key advantages for anyone who wants to deploy applications with large language models or multimodal models:
First, TCP achieves significantly better power efficiency than GPUs when running these kinds of models. This comes from a fundamental difference in how data is handled:
Moving data between off-chip memory (DRAM) and on-chip processing elements consumes up to 10,000x more energy than the computations themselves
TCP's tensor-native architecture enables more extensive data reuse by preserving the natural structure of AI workloads
This means data can be loaded once and reused across multiple operations, dramatically reducing energy-intensive memory transfers
Second, TCP simplifies the process of optimizing new models for deployment:
AI models naturally work with multi-dimensional tensors (for example, handling batch size, sequence length, and feature dimensions simultaneously)
TCP processes these tensor operations directly while preserving their structure
- GPUs must first flatten these tensors into 2D matrices, which:
Obscures natural opportunities for parallelism and data reuse
Requires complex kernel optimizations to recover efficiency
Makes it harder to automatically optimize new models
The result is that deploying new or modified models (like a customized version of Llama) requires less engineering effort while achieving better efficiency.
How does TCP's ability to dynamically reconfigure its compute units provide advantages for different tensor shapes and batch sizes?
A fixed-size systolic array has high utilization for large tensor operations that can fully occupy it, but its utilization drops for smaller or unaligned tensors. In contrast, TCP can maintain high utilization even for smaller or unaligned tensors by reconfiguring the large processing unit into multiple smaller units that handle the tasks efficiently.
What specific operations benefit most from TCP's architecture compared to a fixed systolic array?
In inference scenarios, where all input data is pre-prepared and fewer operations are required compared to training, TCP’s architecture shows significant advantages. When processing a single batch, for example, TCP can reconfigure the entire compute unit into multiple smaller processing units. This allows TCP to achieve higher utilization of processing units, resulting in lower latency.
What advantages does TCP's architecture provide for handling dynamic shapes in AI workloads?
In a systolic array, achieving high utilization requires that batching or tensor partitioning fit the array size precisely. This can be challenging when specific axes of the tensor are not statically defined and change dynamically, however, making it difficult to determine the optimal size.
The TCP architecture addresses this issue by allowing dynamic shapes to be handled along the temporal axis as needed. This flexibility provides greater freedom compared to the limitations inherent in systolic arrays.
How does TCP's architecture impact power efficiency compared to a traditional systolic array design?
TCP can achieve significantly higher data reuse compared to a traditional systolic array, leading to fewer SRAM read/write operations.
While SRAM consumes less power than DRAM, it still consumes significantly more power than the flip-flops used to construct data buffers. Therefore, by reducing the number of SRAM accesses, TCP can achieve greater power efficiency.
Are there any trade-offs or potential drawbacks to TCP's more flexible architecture compared to a fixed systolic array?
When the operations perfectly match the shape of a specific systolic array, the additional elements introduced in TCP to accommodate various shapes and data reuse has additional overhead. In such cases, there is a chance that the operations may not be as efficient compared to a traditional systolic array.
How does the data flow in TCP differ from a traditional systolic array?
In a traditional systolic array, elements of a 2-D partial tensor are sequentially fed according to the shape of the processing units, and the full tensor computation is completed by repeatedly supplying these partial tensors. As a result, the data flow is aligned with the shape of the processing units rather than the tensor’s shape, and data reuse is restricted to the granularity of these partial tensors.
In TCP, the logical shape of the processing units is dynamically adjusted to fit the tensor’s shape, and the optimal data flow is configured to maximize data reuse according to the tensor’s shape.
Can you provide a simplified example that contrasts how data moves through a systolic array vs. how it moves through TCP?
Let’s say our goal is to perform 16K multiply-accumulate (MAC) operations on data stored in BF16 format. In a 128x128 systolic array, data flows through a fixed grid of 16,384 processing elements. The weights stay fixed while input data flows through the grid systematically.
TCP performs the same calculations, but in a very different manner.
A single Processing Element has 64 slices, each containing memory and a contraction engine.
Data stored in the 64 slices can be broadcast in units of four slices via the fetch network, allowing the data to be reused four times without having to read it again from SRAM.
In each slice, the Contraction Engine (CE) receives data from the Fetch Unit and temporarily buffers it in a space called the feed buffer. This buffer enables additional data reuse (beyond the reuse achieved through the fetch network).
The feed buffer connects to eight dot product engines, all of which use the same input data. As a result, temporal reuse occurs within the CE based on the number of times the same data is supplied from the feed buffer, while spatial reuse is achieved based on the number of dot product engines in use.
In this setup, by feeding the same data four times and using all eight dot product engines (DPEs), the fetched data is reused 4 (feed reuse) * 8 (# of dot product engines) = 32 times. The final count of data reuse from SRAM, achieved through both the Fetch Unit and CE, reaches 4 * 32 = 128 times.
Since the reuse through broadcasting in the Fetch Unit and reuse within the CE can be independently configured, the data flow can be adjusted to enable the required level of reuse.
Did systolic arrays influence the original vision or intuition for our TCP architecture?
Yes, the under-utilization of large systolic arrays had a significant influence on shaping the initial concept of the TCP architecture. Additionally, we realized that very small systolic arrays come with overhead in controlling too many units, highlighting the need to introduce computation units of an optimal size.
As mentioned earlier, one of the primary goals was to address the underutilization of MAC units caused by the fixed structure of the systolic array. This was identified as a key issue to optimize for better performance and efficiency.
What new insight enabled us to go beyond what others have done before with systolic arrays for AI acceleration? What challenges did we face in creating the TCP architecture vs. building a traditional systolic array?
As mentioned earlier, traditional fixed-size systolic arrays (e.g., 64x64 or 128x128) often struggle to achieve sufficient utilization in inference tasks, particularly for CNNs or small-sized models, or operations where the tensor itself may be large, but certain axes are small.
Data reuse is bound by the array size, which is also a significant concern.
So with TCP, we chose to make the basic computational unit a dot product processing unit, rather than a systolic array designed for 2-D matrix multiplication. These units are dynamically configured according to the tensor shape being processed, allowing for more flexible and efficient operations tailored to the specific computation.
Additionally, since there is a significant difference in power consumption between DRAM, SRAM, and internal buffers (in that order), we designed a mechanism to maximize data reuse within the internal buffer, where power consumption is lowest. We developed strategies to maximize data reuse across various scenarios, ensuring that computations are performed in a power-efficient manner.
While designing a dynamically configurable computation architecture instead of a single large one, it was essential to keep the optimization problem space from becoming too large. This is why tensor contraction was selected as the primitive.
Supercomputing is happening this week and there are several presentations and panels about hardware challenges with AI at scale. How does TCP fit into broader trends in the AI hardware industry?
There’s a lot of AI hardware innovation happening right now. GPUs have powered the incredible AI innovation over the past 10 or 15 years, but it seems the industry understands new types of chips are needed to keep this going. That’s true for both training and inference. And one of the critical considerations is the efficiency of the processing units that are running large language models and other memory-intensive algorithms.
Our TCP architecture is structured to enable efficient operations that overcome the fundamental limitations of traditional systolic arrays. We believe our approach will fundamentally contribute to enhancing the efficiency of computing systems and addressing the various challenges faced by the AI industry.
One thing that’s exciting about events like Supercomputing is the volume of papers and research being shared. That’s so important for the whole industry to keep progressing. Furiosa presented a TCP architecture paper at the ISCA conference in June and gave a technical overview of our second-gen RNGD chip at Hot Chips in August. It’s been great to see a lot of interest in our approach.
The original research around systolic arrays was first published in the 1970s. But it’s only been in the past decade or so that the approach has gained traction in the AI hardware community. Our TCP architecture is different from systolic arrays in key respects. But we benefited greatly from decades of open research about systolic arrays and many other approaches to designing chips.