FuriosaAI

Sign up to learn more about RNGD Contact Us

Tensor contraction vs. matrix multiplication for attention, Part 1: Conciseness and greater clarity

Technical Updates

Written by Oliver Libaw

Share this article

The transformer architecture has revolutionized machine learning, and at its heart lies the attention mechanism. While there are many excellent videos and articles explaining the high-level intuition of attention, here we'll look at how tensor operations actually implement these mechanisms, specifically focusing on tensor contractions using Einstein summation (einsum) notation.

Understanding tensor contraction and einsum notation

Tensor contraction is a fundamental operation in deep learning. It can be understood as a generalized form of matrix multiplication, extending to tensors of higher dimensions. Instead of just multiplying matrices (two-dimensional arrays), tensor contraction allows for the multiplication and summation of elements across specified dimensions of multidimensional tensors. In deep learning, these tensors might represent multiple inputs (such as several images being fed into an object detection model) or a single input with more than two dimensions (such as an image with a separate dimension for each of the RGB channels).

And of course tensors can be combined to make even higher-dimensional tensors (such as a single tensor with multiple images, each with separate dimensions for the RGB channels).

Tensor contraction is crucial in many deep learning computations, including attention mechanisms and recurrent Neural Networks (RNNs). It is also used in convolutional operations, which are a core part of diffusion models and many computer vision models.

Einsum notation is simply a concise way to express tensor contractions. For example, 'bte,ehd->bthd' means: multiply dimensions 'e' from both tensors and sum over it, while keeping dimensions 'b', 't', 'h', and 'd' in the output. Dimension ‘e’ is the summation index and ‘b’, ‘t’, ‘h’, and ‘d’ are the output indices.

While matrix multiplication is a common approach for calculating attention and other AI computations, tensor contraction with einsum is a more direct and arguably more readable alternative by specifying dimension interactions without the need to first explicitly reshape tensors.

This conciseness can be valuable in itself. But, as we will see later in this blog post, the real power comes when tensor contraction is treated as a fundamental hardware operation, or “primitive.”

Instead of executing independent matrix multiplications, tensor contraction-based hardware can coordinate multiple computation units to process tensor operations holistically – much like a synchronized factory production line. The regular patterns inherent in high-dimensional AI data enables data to flow through the hardware in a structured and optimized manner, maximizing data reuse.

By contrast, traditional chip architectures, such as GPUs, are built around fixed-size matrix multiplication as their fundamental “primitive” operation. This necessitates breaking down larger tensor contractions into a series of smaller, fixed-size matrix multiplications, which can introduce overhead and limit optimization potential.

In this post, we'll focus specifically on conciseness and clarity of tensor contraction and einsum. It’s important to note that programmability and efficiency gains aren’t inherent in tensor contraction; they only come when paired with hardware specifically designed to use tensor contraction as a primitive operation.

A follow-up post will explore how specialized architectures, like FuriosaAI's Tensor Contraction Processor (TCP), can unlock these advantages through the use of tensor contraction. You can learn about our second-gen TCP chip, RNGD (“Renegade”), here.

Understanding multi-head attention through tensors

Multi-head attention allows a model to jointly attend to information from different representation subspaces at different positions. But how does this work at the tensor level? Let's break it down step by step using a concrete example.

Consider a toy example with:

  • Batch size: 1 (processing a single sequence)

  • Sequence length: 2 (just two words)

  • Number of attention heads: 2

  • Head dimension (length of the vector generated by each attention head): 2

  • Embedding dimension (length of the vector representing each input or output token): 4

Let's start with the input tensor:

        import numpy as np

x = np.array([ # Creating an input tensor with this shape: [batch=1, seq_len=2, embedding_dim=4]
    [
        [0.2, -0.1, 0.4, 0.3],  # word 1 embedding
        [0.1, 0.5, -0.2, 0.4]   # word 2 embedding
    ]
])

w_q = np.array([  # Creating a query tensor with this shape: [embedding_dim=4, num_heads=2, head_dim=2] 
    [[0.3, -0.2], [0.1, 0.4]],  # first embedding dimension
    [[0.2, 0.3], [-0.1, 0.2]],  # second embedding dimension
    [[0.3, -0.2], [0.1, 0.4]],  # third embedding dimension
    [[0.2, 0.3], [-0.1, 0.2]],  # fourth embedding dimension
])
      

The first step in attention is creating the query, key, and value vectors from the input vectors (and positional embeddings). Let's focus on just the query transformation:

        ''' 
Explaining the tensors' dimensions: 
b = batch size
t = sequence length
e = embedding dimensions
h = number of attention heads
d = dimensions per head
'''

q = np.einsum('bte,ehd->bthd', x, w_q) 
      

This einsum operation creates a tensor of shape [1, 2, 2, 2] where:

  • First dimension (1): batch size

  • Second dimension (2): sequence length (two words)

  • Third dimension (2): number of heads

  • Fourth dimension (2): head dimension

The resulting q tensor contains a representation for each combination of word and attention head:

  • Word 1 processed through Head 1 creates a vector of length 2

  • Word 1 processed through Head 2 creates a vector of length 2

  • Word 2 processed through Head 1 creates a vector of length 2

  • Word 2 processed through Head 2 creates a vector of length 2

So in this example, the output q tensor is:

        q = [                         # batch dimension (1)
    [                         # word dimension (2)
        [                     # word 1's representations
            [0.22, -0.06],    # word 1 through head 1 (for example, 0.22 = (0.2×0.3) + (-0.1×0.2) + (0.4×0.3) + (0.3×0.2))
            [0.04, 0.28]      # word 1 through head 2
        ],
        [                     # word 2's representations
            [0.15, 0.29],     # word 2 through head 1
            [-0.10, 0.14]     # word 2 through head 2
        ]
    ]
]
      

Let’s create the equivalent computation using matrix multiplication:

        # For simplicity's sake, let's explicitly define variables representing the tensor dimensions
batch_size = 1
sequence_length = 2
embedding_length = 4
head_dim = 2
num_heads = 2

# Reshape the input tensor
x_reshaped = x.reshape(-1, embedding_length)

# Reshape the weight tensor
w_q_reshaped = w_q.reshape(embedding_dim, -1)

# Perform the matrix multiplication
q = np.matmul(x_reshaped, w_q_reshaped)

# Reshape q to the correct output shape by separating the combined batch and sequence length dimensions and the combined num_heads and head_dim dimensions.
q_matmul = q.reshape(batch_size, sequence_length, num_heads, head_dim)
      

The advantages of einsum: Clarity and conciseness

As you can see, the einsum notation for the query transformation is quite compact. It expresses the operation in a single line, directly reflecting the intended tensor manipulation.

  1. Direct Multi-Dimensional Operations: einsum specifies dimension interactions directly. In our example, 'bte,ehd->bthd' tells us precisely which dimensions are being multiplied and which are being summed over. With matmul, we have to manually reshape and transpose tensors to align the correct dimensions, obscuring the underlying logic.

  2. Conciseness: einsum often expresses complex operations in a single, compact expression. The matmul approach requires multiple lines of code and intermediate variables, making it more verbose.

Let's look at another step in the attention mechanism: calculating the attention scores by multiplying the query tensor q with the key tensor k (we'll assume k has the same shape as q for simplicity).

        # einsum:
qk = einsum('bthd,bshd->bhts', q, k)

# matmul:
q_reshaped = q.reshape(1 * 2, 2, 2)  # Reshape to [2, 2, 2]
k_reshaped = k.reshape(1 * 2, 2, 2)  # Reshape to [2, 2, 2]
qk = np.matmul(q_reshaped, k_reshaped.transpose(0, 2, 1)) # Transpose and matmul
qk = qk.reshape(batch_size, num_heads, seq_len, seq_len) # Reshape back to [1, 2, 2, 2]
      

Again, einsum provides a much more direct and understandable representation of the operation. This becomes even more valuable when working with transformer models, which use thousands of embedding dimensions across dozens of attention heads.

The caveat: No performance advantage with GPUs

It's crucial to acknowledge that the advantages of einsum discussed above – clarity and conciseness – do not automatically translate to performance gains on conventional hardware (CPUs, GPUs). Deep learning compilers are highly sophisticated and can optimize matmul-based code to achieve excellent performance.

The compiler might even internally transform a matmul-based computation into something akin to a contraction if it identifies that as beneficial. And often a compiler will simply lower einsum operations to a series of optimized matmuls (or BLAS calls) to run on a GPU.

Any performance benefit comes from hardware designed specifically for tensor contraction, not merely from using einsum on its own on general-purpose hardware (such as GPUs).

Here is a complete code snippet to compare tensor contraction and matrix multiplication for our toy example:

        import numpy as np

# Define the input tensors (using random data for demonstration)
batch_size = 1
seq_len = 2
num_heads = 2
head_dim = 2
embedding_dim = 4


x = np.random.rand(batch_size, seq_len, embedding_dim)
w_q = np.random.rand(embedding_dim, num_heads, head_dim)
w_k = np.random.rand(embedding_dim, num_heads, head_dim)


# Calculate q and k tensors
q = np.einsum('bte,ehd->bthd', x, w_q)
k = np.einsum('bte,ehd->bthd', x, w_k)


# Tensor contraction (einsum) qk calculation
qk_einsum = np.einsum('bthd,bshd->bhts', q, k)


# Matrix multiplication equivalent (step by step)
q_transposed = q.transpose(0, 2, 1, 3)  # [batch, heads, seq_len, head_dim]
k_transposed = k.transpose(0, 2, 3, 1)  # [batch, heads, seq_len, head_dim]


qk_matmul = np.matmul(q_transposed, k_transposed)  # [batch, heads, seq_len, seq_len]


print("q.shape:", q.shape)
print("k.shape:", k.shape)
print("qk_matmul.shape:", qk_matmul.shape)
print("qk_einsum.shape:", qk_einsum.shape)


# Check results, ignoring numerical differences
print("Are results close?", np.allclose(qk_matmul, qk_einsum))


print("Matrix Multiplication result:")
print(qk_matmul)
print("Tensor Contraction result:")
print(qk_einsum)
      

Performance advantages – when combined with the right hardware

Now that we've seen the conceptual advantages of tensor contraction with einsum notation, let's explore how specialized hardware can translate this into real-world performance gains.

Furiosa’s TCP architecture is optimized to leverage tensor contractions through:

  • Efficient systolic array mapping: TCP’s specialized dataflow patterns maintain the logical structure of tensors, allowing the compiler to map operations effectively to the configurable systolic array.

  • Enhanced memory locality: Einsum notation explicitly reveals data dependencies , enabling the compiler to generate schedules that maximize data reuse in on-chip buffers, significantly reducing external memory access.

  • Dynamic adaptation: Unlike fixed-size systolic arrays, RNGD’s configurable compute units adapt to different tensor shapes to maintain high utilization across different operations (attention, convolutions, diffusion) and various deployment scenarios (like different batch sizes).

Traditional computing architectures like GPUs rely on Single Instruction, Multiple Threads (SIMT) approaches, where thousands of threads execute parallel matrix multiplication instructions. This method introduces several computational challenges, however:

  • Complex thread switching mechanisms

  • Significant overhead in executing and switching threads

  • Increased memory access complexity

  • Lower efficiency, due to less data reuse

TCP's approach enables a more deterministic and streamlined approach to data reuse than the dynamic scheduling mechanisms in GPUs.

In a follow-up blog post, we will delve into how exactly TCP and our RNGD chip take advantages of tensor contraction to deliver these performance benefits.

Written by Oliver Libaw

Share this article

Get the latest updates on FuriosaAI