Attention Is All You Need, Annotated

The 2017 Transformer paper introduced a mechanism so elegant it became the backbone of virtually every frontier model since. Here is the core of it, unpacked.

Scaled dot-product attention

Given queries $Q$ , keys $K$ , and values $V$ , attention is:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The $\frac{1}{\sqrt{d_k}}$ scaling prevents the dot products from growing large in magnitude as $d_k$ increases. Without it the softmax saturates into near-zero gradients.

Toy implementation in NumPy

import numpy as np
 
def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)          # (seq, seq)
    # numerically stable softmax
    scores -= scores.max(axis=-1, keepdims=True)
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V
 
np.random.seed(0)
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
 
out = scaled_dot_product_attention(Q, K, V)
print(out.shape)  # (4, 8)

Multi-head attention

Rather than running one attention operation, the model runs $h$ independent heads in parallel:

\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,W^O

\text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

Each head projects into a lower-dimensional space ( $d_k = d_{\text{model}} / h$ ) and learns to attend to different positional and semantic relationships. Eight heads at $d_k = 64$ give the model eight distinct views of each token pair.

What the paper actually proved

The empirical claim was modest: a pure attention architecture could match or beat recurrent and convolutional seq2seq models on translation benchmarks at a fraction of the training time, thanks to full parallelism across sequence positions.

The deeper claim — that attention is a general-purpose sequence model — took the next five years of scaling to fully validate.