bc

Attention Is All You Need, Annotated

2 min read

The 2017 Transformer paper introduced a mechanism so elegant it became the backbone of virtually every frontier model since. Here is the core of it, unpacked.

Scaled dot-product attention

Given queries QQ, keys KK, and values VV, attention is:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The 1dk\frac{1}{\sqrt{d_k}} scaling prevents the dot products from growing large in magnitude as dkd_k increases. Without it the softmax saturates into near-zero gradients.

Toy implementation in NumPy

import numpy as np
 
def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)          # (seq, seq)
    # numerically stable softmax
    scores -= scores.max(axis=-1, keepdims=True)
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V
 
np.random.seed(0)
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
 
out = scaled_dot_product_attention(Q, K, V)
print(out.shape)  # (4, 8)

Multi-head attention

Rather than running one attention operation, the model runs hh independent heads in parallel:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,W^O headi=Attention(QWiQ,  KWiK,  VWiV)\text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

Each head projects into a lower-dimensional space (dk=dmodel/hd_k = d_{\text{model}} / h) and learns to attend to different positional and semantic relationships. Eight heads at dk=64d_k = 64 give the model eight distinct views of each token pair.

What the paper actually proved

The empirical claim was modest: a pure attention architecture could match or beat recurrent and convolutional seq2seq models on translation benchmarks at a fraction of the training time, thanks to full parallelism across sequence positions.

The deeper claim — that attention is a general-purpose sequence model — took the next five years of scaling to fully validate.