Attention Is All You Need, Annotated
The 2017 Transformer paper introduced a mechanism so elegant it became the backbone of virtually every frontier model since. Here is the core of it, unpacked.
Scaled dot-product attention
Given queries , keys , and values , attention is:
The scaling prevents the dot products from growing large in magnitude as increases. Without it the softmax saturates into near-zero gradients.
Toy implementation in NumPy
import numpy as np
def scaled_dot_product_attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (seq, seq)
# numerically stable softmax
scores -= scores.max(axis=-1, keepdims=True)
weights = np.exp(scores)
weights /= weights.sum(axis=-1, keepdims=True)
return weights @ V
np.random.seed(0)
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
out = scaled_dot_product_attention(Q, K, V)
print(out.shape) # (4, 8)Multi-head attention
Rather than running one attention operation, the model runs independent heads in parallel:
Each head projects into a lower-dimensional space () and learns to attend to different positional and semantic relationships. Eight heads at give the model eight distinct views of each token pair.
What the paper actually proved
The empirical claim was modest: a pure attention architecture could match or beat recurrent and convolutional seq2seq models on translation benchmarks at a fraction of the training time, thanks to full parallelism across sequence positions.
The deeper claim — that attention is a general-purpose sequence model — took the next five years of scaling to fully validate.