Self-attention: the core innovation
Vaswani et al., 2017The key insight of the Transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input when producing each output. When processing the word "bank" in a sentence, attention allows the model to look at surrounding context ("river" vs. "money") and dynamically determine the relevant meaning.
- • Multi-head attention: multiple attention heads run in parallel, each learning different relationship types.
- • Positional encoding: since Transformers process all tokens simultaneously, positional information is injected to preserve word order.
- • Encoder-Decoder: original Transformers used both; modern LLMs typically use decoder-only architectures.