Phase 3: The Attention Era¶
Goal: Master global context and scaling.
Modules¶
| Module | Status | Description |
|---|---|---|
| Tokenization | 🔲 | BPE, SentencePiece |
| Positional Encodings | 🔲 | Sinusoidal, Learned, RoPE, ALiBi |
| Attention Mechanism | 🔲 | Scaled dot-product attention |
| Multi-Head Attention | 🔲 | Learned subspaces |
| Transformer Block | 🔲 | Pre-Norm, FFN, residuals |
| Encoder (BERT-style) | 🔲 | Bidirectional, MLM |
| Decoder (GPT-style) | 🔲 | Causal masking, generation |
| KV-Cache | 🔲 | Inference optimization |
| Flash Attention | 🔲 | Memory-efficient attention |
| Sparse Attention | 🔲 | Longformer, BigBird patterns |
🎯 Capstone¶
Build a mini-GPT and train on a small text corpus. Implement streaming generation with KV-cache.