Skip to content

Phase 3: The Attention Era

Goal: Master global context and scaling.

Modules

Module Status Description
Tokenization 🔲 BPE, SentencePiece
Positional Encodings 🔲 Sinusoidal, Learned, RoPE, ALiBi
Attention Mechanism 🔲 Scaled dot-product attention
Multi-Head Attention 🔲 Learned subspaces
Transformer Block 🔲 Pre-Norm, FFN, residuals
Encoder (BERT-style) 🔲 Bidirectional, MLM
Decoder (GPT-style) 🔲 Causal masking, generation
KV-Cache 🔲 Inference optimization
Flash Attention 🔲 Memory-efficient attention
Sparse Attention 🔲 Longformer, BigBird patterns

🎯 Capstone

Build a mini-GPT and train on a small text corpus. Implement streaming generation with KV-cache.