� TensorToIntelligence: The Full Path¶
A "Build-in-Public" deep learning education series. Each module is a first-principles implementation of a DL primitive, accompanied by a deep-dive technical blog post.
🛠️ Technical Stack¶
| Tool | Purpose |
|---|---|
| PyTorch | Tensors & Autograd (reference implementation) |
| MkDocs + Material | Documentation & Blog |
| MathJax | LaTeX rendering |
| mkdocstrings | Autodoc from docstrings |
| Jupyter Notebooks | Visual experiments |
| GitHub Pages | Hosting |
📚 Roadmap¶
Phase 1: The Atomic Era (Primitives)¶
Goal: Master numerical stability and gradient flow.
| Module | Blog Entry | Code |
|---|---|---|
| Linear Layer | "Why Weight Initialization Matters (Xavier vs. Kaiming)" | Manual weight/bias init, forward pass |
| Activations | "The Gradient Vanishing Problem: From Sigmoid to SwiGLU" | ReLU, GELU, SiLU, SwiGLU |
| Loss Functions | "LogSumExp: The Secret to Numerical Stability in Cross-Entropy" | MSE, CrossEntropy, BCE |
| Normalization | "BatchNorm to RMSNorm: A History of Training Stability" | BatchNorm, LayerNorm, RMSNorm, GroupNorm |
| Regularization | "Dropout: The Surprising Power of Random Noise" | Dropout, L1/L2 penalty |
| Optimizers | "Adam vs. AdamW: Why Decoupling Weight Decay Changed Everything" | SGD, Momentum, Adam, AdamW |
| LR Schedulers | "Warmup and Decay: Navigating the Loss Landscape" | Warmup, StepLR, CosineAnnealing |
| Gradient Clipping | "Exploding Gradients: When to Clip and Why" | Norm clipping, value clipping |
🎯 Capstone: Train an MLP on MNIST using only your primitives. Compare to torch.nn.
Phase 2: The Architectural Era (Vision & Sequences)¶
Goal: Master spatial and temporal data processing.
| Module | Blog Entry | Code |
|---|---|---|
| Convolutions | "im2col: Making Convolutions Fast" | Conv2d, depthwise, dilated |
| Pooling | "Max vs. Average: The Information Bottleneck" | MaxPool, AvgPool, GlobalPool |
| Skip Connections | "ResNet's Insight: Why Identity Mappings Enable Depth" | Residual blocks |
| CNN Architecture | "Building ResNet-18 from Scratch" | Full ResNet implementation |
| Vanilla RNN | "Backprop Through Time: The Temporal Chain Rule" | RNN cell, BPTT |
| LSTM & GRU | "Gating Mechanisms: How LSTMs Solve Short-Term Memory" | LSTM, GRU cells |
| Seq2Seq | "Encoder-Decoder: The Foundation of Translation" | Seq2Seq + Teacher Forcing |
🎯 Capstone: Train ResNet-18 on CIFAR-10 and a character-level LSTM language model.
Phase 3: The Attention Era (The Paradigm Shift)¶
Goal: Master global context and scaling.
| Module | Blog Entry | Code |
|---|---|---|
| Tokenization | "BPE, WordPiece, SentencePiece: How Text Becomes Tensors" | BPE from scratch |
| Positional Encodings | "Sinusoidal vs. Learned vs. RoPE vs. ALiBi" | All encoding variants |
| Attention Mechanism | "Attention Is All You Need? A Breakdown of Scaled Dot-Product" | Single-head attention |
| Multi-Head Attention | "Why Multiple Heads? Learned Subspaces" | MHA implementation |
| Transformer Block | "Pre-Norm vs. Post-Norm: Training Stability at Scale" | LayerNorm, FFN, residuals |
| Encoder (BERT-style) | "Bidirectional Context: Masked Language Modeling" | BERT encoder |
| Decoder (GPT-style) | "Causal Masking: How LLMs Generate Text" | GPT decoder + sampling |
| KV-Cache | "Inference Optimization: Why KV-Cache Makes LLMs Fast" | Cached generation |
| Flash Attention | "IO-Aware Attention: Tiling for Memory Efficiency" | Flash Attention concepts |
| Sparse Attention | "Longformer & BigBird: Scaling to Long Sequences" | Sparse patterns |
🎯 Capstone: Build a mini-GPT and train on a small text corpus. Implement streaming generation with KV-cache.
Phase 4: The Generative Era (Probabilistic Models)¶
Goal: Master probabilistic modeling and sampling.
| Module | Blog Entry | Code |
|---|---|---|
| Autoencoders | "Compression as Learning: The Autoencoder Bottleneck" | Vanilla AE |
| VAE | "The Reparameterization Trick: Making VAEs Trainable" | VAE + latent sampling |
| GAN Basics | "Generator vs. Discriminator: The Adversarial Game" | Vanilla GAN |
| GAN Training | "Mode Collapse and Wasserstein: Stabilizing GANs" | WGAN, training tricks |
| DDPM | "Denoising as Generative Modeling: The Math of Diffusion" | Forward/reverse process |
| U-Net | "The Architecture of Noise Prediction" | U-Net with cross-attention |
| DDIM | "Deterministic Diffusion: Faster Sampling" | DDIM sampler |
| Latent Diffusion | "VAE + Diffusion: Why Latent Space Matters" | Latent diffusion pipeline |
| Classifier-Free Guidance | "CFG: How Stable Diffusion Does Conditioning" | CFG implementation |
🎯 Capstone: Train a VAE on MNIST, then train a DDPM to generate digits. Implement CFG-guided generation.
Phase 5: The Frontier Era (What's Next)¶
Goal: Explore cutting-edge techniques.
| Module | Blog Entry | Code |
|---|---|---|
| State Space Models | "Mamba & S6: Beyond Attention?" | SSM concepts |
| Mixture of Experts | "MoE: How Mixtral Scales Efficiently" | Sparse routing |
| LoRA | "Parameter-Efficient Fine-Tuning: Adapting Giants" | LoRA implementation |
| Quantization | "INT8 & FP8: Making Models Small and Fast" | PTQ, QAT concepts |
| Speculative Decoding | "Draft Models: Parallel Generation" | Speculative sampling |
| RLHF | "From GPT to ChatGPT: Aligning with Humans" | Conceptual overview |
🎯 Capstone: Apply LoRA to fine-tune a small transformer. Implement INT8 quantization and measure speedup.
📝 Blog Entry Template¶
Every module follows this structure:
- The Intuition — 1 paragraph: "Why does this exist?"
- The Math — Formal LaTeX derivation of forward and backward pass
- The Code — Annotated Python implementation
- The Test — Proof of parity with
torch.nn(numerical accuracy) - The Experiment — Visualization (gradient flow, loss curve, generated samples)
📊 Progress Tracker¶
| Phase | Status | Modules |
|---|---|---|
| Phase 1 | 🔲 Not Started | 0/8 |
| Phase 2 | 🔲 Not Started | 0/7 |
| Phase 3 | 🔲 Not Started | 0/10 |
| Phase 4 | 🔲 Not Started | 0/9 |
| Phase 5 | 🔲 Not Started | 0/6 |
Total: 0/43 modules completed