Skip to content

� TensorToIntelligence: The Full Path

A "Build-in-Public" deep learning education series. Each module is a first-principles implementation of a DL primitive, accompanied by a deep-dive technical blog post.

🛠️ Technical Stack

Tool Purpose
PyTorch Tensors & Autograd (reference implementation)
MkDocs + Material Documentation & Blog
MathJax LaTeX rendering
mkdocstrings Autodoc from docstrings
Jupyter Notebooks Visual experiments
GitHub Pages Hosting

📚 Roadmap


Phase 1: The Atomic Era (Primitives)

Goal: Master numerical stability and gradient flow.

Module Blog Entry Code
Linear Layer "Why Weight Initialization Matters (Xavier vs. Kaiming)" Manual weight/bias init, forward pass
Activations "The Gradient Vanishing Problem: From Sigmoid to SwiGLU" ReLU, GELU, SiLU, SwiGLU
Loss Functions "LogSumExp: The Secret to Numerical Stability in Cross-Entropy" MSE, CrossEntropy, BCE
Normalization "BatchNorm to RMSNorm: A History of Training Stability" BatchNorm, LayerNorm, RMSNorm, GroupNorm
Regularization "Dropout: The Surprising Power of Random Noise" Dropout, L1/L2 penalty
Optimizers "Adam vs. AdamW: Why Decoupling Weight Decay Changed Everything" SGD, Momentum, Adam, AdamW
LR Schedulers "Warmup and Decay: Navigating the Loss Landscape" Warmup, StepLR, CosineAnnealing
Gradient Clipping "Exploding Gradients: When to Clip and Why" Norm clipping, value clipping

🎯 Capstone: Train an MLP on MNIST using only your primitives. Compare to torch.nn.


Phase 2: The Architectural Era (Vision & Sequences)

Goal: Master spatial and temporal data processing.

Module Blog Entry Code
Convolutions "im2col: Making Convolutions Fast" Conv2d, depthwise, dilated
Pooling "Max vs. Average: The Information Bottleneck" MaxPool, AvgPool, GlobalPool
Skip Connections "ResNet's Insight: Why Identity Mappings Enable Depth" Residual blocks
CNN Architecture "Building ResNet-18 from Scratch" Full ResNet implementation
Vanilla RNN "Backprop Through Time: The Temporal Chain Rule" RNN cell, BPTT
LSTM & GRU "Gating Mechanisms: How LSTMs Solve Short-Term Memory" LSTM, GRU cells
Seq2Seq "Encoder-Decoder: The Foundation of Translation" Seq2Seq + Teacher Forcing

🎯 Capstone: Train ResNet-18 on CIFAR-10 and a character-level LSTM language model.


Phase 3: The Attention Era (The Paradigm Shift)

Goal: Master global context and scaling.

Module Blog Entry Code
Tokenization "BPE, WordPiece, SentencePiece: How Text Becomes Tensors" BPE from scratch
Positional Encodings "Sinusoidal vs. Learned vs. RoPE vs. ALiBi" All encoding variants
Attention Mechanism "Attention Is All You Need? A Breakdown of Scaled Dot-Product" Single-head attention
Multi-Head Attention "Why Multiple Heads? Learned Subspaces" MHA implementation
Transformer Block "Pre-Norm vs. Post-Norm: Training Stability at Scale" LayerNorm, FFN, residuals
Encoder (BERT-style) "Bidirectional Context: Masked Language Modeling" BERT encoder
Decoder (GPT-style) "Causal Masking: How LLMs Generate Text" GPT decoder + sampling
KV-Cache "Inference Optimization: Why KV-Cache Makes LLMs Fast" Cached generation
Flash Attention "IO-Aware Attention: Tiling for Memory Efficiency" Flash Attention concepts
Sparse Attention "Longformer & BigBird: Scaling to Long Sequences" Sparse patterns

🎯 Capstone: Build a mini-GPT and train on a small text corpus. Implement streaming generation with KV-cache.


Phase 4: The Generative Era (Probabilistic Models)

Goal: Master probabilistic modeling and sampling.

Module Blog Entry Code
Autoencoders "Compression as Learning: The Autoencoder Bottleneck" Vanilla AE
VAE "The Reparameterization Trick: Making VAEs Trainable" VAE + latent sampling
GAN Basics "Generator vs. Discriminator: The Adversarial Game" Vanilla GAN
GAN Training "Mode Collapse and Wasserstein: Stabilizing GANs" WGAN, training tricks
DDPM "Denoising as Generative Modeling: The Math of Diffusion" Forward/reverse process
U-Net "The Architecture of Noise Prediction" U-Net with cross-attention
DDIM "Deterministic Diffusion: Faster Sampling" DDIM sampler
Latent Diffusion "VAE + Diffusion: Why Latent Space Matters" Latent diffusion pipeline
Classifier-Free Guidance "CFG: How Stable Diffusion Does Conditioning" CFG implementation

🎯 Capstone: Train a VAE on MNIST, then train a DDPM to generate digits. Implement CFG-guided generation.


Phase 5: The Frontier Era (What's Next)

Goal: Explore cutting-edge techniques.

Module Blog Entry Code
State Space Models "Mamba & S6: Beyond Attention?" SSM concepts
Mixture of Experts "MoE: How Mixtral Scales Efficiently" Sparse routing
LoRA "Parameter-Efficient Fine-Tuning: Adapting Giants" LoRA implementation
Quantization "INT8 & FP8: Making Models Small and Fast" PTQ, QAT concepts
Speculative Decoding "Draft Models: Parallel Generation" Speculative sampling
RLHF "From GPT to ChatGPT: Aligning with Humans" Conceptual overview

🎯 Capstone: Apply LoRA to fine-tune a small transformer. Implement INT8 quantization and measure speedup.


📝 Blog Entry Template

Every module follows this structure:

  1. The Intuition — 1 paragraph: "Why does this exist?"
  2. The Math — Formal LaTeX derivation of forward and backward pass
  3. The Code — Annotated Python implementation
  4. The Test — Proof of parity with torch.nn (numerical accuracy)
  5. The Experiment — Visualization (gradient flow, loss curve, generated samples)

📊 Progress Tracker

Phase Status Modules
Phase 1 🔲 Not Started 0/8
Phase 2 🔲 Not Started 0/7
Phase 3 🔲 Not Started 0/10
Phase 4 🔲 Not Started 0/9
Phase 5 🔲 Not Started 0/6

Total: 0/43 modules completed