� TensorToIntelligence: The Full Path¶

A "Build-in-Public" deep learning education series. Each module is a first-principles implementation of a DL primitive, accompanied by a deep-dive technical blog post.

🛠️ Technical Stack¶

Tool	Purpose
PyTorch	Tensors & Autograd (reference implementation)
MkDocs + Material	Documentation & Blog
MathJax	LaTeX rendering
mkdocstrings	Autodoc from docstrings
Jupyter Notebooks	Visual experiments
GitHub Pages	Hosting

📚 Roadmap¶

Phase 1: The Atomic Era (Primitives)¶

Goal: Master numerical stability and gradient flow.

Module	Blog Entry	Code
Linear Layer	"Why Weight Initialization Matters (Xavier vs. Kaiming)"	Manual weight/bias init, forward pass
Activations	"The Gradient Vanishing Problem: From Sigmoid to SwiGLU"	ReLU, GELU, SiLU, SwiGLU
Loss Functions	"LogSumExp: The Secret to Numerical Stability in Cross-Entropy"	MSE, CrossEntropy, BCE
Normalization	"BatchNorm to RMSNorm: A History of Training Stability"	BatchNorm, LayerNorm, RMSNorm, GroupNorm
Regularization	"Dropout: The Surprising Power of Random Noise"	Dropout, L1/L2 penalty
Optimizers	"Adam vs. AdamW: Why Decoupling Weight Decay Changed Everything"	SGD, Momentum, Adam, AdamW
LR Schedulers	"Warmup and Decay: Navigating the Loss Landscape"	Warmup, StepLR, CosineAnnealing
Gradient Clipping	"Exploding Gradients: When to Clip and Why"	Norm clipping, value clipping

🎯 Capstone: Train an MLP on MNIST using only your primitives. Compare to torch.nn.

Phase 2: The Architectural Era (Vision & Sequences)¶

Goal: Master spatial and temporal data processing.

Module	Blog Entry	Code
Convolutions	"im2col: Making Convolutions Fast"	Conv2d, depthwise, dilated
Pooling	"Max vs. Average: The Information Bottleneck"	MaxPool, AvgPool, GlobalPool
Skip Connections	"ResNet's Insight: Why Identity Mappings Enable Depth"	Residual blocks
CNN Architecture	"Building ResNet-18 from Scratch"	Full ResNet implementation
Vanilla RNN	"Backprop Through Time: The Temporal Chain Rule"	RNN cell, BPTT
LSTM & GRU	"Gating Mechanisms: How LSTMs Solve Short-Term Memory"	LSTM, GRU cells
Seq2Seq	"Encoder-Decoder: The Foundation of Translation"	Seq2Seq + Teacher Forcing

🎯 Capstone: Train ResNet-18 on CIFAR-10 and a character-level LSTM language model.

Phase 3: The Attention Era (The Paradigm Shift)¶

Goal: Master global context and scaling.

Module	Blog Entry	Code
Tokenization	"BPE, WordPiece, SentencePiece: How Text Becomes Tensors"	BPE from scratch
Positional Encodings	"Sinusoidal vs. Learned vs. RoPE vs. ALiBi"	All encoding variants
Attention Mechanism	"Attention Is All You Need? A Breakdown of Scaled Dot-Product"	Single-head attention
Multi-Head Attention	"Why Multiple Heads? Learned Subspaces"	MHA implementation
Transformer Block	"Pre-Norm vs. Post-Norm: Training Stability at Scale"	LayerNorm, FFN, residuals
Encoder (BERT-style)	"Bidirectional Context: Masked Language Modeling"	BERT encoder
Decoder (GPT-style)	"Causal Masking: How LLMs Generate Text"	GPT decoder + sampling
KV-Cache	"Inference Optimization: Why KV-Cache Makes LLMs Fast"	Cached generation
Flash Attention	"IO-Aware Attention: Tiling for Memory Efficiency"	Flash Attention concepts
Sparse Attention	"Longformer & BigBird: Scaling to Long Sequences"	Sparse patterns

🎯 Capstone: Build a mini-GPT and train on a small text corpus. Implement streaming generation with KV-cache.

Phase 4: The Generative Era (Probabilistic Models)¶

Goal: Master probabilistic modeling and sampling.

Module	Blog Entry	Code
Autoencoders	"Compression as Learning: The Autoencoder Bottleneck"	Vanilla AE
VAE	"The Reparameterization Trick: Making VAEs Trainable"	VAE + latent sampling
GAN Basics	"Generator vs. Discriminator: The Adversarial Game"	Vanilla GAN
GAN Training	"Mode Collapse and Wasserstein: Stabilizing GANs"	WGAN, training tricks
DDPM	"Denoising as Generative Modeling: The Math of Diffusion"	Forward/reverse process
U-Net	"The Architecture of Noise Prediction"	U-Net with cross-attention
DDIM	"Deterministic Diffusion: Faster Sampling"	DDIM sampler
Latent Diffusion	"VAE + Diffusion: Why Latent Space Matters"	Latent diffusion pipeline
Classifier-Free Guidance	"CFG: How Stable Diffusion Does Conditioning"	CFG implementation

🎯 Capstone: Train a VAE on MNIST, then train a DDPM to generate digits. Implement CFG-guided generation.

Phase 5: The Frontier Era (What's Next)¶

Goal: Explore cutting-edge techniques.

Module	Blog Entry	Code
State Space Models	"Mamba & S6: Beyond Attention?"	SSM concepts
Mixture of Experts	"MoE: How Mixtral Scales Efficiently"	Sparse routing
LoRA	"Parameter-Efficient Fine-Tuning: Adapting Giants"	LoRA implementation
Quantization	"INT8 & FP8: Making Models Small and Fast"	PTQ, QAT concepts
Speculative Decoding	"Draft Models: Parallel Generation"	Speculative sampling
RLHF	"From GPT to ChatGPT: Aligning with Humans"	Conceptual overview

🎯 Capstone: Apply LoRA to fine-tune a small transformer. Implement INT8 quantization and measure speedup.

📝 Blog Entry Template¶

Every module follows this structure:

The Intuition — 1 paragraph: "Why does this exist?"
The Math — Formal LaTeX derivation of forward and backward pass
The Code — Annotated Python implementation
The Test — Proof of parity with torch.nn (numerical accuracy)
The Experiment — Visualization (gradient flow, loss curve, generated samples)

📊 Progress Tracker¶

Phase	Status	Modules
Phase 1	🔲 Not Started	0/8
Phase 2	🔲 Not Started	0/7
Phase 3	🔲 Not Started	0/10
Phase 4	🔲 Not Started	0/9
Phase 5	🔲 Not Started	0/6

Total: 0/43 modules completed