mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-26 06:00:54 -05:00

Files

Vijay Janapa Reddi 9767c78155 Add milestone system with clean architecture

- Single source of truth in milestone_tracker.py
- Zero code duplication across codebase
- Clean API: check_module_export(module_name, console)
- Gamified learning experience through ML history
- Progressive unlocking of 5 major milestones
- Comprehensive documentation for students and developers
- Integration with module workflow and CLI commands

2025-11-22 20:29:34 -05:00

9.4 KiB

Raw Permalink Blame History

From Perceptron to Transformer: How Neural Networks Evolved

This document traces the key innovations in neural network history, showing how each breakthrough solved a specific problem left by its predecessor. We're not just listing milestones—we're showing how they connect.

1957: PERCEPTRON
┌─────────────────┐
│   Input (x)     │
│       ↓         │
│   w·x + b       │  ← Single neuron, linear decision boundary
│       ↓         │
│   Output (y)    │
└─────────────────┘
Problem: Can't learn XOR (non-linear patterns)
         ↓

1986: BACKPROPAGATION (XOR)
┌─────────────────┐
│   Input (x)     │
│       ↓         │
│   Linear(2,4)   │  ← Hidden layer
│       ↓         │
│   Tanh()        │  ← Non-linearity (KEY!)
│       ↓         │
│   Linear(4,1)   │
│       ↓         │
│   Sigmoid()     │
│       ↓         │
│   Output (y)    │
└─────────────────┘
Problem: Only tested on toy problems
         ↓

1989: MLP ON REAL DATA
┌─────────────────┐
│  Image (8×8)    │
│       ↓         │
│   Flatten()     │  ← Loses spatial structure!
│       ↓         │
│  Linear(64,128) │
│       ↓         │
│   ReLU()        │
│       ↓         │
│  Linear(128,64) │
│       ↓         │
│   ReLU()        │
│       ↓         │
│  Linear(64,10)  │  ← 10 classes
│       ↓         │
│   Softmax       │
└─────────────────┘
Problem: Flattening destroys spatial relationships
         ↓

1998: CONVOLUTIONAL NETWORKS
┌─────────────────┐
│  Image (1,8,8)  │  ← Preserves 2D structure!
│       ↓         │
│ Conv2d(1,8,3×3) │  ← Spatial filters
│       ↓         │
│   ReLU()        │
│       ↓         │
│ MaxPool2d(2×2)  │  ← Spatial downsampling
│       ↓         │
│   Flatten()     │
│       ↓         │
│  Linear(72,10)  │
└─────────────────┘
Problem: Images have spatial structure, sequences have temporal structure
         ↓

2017: TRANSFORMER (ATTENTION)
┌─────────────────────┐
│  Sequence [1,2,3,4] │
│         ↓           │
│  Embedding(10,16)   │  ← Token → vector
│         ↓           │
│  PositionalEnc(16)  │  ← Add position info
│         ↓           │
│  MultiHeadAttn(2)   │  ← Attend to all positions
│         ↓           │
│  Linear(16,10)      │  ← Predict tokens
│         ↓           │
│  Output [1,2,3,4]   │
└─────────────────────┘

How Each Innovation Builds on the Last

1. Perceptron → XOR: Adding Non-linearity

# Perceptron (fails on XOR)
y = w·x + b

# MLP (solves XOR)
h = tanh(W1·x + b1)  # Hidden layer learns features
y = σ(W2·h + b2)     # Output layer combines features

Here's the thing: without non-linearity, stacking layers doesn't help. Two linear layers collapse into one:

Layer 1: y = W1·x + b1
Layer 2: z = W2·y + b2 = W2·(W1·x + b1) + b2 = (W2·W1)·x + (W2·b1 + b2)
Result: Still just a linear function!

The activation function (tanh, sigmoid, ReLU) is what makes depth meaningful.

2. XOR → MLP: Scaling to Real Data

# XOR: 4 samples, 2 features
X = [[0,0], [0,1], [1,0], [1,1]]

# Digits: 1000 samples, 64 features (8×8 images)
X = load_tiny_digits()  # Real-world complexity

Solving XOR was a proof of concept. But to be useful, neural networks needed to handle:

Real datasets (not 4 hand-crafted samples)
High-dimensional inputs (images have 64+ pixels, not 2 features)
Multiple classes (10 digits, not binary)

That's what the MLP milestone demonstrates.

3. MLP → CNN: Preserving Spatial Structure

# MLP: Flatten destroys structure
image = [[1,2,3],
         [4,5,6],
         [7,8,9]]
flat = [1,2,3,4,5,6,7,8,9]  # Lost neighborhood info!

# CNN: Preserve structure
conv = Conv2d(1, 8, kernel_size=3)
features = conv(image)  # Learns spatial patterns

When you flatten an image into a vector, you lose neighborhood information. Pixel (1,1) is spatially close to (0,1), (1,0), (2,1), (1,2)—but after flattening, the network doesn't know that.

Convolution fixes this by scanning a small filter across the image, preserving local structure. As a bonus, you get massive parameter savings:

MLP:     64 inputs × 128 hidden = 8,192 parameters
CNN:     3×3 kernel × 8 filters = 72 parameters (113× fewer!)

4. CNN → Transformer: From Spatial to Sequential

# CNN: Spatial relationships (2D)
image[i,j] relates to image[i±1, j±1]

# Transformer: Temporal relationships (1D)
sequence[t] relates to sequence[0...T]

CNNs work great for images because spatial relationships are local—edges, corners, textures. But sequences (text, time series) have different structure. The first word in a sentence can affect the meaning of the 100th word.

Attention solves this by letting every position look at every other position:

# For each position, compute attention to all positions
Q = query(x)    # What am I looking for?
K = key(x)      # What do I contain?
V = value(x)    # What should I output?

attention = softmax(Q @ K.T / √d)  # Where to look
output = attention @ V              # What to copy

The Common Thread: Gradient Flow

Every innovation requires proper backpropagation:

# Forward pass
y = f(x, θ)

# Backward pass (compute ∂L/∂θ)
loss.backward()

# Update
θ = θ - lr * ∂L/∂θ

Gradient Flow Examples:

Perceptron:

∂L/∂w = ∂L/∂y · ∂y/∂w = (y - target) · x

MLP (chain rule):

∂L/∂W1 = ∂L/∂y · ∂y/∂h · ∂h/∂W1
         └─────┴─────┴──────┘
         Chain through layers

CNN (convolution):

∂L/∂kernel = ∂L/∂output · ∂output/∂kernel
            = ∂L/∂output ⊗ input  (convolution!)

Transformer (attention):

∂L/∂Q = ∂L/∂attn · ∂attn/∂scores · ∂scores/∂Q
        └────────┴────────────┴──────────┘
        Through softmax and matmul

Learning Verification: What We Test

1. Perceptron (1957)

✅ Loss decreases (optimization works)
✅ Accuracy >90% (learns linear boundary)
✅ Gradients flow to w, b
✅ Weights update

2. XOR (1986)

✅ Loss decreases >50%
✅ Accuracy >90% (solves non-linear problem!)
✅ Gradients flow through all layers
✅ Hidden layer learns useful features

3. MLP Digits (1989)

✅ Test accuracy >80% (generalizes)
✅ Loss decreases >50%
✅ All 6 parameter groups receive gradients
✅ Works on real data (1000 samples)

4. CNN (1998)

✅ Test accuracy >80%
✅ Convolution gradients flow properly
✅ More efficient than MLP (68% vs 52% loss reduction)
✅ Spatial features learned

5. Transformer (2017)

✅ Perfect accuracy (100%) on copy task
✅ All 19 parameters receive gradients
✅ Embeddings learn token representations
✅ Positional encoding preserves order
✅ Attention learns to copy

Fair Comparisons

MLP vs CNN (Digits)

Setup (identical training budget):

# Both models
batch_size = 32
epochs = 25
samples = 1000
updates = 25 × (1000 ÷ 32) = 775 gradient updates

Results:

Model	Accuracy	Loss Decrease	Parameters
MLP	82.0%	52.3%	10,890
CNN	82.0%	68.1%	1,098

Insights:

Same accuracy (8×8 too small for CNN advantage)
CNN converges faster (better loss reduction)
CNN uses 10× fewer parameters
On larger images, CNN dominates

The Big Picture

PERCEPTRON (1957)
    ↓ Add hidden layers + non-linearity
BACKPROPAGATION (1986)
    ↓ Scale to real data + deeper networks
MLP (1989)
    ↓ Preserve spatial structure
CNN (1998)
    ↓ Handle sequential/temporal structure
TRANSFORMER (2017)
    ↓ Scale to billions of parameters
MODERN DEEP LEARNING (2020s)

Key Takeaways

Each innovation solves a specific limitation
- Perceptron → XOR: Need non-linearity
- XOR → MLP: Need to scale
- MLP → CNN: Need spatial awareness
- CNN → Transformer: Need long-range dependencies
All require proper gradients
- Every new operation needs backward pass
- Chain rule connects everything
- Tests verify gradients actually flow
Learning verification is critical
- Code running ≠ model learning
- Must verify: loss ↓, accuracy ↑, gradients flow
- Fair comparisons require matched training budgets
Building blocks compound
- Transformer uses: Linear (1957), ReLU (1989), Embeddings (2013)
- Each milestone stands on previous work
- Modern systems combine all these ideas

What's Next?

The journey continues:

Residual connections (ResNet, 2015)
Batch normalization (2015)
Transformers at scale (GPT, BERT, 2018+)
Diffusion models (2020+)
Mixture of Experts (2023+)

But they all build on these five fundamental milestones! 🚀

9.4 KiB Raw Permalink Blame History Unescape Escape