Files
TinyTorch/tests/milestones

🎯 TinyTorch Learning Milestones

A chronological journey through the history of neural networks, verifying that each breakthrough actually learns.

📚 Overview

This test suite validates that TinyTorch correctly implements the fundamental breakthroughs in neural network history. Each test verifies that the model actually learns - not just that the code runs, but that gradients flow, weights update, and performance improves.

🧪 The Five Milestones

1 1957 - The Perceptron (Frank Rosenblatt)

The Beginning: The first learning algorithm that could automatically adjust its weights.

# Single neuron learning a linear decision boundary
perceptron = Linear(2, 1)  # 2 inputs → 1 output

What it learns: Linearly separable patterns (AND, OR gates)

Key innovation:

  • Automatic weight updates via gradient descent
  • Proof that machines can learn from data

Verification:

  • Loss decreases by >50%
  • Accuracy reaches >90%
  • Gradients flow to all parameters
  • Weights actually change during training

2 1986 - Backpropagation for XOR (Rumelhart, Hinton, Williams)

The Breakthrough: Solving the problem that killed neural networks in the 1960s.

# Multi-layer network with hidden layer
model = Sequential([
    Linear(2, 4),    # Input → Hidden
    Tanh(),          # Non-linearity (critical!)
    Linear(4, 1),    # Hidden → Output
    Sigmoid()
])

What it learns: XOR - the canonical non-linearly separable problem

Key innovation:

  • Backpropagation: Chain rule applied to compute gradients through layers
  • Hidden layers: Learn intermediate representations
  • Non-linearity: Without it, multiple layers = single layer

Why XOR matters:

Input: (0,0) → 0    Input: (0,1) → 1
Input: (1,0) → 1    Input: (1,1) → 0

No single line can separate these! You need a hidden layer.

Verification:

  • Solves XOR (>90% accuracy)
  • Gradients flow through all layers
  • Hidden layer learns useful features
  • Loss decreases significantly

3 1989 - Multi-Layer Perceptron on Real Data (LeCun)

Scaling Up: From toy problems to real-world pattern recognition.

# Deeper network for image classification
model = Sequential([
    Linear(64, 128),   # Input (8×8 images flattened)
    ReLU(),            # Modern activation
    Linear(128, 64),   # Hidden layer
    ReLU(),
    Linear(64, 10)     # 10 digit classes
])

What it learns: Handwritten digit recognition (TinyDigits dataset)

Key innovations:

  • Deeper architectures: Multiple hidden layers
  • Real data: 1000 training images, 200 test images
  • Classification: Multi-class output (10 digits)

Why it matters:

  • Proved neural networks work on real-world data
  • Showed that depth helps (but flattening images loses spatial structure)
  • Foundation for modern deep learning

Verification:

  • Test accuracy >80%
  • Loss decreases >50%
  • All layers receive gradients
  • Generalizes to unseen test data

Training setup (fair comparison with CNN):

  • Batch size: 32
  • Epochs: 25
  • Total updates: 775

4 1998 - Convolutional Neural Networks (Yann LeCun)

Spatial Structure: Stop flattening images - preserve their 2D structure!

# Convolutional architecture
model = Sequential([
    Conv2d(1, 8, kernel_size=3),   # Learn spatial filters
    ReLU(),
    MaxPool2d(kernel_size=2),      # Spatial downsampling
    Flatten(),
    Linear(8 * 3 * 3, 10)          # Classification head
])

What it learns: Same digit recognition, but with spatial awareness

Key innovations:

  • Convolution: Shared weights that scan across the image
  • Spatial hierarchy: Early layers detect edges, later layers detect shapes
  • Translation invariance: Digit in any position gets recognized
  • Parameter efficiency: Fewer parameters than MLP

MLP vs CNN comparison (fair setup):

Architecture Batch Size Epochs Updates Final Accuracy Loss Decrease
MLP 32 25 775 82.0% 52.3%
CNN 32 25 775 82.0% 68.1%

Key insights:

  • Same final accuracy on 8×8 images (too small for CNNs to shine)
  • CNN converges faster (68% vs 52% loss reduction)
  • On larger images (32×32, 224×224), CNNs dominate
  • Spatial inductive bias helps even when images are tiny

Verification:

  • Test accuracy >80%
  • Convolution gradients flow properly
  • Spatial features learned
  • More efficient learning than MLP

5 2017 - Transformer (Attention) (Vaswani et al.)

Sequence Processing: From spatial structure to temporal/sequential structure.

# Transformer architecture
model = Sequential([
    Embedding(vocab_size, d_model),           # Token → vector
    PositionalEncoding(d_model, max_len),     # Add position info
    MultiHeadAttention(d_model, num_heads),   # Attend to all positions
    Linear(d_model, vocab_size)               # Predict next token
])

What it learns: Sequence copying - the foundation of language modeling

Key innovations:

  • Self-attention: Each position attends to all other positions
  • Positional encoding: Inject sequence order information
  • No recurrence: Parallel processing of entire sequence
  • Multi-head attention: Learn multiple attention patterns

The copy task:

Input:  [1, 2, 3, 4]
Target: [1, 2, 3, 4]

Simple, but requires:

  1. Embeddings to represent tokens
  2. Positional encoding to know order
  3. Attention to copy the right token to each position
  4. Gradient flow through all components

Why copy matters:

  • Tests attention mechanism in isolation
  • Proves positional encoding works
  • Foundation for language modeling (predict next token)
  • If it can't copy, it can't do language

Verification:

  • Perfect accuracy (100%) on copy task
  • All 19 parameters receive gradients
  • Embeddings, positions, attention all learn
  • Attention weights show correct patterns

Training setup:

  • Batch size: 32
  • Epochs: 50
  • Sequence length: 4
  • Vocabulary: 10 tokens

🔗 How They Connect: The Through-Line

1. Perceptron → Backpropagation

  • Problem: Perceptron can't learn XOR (non-linear patterns)
  • Solution: Add hidden layers + non-linearity
  • Requirement: Need backpropagation to train multiple layers

2. Backpropagation → MLP

  • Problem: XOR is a toy problem
  • Solution: Scale to real data (images, many classes)
  • Requirement: Deeper networks, more data, better optimization

3. MLP → CNN

  • Problem: Flattening images loses spatial structure
  • Solution: Convolution preserves 2D relationships
  • Requirement: New operations (Conv2d, MaxPool2d) with proper gradients

4. CNN → Transformer

  • Problem: Images have spatial structure, but sequences have temporal structure
  • Solution: Attention mechanism to relate positions
  • Requirement: Embeddings, positional encoding, attention with proper gradients

5. The Common Thread

Every breakthrough requires:

  1. New architecture (more expressive)
  2. Proper gradients (backprop through new operations)
  3. Verification (actually learns on appropriate task)

🎓 Educational Value

For Students:

  1. Historical context: See why each innovation mattered
  2. Hands-on verification: Run the tests, see them learn
  3. Building blocks: Each milestone uses previous ones
  4. Debugging skills: If a test fails, gradients aren't flowing

For Instructors:

  1. Progression: Natural curriculum from simple to complex
  2. Verification: Proof that implementations are correct
  3. Comparisons: Fair benchmarks (MLP vs CNN)
  4. Debugging: Tests catch common implementation errors

🚀 Running the Tests

Run all milestones:

pytest tests/milestones/test_learning_verification.py -v

Run individual milestones:

# Test 1: Perceptron
pytest tests/milestones/test_learning_verification.py::test_perceptron_learning -v

# Test 2: XOR
pytest tests/milestones/test_learning_verification.py::test_xor_learning -v

# Test 3: MLP Digits
pytest tests/milestones/test_learning_verification.py::test_mlp_digits_learning -v

# Test 4: CNN
pytest tests/milestones/test_learning_verification.py::test_cnn_learning -v

# Test 5: Transformer
pytest tests/milestones/test_learning_verification.py::test_transformer_learning -v

Expected output:

✅ 5 passed in 90s

📊 What Each Test Verifies

Milestone Loss ↓ Accuracy Gradients Weights Updated
Perceptron >50% >90% 2/2
XOR >50% >90% 8/8
MLP Digits >50% >80% 6/6
CNN >50% >80% 6/6
Transformer >50% 100% 19/19

🐛 Common Issues

If a test fails:

  1. No gradients: Check requires_grad=True on parameters
  2. Gradients don't flow: Check backward functions in operations
  3. Loss doesn't decrease: Check learning rate, optimizer
  4. Low accuracy: Check model architecture, training duration
  5. Weights don't update: Check optimizer step, zero_grad

Debugging workflow:

# 1. Check gradients exist
for param in model.parameters():
    print(param.grad)

# 2. Check gradient magnitudes
for name, param in model.named_parameters():
    print(f"{name}: {param.grad.data.abs().mean()}")

# 3. Check weight changes
initial_weights = [p.data.copy() for p in model.parameters()]
# ... train ...
for i, param in enumerate(model.parameters()):
    diff = (param.data - initial_weights[i]).abs().mean()
    print(f"Param {i} changed by: {diff}")

📖 Further Reading

  • Perceptron: Rosenblatt (1957) "The Perceptron: A Probabilistic Model"
  • Backpropagation: Rumelhart et al. (1986) "Learning representations by back-propagating errors"
  • MLP: LeCun et al. (1989) "Backpropagation Applied to Handwritten Zip Code Recognition"
  • CNN: LeCun et al. (1998) "Gradient-Based Learning Applied to Document Recognition"
  • Transformer: Vaswani et al. (2017) "Attention Is All You Need"

🎯 Success Criteria

All tests pass when:

  • Loss decreases significantly (>50%)
  • Accuracy meets threshold (varies by task)
  • All parameters receive gradients
  • Weights actually update during training
  • Model generalizes to test data

🏆 Current Status

All 5 milestones passing

test_perceptron_learning ✅
test_xor_learning ✅
test_mlp_digits_learning ✅
test_cnn_learning ✅
test_transformer_learning ✅

TinyTorch successfully implements 60+ years of neural network history!