This test suite validates that TinyTorch correctly implements the fundamental breakthroughs in neural network history. Each test verifies that the model actually learns - not just that the code runs, but that gradients flow, weights update, and performance improves.

🧪 The Five Milestones

1️⃣ 1957 - The Perceptron (Frank Rosenblatt)

The Beginning: The first learning algorithm that could automatically adjust its weights.

# Single neuron learning a linear decision boundary
perceptron = Linear(2, 1)  # 2 inputs → 1 output

What it learns: Linearly separable patterns (AND, OR gates)

Key innovation:

Automatic weight updates via gradient descent
Proof that machines can learn from data

Verification:

✅ Loss decreases by >50%
✅ Accuracy reaches >90%
✅ Gradients flow to all parameters
✅ Weights actually change during training

2️⃣ 1986 - Backpropagation for XOR (Rumelhart, Hinton, Williams)

The Breakthrough: Solving the problem that killed neural networks in the 1960s.

# Multi-layer network with hidden layer
model = Sequential([
    Linear(2, 4),    # Input → Hidden
    Tanh(),          # Non-linearity (critical!)
    Linear(4, 1),    # Hidden → Output
    Sigmoid()
])

What it learns: XOR - the canonical non-linearly separable problem

Key innovation:

Backpropagation: Chain rule applied to compute gradients through layers
Hidden layers: Learn intermediate representations
Non-linearity: Without it, multiple layers = single layer

Why XOR matters:

Input: (0,0) → 0    Input: (0,1) → 1
Input: (1,0) → 1    Input: (1,1) → 0

No single line can separate these! You need a hidden layer.

Verification:

✅ Solves XOR (>90% accuracy)
✅ Gradients flow through all layers
✅ Hidden layer learns useful features
✅ Loss decreases significantly

3️⃣ 1989 - Multi-Layer Perceptron on Real Data (LeCun)

Scaling Up: From toy problems to real-world pattern recognition.

# Deeper network for image classification
model = Sequential([
    Linear(64, 128),   # Input (8×8 images flattened)
    ReLU(),            # Modern activation
    Linear(128, 64),   # Hidden layer
    ReLU(),
    Linear(64, 10)     # 10 digit classes
])

What it learns: Handwritten digit recognition (TinyDigits dataset)

Key innovations:

Deeper architectures: Multiple hidden layers
Real data: 1000 training images, 200 test images
Classification: Multi-class output (10 digits)

Why it matters:

Proved neural networks work on real-world data
Showed that depth helps (but flattening images loses spatial structure)
Foundation for modern deep learning

Verification:

✅ Test accuracy >80%
✅ Loss decreases >50%
✅ All layers receive gradients
✅ Generalizes to unseen test data

Training setup (fair comparison with CNN):

Batch size: 32
Epochs: 25
Total updates: 775

4️⃣ 1998 - Convolutional Neural Networks (Yann LeCun)

Spatial Structure: Stop flattening images - preserve their 2D structure!

# Convolutional architecture
model = Sequential([
    Conv2d(1, 8, kernel_size=3),   # Learn spatial filters
    ReLU(),
    MaxPool2d(kernel_size=2),      # Spatial downsampling
    Flatten(),
    Linear(8 * 3 * 3, 10)          # Classification head
])

What it learns: Same digit recognition, but with spatial awareness

Key innovations:

Convolution: Shared weights that scan across the image
Spatial hierarchy: Early layers detect edges, later layers detect shapes
Translation invariance: Digit in any position gets recognized
Parameter efficiency: Fewer parameters than MLP

MLP vs CNN comparison (fair setup):

Architecture	Batch Size	Epochs	Updates	Final Accuracy	Loss Decrease
MLP	32	25	775	82.0%	52.3%
CNN	32	25	775	82.0%	68.1%

Key insights:

Same final accuracy on 8×8 images (too small for CNNs to shine)
CNN converges faster (68% vs 52% loss reduction)
On larger images (32×32, 224×224), CNNs dominate
Spatial inductive bias helps even when images are tiny

Verification:

✅ Test accuracy >80%
✅ Convolution gradients flow properly
✅ Spatial features learned
✅ More efficient learning than MLP

5️⃣ 2017 - Transformer (Attention) (Vaswani et al.)

Sequence Processing: From spatial structure to temporal/sequential structure.

# Transformer architecture
model = Sequential([
    Embedding(vocab_size, d_model),           # Token → vector
    PositionalEncoding(d_model, max_len),     # Add position info
    MultiHeadAttention(d_model, num_heads),   # Attend to all positions
    Linear(d_model, vocab_size)               # Predict next token
])

What it learns: Sequence copying - the foundation of language modeling

Key innovations:

Self-attention: Each position attends to all other positions
Positional encoding: Inject sequence order information
No recurrence: Parallel processing of entire sequence
Multi-head attention: Learn multiple attention patterns

The copy task:

Input:  [1, 2, 3, 4]
Target: [1, 2, 3, 4]

Simple, but requires:

Embeddings to represent tokens
Positional encoding to know order
Attention to copy the right token to each position
Gradient flow through all components

Why copy matters:

Tests attention mechanism in isolation
Proves positional encoding works
Foundation for language modeling (predict next token)
If it can't copy, it can't do language

Verification:

✅ Perfect accuracy (100%) on copy task
✅ All 19 parameters receive gradients
✅ Embeddings, positions, attention all learn
✅ Attention weights show correct patterns

Training setup:

Batch size: 32
Epochs: 50
Sequence length: 4
Vocabulary: 10 tokens

🔗 How They Connect: The Through-Line

1. Perceptron → Backpropagation

Problem: Perceptron can't learn XOR (non-linear patterns)
Solution: Add hidden layers + non-linearity
Requirement: Need backpropagation to train multiple layers

2. Backpropagation → MLP

Problem: XOR is a toy problem
Solution: Scale to real data (images, many classes)
Requirement: Deeper networks, more data, better optimization

3. MLP → CNN

Problem: Flattening images loses spatial structure
Solution: Convolution preserves 2D relationships
Requirement: New operations (Conv2d, MaxPool2d) with proper gradients

4. CNN → Transformer

Problem: Images have spatial structure, but sequences have temporal structure
Solution: Attention mechanism to relate positions
Requirement: Embeddings, positional encoding, attention with proper gradients

5. The Common Thread

Every breakthrough requires:

New architecture (more expressive)
Proper gradients (backprop through new operations)
Verification (actually learns on appropriate task)

🎓 Educational Value

For Students:

Historical context: See why each innovation mattered
Hands-on verification: Run the tests, see them learn
Building blocks: Each milestone uses previous ones
Debugging skills: If a test fails, gradients aren't flowing

For Instructors:

Progression: Natural curriculum from simple to complex
Verification: Proof that implementations are correct
Comparisons: Fair benchmarks (MLP vs CNN)
Debugging: Tests catch common implementation errors

🚀 Running the Tests

Run all milestones:

pytest tests/milestones/test_learning_verification.py -v

Run individual milestones:

# Test 1: Perceptron
pytest tests/milestones/test_learning_verification.py::test_perceptron_learning -v

# Test 2: XOR
pytest tests/milestones/test_learning_verification.py::test_xor_learning -v

# Test 3: MLP Digits
pytest tests/milestones/test_learning_verification.py::test_mlp_digits_learning -v

# Test 4: CNN
pytest tests/milestones/test_learning_verification.py::test_cnn_learning -v

# Test 5: Transformer
pytest tests/milestones/test_learning_verification.py::test_transformer_learning -v

Expected output:

✅ 5 passed in 90s

📊 What Each Test Verifies

Milestone	Loss ↓	Accuracy	Gradients	Weights Updated
Perceptron	>50%	>90%	2/2	✅
XOR	>50%	>90%	8/8	✅
MLP Digits	>50%	>80%	6/6	✅
CNN	>50%	>80%	6/6	✅
Transformer	>50%	100%	19/19	✅

🐛 Common Issues

If a test fails:

No gradients: Check requires_grad=True on parameters
Gradients don't flow: Check backward functions in operations
Loss doesn't decrease: Check learning rate, optimizer
Low accuracy: Check model architecture, training duration
Weights don't update: Check optimizer step, zero_grad

Debugging workflow:

# 1. Check gradients exist
for param in model.parameters():
    print(param.grad)

# 2. Check gradient magnitudes
for name, param in model.named_parameters():
    print(f"{name}: {param.grad.data.abs().mean()}")

# 3. Check weight changes
initial_weights = [p.data.copy() for p in model.parameters()]
# ... train ...
for i, param in enumerate(model.parameters()):
    diff = (param.data - initial_weights[i]).abs().mean()
    print(f"Param {i} changed by: {diff}")

📖 Further Reading

Perceptron: Rosenblatt (1957) "The Perceptron: A Probabilistic Model"
Backpropagation: Rumelhart et al. (1986) "Learning representations by back-propagating errors"
MLP: LeCun et al. (1989) "Backpropagation Applied to Handwritten Zip Code Recognition"
CNN: LeCun et al. (1998) "Gradient-Based Learning Applied to Document Recognition"
Transformer: Vaswani et al. (2017) "Attention Is All You Need"

🎯 Success Criteria

All tests pass when:

✅ Loss decreases significantly (>50%)
✅ Accuracy meets threshold (varies by task)
✅ All parameters receive gradients
✅ Weights actually update during training
✅ Model generalizes to test data

🏆 Current Status

All 5 milestones passing ✅

test_perceptron_learning ✅
test_xor_learning ✅
test_mlp_digits_learning ✅
test_cnn_learning ✅
test_transformer_learning ✅

TinyTorch successfully implements 60+ years of neural network history!

README.md Unescape Escape

🎯 TinyTorch Learning Milestones

📚 Overview

🧪 The Five Milestones

1️⃣ 1957 - The Perceptron (Frank Rosenblatt)

2️⃣ 1986 - Backpropagation for XOR (Rumelhart, Hinton, Williams)

3️⃣ 1989 - Multi-Layer Perceptron on Real Data (LeCun)

4️⃣ 1998 - Convolutional Neural Networks (Yann LeCun)

5️⃣ 2017 - Transformer (Attention) (Vaswani et al.)

🔗 How They Connect: The Through-Line

1. Perceptron → Backpropagation

2. Backpropagation → MLP

3. MLP → CNN

4. CNN → Transformer

5. The Common Thread

🎓 Educational Value

For Students:

For Instructors:

🚀 Running the Tests

Run all milestones:

Run individual milestones:

Expected output:

📊 What Each Test Verifies

🐛 Common Issues

If a test fails:

Debugging workflow:

📖 Further Reading

🎯 Success Criteria

🏆 Current Status

README.md