🎯 TinyTorch Learning Milestones
A chronological journey through the history of neural networks, verifying that each breakthrough actually learns.
📚 Overview
This test suite validates that TinyTorch correctly implements the fundamental breakthroughs in neural network history. Each test verifies that the model actually learns - not just that the code runs, but that gradients flow, weights update, and performance improves.
🧪 The Five Milestones
1️⃣ 1957 - The Perceptron (Frank Rosenblatt)
The Beginning: The first learning algorithm that could automatically adjust its weights.
# Single neuron learning a linear decision boundary
perceptron = Linear(2, 1) # 2 inputs → 1 output
What it learns: Linearly separable patterns (AND, OR gates)
Key innovation:
- Automatic weight updates via gradient descent
- Proof that machines can learn from data
Verification:
- ✅ Loss decreases by >50%
- ✅ Accuracy reaches >90%
- ✅ Gradients flow to all parameters
- ✅ Weights actually change during training
2️⃣ 1986 - Backpropagation for XOR (Rumelhart, Hinton, Williams)
The Breakthrough: Solving the problem that killed neural networks in the 1960s.
# Multi-layer network with hidden layer
model = Sequential([
Linear(2, 4), # Input → Hidden
Tanh(), # Non-linearity (critical!)
Linear(4, 1), # Hidden → Output
Sigmoid()
])
What it learns: XOR - the canonical non-linearly separable problem
Key innovation:
- Backpropagation: Chain rule applied to compute gradients through layers
- Hidden layers: Learn intermediate representations
- Non-linearity: Without it, multiple layers = single layer
Why XOR matters:
Input: (0,0) → 0 Input: (0,1) → 1
Input: (1,0) → 1 Input: (1,1) → 0
No single line can separate these! You need a hidden layer.
Verification:
- ✅ Solves XOR (>90% accuracy)
- ✅ Gradients flow through all layers
- ✅ Hidden layer learns useful features
- ✅ Loss decreases significantly
3️⃣ 1989 - Multi-Layer Perceptron on Real Data (LeCun)
Scaling Up: From toy problems to real-world pattern recognition.
# Deeper network for image classification
model = Sequential([
Linear(64, 128), # Input (8×8 images flattened)
ReLU(), # Modern activation
Linear(128, 64), # Hidden layer
ReLU(),
Linear(64, 10) # 10 digit classes
])
What it learns: Handwritten digit recognition (TinyDigits dataset)
Key innovations:
- Deeper architectures: Multiple hidden layers
- Real data: 1000 training images, 200 test images
- Classification: Multi-class output (10 digits)
Why it matters:
- Proved neural networks work on real-world data
- Showed that depth helps (but flattening images loses spatial structure)
- Foundation for modern deep learning
Verification:
- ✅ Test accuracy >80%
- ✅ Loss decreases >50%
- ✅ All layers receive gradients
- ✅ Generalizes to unseen test data
Training setup (fair comparison with CNN):
- Batch size: 32
- Epochs: 25
- Total updates: 775
4️⃣ 1998 - Convolutional Neural Networks (Yann LeCun)
Spatial Structure: Stop flattening images - preserve their 2D structure!
# Convolutional architecture
model = Sequential([
Conv2d(1, 8, kernel_size=3), # Learn spatial filters
ReLU(),
MaxPool2d(kernel_size=2), # Spatial downsampling
Flatten(),
Linear(8 * 3 * 3, 10) # Classification head
])
What it learns: Same digit recognition, but with spatial awareness
Key innovations:
- Convolution: Shared weights that scan across the image
- Spatial hierarchy: Early layers detect edges, later layers detect shapes
- Translation invariance: Digit in any position gets recognized
- Parameter efficiency: Fewer parameters than MLP
MLP vs CNN comparison (fair setup):
| Architecture | Batch Size | Epochs | Updates | Final Accuracy | Loss Decrease |
|---|---|---|---|---|---|
| MLP | 32 | 25 | 775 | 82.0% | 52.3% |
| CNN | 32 | 25 | 775 | 82.0% | 68.1% |
Key insights:
- Same final accuracy on 8×8 images (too small for CNNs to shine)
- CNN converges faster (68% vs 52% loss reduction)
- On larger images (32×32, 224×224), CNNs dominate
- Spatial inductive bias helps even when images are tiny
Verification:
- ✅ Test accuracy >80%
- ✅ Convolution gradients flow properly
- ✅ Spatial features learned
- ✅ More efficient learning than MLP
5️⃣ 2017 - Transformer (Attention) (Vaswani et al.)
Sequence Processing: From spatial structure to temporal/sequential structure.
# Transformer architecture
model = Sequential([
Embedding(vocab_size, d_model), # Token → vector
PositionalEncoding(d_model, max_len), # Add position info
MultiHeadAttention(d_model, num_heads), # Attend to all positions
Linear(d_model, vocab_size) # Predict next token
])
What it learns: Sequence copying - the foundation of language modeling
Key innovations:
- Self-attention: Each position attends to all other positions
- Positional encoding: Inject sequence order information
- No recurrence: Parallel processing of entire sequence
- Multi-head attention: Learn multiple attention patterns
The copy task:
Input: [1, 2, 3, 4]
Target: [1, 2, 3, 4]
Simple, but requires:
- Embeddings to represent tokens
- Positional encoding to know order
- Attention to copy the right token to each position
- Gradient flow through all components
Why copy matters:
- Tests attention mechanism in isolation
- Proves positional encoding works
- Foundation for language modeling (predict next token)
- If it can't copy, it can't do language
Verification:
- ✅ Perfect accuracy (100%) on copy task
- ✅ All 19 parameters receive gradients
- ✅ Embeddings, positions, attention all learn
- ✅ Attention weights show correct patterns
Training setup:
- Batch size: 32
- Epochs: 50
- Sequence length: 4
- Vocabulary: 10 tokens
🔗 How They Connect: The Through-Line
1. Perceptron → Backpropagation
- Problem: Perceptron can't learn XOR (non-linear patterns)
- Solution: Add hidden layers + non-linearity
- Requirement: Need backpropagation to train multiple layers
2. Backpropagation → MLP
- Problem: XOR is a toy problem
- Solution: Scale to real data (images, many classes)
- Requirement: Deeper networks, more data, better optimization
3. MLP → CNN
- Problem: Flattening images loses spatial structure
- Solution: Convolution preserves 2D relationships
- Requirement: New operations (Conv2d, MaxPool2d) with proper gradients
4. CNN → Transformer
- Problem: Images have spatial structure, but sequences have temporal structure
- Solution: Attention mechanism to relate positions
- Requirement: Embeddings, positional encoding, attention with proper gradients
5. The Common Thread
Every breakthrough requires:
- New architecture (more expressive)
- Proper gradients (backprop through new operations)
- Verification (actually learns on appropriate task)
🎓 Educational Value
For Students:
- Historical context: See why each innovation mattered
- Hands-on verification: Run the tests, see them learn
- Building blocks: Each milestone uses previous ones
- Debugging skills: If a test fails, gradients aren't flowing
For Instructors:
- Progression: Natural curriculum from simple to complex
- Verification: Proof that implementations are correct
- Comparisons: Fair benchmarks (MLP vs CNN)
- Debugging: Tests catch common implementation errors
🚀 Running the Tests
Run all milestones:
pytest tests/milestones/test_learning_verification.py -v
Run individual milestones:
# Test 1: Perceptron
pytest tests/milestones/test_learning_verification.py::test_perceptron_learning -v
# Test 2: XOR
pytest tests/milestones/test_learning_verification.py::test_xor_learning -v
# Test 3: MLP Digits
pytest tests/milestones/test_learning_verification.py::test_mlp_digits_learning -v
# Test 4: CNN
pytest tests/milestones/test_learning_verification.py::test_cnn_learning -v
# Test 5: Transformer
pytest tests/milestones/test_learning_verification.py::test_transformer_learning -v
Expected output:
✅ 5 passed in 90s
📊 What Each Test Verifies
| Milestone | Loss ↓ | Accuracy | Gradients | Weights Updated |
|---|---|---|---|---|
| Perceptron | >50% | >90% | 2/2 | ✅ |
| XOR | >50% | >90% | 8/8 | ✅ |
| MLP Digits | >50% | >80% | 6/6 | ✅ |
| CNN | >50% | >80% | 6/6 | ✅ |
| Transformer | >50% | 100% | 19/19 | ✅ |
🐛 Common Issues
If a test fails:
- No gradients: Check
requires_grad=Trueon parameters - Gradients don't flow: Check backward functions in operations
- Loss doesn't decrease: Check learning rate, optimizer
- Low accuracy: Check model architecture, training duration
- Weights don't update: Check optimizer step, zero_grad
Debugging workflow:
# 1. Check gradients exist
for param in model.parameters():
print(param.grad)
# 2. Check gradient magnitudes
for name, param in model.named_parameters():
print(f"{name}: {param.grad.data.abs().mean()}")
# 3. Check weight changes
initial_weights = [p.data.copy() for p in model.parameters()]
# ... train ...
for i, param in enumerate(model.parameters()):
diff = (param.data - initial_weights[i]).abs().mean()
print(f"Param {i} changed by: {diff}")
📖 Further Reading
- Perceptron: Rosenblatt (1957) "The Perceptron: A Probabilistic Model"
- Backpropagation: Rumelhart et al. (1986) "Learning representations by back-propagating errors"
- MLP: LeCun et al. (1989) "Backpropagation Applied to Handwritten Zip Code Recognition"
- CNN: LeCun et al. (1998) "Gradient-Based Learning Applied to Document Recognition"
- Transformer: Vaswani et al. (2017) "Attention Is All You Need"
🎯 Success Criteria
All tests pass when:
- ✅ Loss decreases significantly (>50%)
- ✅ Accuracy meets threshold (varies by task)
- ✅ All parameters receive gradients
- ✅ Weights actually update during training
- ✅ Model generalizes to test data
🏆 Current Status
All 5 milestones passing ✅
test_perceptron_learning ✅
test_xor_learning ✅
test_mlp_digits_learning ✅
test_cnn_learning ✅
test_transformer_learning ✅
TinyTorch successfully implements 60+ years of neural network history!