Fix tutorial master plan: Logical module sequence for Phase 2

- Phase 2 now: Autograd → Spatial → Optimizers → DataLoader → Training
- Move Spatial (CNNs) from Phase 3 to Phase 2 Module 7
- Integrate BatchNorm into Spatial module (mirrors PyTorch patterns)
- Fix milestone: CNN training achievable at end of Phase 2 (Module 10)
- Phase 3 focuses on language: Tokenization → Embeddings → Attention → Transformers
- Logical dependency flow: understand conv operations before optimizing them
This commit is contained in:
Vijay Janapa Reddi
2025-09-23 18:28:44 -04:00
parent 3edd6af0cd
commit 65f662e00b

View File

@@ -0,0 +1,179 @@
# 🎯 TinyTorch Tutorial Master Plan: Complete ML Systems Engineering
## Vision Statement
**Students build a complete ML framework from scratch, learning systems engineering through hands-on implementation. From basic tensors to production-optimized transformers, every line of code teaches both algorithms AND systems thinking.**
---
## 📚 **Core Curriculum: 15 Modules (Complete ML Systems Education)**
### **Phase 1: Foundation (Modules 1-5)**
*Build → Use: Mathematical foundations with immediate application*
| Module | Name | What Students Build | Systems Engineering Concepts |
|--------|------|---------------------|----------------------------|
| **1** | **Setup** | • Virtual environment configuration<br>• Rich CLI progress tracking<br>• Memory profiler setup<br>• Testing infrastructure | • Development environment best practices<br>• Profiling and measurement tools<br>• Testing frameworks<br>• Dependency management |
| **2** | **Tensor** | • N-dimensional Tensor class<br>• Broadcasting operations<br>• Memory views and slicing<br>• Basic math ops (+, -, *, /) | • Memory layout (row-major vs column-major)<br>• Zero-copy operations with views<br>• Cache-friendly memory access patterns<br>• Vectorization opportunities |
| **3** | **Layers** | • Module base class<br>• Parameter management<br>• Linear/Dense layer implementation<br>• Forward/backward protocol | • Object-oriented design for ML<br>• Parameter memory overhead<br>• Matrix multiplication complexity O(N³)<br>• Cache effects in GEMM |
| **4** | **Activations** | • ReLU, Sigmoid, Tanh, Softmax<br>• Backward passes for each<br>• In-place operations<br>• Numerical stability fixes | • In-place vs copy memory tradeoffs<br>• Numerical stability (overflow/underflow)<br>• Memory allocation patterns<br>• Why nonlinearity enables learning |
| **5** | **Networks** | • Sequential container<br>• Multi-layer composition<br>• Weight initialization strategies<br>• Complete neural network class | • Network depth vs memory scaling<br>• Gradient flow in deep networks<br>• Initialization impact on convergence<br>• Parameter scaling with network size |
**🎉 Milestone: Inference Examples Unlocked**
- Students can run pretrained XOR, MNIST, and CIFAR-10 models
- **Learning Validation**: "I built the mathematical foundation for all neural networks"
---
### **Phase 2: Vision Training (Modules 6-10)**
*Learn → Optimize: Complete CNN training capabilities*
| Module | Name | What Students Build | Systems Engineering Concepts |
|--------|------|---------------------|----------------------------|
| **6** | **Autograd** | • Computational graph<br>• Automatic differentiation<br>• Gradient accumulation<br>• Memory checkpointing | • Graph memory explosion O(N)<br>• Forward vs reverse mode AD<br>• Gradient checkpointing tradeoffs<br>• Memory efficient backpropagation |
| **7** | **Spatial (CNNs)** | • Conv2d layer implementation<br>• BatchNorm for training stability<br>• MaxPool2d operations<br>• Complete CNN architectures | • Convolution complexity O(N²K²C²)<br>• Feature map memory scaling<br>• BatchNorm parameter overhead<br>• Cache-friendly convolution patterns |
| **8** | **Optimizers** | • SGD with momentum<br>• Adam optimizer<br>• Memory buffers for conv weights<br>• Learning rate scheduling | • Adam memory cost: 3× parameters<br>• Conv weight memory scaling<br>• Momentum buffer allocation<br>• Convergence vs memory tradeoffs |
| **9** | **DataLoader** | • Dataset abstraction class<br>• CIFAR-10 image data loader<br>• Batch sampling for CNNs<br>• Image preprocessing pipeline | • I/O bottlenecks for image data<br>• Memory vs disk tradeoffs<br>• Image batch size impact on throughput<br>• Data pipeline optimization for vision |
| **10** | **Training** | • CNN training loops<br>• CrossEntropy loss for classification<br>• Validation on CIFAR-10<br>• Model checkpointing | • CNN memory during training (conv + BatchNorm)<br>• Image batch gradient accumulation<br>• Model checkpoint disk I/O<br>• CIFAR-10 training memory profiling |
**🎉 Milestone: CNN Training Unlocked**
- Students train CNNs on CIFAR-10 to 75% accuracy
- **Learning Validation**: "I understand how modern ML training works under the hood"
---
### **Phase 3: Language & Advanced Architectures (Modules 11-15)**
*Specialize → Apply: Language models and advanced techniques*
| Module | Name | What Students Build | Systems Engineering Concepts |
|--------|------|---------------------|----------------------------|
| **11** | **Tokenization** | • Character tokenizer<br>• BPE tokenizer basics<br>• Vocabulary management<br>• Padding and truncation | • Memory efficiency of token representations<br>• Vocabulary size vs model size tradeoffs<br>• Tokenization throughput optimization<br>• String processing performance |
| **12** | **Embeddings** | • Embedding layer implementation<br>• Positional encodings<br>• Learned vs fixed embeddings<br>• Embedding initialization | • Embedding table memory (vocab_size × dim)<br>• Sparse vs dense lookup operations<br>• Cache locality in embedding lookups<br>• Memory scaling with vocabulary size |
| **13** | **Attention** | • Scaled dot-product attention<br>• Multi-head attention<br>• Causal masking<br>• KV-cache implementation | • Quadratic memory scaling O(N²)<br>• Attention memory bottlenecks<br>• KV-cache memory savings<br>• Sequence length vs memory tradeoffs |
| **14** | **Transformers** | • LayerNorm for transformers<br>• Transformer block<br>• Complete TinyGPT architecture<br>• Residual connections | • LayerNorm vs BatchNorm differences<br>• Layer memory accumulation<br>• Activation memory per transformer layer<br>• Residual path gradient flow |
| **15** | **Generation** | • Autoregressive text generation<br>• Sampling strategies<br>• Temperature and top-k<br>• Complete TinyGPT training | • Autoregressive generation memory<br>• KV-cache efficiency during generation<br>• Sampling algorithm performance<br>• Training vs inference memory patterns |
**🎉 Grand Finale: Complete TinyGPT Language Model**
- Students build working transformer for text generation
- **Learning Validation**: "I built a unified framework supporting both vision and language"
---
## 🚀 **Advanced Track: 5 Optional Modules (Production-Level Systems)**
*For students wanting deeper production ML systems expertise*
### **Systems Optimization Specialization (Modules 16-20)**
| Module | Name | What Students Build | Systems Engineering Concepts |
|--------|------|---------------------|----------------------------|
| **16** | **Profiling** | • Performance measurement tools<br>• Memory usage profilers<br>• Bottleneck identification<br>• System analysis frameworks | • Systematic performance analysis<br>• Memory vs compute profiling<br>• Scaling behavior measurement<br>• Performance regression detection |
| **17** | **Kernels** | • Optimized matrix multiplication<br>• Vectorized activations with NumPy<br>• Fused operations (relu+add)<br>• Parallel processing optimization | • Memory bandwidth optimization<br>• Kernel fusion benefits (2-5× speedup)<br>• Cache-friendly algorithms<br>• Vectorization techniques |
| **18** | **Compression** | • Weight pruning algorithms<br>• Basic quantization (INT16)<br>• Knowledge distillation<br>• Model size reduction tools | • 4× memory reduction techniques<br>• Structured vs unstructured sparsity<br>• Distillation training loops<br>• Accuracy vs size tradeoffs |
| **19** | **KV-Cache** | • Simple KV-cache for attention<br>• Cache hit/miss optimization<br>• Memory-efficient attention<br>• Sequence length optimization | • Memory vs computation tradeoffs<br>• Cache-aware attention algorithms<br>• O(N²) → O(N) optimization<br>• Memory allocation strategies |
| **20** | **Competition** | • Apply ALL optimizations (16-19)<br>• Multi-objective optimization<br>• Leaderboard submission system<br>• Competition across multiple metrics | • Real-world constraint optimization<br>• Multi-metric evaluation<br>• Production-ready systems thinking<br>• Competitive optimization |
**🏆 Ultimate Achievement: Production-Optimized ML Systems**
- Students optimize their framework for speed, memory, and deployment
- **Learning Validation**: "I understand how to build production-ready ML systems"
---
## 🎯 **Learning Progression & Validation**
### **Module Progression Logic**
```
Foundation (1-5): "Can I build the math?" [Build → Use]
Training (6-10): "Can I learn from data?" [Learn → Optimize]
Architectures (11-15): "Can I handle multiple modalities?" [Specialize → Apply]
Systems (16-20): "Can I optimize for production?" [Measure → Optimize]
```
### **Achievement Milestones**
- **Module 5**: Run inference on pretrained models (Foundation complete)
- **Module 10**: Train CNNs on CIFAR-10 to 75% accuracy (Vision training complete)
- **Module 15**: Generate text with TinyGPT (Language architectures complete)
- **Module 20**: Optimize framework for production constraints (Systems mastery)
### **Systems Engineering Thread Throughout**
Every module teaches both **algorithms AND systems**:
- **Memory usage patterns**: How operations scale with input size
- **Computational complexity**: O(N), O(N²), O(N³) analysis
- **Performance bottlenecks**: Where systems break under load
- **Production implications**: How real frameworks handle these challenges
---
## 📊 **Time Estimates & Scope**
### **Core Curriculum (15 modules)**
- **Time**: 4-6 hours per module = 60-90 hours total
- **Semester fit**: 15 weeks = 4-6 hours/week (realistic)
- **Outcome**: Complete ML systems engineer
### **Advanced Track (5 modules)**
- **Time**: 3-4 hours per module = 15-20 hours additional
- **Audience**: Motivated students wanting production skills
- **Outcome**: Production-ready optimization expertise
### **Total Program**
- **Core only**: Complete foundation in 15 weeks
- **With advanced**: Production expertise in 20 weeks
- **Flexibility**: Natural stopping point at Module 15
---
## 🔄 **Continuous Systems Focus**
### **Every Module Includes:**
1. **Memory Analysis**: Explicit memory profiling and optimization
2. **Performance Measurement**: Timing and complexity analysis
3. **Scaling Behavior**: How does this break with larger inputs?
4. **Production Context**: How do real systems (PyTorch/TensorFlow) handle this?
### **Cumulative Systems Knowledge**
- **Modules 1-5**: Memory-efficient operations
- **Modules 6-10**: Training memory management
- **Modules 11-15**: Attention memory scaling
- **Modules 16-20**: Production optimization techniques
---
## 🎯 **Success Metrics**
### **Student Capabilities After Core (15 modules):**
- **"I can build any neural network architecture from scratch"**
- **"I understand memory and performance implications of my code"**
- **"I can train models on real datasets like CIFAR-10"**
- **"I can extend my framework to new modalities (vision → language)"**
### **Student Capabilities After Advanced (20 modules):**
- **"I can optimize ML systems for production constraints"**
- **"I understand the engineering tradeoffs in real ML frameworks"**
- **"I can measure, profile, and systematically improve performance"**
- **"I can compete in optimization challenges using my own code"**
---
## 🚀 **Why This Approach Works**
### **Learning Through Building**
Students don't just study ML algorithms - they **build the infrastructure** that makes modern AI possible.
### **Systems Engineering Focus**
Every concept is taught through the lens of **memory, performance, and scaling** - the core of ML systems engineering.
### **Progressive Complexity**
Clear progression from basic math operations to production-optimized transformers.
### **Immediate Validation**
Students can run inference, train models, and generate text using code they built themselves.
### **Industry Relevance**
Skills transfer directly to understanding PyTorch, TensorFlow, and production ML systems.
---
**🎉 Final Achievement: Students build a complete, optimized ML framework from scratch and understand every line of code in modern AI systems.**