Fix tutorial master plan: Logical module sequence for Phase 2

- Phase 2 now: Autograd → Spatial → Optimizers → DataLoader → Training - Move Spatial (CNNs) from Phase 3 to Phase 2 Module 7 - Integrate BatchNorm into Spatial module (mirrors PyTorch patterns) - Fix milestone: CNN training achievable at end of Phase 2 (Module 10) - Phase 3 focuses on language: Tokenization → Embeddings → Attention → Transformers - Logical dependency flow: understand conv operations before optimizing them
2026-04-26 21:07:32 -05:00 · 2025-09-23 18:28:44 -04:00
parent 3edd6af0cd
commit 65f662e00b
1 changed files with 179 additions and 0 deletions
--- a/docs/tutorial-master-plan.md
+++ b/docs/tutorial-master-plan.md
@@ -0,0 +1,179 @@
+# 🎯 TinyTorch Tutorial Master Plan: Complete ML Systems Engineering
+
+## Vision Statement
+**Students build a complete ML framework from scratch, learning systems engineering through hands-on implementation. From basic tensors to production-optimized transformers, every line of code teaches both algorithms AND systems thinking.**
+
+---
+
+## 📚 **Core Curriculum: 15 Modules (Complete ML Systems Education)**
+
+### **Phase 1: Foundation (Modules 1-5)**
+*Build → Use: Mathematical foundations with immediate application*
+
+| Module | Name | What Students Build | Systems Engineering Concepts |
+|--------|------|---------------------|----------------------------|
+| **1** | **Setup** | • Virtual environment configuration<br>• Rich CLI progress tracking<br>• Memory profiler setup<br>• Testing infrastructure | • Development environment best practices<br>• Profiling and measurement tools<br>• Testing frameworks<br>• Dependency management |
+| **2** | **Tensor** | • N-dimensional Tensor class<br>• Broadcasting operations<br>• Memory views and slicing<br>• Basic math ops (+, -, *, /) | • Memory layout (row-major vs column-major)<br>• Zero-copy operations with views<br>• Cache-friendly memory access patterns<br>• Vectorization opportunities |
+| **3** | **Layers** | • Module base class<br>• Parameter management<br>• Linear/Dense layer implementation<br>• Forward/backward protocol | • Object-oriented design for ML<br>• Parameter memory overhead<br>• Matrix multiplication complexity O(N³)<br>• Cache effects in GEMM |
+| **4** | **Activations** | • ReLU, Sigmoid, Tanh, Softmax<br>• Backward passes for each<br>• In-place operations<br>• Numerical stability fixes | • In-place vs copy memory tradeoffs<br>• Numerical stability (overflow/underflow)<br>• Memory allocation patterns<br>• Why nonlinearity enables learning |
+| **5** | **Networks** | • Sequential container<br>• Multi-layer composition<br>• Weight initialization strategies<br>• Complete neural network class | • Network depth vs memory scaling<br>• Gradient flow in deep networks<br>• Initialization impact on convergence<br>• Parameter scaling with network size |
+
+**🎉 Milestone: Inference Examples Unlocked**
+- Students can run pretrained XOR, MNIST, and CIFAR-10 models
+- **Learning Validation**: "I built the mathematical foundation for all neural networks"
+
+---
+
+### **Phase 2: Vision Training (Modules 6-10)**
+*Learn → Optimize: Complete CNN training capabilities*
+
+| Module | Name | What Students Build | Systems Engineering Concepts |
+|--------|------|---------------------|----------------------------|
+| **6** | **Autograd** | • Computational graph<br>• Automatic differentiation<br>• Gradient accumulation<br>• Memory checkpointing | • Graph memory explosion O(N)<br>• Forward vs reverse mode AD<br>• Gradient checkpointing tradeoffs<br>• Memory efficient backpropagation |
+| **7** | **Spatial (CNNs)** | • Conv2d layer implementation<br>• BatchNorm for training stability<br>• MaxPool2d operations<br>• Complete CNN architectures | • Convolution complexity O(N²K²C²)<br>• Feature map memory scaling<br>• BatchNorm parameter overhead<br>• Cache-friendly convolution patterns |
+| **8** | **Optimizers** | • SGD with momentum<br>• Adam optimizer<br>• Memory buffers for conv weights<br>• Learning rate scheduling | • Adam memory cost: 3× parameters<br>• Conv weight memory scaling<br>• Momentum buffer allocation<br>• Convergence vs memory tradeoffs |
+| **9** | **DataLoader** | • Dataset abstraction class<br>• CIFAR-10 image data loader<br>• Batch sampling for CNNs<br>• Image preprocessing pipeline | • I/O bottlenecks for image data<br>• Memory vs disk tradeoffs<br>• Image batch size impact on throughput<br>• Data pipeline optimization for vision |
+| **10** | **Training** | • CNN training loops<br>• CrossEntropy loss for classification<br>• Validation on CIFAR-10<br>• Model checkpointing | • CNN memory during training (conv + BatchNorm)<br>• Image batch gradient accumulation<br>• Model checkpoint disk I/O<br>• CIFAR-10 training memory profiling |
+
+**🎉 Milestone: CNN Training Unlocked**
+- Students train CNNs on CIFAR-10 to 75% accuracy
+- **Learning Validation**: "I understand how modern ML training works under the hood"
+
+---
+
+### **Phase 3: Language & Advanced Architectures (Modules 11-15)**
+*Specialize → Apply: Language models and advanced techniques*
+
+| Module | Name | What Students Build | Systems Engineering Concepts |
+|--------|------|---------------------|----------------------------|
+| **11** | **Tokenization** | • Character tokenizer<br>• BPE tokenizer basics<br>• Vocabulary management<br>• Padding and truncation | • Memory efficiency of token representations<br>• Vocabulary size vs model size tradeoffs<br>• Tokenization throughput optimization<br>• String processing performance |
+| **12** | **Embeddings** | • Embedding layer implementation<br>• Positional encodings<br>• Learned vs fixed embeddings<br>• Embedding initialization | • Embedding table memory (vocab_size × dim)<br>• Sparse vs dense lookup operations<br>• Cache locality in embedding lookups<br>• Memory scaling with vocabulary size |
+| **13** | **Attention** | • Scaled dot-product attention<br>• Multi-head attention<br>• Causal masking<br>• KV-cache implementation | • Quadratic memory scaling O(N²)<br>• Attention memory bottlenecks<br>• KV-cache memory savings<br>• Sequence length vs memory tradeoffs |
+| **14** | **Transformers** | • LayerNorm for transformers<br>• Transformer block<br>• Complete TinyGPT architecture<br>• Residual connections | • LayerNorm vs BatchNorm differences<br>• Layer memory accumulation<br>• Activation memory per transformer layer<br>• Residual path gradient flow |
+| **15** | **Generation** | • Autoregressive text generation<br>• Sampling strategies<br>• Temperature and top-k<br>• Complete TinyGPT training | • Autoregressive generation memory<br>• KV-cache efficiency during generation<br>• Sampling algorithm performance<br>• Training vs inference memory patterns |
+
+**🎉 Grand Finale: Complete TinyGPT Language Model**
+- Students build working transformer for text generation
+- **Learning Validation**: "I built a unified framework supporting both vision and language"
+
+---
+
+## 🚀 **Advanced Track: 5 Optional Modules (Production-Level Systems)**
+
+*For students wanting deeper production ML systems expertise*
+
+### **Systems Optimization Specialization (Modules 16-20)**
+
+| Module | Name | What Students Build | Systems Engineering Concepts |
+|--------|------|---------------------|----------------------------|
+| **16** | **Profiling** | • Performance measurement tools<br>• Memory usage profilers<br>• Bottleneck identification<br>• System analysis frameworks | • Systematic performance analysis<br>• Memory vs compute profiling<br>• Scaling behavior measurement<br>• Performance regression detection |
+| **17** | **Kernels** | • Optimized matrix multiplication<br>• Vectorized activations with NumPy<br>• Fused operations (relu+add)<br>• Parallel processing optimization | • Memory bandwidth optimization<br>• Kernel fusion benefits (2-5× speedup)<br>• Cache-friendly algorithms<br>• Vectorization techniques |
+| **18** | **Compression** | • Weight pruning algorithms<br>• Basic quantization (INT16)<br>• Knowledge distillation<br>• Model size reduction tools | • 4× memory reduction techniques<br>• Structured vs unstructured sparsity<br>• Distillation training loops<br>• Accuracy vs size tradeoffs |
+| **19** | **KV-Cache** | • Simple KV-cache for attention<br>• Cache hit/miss optimization<br>• Memory-efficient attention<br>• Sequence length optimization | • Memory vs computation tradeoffs<br>• Cache-aware attention algorithms<br>• O(N²) → O(N) optimization<br>• Memory allocation strategies |
+| **20** | **Competition** | • Apply ALL optimizations (16-19)<br>• Multi-objective optimization<br>• Leaderboard submission system<br>• Competition across multiple metrics | • Real-world constraint optimization<br>• Multi-metric evaluation<br>• Production-ready systems thinking<br>• Competitive optimization |
+
+**🏆 Ultimate Achievement: Production-Optimized ML Systems**
+- Students optimize their framework for speed, memory, and deployment
+- **Learning Validation**: "I understand how to build production-ready ML systems"
+
+---
+
+## 🎯 **Learning Progression & Validation**
+
+### **Module Progression Logic**
+```
+Foundation (1-5): "Can I build the math?" [Build → Use]
+    ↓
+Training (6-10): "Can I learn from data?" [Learn → Optimize]
+    ↓
+Architectures (11-15): "Can I handle multiple modalities?" [Specialize → Apply]
+    ↓
+Systems (16-20): "Can I optimize for production?" [Measure → Optimize]
+```
+
+### **Achievement Milestones**
+- **Module 5**: Run inference on pretrained models (Foundation complete)
+- **Module 10**: Train CNNs on CIFAR-10 to 75% accuracy (Vision training complete)
+- **Module 15**: Generate text with TinyGPT (Language architectures complete)
+- **Module 20**: Optimize framework for production constraints (Systems mastery)
+
+### **Systems Engineering Thread Throughout**
+Every module teaches both **algorithms AND systems**:
+- **Memory usage patterns**: How operations scale with input size
+- **Computational complexity**: O(N), O(N²), O(N³) analysis
+- **Performance bottlenecks**: Where systems break under load
+- **Production implications**: How real frameworks handle these challenges
+
+---
+
+## 📊 **Time Estimates & Scope**
+
+### **Core Curriculum (15 modules)**
+- **Time**: 4-6 hours per module = 60-90 hours total
+- **Semester fit**: 15 weeks = 4-6 hours/week (realistic)
+- **Outcome**: Complete ML systems engineer
+
+### **Advanced Track (5 modules)**  
+- **Time**: 3-4 hours per module = 15-20 hours additional
+- **Audience**: Motivated students wanting production skills
+- **Outcome**: Production-ready optimization expertise
+
+### **Total Program**
+- **Core only**: Complete foundation in 15 weeks
+- **With advanced**: Production expertise in 20 weeks
+- **Flexibility**: Natural stopping point at Module 15
+
+---
+
+## 🔄 **Continuous Systems Focus**
+
+### **Every Module Includes:**
+1. **Memory Analysis**: Explicit memory profiling and optimization
+2. **Performance Measurement**: Timing and complexity analysis
+3. **Scaling Behavior**: How does this break with larger inputs?
+4. **Production Context**: How do real systems (PyTorch/TensorFlow) handle this?
+
+### **Cumulative Systems Knowledge**
+- **Modules 1-5**: Memory-efficient operations
+- **Modules 6-10**: Training memory management  
+- **Modules 11-15**: Attention memory scaling
+- **Modules 16-20**: Production optimization techniques
+
+---
+
+## 🎯 **Success Metrics**
+
+### **Student Capabilities After Core (15 modules):**
+- **"I can build any neural network architecture from scratch"**
+- **"I understand memory and performance implications of my code"**
+- **"I can train models on real datasets like CIFAR-10"**
+- **"I can extend my framework to new modalities (vision → language)"**
+
+### **Student Capabilities After Advanced (20 modules):**
+- **"I can optimize ML systems for production constraints"**
+- **"I understand the engineering tradeoffs in real ML frameworks"**
+- **"I can measure, profile, and systematically improve performance"**
+- **"I can compete in optimization challenges using my own code"**
+
+---
+
+## 🚀 **Why This Approach Works**
+
+### **Learning Through Building**
+Students don't just study ML algorithms - they **build the infrastructure** that makes modern AI possible.
+
+### **Systems Engineering Focus**
+Every concept is taught through the lens of **memory, performance, and scaling** - the core of ML systems engineering.
+
+### **Progressive Complexity**
+Clear progression from basic math operations to production-optimized transformers.
+
+### **Immediate Validation**
+Students can run inference, train models, and generate text using code they built themselves.
+
+### **Industry Relevance**
+Skills transfer directly to understanding PyTorch, TensorFlow, and production ML systems.
+
+---
+
+**🎉 Final Achievement: Students build a complete, optimized ML framework from scratch and understand every line of code in modern AI systems.**