# 🎯 TinyTorch Tutorial Master Plan: Complete ML Systems Engineering ## Vision Statement **Students build a complete ML framework from scratch, learning systems engineering through hands-on implementation. From basic tensors to production-optimized transformers, every line of code teaches both algorithms AND systems thinking.** --- ## 📚 **Core Curriculum: 15 Modules (Complete ML Systems Education)** ### **Phase 1: Foundation (Modules 1-5)** *Build → Use: Mathematical foundations with immediate application* | Module | Name | What Students Build | Systems Engineering Concepts | |--------|------|---------------------|----------------------------| | **1** | **Setup** | • Virtual environment configuration
• Rich CLI progress tracking
• Memory profiler setup
• Testing infrastructure | • Development environment best practices
• Profiling and measurement tools
• Testing frameworks
• Dependency management | | **2** | **Tensor** | • N-dimensional Tensor class
• Broadcasting operations
• Memory views and slicing
• Basic math ops (+, -, *, /) | • Memory layout (row-major vs column-major)
• Zero-copy operations with views
• Cache-friendly memory access patterns
• Vectorization opportunities | | **3** | **Layers** | • Module base class
• Parameter management
• Linear/Dense layer implementation
• Forward/backward protocol | • Object-oriented design for ML
• Parameter memory overhead
• Matrix multiplication complexity O(N³)
• Cache effects in GEMM | | **4** | **Activations** | • ReLU, Sigmoid, Tanh, Softmax
• Backward passes for each
• In-place operations
• Numerical stability fixes | • In-place vs copy memory tradeoffs
• Numerical stability (overflow/underflow)
• Memory allocation patterns
• Why nonlinearity enables learning | | **5** | **Networks** | • Sequential container
• Multi-layer composition
• Weight initialization strategies
• Complete neural network class | • Network depth vs memory scaling
• Gradient flow in deep networks
• Initialization impact on convergence
• Parameter scaling with network size | **🎉 Milestone: Inference Examples Unlocked** - Students can run pretrained XOR, MNIST, and CIFAR-10 models - **Learning Validation**: "I built the mathematical foundation for all neural networks" --- ### **Phase 2: Vision Training (Modules 6-10)** *Learn → Optimize: Complete CNN training capabilities* | Module | Name | What Students Build | Systems Engineering Concepts | |--------|------|---------------------|----------------------------| | **6** | **Autograd** | • Computational graph
• Automatic differentiation
• Gradient accumulation
• Memory checkpointing | • Graph memory explosion O(N)
• Forward vs reverse mode AD
• Gradient checkpointing tradeoffs
• Memory efficient backpropagation | | **7** | **Spatial (CNNs)** | • Conv2d layer implementation
• BatchNorm for training stability
• MaxPool2d operations
• Complete CNN architectures | • Convolution complexity O(N²K²C²)
• Feature map memory scaling
• BatchNorm parameter overhead
• Cache-friendly convolution patterns | | **8** | **Optimizers** | • SGD with momentum
• Adam optimizer
• Memory buffers for conv weights
• Learning rate scheduling | • Adam memory cost: 3× parameters
• Conv weight memory scaling
• Momentum buffer allocation
• Convergence vs memory tradeoffs | | **9** | **DataLoader** | • Dataset abstraction class
• CIFAR-10 image data loader
• Batch sampling for CNNs
• Image preprocessing pipeline | • I/O bottlenecks for image data
• Memory vs disk tradeoffs
• Image batch size impact on throughput
• Data pipeline optimization for vision | | **10** | **Training** | • CNN training loops
• CrossEntropy loss for classification
• Validation on CIFAR-10
• Model checkpointing | • CNN memory during training (conv + BatchNorm)
• Image batch gradient accumulation
• Model checkpoint disk I/O
• CIFAR-10 training memory profiling | **🎉 Milestone: CNN Training Unlocked** - Students train CNNs on CIFAR-10 to 75% accuracy - **Learning Validation**: "I understand how modern ML training works under the hood" --- ### **Phase 3: Language & Advanced Architectures (Modules 11-15)** *Specialize → Apply: Language models and advanced techniques* | Module | Name | What Students Build | Systems Engineering Concepts | |--------|------|---------------------|----------------------------| | **11** | **Tokenization** | • Character tokenizer
• BPE tokenizer basics
• Vocabulary management
• Padding and truncation | • Memory efficiency of token representations
• Vocabulary size vs model size tradeoffs
• Tokenization throughput optimization
• String processing performance | | **12** | **Embeddings** | • Embedding layer implementation
• Positional encodings
• Learned vs fixed embeddings
• Embedding initialization | • Embedding table memory (vocab_size × dim)
• Sparse vs dense lookup operations
• Cache locality in embedding lookups
• Memory scaling with vocabulary size | | **13** | **Attention** | • Scaled dot-product attention
• Multi-head attention
• Causal masking
• KV-cache implementation | • Quadratic memory scaling O(N²)
• Attention memory bottlenecks
• KV-cache memory savings
• Sequence length vs memory tradeoffs | | **14** | **Transformers** | • LayerNorm for transformers
• Transformer block
• Complete TinyGPT architecture
• Residual connections | • LayerNorm vs BatchNorm differences
• Layer memory accumulation
• Activation memory per transformer layer
• Residual path gradient flow | | **15** | **Generation** | • Autoregressive text generation
• Sampling strategies
• Temperature and top-k
• Complete TinyGPT training | • Autoregressive generation memory
• KV-cache efficiency during generation
• Sampling algorithm performance
• Training vs inference memory patterns | **🎉 Grand Finale: Complete TinyGPT Language Model** - Students build working transformer for text generation - **Learning Validation**: "I built a unified framework supporting both vision and language" --- ## 🚀 **Advanced Track: 5 Optional Modules (Production-Level Systems)** *For students wanting deeper production ML systems expertise* ### **Systems Optimization Specialization (Modules 16-20)** | Module | Name | What Students Build | Systems Engineering Concepts | |--------|------|---------------------|----------------------------| | **16** | **Profiling** | • Performance measurement tools
• Memory usage profilers
• Bottleneck identification
• System analysis frameworks | • Systematic performance analysis
• Memory vs compute profiling
• Scaling behavior measurement
• Performance regression detection | | **17** | **Kernels** | • Optimized matrix multiplication
• Vectorized activations with NumPy
• Fused operations (relu+add)
• Parallel processing optimization | • Memory bandwidth optimization
• Kernel fusion benefits (2-5× speedup)
• Cache-friendly algorithms
• Vectorization techniques | | **18** | **Compression** | • Weight pruning algorithms
• Basic quantization (INT16)
• Knowledge distillation
• Model size reduction tools | • 4× memory reduction techniques
• Structured vs unstructured sparsity
• Distillation training loops
• Accuracy vs size tradeoffs | | **19** | **KV-Cache** | • Simple KV-cache for attention
• Cache hit/miss optimization
• Memory-efficient attention
• Sequence length optimization | • Memory vs computation tradeoffs
• Cache-aware attention algorithms
• O(N²) → O(N) optimization
• Memory allocation strategies | | **20** | **Competition** | • Apply ALL optimizations (16-19)
• Multi-objective optimization
• Leaderboard submission system
• Competition across multiple metrics | • Real-world constraint optimization
• Multi-metric evaluation
• Production-ready systems thinking
• Competitive optimization | **🏆 Ultimate Achievement: Production-Optimized ML Systems** - Students optimize their framework for speed, memory, and deployment - **Learning Validation**: "I understand how to build production-ready ML systems" --- ## 🎯 **Learning Progression & Validation** ### **Module Progression Logic** ``` Foundation (1-5): "Can I build the math?" [Build → Use] ↓ Training (6-10): "Can I learn from data?" [Learn → Optimize] ↓ Architectures (11-15): "Can I handle multiple modalities?" [Specialize → Apply] ↓ Systems (16-20): "Can I optimize for production?" [Measure → Optimize] ``` ### **Achievement Milestones** - **Module 5**: Run inference on pretrained models (Foundation complete) - **Module 10**: Train CNNs on CIFAR-10 to 75% accuracy (Vision training complete) - **Module 15**: Generate text with TinyGPT (Language architectures complete) - **Module 20**: Optimize framework for production constraints (Systems mastery) ### **Systems Engineering Thread Throughout** Every module teaches both **algorithms AND systems**: - **Memory usage patterns**: How operations scale with input size - **Computational complexity**: O(N), O(N²), O(N³) analysis - **Performance bottlenecks**: Where systems break under load - **Production implications**: How real frameworks handle these challenges --- ## 📊 **Time Estimates & Scope** ### **Core Curriculum (15 modules)** - **Time**: 4-6 hours per module = 60-90 hours total - **Semester fit**: 15 weeks = 4-6 hours/week (realistic) - **Outcome**: Complete ML systems engineer ### **Advanced Track (5 modules)** - **Time**: 3-4 hours per module = 15-20 hours additional - **Audience**: Motivated students wanting production skills - **Outcome**: Production-ready optimization expertise ### **Total Program** - **Core only**: Complete foundation in 15 weeks - **With advanced**: Production expertise in 20 weeks - **Flexibility**: Natural stopping point at Module 15 --- ## 🔄 **Continuous Systems Focus** ### **Every Module Includes:** 1. **Memory Analysis**: Explicit memory profiling and optimization 2. **Performance Measurement**: Timing and complexity analysis 3. **Scaling Behavior**: How does this break with larger inputs? 4. **Production Context**: How do real systems (PyTorch/TensorFlow) handle this? ### **Cumulative Systems Knowledge** - **Modules 1-5**: Memory-efficient operations - **Modules 6-10**: Training memory management - **Modules 11-15**: Attention memory scaling - **Modules 16-20**: Production optimization techniques --- ## 🎯 **Success Metrics** ### **Student Capabilities After Core (15 modules):** - **"I can build any neural network architecture from scratch"** - **"I understand memory and performance implications of my code"** - **"I can train models on real datasets like CIFAR-10"** - **"I can extend my framework to new modalities (vision → language)"** ### **Student Capabilities After Advanced (20 modules):** - **"I can optimize ML systems for production constraints"** - **"I understand the engineering tradeoffs in real ML frameworks"** - **"I can measure, profile, and systematically improve performance"** - **"I can compete in optimization challenges using my own code"** --- ## 🚀 **Why This Approach Works** ### **Learning Through Building** Students don't just study ML algorithms - they **build the infrastructure** that makes modern AI possible. ### **Systems Engineering Focus** Every concept is taught through the lens of **memory, performance, and scaling** - the core of ML systems engineering. ### **Progressive Complexity** Clear progression from basic math operations to production-optimized transformers. ### **Immediate Validation** Students can run inference, train models, and generate text using code they built themselves. ### **Industry Relevance** Skills transfer directly to understanding PyTorch, TensorFlow, and production ML systems. --- **🎉 Final Achievement: Students build a complete, optimized ML framework from scratch and understand every line of code in modern AI systems.**