# πŸš€ TinyTorch Final Module Plan: 17 Modules to ML Systems Mastery ## Overview: Three Learning Phases **Phase 1: Foundation (Modules 1-5)** β†’ Unlock Inference Examples **Phase 2: Training & Vision (Modules 6-10)** β†’ Unlock CNN Training **Phase 3: Language & Systems (Modules 11-17)** β†’ Unlock TinyGPT & Competition --- ## πŸ“š Phase 1: Foundation - "Look What You Can Already Do!" ### Module 01: Setup **What Students Build:** - Virtual environment configuration - Rich CLI for beautiful progress tracking - Testing infrastructure - Development tools (debugger, profiler stubs) **Systems Concepts:** - Development environment best practices - Dependency management - Testing frameworks ### Module 02: Tensor **What Students Build:** - N-dimensional array class - Broadcasting operations - Memory-efficient views and slicing - Basic math operations (+, -, *, /) **Systems Concepts:** - Memory layout (row-major vs column-major) - Cache efficiency - Vectorization opportunities - O(1) vs O(N) operations ### Module 03: Activations **What Students Build:** - ReLU, Sigmoid, Tanh, Softmax - Backward pass for each activation - Numerical stability (LogSoftmax) **Systems Concepts:** - Numerical stability (overflow/underflow) - Computational complexity per activation - Memory requirements (in-place vs copy) ### Module 04: Layers **What Students Build:** - Module base class - Parameter management - Forward/backward protocol - Layer composition patterns **Systems Concepts:** - Object-oriented design for ML - Memory management for parameters - Modular architecture benefits ### Module 05: Networks (Dense) **What Students Build:** - Linear/Dense layer - Sequential container - Basic neural network class - Weight initialization **Systems Concepts:** - Matrix multiplication complexity O(NΒ²) or O(NΒ³) - Parameter memory scaling - Why initialization matters **πŸŽ‰ UNLOCK: Inference Examples!** - Run pretrained XOR network - Run pretrained MNIST classifier - Run pretrained CIFAR-10 CNN - Students see their code actually works! --- ## πŸ“š Phase 2: Training & Vision - "Now Train Your Own!" ### Module 06: DataLoader **What Students Build:** - Dataset abstraction - Batch sampling - Shuffling and iteration - CIFAR-10 loader **Systems Concepts:** - I/O bottlenecks - Memory vs disk tradeoffs - Prefetching and pipelining ### Module 07: Autograd **What Students Build:** - Computational graph - Automatic differentiation - Gradient accumulation - Backward pass automation **Systems Concepts:** - Graph memory consumption - Forward vs reverse mode AD - Gradient checkpointing concepts ### Module 08: Optimizers **What Students Build:** - SGD with momentum - Adam optimizer - Learning rate scheduling - Gradient clipping **Systems Concepts:** - Memory usage (Adam = 3Γ— parameters!) - Convergence rates - Numerical stability in updates ### Module 09: Training **What Students Build:** - Training loop - Loss functions (MSE, CrossEntropy) - Validation and metrics - Checkpointing **Systems Concepts:** - Memory during training - Gradient accumulation for large batches - Disk I/O for checkpoints ### Module 10: Spatial (CNN) **What Students Build:** - Conv2d layer - Pooling operations - CNN architectures - Image augmentation **Systems Concepts:** - Convolution complexity O(NΒ²KΒ²CΒ²) - Memory footprint of feature maps - Cache-friendly implementations **πŸŽ‰ UNLOCK: CNN Training!** - Train CNN on CIFAR-10 - Achieve 75% accuracy milestone - Visualize learned features --- ## πŸ“š Phase 3: Language & Systems - "From Vision to Language to Production!" ### Module 11: Tokenization **What Students Build:** - Character tokenizer - BPE tokenizer basics - Vocabulary management - Padding and truncation **Systems Concepts:** - Memory efficiency of token representations - Vocabulary size tradeoffs - Tokenization speed considerations ### Module 12: Embeddings **What Students Build:** - Embedding layer - Positional encodings - Learned vs fixed embeddings - Embedding initialization **Systems Concepts:** - Embedding table memory (vocab_size Γ— dim) - Sparse vs dense operations - Cache locality in lookups ### Module 13: Attention **What Students Build:** - Scaled dot-product attention - Multi-head attention - Causal masking - KV-cache basics **Systems Concepts:** - O(NΒ²) attention complexity - Memory bottlenecks in attention - Why KV-cache matters ### Module 14: Transformers **What Students Build:** - LayerNorm - Transformer block - Full GPT architecture - Residual connections **Systems Concepts:** - Layer normalization stability - Residual path gradient flow - Transformer memory scaling **πŸŽ‰ UNLOCK: TinyGPT!** - Train character-level language model - Generate text - Compare with vision models --- ## πŸ”₯ Phase 4: Systems Optimization - "Make It Fast, Make It Small!" ### Module 15: Kernels **What Students Build:** - Fused operations (e.g., fused_relu_add) - Matrix multiplication optimization - Custom CUDA-like kernels (in NumPy) - Operator fusion patterns **Why Universal:** - Works for MLPs, CNNs, and Transformers - Reduces memory bandwidth usage - Speeds up any model architecture **Systems Concepts:** - Memory bandwidth vs compute bound - Kernel fusion benefits - Cache optimization - Vectorization with NumPy **Performance Gains:** - 2-5Γ— speedup from fusion - Memory bandwidth reduction - Works on CPU (NumPy vectorization) ### Module 16: Compression **What Students Build:** - Quantization (INT8, INT4) - Pruning (magnitude, structured) - Knowledge distillation setup - Model size reduction **Why Universal:** - Quantize any model (MLP/CNN/GPT) - Prune any architecture - Distill large to small **Systems Concepts:** - Precision vs accuracy tradeoffs - Structured vs unstructured sparsity - Compression ratios - Inference speedup from quantization **Performance Gains:** - 4Γ— size reduction (FP32 β†’ INT8) - 2Γ— inference speedup - 90% sparsity possible ### Module 17: Competition - "The Grand Finale!" **What Students Build:** - KV-cache for transformers - Dynamic batching - Mixed precision training - Model ensemble techniques - All optimizations combined! **Competition Elements:** - **Leaderboard**: Real-time ranking - **Metrics**: Accuracy, speed, model size - **Constraints**: Max 10MB model, <100ms inference - **Tasks**: CIFAR-10, MNIST, TinyGPT generation **Systems Concepts:** - KV-cache memory management - Batch size vs latency tradeoffs - Optimization stacking - Production deployment considerations **πŸ† GRAND FINALE:** - Students submit optimized models - Automatic evaluation on hidden test set - Leaderboard shows: - Accuracy scores - Inference time - Model size - Memory usage - Winners announced for: - Best accuracy - Fastest inference - Smallest model - Best accuracy/size ratio --- ## 🎯 Why This Structure Works ### Progressive Unlocking 1. **Modules 1-5**: Build foundation β†’ Unlock inference (immediate gratification) 2. **Modules 6-10**: Add training β†’ Unlock CNN training (real achievement) 3. **Modules 11-14**: Add language β†’ Unlock TinyGPT (wow factor) 4. **Modules 15-17**: Optimize everything β†’ Competition (epic finale) ### Universal Optimizations (Modules 15-17) - **Not** architecture-specific - Work on MLPs, CNNs, and Transformers - Real production techniques - Measurable improvements ### Competition as Culmination - Uses EVERYTHING students built - Competitive element drives engagement - Multiple winning categories (not just accuracy) - Shows real ML engineering tradeoffs - Students optimize their own code! ### High Note Ending - Module 15: "Make it fast!" (kernels) - Module 16: "Make it small!" (compression) - Module 17: "Make it production-ready!" (competition) - Final message: "You built a complete ML framework and optimized it for production!" --- ## πŸ“Š Module Complexity Progression ``` Complexity: β–β–‚β–ƒβ–„β–„β–…β–…β–†β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Modules: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 └─Found.β”€β”˜β””Trainingβ”˜β””β”€Languageβ”€β”˜β””Systemsβ”˜ Unlocks: ↑ ↑ ↑ ↑ Inference CNN TinyGPT Competition ``` --- ## 🏁 Student Journey Summary **Week 1-2**: Foundation (Modules 1-5) - "I built tensors and layers!" - "I can run pretrained models!" **Week 3-4**: Training (Modules 6-10) - "I built autograd from scratch!" - "I trained a CNN to 75% accuracy!" **Week 5-6**: Language (Modules 11-14) - "I built attention mechanisms!" - "I have a working GPT!" **Week 7**: Systems (Modules 15-17) - "I optimized everything!" - "I'm on the leaderboard!" - "I built a complete, optimized ML framework!" **Final Achievement**: "I didn't just learn ML algorithms - I built the entire infrastructure, optimized it for production, and competed against my peers. I understand ML systems engineering!"