- Phase 2 now: Autograd → Spatial → Optimizers → DataLoader → Training - Move Spatial (CNNs) from Phase 3 to Phase 2 Module 7 - Integrate BatchNorm into Spatial module (mirrors PyTorch patterns) - Fix milestone: CNN training achievable at end of Phase 2 (Module 10) - Phase 3 focuses on language: Tokenization → Embeddings → Attention → Transformers - Logical dependency flow: understand conv operations before optimizing them
12 KiB
🎯 TinyTorch Tutorial Master Plan: Complete ML Systems Engineering
Vision Statement
Students build a complete ML framework from scratch, learning systems engineering through hands-on implementation. From basic tensors to production-optimized transformers, every line of code teaches both algorithms AND systems thinking.
📚 Core Curriculum: 15 Modules (Complete ML Systems Education)
Phase 1: Foundation (Modules 1-5)
Build → Use: Mathematical foundations with immediate application
| Module | Name | What Students Build | Systems Engineering Concepts |
|---|---|---|---|
| 1 | Setup | • Virtual environment configuration • Rich CLI progress tracking • Memory profiler setup • Testing infrastructure |
• Development environment best practices • Profiling and measurement tools • Testing frameworks • Dependency management |
| 2 | Tensor | • N-dimensional Tensor class • Broadcasting operations • Memory views and slicing • Basic math ops (+, -, *, /) |
• Memory layout (row-major vs column-major) • Zero-copy operations with views • Cache-friendly memory access patterns • Vectorization opportunities |
| 3 | Layers | • Module base class • Parameter management • Linear/Dense layer implementation • Forward/backward protocol |
• Object-oriented design for ML • Parameter memory overhead • Matrix multiplication complexity O(N³) • Cache effects in GEMM |
| 4 | Activations | • ReLU, Sigmoid, Tanh, Softmax • Backward passes for each • In-place operations • Numerical stability fixes |
• In-place vs copy memory tradeoffs • Numerical stability (overflow/underflow) • Memory allocation patterns • Why nonlinearity enables learning |
| 5 | Networks | • Sequential container • Multi-layer composition • Weight initialization strategies • Complete neural network class |
• Network depth vs memory scaling • Gradient flow in deep networks • Initialization impact on convergence • Parameter scaling with network size |
🎉 Milestone: Inference Examples Unlocked
- Students can run pretrained XOR, MNIST, and CIFAR-10 models
- Learning Validation: "I built the mathematical foundation for all neural networks"
Phase 2: Vision Training (Modules 6-10)
Learn → Optimize: Complete CNN training capabilities
| Module | Name | What Students Build | Systems Engineering Concepts |
|---|---|---|---|
| 6 | Autograd | • Computational graph • Automatic differentiation • Gradient accumulation • Memory checkpointing |
• Graph memory explosion O(N) • Forward vs reverse mode AD • Gradient checkpointing tradeoffs • Memory efficient backpropagation |
| 7 | Spatial (CNNs) | • Conv2d layer implementation • BatchNorm for training stability • MaxPool2d operations • Complete CNN architectures |
• Convolution complexity O(N²K²C²) • Feature map memory scaling • BatchNorm parameter overhead • Cache-friendly convolution patterns |
| 8 | Optimizers | • SGD with momentum • Adam optimizer • Memory buffers for conv weights • Learning rate scheduling |
• Adam memory cost: 3× parameters • Conv weight memory scaling • Momentum buffer allocation • Convergence vs memory tradeoffs |
| 9 | DataLoader | • Dataset abstraction class • CIFAR-10 image data loader • Batch sampling for CNNs • Image preprocessing pipeline |
• I/O bottlenecks for image data • Memory vs disk tradeoffs • Image batch size impact on throughput • Data pipeline optimization for vision |
| 10 | Training | • CNN training loops • CrossEntropy loss for classification • Validation on CIFAR-10 • Model checkpointing |
• CNN memory during training (conv + BatchNorm) • Image batch gradient accumulation • Model checkpoint disk I/O • CIFAR-10 training memory profiling |
🎉 Milestone: CNN Training Unlocked
- Students train CNNs on CIFAR-10 to 75% accuracy
- Learning Validation: "I understand how modern ML training works under the hood"
Phase 3: Language & Advanced Architectures (Modules 11-15)
Specialize → Apply: Language models and advanced techniques
| Module | Name | What Students Build | Systems Engineering Concepts |
|---|---|---|---|
| 11 | Tokenization | • Character tokenizer • BPE tokenizer basics • Vocabulary management • Padding and truncation |
• Memory efficiency of token representations • Vocabulary size vs model size tradeoffs • Tokenization throughput optimization • String processing performance |
| 12 | Embeddings | • Embedding layer implementation • Positional encodings • Learned vs fixed embeddings • Embedding initialization |
• Embedding table memory (vocab_size × dim) • Sparse vs dense lookup operations • Cache locality in embedding lookups • Memory scaling with vocabulary size |
| 13 | Attention | • Scaled dot-product attention • Multi-head attention • Causal masking • KV-cache implementation |
• Quadratic memory scaling O(N²) • Attention memory bottlenecks • KV-cache memory savings • Sequence length vs memory tradeoffs |
| 14 | Transformers | • LayerNorm for transformers • Transformer block • Complete TinyGPT architecture • Residual connections |
• LayerNorm vs BatchNorm differences • Layer memory accumulation • Activation memory per transformer layer • Residual path gradient flow |
| 15 | Generation | • Autoregressive text generation • Sampling strategies • Temperature and top-k • Complete TinyGPT training |
• Autoregressive generation memory • KV-cache efficiency during generation • Sampling algorithm performance • Training vs inference memory patterns |
🎉 Grand Finale: Complete TinyGPT Language Model
- Students build working transformer for text generation
- Learning Validation: "I built a unified framework supporting both vision and language"
🚀 Advanced Track: 5 Optional Modules (Production-Level Systems)
For students wanting deeper production ML systems expertise
Systems Optimization Specialization (Modules 16-20)
| Module | Name | What Students Build | Systems Engineering Concepts |
|---|---|---|---|
| 16 | Profiling | • Performance measurement tools • Memory usage profilers • Bottleneck identification • System analysis frameworks |
• Systematic performance analysis • Memory vs compute profiling • Scaling behavior measurement • Performance regression detection |
| 17 | Kernels | • Optimized matrix multiplication • Vectorized activations with NumPy • Fused operations (relu+add) • Parallel processing optimization |
• Memory bandwidth optimization • Kernel fusion benefits (2-5× speedup) • Cache-friendly algorithms • Vectorization techniques |
| 18 | Compression | • Weight pruning algorithms • Basic quantization (INT16) • Knowledge distillation • Model size reduction tools |
• 4× memory reduction techniques • Structured vs unstructured sparsity • Distillation training loops • Accuracy vs size tradeoffs |
| 19 | KV-Cache | • Simple KV-cache for attention • Cache hit/miss optimization • Memory-efficient attention • Sequence length optimization |
• Memory vs computation tradeoffs • Cache-aware attention algorithms • O(N²) → O(N) optimization • Memory allocation strategies |
| 20 | Competition | • Apply ALL optimizations (16-19) • Multi-objective optimization • Leaderboard submission system • Competition across multiple metrics |
• Real-world constraint optimization • Multi-metric evaluation • Production-ready systems thinking • Competitive optimization |
🏆 Ultimate Achievement: Production-Optimized ML Systems
- Students optimize their framework for speed, memory, and deployment
- Learning Validation: "I understand how to build production-ready ML systems"
🎯 Learning Progression & Validation
Module Progression Logic
Foundation (1-5): "Can I build the math?" [Build → Use]
↓
Training (6-10): "Can I learn from data?" [Learn → Optimize]
↓
Architectures (11-15): "Can I handle multiple modalities?" [Specialize → Apply]
↓
Systems (16-20): "Can I optimize for production?" [Measure → Optimize]
Achievement Milestones
- Module 5: Run inference on pretrained models (Foundation complete)
- Module 10: Train CNNs on CIFAR-10 to 75% accuracy (Vision training complete)
- Module 15: Generate text with TinyGPT (Language architectures complete)
- Module 20: Optimize framework for production constraints (Systems mastery)
Systems Engineering Thread Throughout
Every module teaches both algorithms AND systems:
- Memory usage patterns: How operations scale with input size
- Computational complexity: O(N), O(N²), O(N³) analysis
- Performance bottlenecks: Where systems break under load
- Production implications: How real frameworks handle these challenges
📊 Time Estimates & Scope
Core Curriculum (15 modules)
- Time: 4-6 hours per module = 60-90 hours total
- Semester fit: 15 weeks = 4-6 hours/week (realistic)
- Outcome: Complete ML systems engineer
Advanced Track (5 modules)
- Time: 3-4 hours per module = 15-20 hours additional
- Audience: Motivated students wanting production skills
- Outcome: Production-ready optimization expertise
Total Program
- Core only: Complete foundation in 15 weeks
- With advanced: Production expertise in 20 weeks
- Flexibility: Natural stopping point at Module 15
🔄 Continuous Systems Focus
Every Module Includes:
- Memory Analysis: Explicit memory profiling and optimization
- Performance Measurement: Timing and complexity analysis
- Scaling Behavior: How does this break with larger inputs?
- Production Context: How do real systems (PyTorch/TensorFlow) handle this?
Cumulative Systems Knowledge
- Modules 1-5: Memory-efficient operations
- Modules 6-10: Training memory management
- Modules 11-15: Attention memory scaling
- Modules 16-20: Production optimization techniques
🎯 Success Metrics
Student Capabilities After Core (15 modules):
- "I can build any neural network architecture from scratch"
- "I understand memory and performance implications of my code"
- "I can train models on real datasets like CIFAR-10"
- "I can extend my framework to new modalities (vision → language)"
Student Capabilities After Advanced (20 modules):
- "I can optimize ML systems for production constraints"
- "I understand the engineering tradeoffs in real ML frameworks"
- "I can measure, profile, and systematically improve performance"
- "I can compete in optimization challenges using my own code"
🚀 Why This Approach Works
Learning Through Building
Students don't just study ML algorithms - they build the infrastructure that makes modern AI possible.
Systems Engineering Focus
Every concept is taught through the lens of memory, performance, and scaling - the core of ML systems engineering.
Progressive Complexity
Clear progression from basic math operations to production-optimized transformers.
Immediate Validation
Students can run inference, train models, and generate text using code they built themselves.
Industry Relevance
Skills transfer directly to understanding PyTorch, TensorFlow, and production ML systems.
🎉 Final Achievement: Students build a complete, optimized ML framework from scratch and understand every line of code in modern AI systems.