Files
TinyTorch/docs/module-plan-final.md
Vijay Janapa Reddi c59d9a116a MILESTONE: Complete Phase 2 CNN training pipeline
 Phase 1-2 Complete: Modules 1-10 aligned with tutorial master plan
 CNN Training Pipeline: Autograd → Spatial → Optimizers → DataLoader → Training
 Technical Validation: All modules import and function correctly
 CIFAR-10 Ready: Multi-channel Conv2D, BatchNorm, MaxPool2D, complete pipeline

Key Achievements:
- Fixed module sequence alignment (spatial now Module 7, not 6)
- Updated tutorial master plan for logical pedagogical flow
- Phase 2 milestone achieved: Students can train CNNs on CIFAR-10
- Complete systems engineering focus throughout all modules
- Production-ready CNN pipeline with memory profiling

Next Phase: Language models (Modules 11-15) for TinyGPT milestone
2025-09-23 18:33:56 -04:00

8.7 KiB
Raw Blame History

🚀 TinyTorch Final Module Plan: 17 Modules to ML Systems Mastery

Overview: Three Learning Phases

Phase 1: Foundation (Modules 1-5) → Unlock Inference Examples Phase 2: Training & Vision (Modules 6-10) → Unlock CNN Training
Phase 3: Language & Systems (Modules 11-17) → Unlock TinyGPT & Competition


📚 Phase 1: Foundation - "Look What You Can Already Do!"

Module 01: Setup

What Students Build:

  • Virtual environment configuration
  • Rich CLI for beautiful progress tracking
  • Testing infrastructure
  • Development tools (debugger, profiler stubs)

Systems Concepts:

  • Development environment best practices
  • Dependency management
  • Testing frameworks

Module 02: Tensor

What Students Build:

  • N-dimensional array class
  • Broadcasting operations
  • Memory-efficient views and slicing
  • Basic math operations (+, -, *, /)

Systems Concepts:

  • Memory layout (row-major vs column-major)
  • Cache efficiency
  • Vectorization opportunities
  • O(1) vs O(N) operations

Module 03: Activations

What Students Build:

  • ReLU, Sigmoid, Tanh, Softmax
  • Backward pass for each activation
  • Numerical stability (LogSoftmax)

Systems Concepts:

  • Numerical stability (overflow/underflow)
  • Computational complexity per activation
  • Memory requirements (in-place vs copy)

Module 04: Layers

What Students Build:

  • Module base class
  • Parameter management
  • Forward/backward protocol
  • Layer composition patterns

Systems Concepts:

  • Object-oriented design for ML
  • Memory management for parameters
  • Modular architecture benefits

Module 05: Networks (Dense)

What Students Build:

  • Linear/Dense layer
  • Sequential container
  • Basic neural network class
  • Weight initialization

Systems Concepts:

  • Matrix multiplication complexity O(N²) or O(N³)
  • Parameter memory scaling
  • Why initialization matters

🎉 UNLOCK: Inference Examples!

  • Run pretrained XOR network
  • Run pretrained MNIST classifier
  • Run pretrained CIFAR-10 CNN
  • Students see their code actually works!

📚 Phase 2: Training & Vision - "Now Train Your Own!"

Module 06: DataLoader

What Students Build:

  • Dataset abstraction
  • Batch sampling
  • Shuffling and iteration
  • CIFAR-10 loader

Systems Concepts:

  • I/O bottlenecks
  • Memory vs disk tradeoffs
  • Prefetching and pipelining

Module 07: Autograd

What Students Build:

  • Computational graph
  • Automatic differentiation
  • Gradient accumulation
  • Backward pass automation

Systems Concepts:

  • Graph memory consumption
  • Forward vs reverse mode AD
  • Gradient checkpointing concepts

Module 08: Optimizers

What Students Build:

  • SGD with momentum
  • Adam optimizer
  • Learning rate scheduling
  • Gradient clipping

Systems Concepts:

  • Memory usage (Adam = 3× parameters!)
  • Convergence rates
  • Numerical stability in updates

Module 09: Training

What Students Build:

  • Training loop
  • Loss functions (MSE, CrossEntropy)
  • Validation and metrics
  • Checkpointing

Systems Concepts:

  • Memory during training
  • Gradient accumulation for large batches
  • Disk I/O for checkpoints

Module 10: Spatial (CNN)

What Students Build:

  • Conv2d layer
  • Pooling operations
  • CNN architectures
  • Image augmentation

Systems Concepts:

  • Convolution complexity O(N²K²C²)
  • Memory footprint of feature maps
  • Cache-friendly implementations

🎉 UNLOCK: CNN Training!

  • Train CNN on CIFAR-10
  • Achieve 75% accuracy milestone
  • Visualize learned features

📚 Phase 3: Language & Systems - "From Vision to Language to Production!"

Module 11: Tokenization

What Students Build:

  • Character tokenizer
  • BPE tokenizer basics
  • Vocabulary management
  • Padding and truncation

Systems Concepts:

  • Memory efficiency of token representations
  • Vocabulary size tradeoffs
  • Tokenization speed considerations

Module 12: Embeddings

What Students Build:

  • Embedding layer
  • Positional encodings
  • Learned vs fixed embeddings
  • Embedding initialization

Systems Concepts:

  • Embedding table memory (vocab_size × dim)
  • Sparse vs dense operations
  • Cache locality in lookups

Module 13: Attention

What Students Build:

  • Scaled dot-product attention
  • Multi-head attention
  • Causal masking
  • KV-cache basics

Systems Concepts:

  • O(N²) attention complexity
  • Memory bottlenecks in attention
  • Why KV-cache matters

Module 14: Transformers

What Students Build:

  • LayerNorm
  • Transformer block
  • Full GPT architecture
  • Residual connections

Systems Concepts:

  • Layer normalization stability
  • Residual path gradient flow
  • Transformer memory scaling

🎉 UNLOCK: TinyGPT!

  • Train character-level language model
  • Generate text
  • Compare with vision models

🔥 Phase 4: Systems Optimization - "Make It Fast, Make It Small!"

Module 15: Kernels

What Students Build:

  • Fused operations (e.g., fused_relu_add)
  • Matrix multiplication optimization
  • Custom CUDA-like kernels (in NumPy)
  • Operator fusion patterns

Why Universal:

  • Works for MLPs, CNNs, and Transformers
  • Reduces memory bandwidth usage
  • Speeds up any model architecture

Systems Concepts:

  • Memory bandwidth vs compute bound
  • Kernel fusion benefits
  • Cache optimization
  • Vectorization with NumPy

Performance Gains:

  • 2-5× speedup from fusion
  • Memory bandwidth reduction
  • Works on CPU (NumPy vectorization)

Module 16: Compression

What Students Build:

  • Quantization (INT8, INT4)
  • Pruning (magnitude, structured)
  • Knowledge distillation setup
  • Model size reduction

Why Universal:

  • Quantize any model (MLP/CNN/GPT)
  • Prune any architecture
  • Distill large to small

Systems Concepts:

  • Precision vs accuracy tradeoffs
  • Structured vs unstructured sparsity
  • Compression ratios
  • Inference speedup from quantization

Performance Gains:

  • 4× size reduction (FP32 → INT8)
  • 2× inference speedup
  • 90% sparsity possible

Module 17: Competition - "The Grand Finale!"

What Students Build:

  • KV-cache for transformers
  • Dynamic batching
  • Mixed precision training
  • Model ensemble techniques
  • All optimizations combined!

Competition Elements:

  • Leaderboard: Real-time ranking
  • Metrics: Accuracy, speed, model size
  • Constraints: Max 10MB model, <100ms inference
  • Tasks: CIFAR-10, MNIST, TinyGPT generation

Systems Concepts:

  • KV-cache memory management
  • Batch size vs latency tradeoffs
  • Optimization stacking
  • Production deployment considerations

🏆 GRAND FINALE:

  • Students submit optimized models
  • Automatic evaluation on hidden test set
  • Leaderboard shows:
    • Accuracy scores
    • Inference time
    • Model size
    • Memory usage
  • Winners announced for:
    • Best accuracy
    • Fastest inference
    • Smallest model
    • Best accuracy/size ratio

🎯 Why This Structure Works

Progressive Unlocking

  1. Modules 1-5: Build foundation → Unlock inference (immediate gratification)
  2. Modules 6-10: Add training → Unlock CNN training (real achievement)
  3. Modules 11-14: Add language → Unlock TinyGPT (wow factor)
  4. Modules 15-17: Optimize everything → Competition (epic finale)

Universal Optimizations (Modules 15-17)

  • Not architecture-specific
  • Work on MLPs, CNNs, and Transformers
  • Real production techniques
  • Measurable improvements

Competition as Culmination

  • Uses EVERYTHING students built
  • Competitive element drives engagement
  • Multiple winning categories (not just accuracy)
  • Shows real ML engineering tradeoffs
  • Students optimize their own code!

High Note Ending

  • Module 15: "Make it fast!" (kernels)
  • Module 16: "Make it small!" (compression)
  • Module 17: "Make it production-ready!" (competition)
  • Final message: "You built a complete ML framework and optimized it for production!"

📊 Module Complexity Progression

Complexity:  ▁▂▃▄▄▅▅▆▆▇▇███████
Modules:     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
             └─Found.─┘└Training┘└─Language─┘└Systems┘
Unlocks:          ↑           ↑         ↑          ↑
              Inference    CNN      TinyGPT   Competition

🏁 Student Journey Summary

Week 1-2: Foundation (Modules 1-5)

  • "I built tensors and layers!"
  • "I can run pretrained models!"

Week 3-4: Training (Modules 6-10)

  • "I built autograd from scratch!"
  • "I trained a CNN to 75% accuracy!"

Week 5-6: Language (Modules 11-14)

  • "I built attention mechanisms!"
  • "I have a working GPT!"

Week 7: Systems (Modules 15-17)

  • "I optimized everything!"
  • "I'm on the leaderboard!"
  • "I built a complete, optimized ML framework!"

Final Achievement: "I didn't just learn ML algorithms - I built the entire infrastructure, optimized it for production, and competed against my peers. I understand ML systems engineering!"