mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-27 23:17:39 -05:00

Files

Vijay Janapa Reddi c59d9a116a MILESTONE: Complete Phase 2 CNN training pipeline

✅ Phase 1-2 Complete: Modules 1-10 aligned with tutorial master plan
✅ CNN Training Pipeline: Autograd → Spatial → Optimizers → DataLoader → Training
✅ Technical Validation: All modules import and function correctly
✅ CIFAR-10 Ready: Multi-channel Conv2D, BatchNorm, MaxPool2D, complete pipeline

Key Achievements:
- Fixed module sequence alignment (spatial now Module 7, not 6)
- Updated tutorial master plan for logical pedagogical flow
- Phase 2 milestone achieved: Students can train CNNs on CIFAR-10
- Complete systems engineering focus throughout all modules
- Production-ready CNN pipeline with memory profiling

Next Phase: Language models (Modules 11-15) for TinyGPT milestone

2025-09-23 18:33:56 -04:00

8.7 KiB

Raw Blame History

🚀 TinyTorch Final Module Plan: 17 Modules to ML Systems Mastery

Overview: Three Learning Phases

Phase 1: Foundation (Modules 1-5) → Unlock Inference Examples Phase 2: Training & Vision (Modules 6-10) → Unlock CNN Training
Phase 3: Language & Systems (Modules 11-17) → Unlock TinyGPT & Competition

📚 Phase 1: Foundation - "Look What You Can Already Do!"

Module 01: Setup

What Students Build:

Virtual environment configuration
Rich CLI for beautiful progress tracking
Testing infrastructure
Development tools (debugger, profiler stubs)

Systems Concepts:

Development environment best practices
Dependency management
Testing frameworks

Module 02: Tensor

What Students Build:

N-dimensional array class
Broadcasting operations
Memory-efficient views and slicing
Basic math operations (+, -, *, /)

Systems Concepts:

Memory layout (row-major vs column-major)
Cache efficiency
Vectorization opportunities
O(1) vs O(N) operations

Module 03: Activations

What Students Build:

ReLU, Sigmoid, Tanh, Softmax
Backward pass for each activation
Numerical stability (LogSoftmax)

Systems Concepts:

Numerical stability (overflow/underflow)
Computational complexity per activation
Memory requirements (in-place vs copy)

Module 04: Layers

What Students Build:

Module base class
Parameter management
Forward/backward protocol
Layer composition patterns

Systems Concepts:

Object-oriented design for ML
Memory management for parameters
Modular architecture benefits

Module 05: Networks (Dense)

What Students Build:

Linear/Dense layer
Sequential container
Basic neural network class
Weight initialization

Systems Concepts:

Matrix multiplication complexity O(N²) or O(N³)
Parameter memory scaling
Why initialization matters

🎉 UNLOCK: Inference Examples!

Run pretrained XOR network
Run pretrained MNIST classifier
Run pretrained CIFAR-10 CNN
Students see their code actually works!

📚 Phase 2: Training & Vision - "Now Train Your Own!"

Module 06: DataLoader

What Students Build:

Dataset abstraction
Batch sampling
Shuffling and iteration
CIFAR-10 loader

Systems Concepts:

I/O bottlenecks
Memory vs disk tradeoffs
Prefetching and pipelining

Module 07: Autograd

What Students Build:

Computational graph
Automatic differentiation
Gradient accumulation
Backward pass automation

Systems Concepts:

Graph memory consumption
Forward vs reverse mode AD
Gradient checkpointing concepts

Module 08: Optimizers

What Students Build:

SGD with momentum
Adam optimizer
Learning rate scheduling
Gradient clipping

Systems Concepts:

Memory usage (Adam = 3× parameters!)
Convergence rates
Numerical stability in updates

Module 09: Training

What Students Build:

Training loop
Loss functions (MSE, CrossEntropy)
Validation and metrics
Checkpointing

Systems Concepts:

Memory during training
Gradient accumulation for large batches
Disk I/O for checkpoints

Module 10: Spatial (CNN)

What Students Build:

Conv2d layer
Pooling operations
CNN architectures
Image augmentation

Systems Concepts:

Convolution complexity O(N²K²C²)
Memory footprint of feature maps
Cache-friendly implementations

🎉 UNLOCK: CNN Training!

Train CNN on CIFAR-10
Achieve 75% accuracy milestone
Visualize learned features

📚 Phase 3: Language & Systems - "From Vision to Language to Production!"

Module 11: Tokenization

What Students Build:

Character tokenizer
BPE tokenizer basics
Vocabulary management
Padding and truncation

Systems Concepts:

Memory efficiency of token representations
Vocabulary size tradeoffs
Tokenization speed considerations

Module 12: Embeddings

What Students Build:

Embedding layer
Positional encodings
Learned vs fixed embeddings
Embedding initialization

Systems Concepts:

Embedding table memory (vocab_size × dim)
Sparse vs dense operations
Cache locality in lookups

Module 13: Attention

What Students Build:

Scaled dot-product attention
Multi-head attention
Causal masking
KV-cache basics

Systems Concepts:

O(N²) attention complexity
Memory bottlenecks in attention
Why KV-cache matters

Module 14: Transformers

What Students Build:

LayerNorm
Transformer block
Full GPT architecture
Residual connections

Systems Concepts:

Layer normalization stability
Residual path gradient flow
Transformer memory scaling

🎉 UNLOCK: TinyGPT!

Train character-level language model
Generate text
Compare with vision models

🔥 Phase 4: Systems Optimization - "Make It Fast, Make It Small!"

Module 15: Kernels

What Students Build:

Fused operations (e.g., fused_relu_add)
Matrix multiplication optimization
Custom CUDA-like kernels (in NumPy)
Operator fusion patterns

Why Universal:

Works for MLPs, CNNs, and Transformers
Reduces memory bandwidth usage
Speeds up any model architecture

Systems Concepts:

Memory bandwidth vs compute bound
Kernel fusion benefits
Cache optimization
Vectorization with NumPy

Performance Gains:

2-5× speedup from fusion
Memory bandwidth reduction
Works on CPU (NumPy vectorization)

Module 16: Compression

What Students Build:

Quantization (INT8, INT4)
Pruning (magnitude, structured)
Knowledge distillation setup
Model size reduction

Why Universal:

Quantize any model (MLP/CNN/GPT)
Prune any architecture
Distill large to small

Systems Concepts:

Precision vs accuracy tradeoffs
Structured vs unstructured sparsity
Compression ratios
Inference speedup from quantization

Performance Gains:

4× size reduction (FP32 → INT8)
2× inference speedup
90% sparsity possible

Module 17: Competition - "The Grand Finale!"

What Students Build:

KV-cache for transformers
Dynamic batching
Mixed precision training
Model ensemble techniques
All optimizations combined!

Competition Elements:

Leaderboard: Real-time ranking
Metrics: Accuracy, speed, model size
Constraints: Max 10MB model, <100ms inference
Tasks: CIFAR-10, MNIST, TinyGPT generation

Systems Concepts:

KV-cache memory management
Batch size vs latency tradeoffs
Optimization stacking
Production deployment considerations

🏆 GRAND FINALE:

Students submit optimized models
Automatic evaluation on hidden test set
Leaderboard shows:
- Accuracy scores
- Inference time
- Model size
- Memory usage
Winners announced for:
- Best accuracy
- Fastest inference
- Smallest model
- Best accuracy/size ratio

🎯 Why This Structure Works

Progressive Unlocking

Modules 1-5: Build foundation → Unlock inference (immediate gratification)
Modules 6-10: Add training → Unlock CNN training (real achievement)
Modules 11-14: Add language → Unlock TinyGPT (wow factor)
Modules 15-17: Optimize everything → Competition (epic finale)

Universal Optimizations (Modules 15-17)

Not architecture-specific
Work on MLPs, CNNs, and Transformers
Real production techniques
Measurable improvements

Competition as Culmination

Uses EVERYTHING students built
Competitive element drives engagement
Multiple winning categories (not just accuracy)
Shows real ML engineering tradeoffs
Students optimize their own code!

High Note Ending

Module 15: "Make it fast!" (kernels)
Module 16: "Make it small!" (compression)
Module 17: "Make it production-ready!" (competition)
Final message: "You built a complete ML framework and optimized it for production!"

📊 Module Complexity Progression

Complexity:  ▁▂▃▄▄▅▅▆▆▇▇███████
Modules:     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
             └─Found.─┘└Training┘└─Language─┘└Systems┘
Unlocks:          ↑           ↑         ↑          ↑
              Inference    CNN      TinyGPT   Competition

🏁 Student Journey Summary

Week 1-2: Foundation (Modules 1-5)

"I built tensors and layers!"
"I can run pretrained models!"

Week 3-4: Training (Modules 6-10)

"I built autograd from scratch!"
"I trained a CNN to 75% accuracy!"

Week 5-6: Language (Modules 11-14)

"I built attention mechanisms!"
"I have a working GPT!"

Week 7: Systems (Modules 15-17)

"I optimized everything!"
"I'm on the leaderboard!"
"I built a complete, optimized ML framework!"

Final Achievement: "I didn't just learn ML algorithms - I built the entire infrastructure, optimized it for production, and competed against my peers. I understand ML systems engineering!"

8.7 KiB Raw Blame History Unescape Escape

🚀 TinyTorch Final Module Plan: 17 Modules to ML Systems Mastery

Overview: Three Learning Phases

📚 Phase 1: Foundation - "Look What You Can Already Do!"

Module 01: Setup

Module 02: Tensor

Module 03: Activations

Module 04: Layers

Module 05: Networks (Dense)

📚 Phase 2: Training & Vision - "Now Train Your Own!"

Module 06: DataLoader

Module 07: Autograd

Module 08: Optimizers

Module 09: Training

Module 10: Spatial (CNN)

📚 Phase 3: Language & Systems - "From Vision to Language to Production!"

Module 11: Tokenization

Module 12: Embeddings

Module 13: Attention

Module 14: Transformers

🔥 Phase 4: Systems Optimization - "Make It Fast, Make It Small!"

Module 15: Kernels

Module 16: Compression

Module 17: Competition - "The Grand Finale!"

🎯 Why This Structure Works

Progressive Unlocking

Universal Optimizations (Modules 15-17)

Competition as Culmination

High Note Ending

📊 Module Complexity Progression

🏁 Student Journey Summary

8.7 KiB

Raw Blame History