mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-01 05:37:30 -05:00

Files

Vijay Janapa Reddi 65f662e00b Fix tutorial master plan: Logical module sequence for Phase 2

- Phase 2 now: Autograd → Spatial → Optimizers → DataLoader → Training
- Move Spatial (CNNs) from Phase 3 to Phase 2 Module 7
- Integrate BatchNorm into Spatial module (mirrors PyTorch patterns)
- Fix milestone: CNN training achievable at end of Phase 2 (Module 10)
- Phase 3 focuses on language: Tokenization → Embeddings → Attention → Transformers
- Logical dependency flow: understand conv operations before optimizing them

2025-09-23 18:28:44 -04:00

12 KiB

Raw Blame History

🎯 TinyTorch Tutorial Master Plan: Complete ML Systems Engineering

Vision Statement

Students build a complete ML framework from scratch, learning systems engineering through hands-on implementation. From basic tensors to production-optimized transformers, every line of code teaches both algorithms AND systems thinking.

📚 Core Curriculum: 15 Modules (Complete ML Systems Education)

Phase 1: Foundation (Modules 1-5)

Build → Use: Mathematical foundations with immediate application

Module	Name	What Students Build	Systems Engineering Concepts
1	Setup	• Virtual environment configuration • Rich CLI progress tracking • Memory profiler setup • Testing infrastructure	• Development environment best practices • Profiling and measurement tools • Testing frameworks • Dependency management
2	Tensor	• N-dimensional Tensor class • Broadcasting operations • Memory views and slicing • Basic math ops (+, -, *, /)	• Memory layout (row-major vs column-major) • Zero-copy operations with views • Cache-friendly memory access patterns • Vectorization opportunities
3	Layers	• Module base class • Parameter management • Linear/Dense layer implementation • Forward/backward protocol	• Object-oriented design for ML • Parameter memory overhead • Matrix multiplication complexity O(N³) • Cache effects in GEMM
4	Activations	• ReLU, Sigmoid, Tanh, Softmax • Backward passes for each • In-place operations • Numerical stability fixes	• In-place vs copy memory tradeoffs • Numerical stability (overflow/underflow) • Memory allocation patterns • Why nonlinearity enables learning
5	Networks	• Sequential container • Multi-layer composition • Weight initialization strategies • Complete neural network class	• Network depth vs memory scaling • Gradient flow in deep networks • Initialization impact on convergence • Parameter scaling with network size

🎉 Milestone: Inference Examples Unlocked

Students can run pretrained XOR, MNIST, and CIFAR-10 models
Learning Validation: "I built the mathematical foundation for all neural networks"

Phase 2: Vision Training (Modules 6-10)

Learn → Optimize: Complete CNN training capabilities

Module	Name	What Students Build	Systems Engineering Concepts
6	Autograd	• Computational graph • Automatic differentiation • Gradient accumulation • Memory checkpointing	• Graph memory explosion O(N) • Forward vs reverse mode AD • Gradient checkpointing tradeoffs • Memory efficient backpropagation
7	Spatial (CNNs)	• Conv2d layer implementation • BatchNorm for training stability • MaxPool2d operations • Complete CNN architectures	• Convolution complexity O(N²K²C²) • Feature map memory scaling • BatchNorm parameter overhead • Cache-friendly convolution patterns
8	Optimizers	• SGD with momentum • Adam optimizer • Memory buffers for conv weights • Learning rate scheduling	• Adam memory cost: 3× parameters • Conv weight memory scaling • Momentum buffer allocation • Convergence vs memory tradeoffs
9	DataLoader	• Dataset abstraction class • CIFAR-10 image data loader • Batch sampling for CNNs • Image preprocessing pipeline	• I/O bottlenecks for image data • Memory vs disk tradeoffs • Image batch size impact on throughput • Data pipeline optimization for vision
10	Training	• CNN training loops • CrossEntropy loss for classification • Validation on CIFAR-10 • Model checkpointing	• CNN memory during training (conv + BatchNorm) • Image batch gradient accumulation • Model checkpoint disk I/O • CIFAR-10 training memory profiling

🎉 Milestone: CNN Training Unlocked

Students train CNNs on CIFAR-10 to 75% accuracy
Learning Validation: "I understand how modern ML training works under the hood"

Phase 3: Language & Advanced Architectures (Modules 11-15)

Specialize → Apply: Language models and advanced techniques

Module	Name	What Students Build	Systems Engineering Concepts
11	Tokenization	• Character tokenizer • BPE tokenizer basics • Vocabulary management • Padding and truncation	• Memory efficiency of token representations • Vocabulary size vs model size tradeoffs • Tokenization throughput optimization • String processing performance
12	Embeddings	• Embedding layer implementation • Positional encodings • Learned vs fixed embeddings • Embedding initialization	• Embedding table memory (vocab_size × dim) • Sparse vs dense lookup operations • Cache locality in embedding lookups • Memory scaling with vocabulary size
13	Attention	• Scaled dot-product attention • Multi-head attention • Causal masking • KV-cache implementation	• Quadratic memory scaling O(N²) • Attention memory bottlenecks • KV-cache memory savings • Sequence length vs memory tradeoffs
14	Transformers	• LayerNorm for transformers • Transformer block • Complete TinyGPT architecture • Residual connections	• LayerNorm vs BatchNorm differences • Layer memory accumulation • Activation memory per transformer layer • Residual path gradient flow
15	Generation	• Autoregressive text generation • Sampling strategies • Temperature and top-k • Complete TinyGPT training	• Autoregressive generation memory • KV-cache efficiency during generation • Sampling algorithm performance • Training vs inference memory patterns

🎉 Grand Finale: Complete TinyGPT Language Model

Students build working transformer for text generation
Learning Validation: "I built a unified framework supporting both vision and language"

🚀 Advanced Track: 5 Optional Modules (Production-Level Systems)

For students wanting deeper production ML systems expertise

Systems Optimization Specialization (Modules 16-20)

Module	Name	What Students Build	Systems Engineering Concepts
16	Profiling	• Performance measurement tools • Memory usage profilers • Bottleneck identification • System analysis frameworks	• Systematic performance analysis • Memory vs compute profiling • Scaling behavior measurement • Performance regression detection
17	Kernels	• Optimized matrix multiplication • Vectorized activations with NumPy • Fused operations (relu+add) • Parallel processing optimization	• Memory bandwidth optimization • Kernel fusion benefits (2-5× speedup) • Cache-friendly algorithms • Vectorization techniques
18	Compression	• Weight pruning algorithms • Basic quantization (INT16) • Knowledge distillation • Model size reduction tools	• 4× memory reduction techniques • Structured vs unstructured sparsity • Distillation training loops • Accuracy vs size tradeoffs
19	KV-Cache	• Simple KV-cache for attention • Cache hit/miss optimization • Memory-efficient attention • Sequence length optimization	• Memory vs computation tradeoffs • Cache-aware attention algorithms • O(N²) → O(N) optimization • Memory allocation strategies
20	Competition	• Apply ALL optimizations (16-19) • Multi-objective optimization • Leaderboard submission system • Competition across multiple metrics	• Real-world constraint optimization • Multi-metric evaluation • Production-ready systems thinking • Competitive optimization

🏆 Ultimate Achievement: Production-Optimized ML Systems

Students optimize their framework for speed, memory, and deployment
Learning Validation: "I understand how to build production-ready ML systems"

🎯 Learning Progression & Validation

Module Progression Logic

Foundation (1-5): "Can I build the math?" [Build → Use]
    ↓
Training (6-10): "Can I learn from data?" [Learn → Optimize]
    ↓
Architectures (11-15): "Can I handle multiple modalities?" [Specialize → Apply]
    ↓
Systems (16-20): "Can I optimize for production?" [Measure → Optimize]

Achievement Milestones

Module 5: Run inference on pretrained models (Foundation complete)
Module 10: Train CNNs on CIFAR-10 to 75% accuracy (Vision training complete)
Module 15: Generate text with TinyGPT (Language architectures complete)
Module 20: Optimize framework for production constraints (Systems mastery)

Systems Engineering Thread Throughout

Every module teaches both algorithms AND systems:

Memory usage patterns: How operations scale with input size
Computational complexity: O(N), O(N²), O(N³) analysis
Performance bottlenecks: Where systems break under load
Production implications: How real frameworks handle these challenges

📊 Time Estimates & Scope

Core Curriculum (15 modules)

Time: 4-6 hours per module = 60-90 hours total
Semester fit: 15 weeks = 4-6 hours/week (realistic)
Outcome: Complete ML systems engineer

Advanced Track (5 modules)

Time: 3-4 hours per module = 15-20 hours additional
Audience: Motivated students wanting production skills
Outcome: Production-ready optimization expertise

Total Program

Core only: Complete foundation in 15 weeks
With advanced: Production expertise in 20 weeks
Flexibility: Natural stopping point at Module 15

🔄 Continuous Systems Focus

Every Module Includes:

Memory Analysis: Explicit memory profiling and optimization
Performance Measurement: Timing and complexity analysis
Scaling Behavior: How does this break with larger inputs?
Production Context: How do real systems (PyTorch/TensorFlow) handle this?

Cumulative Systems Knowledge

Modules 1-5: Memory-efficient operations
Modules 6-10: Training memory management
Modules 11-15: Attention memory scaling
Modules 16-20: Production optimization techniques

🎯 Success Metrics

Student Capabilities After Core (15 modules):

"I can build any neural network architecture from scratch"
"I understand memory and performance implications of my code"
"I can train models on real datasets like CIFAR-10"
"I can extend my framework to new modalities (vision → language)"

Student Capabilities After Advanced (20 modules):

"I can optimize ML systems for production constraints"
"I understand the engineering tradeoffs in real ML frameworks"
"I can measure, profile, and systematically improve performance"
"I can compete in optimization challenges using my own code"

🚀 Why This Approach Works

Learning Through Building

Students don't just study ML algorithms - they build the infrastructure that makes modern AI possible.

Systems Engineering Focus

Every concept is taught through the lens of memory, performance, and scaling - the core of ML systems engineering.

Progressive Complexity

Clear progression from basic math operations to production-optimized transformers.

Immediate Validation

Students can run inference, train models, and generate text using code they built themselves.

Industry Relevance

Skills transfer directly to understanding PyTorch, TensorFlow, and production ML systems.

🎉 Final Achievement: Students build a complete, optimized ML framework from scratch and understand every line of code in modern AI systems.

12 KiB Raw Blame History Unescape Escape