This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles. ## Module Reordering Summary **Previous Order (Problems)**: - 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training - Issues: Autograd before optimizers, DataLoader before training, scattered dependencies **New Order (Beautiful Progression)**: - 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader - Benefits: Each module creates inevitable need for the next ## Pedagogical Flow Achieved **05_losses** → "Need systematic weight updates" → **06_optimizers** **06_optimizers** → "Need automatic gradients" → **07_autograd** **07_autograd** → "Need systematic training" → **08_training** **08_training** → "MLPs hit limits on images" → **09_spatial** **09_spatial** → "Training is too slow" → **10_dataloader** ## Technical Changes ### Module Directory Renaming - `06_autograd` → `07_autograd` - `07_dataloader` → `10_dataloader` - `08_optimizers` → `06_optimizers` - `10_training` → `08_training` - `09_spatial` → `09_spatial` (no change) ### System Integration Updates - **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py - **Test directories**: Renamed module_XX directories to match new numbers - **Documentation**: Updated all references in MD files and agent configurations - **CLI integration**: Updated next-steps suggestions for proper flow ### Agent Configuration Updates - **Quality Assurance**: Updated module audit status with new numbers - **Module Developer**: Updated work tracking with new sequence - **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression ## Educational Benefits 1. **Inevitable Discovery**: Each module naturally leads to the next 2. **Cognitive Load**: Concepts introduced exactly when needed 3. **Motivation**: Students understand WHY each tool is necessary 4. **Synthesis**: Everything flows toward complete ML systems understanding 5. **Professional Alignment**: Matches real ML engineering workflows ## Quality Assurance - ✅ All CLI commands still function - ✅ Checkpoint system mappings updated - ✅ Documentation consistency maintained - ✅ Test directory structure aligned - ✅ Agent configurations synchronized **Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
Module 15: Hardware Acceleration and Kernel Optimization
Overview
This module teaches hardware acceleration principles through hands-on implementation of optimized kernels that demonstrate real performance improvements. Students learn to understand hardware bottlenecks, implement cache-friendly algorithms, and build systems that automatically apply optimizations.
Learning Objectives
By the end of this module, students will be able to:
- Understand Performance Bottlenecks: Identify why naive implementations are slow and where optimization opportunities exist
- Implement Cache-Friendly Algorithms: Build blocked matrix multiplication that leverages CPU cache hierarchy
- Optimize Memory Access Patterns: Create vectorized operations with contiguous memory access
- Build Transparent Backend Systems: Design automatic dispatch between naive and optimized implementations
- Measure Real Speedups: Quantify performance improvements and understand when optimizations matter
Key Concepts
Hardware Reality: Cache is King
Modern CPU performance is dominated by memory access patterns, not raw computation speed:
- L1 Cache: ~32KB, 1-2 cycles (fastest)
- L2 Cache: ~256KB, 3-10 cycles
- L3 Cache: ~8MB, 10-20 cycles
- RAM: Gigabytes, 100-300 cycles (slowest)
The key insight: keeping data in cache and accessing memory in cache-friendly patterns provides dramatic speedups.
What You'll Build
1. Performance Benchmarking Tools
- Scientific measurement infrastructure for quantifying speedups
- Automated timing with statistical analysis
- Memory usage profiling and operation counting
2. Optimized Kernels
- Blocked Matrix Multiplication: Cache-friendly algorithm showing 2-5x speedups
- Vectorized Operations: Memory-optimized implementations with 10-100x improvements
- In-place Operations: Reduce memory allocation overhead
3. Backend System
- Abstract
ComputeBackendinterface for pluggable implementations - Automatic dispatch based on problem size and hardware characteristics
- Transparent optimization without changing user code
4. Competition Framework
- Kernel submission and benchmarking system
- Quantitative performance comparisons with leaderboards
- Educational framework for optimization challenges
Performance Improvements Demonstrated
Students will achieve and measure these real speedups:
- Cache-friendly blocking: 2-5x speedup from optimized memory access patterns
- Vectorization: 10-100x speedup from eliminating Python loop overhead
- In-place operations: 1.5-2x improvement from reduced memory allocation
- Automatic dispatch: Optimal performance across different problem sizes
Systems Thinking Focus
This module emphasizes understanding optimization through systems principles:
Optimization Priorities (Most → Least Impact)
- Algorithmic Complexity: O(N³) → O(N²) matters more than 2x constant factors
- Memory Access Patterns: Cache-friendly algorithms enable 2-10x speedups
- Vectorization: SIMD instructions and avoiding Python loops: 5-50x
- Memory Management: Minimize allocations, use in-place operations: 1.5-3x
- Hardware Utilization: CPU → GPU for large parallel operations: 10-100x
When to Optimize vs When Not To
- ✅ Optimize: Proven bottlenecks, poor algorithmic complexity, large data, cache-unfriendly patterns
- ❌ Don't Optimize: Already using optimized libraries, small data, I/O bottlenecks, non-critical code
Real-World Context
How ML Frameworks Apply These Principles
- PyTorch/TensorFlow: Use optimized BLAS libraries (cuBLAS, MKL)
- Memory Layouts: Cache-friendly data arrangements (NCHW vs NHWC)
- Vectorization: Batch processing and SIMD instruction utilization
- GPU Kernels: Parallel operations for large tensor computations
Where User Optimization Matters
- Custom operations not in standard libraries
- Data preprocessing and augmentation pipelines
- Memory management for large models
- Distributed training communication patterns
Educational Approach
Pedagogical Structure
- Measure First: Establish performance baselines with scientific benchmarking
- Understand Why: Implement naive versions to see why they're slow
- Optimize Systematically: Build cache-friendly and vectorized improvements
- Automate Selection: Create systems that choose optimal implementations
- Compete and Compare: Framework for quantitative optimization challenges
Key Learning Insights
- Memory access patterns dominate performance over pure computation
- Existing optimized libraries (NumPy, BLAS) are extremely well-engineered
- Hardware awareness (cache, vectorization) enables dramatic improvements
- Competition frameworks make optimization learning engaging and quantifiable
Prerequisites
- Module 2: Tensor operations and NumPy fundamentals
- Module 4: Linear layers and matrix multiplication understanding
- Algorithmic Complexity: Basic understanding of O notation
- Systems Thinking: Interest in understanding how software meets hardware
Time Commitment
Estimated Time: 3-4 hours
- Understanding concepts and cache hierarchy: 30 minutes
- Implementing optimized kernels: 2 hours
- Building backend system: 1 hour
- Competition framework and analysis: 30 minutes
Assessment
Students demonstrate mastery through:
- Blocked Matrix Multiplication: Implement cache-friendly algorithm with measurable speedups
- Vectorized Operations: Build optimized implementations avoiding Python loops
- Backend Architecture: Create transparent system for automatic optimization
- Performance Analysis: Measure and explain optimization principles scientifically
- Systems Understanding: Apply optimization thinking to real ML system challenges
Connection to ML Systems
This module directly prepares students for understanding:
- How PyTorch and TensorFlow achieve performance internally
- Why GPU acceleration matters for large neural networks
- Where optimization efforts provide real value in production systems
- How to make informed decisions about performance vs development time trade-offs
Students learn to think like performance engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.