mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-04 06:34:28 -05:00
This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles. ## Module Reordering Summary **Previous Order (Problems)**: - 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training - Issues: Autograd before optimizers, DataLoader before training, scattered dependencies **New Order (Beautiful Progression)**: - 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader - Benefits: Each module creates inevitable need for the next ## Pedagogical Flow Achieved **05_losses** → "Need systematic weight updates" → **06_optimizers** **06_optimizers** → "Need automatic gradients" → **07_autograd** **07_autograd** → "Need systematic training" → **08_training** **08_training** → "MLPs hit limits on images" → **09_spatial** **09_spatial** → "Training is too slow" → **10_dataloader** ## Technical Changes ### Module Directory Renaming - `06_autograd` → `07_autograd` - `07_dataloader` → `10_dataloader` - `08_optimizers` → `06_optimizers` - `10_training` → `08_training` - `09_spatial` → `09_spatial` (no change) ### System Integration Updates - **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py - **Test directories**: Renamed module_XX directories to match new numbers - **Documentation**: Updated all references in MD files and agent configurations - **CLI integration**: Updated next-steps suggestions for proper flow ### Agent Configuration Updates - **Quality Assurance**: Updated module audit status with new numbers - **Module Developer**: Updated work tracking with new sequence - **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression ## Educational Benefits 1. **Inevitable Discovery**: Each module naturally leads to the next 2. **Cognitive Load**: Concepts introduced exactly when needed 3. **Motivation**: Students understand WHY each tool is necessary 4. **Synthesis**: Everything flows toward complete ML systems understanding 5. **Professional Alignment**: Matches real ML engineering workflows ## Quality Assurance - ✅ All CLI commands still function - ✅ Checkpoint system mappings updated - ✅ Documentation consistency maintained - ✅ Test directory structure aligned - ✅ Agent configurations synchronized **Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
139 lines
6.3 KiB
Markdown
139 lines
6.3 KiB
Markdown
# Module 15: Hardware Acceleration and Kernel Optimization
|
|
|
|
## Overview
|
|
|
|
This module teaches hardware acceleration principles through hands-on implementation of optimized kernels that demonstrate real performance improvements. Students learn to understand hardware bottlenecks, implement cache-friendly algorithms, and build systems that automatically apply optimizations.
|
|
|
|
## Learning Objectives
|
|
|
|
By the end of this module, students will be able to:
|
|
|
|
1. **Understand Performance Bottlenecks**: Identify why naive implementations are slow and where optimization opportunities exist
|
|
2. **Implement Cache-Friendly Algorithms**: Build blocked matrix multiplication that leverages CPU cache hierarchy
|
|
3. **Optimize Memory Access Patterns**: Create vectorized operations with contiguous memory access
|
|
4. **Build Transparent Backend Systems**: Design automatic dispatch between naive and optimized implementations
|
|
5. **Measure Real Speedups**: Quantify performance improvements and understand when optimizations matter
|
|
|
|
## Key Concepts
|
|
|
|
### Hardware Reality: Cache is King
|
|
|
|
Modern CPU performance is dominated by memory access patterns, not raw computation speed:
|
|
|
|
- **L1 Cache**: ~32KB, 1-2 cycles (fastest)
|
|
- **L2 Cache**: ~256KB, 3-10 cycles
|
|
- **L3 Cache**: ~8MB, 10-20 cycles
|
|
- **RAM**: Gigabytes, 100-300 cycles (slowest)
|
|
|
|
The key insight: keeping data in cache and accessing memory in cache-friendly patterns provides dramatic speedups.
|
|
|
|
## What You'll Build
|
|
|
|
### 1. Performance Benchmarking Tools
|
|
- Scientific measurement infrastructure for quantifying speedups
|
|
- Automated timing with statistical analysis
|
|
- Memory usage profiling and operation counting
|
|
|
|
### 2. Optimized Kernels
|
|
- **Blocked Matrix Multiplication**: Cache-friendly algorithm showing 2-5x speedups
|
|
- **Vectorized Operations**: Memory-optimized implementations with 10-100x improvements
|
|
- **In-place Operations**: Reduce memory allocation overhead
|
|
|
|
### 3. Backend System
|
|
- Abstract `ComputeBackend` interface for pluggable implementations
|
|
- Automatic dispatch based on problem size and hardware characteristics
|
|
- Transparent optimization without changing user code
|
|
|
|
### 4. Competition Framework
|
|
- Kernel submission and benchmarking system
|
|
- Quantitative performance comparisons with leaderboards
|
|
- Educational framework for optimization challenges
|
|
|
|
## Performance Improvements Demonstrated
|
|
|
|
Students will achieve and measure these real speedups:
|
|
|
|
- **Cache-friendly blocking**: 2-5x speedup from optimized memory access patterns
|
|
- **Vectorization**: 10-100x speedup from eliminating Python loop overhead
|
|
- **In-place operations**: 1.5-2x improvement from reduced memory allocation
|
|
- **Automatic dispatch**: Optimal performance across different problem sizes
|
|
|
|
## Systems Thinking Focus
|
|
|
|
This module emphasizes understanding optimization through systems principles:
|
|
|
|
### Optimization Priorities (Most → Least Impact)
|
|
1. **Algorithmic Complexity**: O(N³) → O(N²) matters more than 2x constant factors
|
|
2. **Memory Access Patterns**: Cache-friendly algorithms enable 2-10x speedups
|
|
3. **Vectorization**: SIMD instructions and avoiding Python loops: 5-50x
|
|
4. **Memory Management**: Minimize allocations, use in-place operations: 1.5-3x
|
|
5. **Hardware Utilization**: CPU → GPU for large parallel operations: 10-100x
|
|
|
|
### When to Optimize vs When Not To
|
|
- ✅ **Optimize**: Proven bottlenecks, poor algorithmic complexity, large data, cache-unfriendly patterns
|
|
- ❌ **Don't Optimize**: Already using optimized libraries, small data, I/O bottlenecks, non-critical code
|
|
|
|
## Real-World Context
|
|
|
|
### How ML Frameworks Apply These Principles
|
|
- **PyTorch/TensorFlow**: Use optimized BLAS libraries (cuBLAS, MKL)
|
|
- **Memory Layouts**: Cache-friendly data arrangements (NCHW vs NHWC)
|
|
- **Vectorization**: Batch processing and SIMD instruction utilization
|
|
- **GPU Kernels**: Parallel operations for large tensor computations
|
|
|
|
### Where User Optimization Matters
|
|
- Custom operations not in standard libraries
|
|
- Data preprocessing and augmentation pipelines
|
|
- Memory management for large models
|
|
- Distributed training communication patterns
|
|
|
|
## Educational Approach
|
|
|
|
### Pedagogical Structure
|
|
1. **Measure First**: Establish performance baselines with scientific benchmarking
|
|
2. **Understand Why**: Implement naive versions to see why they're slow
|
|
3. **Optimize Systematically**: Build cache-friendly and vectorized improvements
|
|
4. **Automate Selection**: Create systems that choose optimal implementations
|
|
5. **Compete and Compare**: Framework for quantitative optimization challenges
|
|
|
|
### Key Learning Insights
|
|
- Memory access patterns dominate performance over pure computation
|
|
- Existing optimized libraries (NumPy, BLAS) are extremely well-engineered
|
|
- Hardware awareness (cache, vectorization) enables dramatic improvements
|
|
- Competition frameworks make optimization learning engaging and quantifiable
|
|
|
|
## Prerequisites
|
|
|
|
- **Module 2**: Tensor operations and NumPy fundamentals
|
|
- **Module 4**: Linear layers and matrix multiplication understanding
|
|
- **Algorithmic Complexity**: Basic understanding of O notation
|
|
- **Systems Thinking**: Interest in understanding how software meets hardware
|
|
|
|
## Time Commitment
|
|
|
|
**Estimated Time**: 3-4 hours
|
|
- Understanding concepts and cache hierarchy: 30 minutes
|
|
- Implementing optimized kernels: 2 hours
|
|
- Building backend system: 1 hour
|
|
- Competition framework and analysis: 30 minutes
|
|
|
|
## Assessment
|
|
|
|
Students demonstrate mastery through:
|
|
|
|
1. **Blocked Matrix Multiplication**: Implement cache-friendly algorithm with measurable speedups
|
|
2. **Vectorized Operations**: Build optimized implementations avoiding Python loops
|
|
3. **Backend Architecture**: Create transparent system for automatic optimization
|
|
4. **Performance Analysis**: Measure and explain optimization principles scientifically
|
|
5. **Systems Understanding**: Apply optimization thinking to real ML system challenges
|
|
|
|
## Connection to ML Systems
|
|
|
|
This module directly prepares students for understanding:
|
|
|
|
- How PyTorch and TensorFlow achieve performance internally
|
|
- Why GPU acceleration matters for large neural networks
|
|
- Where optimization efforts provide real value in production systems
|
|
- How to make informed decisions about performance vs development time trade-offs
|
|
|
|
Students learn to think like performance engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most. |