mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-06 14:02:35 -05:00
- Added standardized testing sections to modules 07_autograd and 08_optimizers - Updated module.yaml files to reference inline testing approach - Reorganized kernels module structure with proper testing placement - All 12 TinyTorch modules now have consistent testing framework - Fixed kernels module structure to match optimizers/training pattern
🚀 Module 11: Kernels - Hardware-Aware Optimization
📊 Module Info
- Difficulty: ⭐⭐⭐⭐⭐ Expert
- Time Estimate: 8-10 hours
- Prerequisites: All previous modules (00-10), especially Compression
- Next Steps: Benchmarking, MLOps modules
Bridge the gap between algorithmic optimization and hardware-level performance engineering
🎯 Learning Objectives
After completing this module, you will:
- Understand how to implement custom ML operations beyond NumPy
- Apply SIMD vectorization and CPU optimization techniques
- Optimize memory layout and access patterns for cache efficiency
- Implement GPU-style parallel computing concepts
- Build comprehensive performance profiling and benchmarking tools
- Create hardware-optimized operations for quantized and pruned models
🔗 Connection to Previous Modules
What You Already Know
- Compression (Module 10): What to optimize (model size, computation)
- Layers (Module 03): Basic matrix multiplication with
matmul() - Training (Module 09): High-level optimization workflows
- Networks (Module 04): How operations compose into architectures
The Performance Gap
Students understand algorithmic optimization but not hardware optimization:
- ✅ Algorithmic: Pruning, quantization, knowledge distillation
- ❌ Hardware: Memory layout, vectorization, parallel processing
🧠 Build → Use → Optimize
This module follows the "Build → Use → Optimize" pedagogical framework:
1. Build: Custom Operations
- Move beyond NumPy's black box implementations
- Implement specialized matrix multiplication and activations
- Understand the computational patterns underlying ML
2. Use: Performance Optimization
- Apply SIMD vectorization for CPU optimization
- Implement cache-friendly memory layouts
- Build GPU-style parallel computing concepts
3. Optimize: Real-World Integration
- Profile and benchmark performance improvements
- Integrate with compressed models from Module 10
- Bridge to production deployment considerations
📚 What You'll Build
Step 1: Understanding Custom Operations
# Build on TinyTorch's proven implementations
def matmul_baseline(A, B):
# Use TinyTorch's reliable matmul as baseline
return matmul(A.data, B.data)
Step 2: SIMD Vectorization
# CPU optimization with vector operations
def vectorized_relu(x):
# SIMD-optimized activation using NumPy's vectorized operations
return np.maximum(0, x_data)
def vectorized_operations(x, y):
# Element-wise operations optimized for SIMD
return {
'multiply': x * y,
'add': x + y,
'squared_diff': (x - y)**2
}
Step 3: Memory Layout Optimization
# Cache-friendly data structures
def cache_friendly_matmul(A, B, block_size=32):
# Blocked matrix multiplication for better cache utilization
return blocked_result
Step 4: GPU-Style Parallel Computing
# Parallel processing patterns
def parallel_relu(x, num_workers=4):
# Multi-core CPU utilization with ThreadPoolExecutor
return parallel_result
def parallel_batch_processing(batch_data, operation, num_workers=4):
# Process multiple tensors simultaneously
return batch_results
Step 5: Performance Profiling
# Measure and optimize performance
profiler = SimpleProfiler()
result, metrics = profiler.profile(kernel_function, *args)
print(f"Wall time: {metrics['wall_time']:.4f}s")
Step 6: Compressed Model Kernels
# Hardware-optimized operations for compressed models
def quantized_matmul(A, B, scale_A=1.0, scale_B=1.0):
# INT8 matrix multiplication for mobile deployment
return quantized_result
def quantized_relu(x, scale=1.0):
# Integer domain ReLU activation
return quantized_result
🎓 Learning Path
Foundation Level: Understanding Implementation
- See what happens inside NumPy operations
- Build on TinyTorch's proven components
- Debug performance bottlenecks
Intermediate Level: CPU Optimization
- Apply vectorization techniques
- Optimize memory access patterns
- Understand cache behavior
Advanced Level: Parallel Computing
- Implement GPU-style parallel algorithms
- Profile and benchmark performance
- Integrate with compressed models
Expert Level: Production Integration
- Build kernels for real deployment scenarios
- Optimize for specific hardware targets
- Connect to MLOps and monitoring systems
🔧 Technical Skills Developed
Low-Level Programming
- Manual memory management
- Understanding of CPU architecture
- Assembly-level optimization concepts
Performance Engineering
- Profiling and benchmarking with SimpleProfiler
- Bottleneck identification
- Performance optimization strategies
Parallel Computing
- Thread-level parallelism with ThreadPoolExecutor
- SIMD vectorization principles
- GPU computing concepts
Systems Integration
- Hardware-software co-design
- Production deployment considerations
- Real-world performance constraints
🎯 Real-World Applications
Production ML Systems
- Custom kernels for edge deployment
- Hardware-specific optimizations
- Real-time inference requirements
Research and Development
- Prototype new operations
- Benchmark algorithm improvements
- Understand performance trade-offs
MLOps and Deployment
- Optimize for specific hardware
- Monitor performance in production
- Scale to distributed systems
🚀 Getting Started
Prerequisites Check
- ✅ Complete all previous modules (00-10)
- ✅ Understand compression techniques
- ✅ Familiar with NumPy operations
- ✅ Basic understanding of computer architecture
Development Setup
# Navigate to the kernels module
cd modules/source/11_kernels
# Work in the development file
code kernels_dev.py
# Or work in the Jupyter notebook
jupyter notebook kernels_dev.ipynb
📖 Module Structure
modules/source/11_kernels/
├── kernels_dev.py # Main development file (work here!)
├── kernels_dev.ipynb # Jupyter notebook version
├── README.md # This file
└── module.yaml # Module metadata
🧪 Testing Your Implementation
Inline Testing
# All tests are inline within kernels_dev.py
def test_matmul_baseline():
# Test baseline matrix multiplication
pass
def test_vectorized_operations():
# Test SIMD vectorization
pass
def test_cache_friendly_matmul():
# Test cache optimization
pass
def test_parallel_processing():
# Test parallel computing
pass
def test_performance_profiling():
# Test profiling tools
pass
def test_compressed_kernels():
# Test quantized operations
pass
def final_performance_test():
# Comprehensive performance comparison
pass
Performance Benchmarking
# Run comprehensive performance tests
final_performance_test()
🎯 Success Criteria
You've mastered hardware-aware optimization when:
- ✅ Can implement custom ML operations building on TinyTorch components
- ✅ Understand CPU optimization techniques (SIMD, caching)
- ✅ Can profile and benchmark performance improvements
- ✅ Successfully integrate with compressed models
- ✅ Bridge algorithmic and hardware optimization
🔍 Common Challenges
Performance Debugging
- Use SimpleProfiler to identify bottlenecks
- Understand the difference between algorithmic and implementation efficiency
- Learn to read performance metrics
Hardware Complexity
- Start with CPU optimization before GPU concepts
- Focus on understanding principles, not memorizing details
- Use abstractions to manage complexity
Integration Complexity
- Test kernels independently before integration
- Verify correctness before optimizing for performance
- Maintain compatibility with existing TinyTorch components
🚀 What's Next
After completing this module, you're ready for:
- Module 12: Benchmarking - Systematic performance measurement
- Module 13: MLOps - Production deployment and monitoring
📊 Performance Insights
Performance Hierarchy
Python loops: 1x speed (baseline)
NumPy operations: 10x speed (vectorized)
Optimized kernels: 100x speed (hardware-aware)
GPU kernels: 1000x speed (massive parallelism)
Memory Hierarchy
CPU Registers: 1 cycle (fastest, tiny)
L1 Cache: 3 cycles (fast, small)
L2 Cache: 10 cycles (medium, medium)
L3 Cache: 40 cycles (slow, large)
Main Memory: 200+ cycles (slowest, huge)
Real-World Impact
- Training time: 10 hours → 1 hour
- Inference cost: $1000/month → $100/month
- Energy efficiency: 90% reduction
🏆 What Students Build
By the end of this module, students have implemented:
matmul_baseline()- Reliable matrix multiplication using TinyTorchvectorized_relu()- SIMD-optimized ReLU activationvectorized_operations()- Element-wise operations with vectorizationcache_friendly_matmul()- Blocked matrix multiplicationparallel_relu()- Multi-core CPU utilizationparallel_batch_processing()- Batch processing with workersquantized_matmul()- INT8 matrix multiplicationquantized_relu()- Integer domain ReLU- Performance profiling - Using SimpleProfiler for benchmarking
- Final performance test - Comprehensive comparison of all implementations
Students understand how modern ML frameworks like PyTorch (2000+ CUDA kernels) and TensorFlow (XLA compiler) achieve their performance through hardware-aware optimization.