Files
TinyTorch/modules/source/12_kernels/README.md
Vijay Janapa Reddi 74f5c140f7 Update module numbering from 00-13 to 01-14 and refresh tagline
- Updated all module references to start from 01 instead of 00
- Changed tagline to 'Build your own ML framework. Start small. Go deep.'
- Added educational foundation section linking to ML Systems book
- Updated README, documentation, CLI examples, and prerequisites
- Regenerated book content with consistent numbering throughout
- Maintains 14 modules total but with natural numbering (01-14)
2025-07-15 21:11:07 -04:00

148 lines
5.6 KiB
Markdown

# 🚀 Module 11: Kernels - Hardware-Aware Optimization
## 📊 Module Info
- **Difficulty**: ⭐⭐⭐⭐⭐ Expert
- **Time Estimate**: 8-10 hours
- **Prerequisites**: All previous modules (01-11), especially Compression
- **Next Steps**: Benchmarking, MLOps modules
**Bridge the gap between algorithmic optimization and hardware-level performance engineering**
## 🎯 Learning Objectives
After completing this module, you will:
- Understand how to implement custom ML operations beyond NumPy
- Apply SIMD vectorization and CPU optimization techniques
- Optimize memory layout and access patterns for cache efficiency
- Implement GPU-style parallel computing concepts
- Build comprehensive performance profiling and benchmarking tools
- Create hardware-optimized operations for quantized and pruned models
## 🔗 Connection to Previous Modules
### What You Already Know
- **Compression (Module 10)**: *What* to optimize (model size, computation)
- **Layers (Module 03)**: Basic matrix multiplication with `matmul()`
- **Training (Module 09)**: High-level optimization workflows
- **Networks (Module 04)**: How operations compose into architectures
### The Performance Gap
Students understand **algorithmic optimization** but not **hardware optimization**:
-**Algorithmic**: Pruning, quantization, knowledge distillation
-**Hardware**: Memory layout, vectorization, parallel processing
## 🧠 Build → Use → Optimize
This module follows the **"Build → Use → Optimize"** pedagogical framework:
### 1. **Build**: Custom Operations
- Move beyond NumPy's black box implementations
- Implement specialized matrix multiplication and activations
- Understand the computational patterns underlying ML
### 2. **Use**: Performance Optimization
- Apply SIMD vectorization for CPU optimization
- Implement cache-friendly memory layouts
- Build GPU-style parallel computing concepts
### 3. **Optimize**: Real-World Integration
- Profile and benchmark performance improvements
- Integrate with compressed models from Module 10
- Apply systematic evaluation to validate optimizations
## 📚 What You'll Build
### Core Operations
- **Matrix Multiplication**: Custom `matmul_baseline()` and `cache_friendly_matmul()`
- **Activation Functions**: Vectorized `vectorized_relu()` and `parallel_relu()`
- **Batch Processing**: `parallel_batch_processing()` with multiprocessing
- **Quantized Operations**: `quantized_matmul()` and `quantized_relu()`
### Performance Tools
- **Profiling**: `profile_operation()` for detailed timing analysis
- **Benchmarking**: `benchmark_operation()` for statistical validation
- **Memory Analysis**: Cache-friendly data layout optimization
- **Parallel Computing**: Multi-core processing patterns
## 🛠️ Key Components
### Hardware-Optimized Operations
- **Purpose**: Implement custom ML operations with hardware awareness
- **Methods**: `matmul_baseline()`, `vectorized_relu()`, `cache_friendly_matmul()`
- **Learning**: Understanding computational patterns beyond NumPy
### Parallel Processing Framework
- **Purpose**: Multi-core optimization for batch operations
- **Methods**: `parallel_batch_processing()`, `parallel_relu()`
- **Learning**: GPU-style parallel computing concepts
### Quantization Integration
- **Purpose**: Hardware-optimized operations for compressed models
- **Methods**: `quantized_matmul()`, `quantized_relu()`
- **Learning**: Bridging compression and performance optimization
### Performance Profiling
- **Purpose**: Systematic measurement and validation of optimizations
- **Methods**: `profile_operation()`, `benchmark_operation()`
- **Learning**: Evidence-based performance engineering
## 🌟 Real-World Applications
### Industry Examples
- **Google TPUs**: Custom hardware for ML operations
- **Intel MKL**: Optimized math libraries for CPU performance
- **NVIDIA cuDNN**: GPU-accelerated neural network operations
- **Apple Neural Engine**: Hardware-specific ML acceleration
### Performance Patterns
- **Memory Layout**: Row-major vs column-major access patterns
- **Vectorization**: SIMD instructions for parallel computation
- **Cache Optimization**: Data locality for memory hierarchy
- **Parallel Processing**: Multi-core utilization strategies
## 🚀 Getting Started
### Prerequisites Check
```bash
tito test --module compression # Should pass
tito status --module 11_kernels # Check module status
```
### Development Workflow
```bash
cd modules/source/11_kernels
jupyter notebook kernels_dev.py # or edit directly
```
### Testing Your Implementation
```bash
# Test inline (within notebook)
# Run comprehensive tests
tito test --module kernels
```
## 📖 Module Structure
```
modules/source/11_kernels/
├── kernels_dev.py # Main development file
├── README.md # This overview
└── module.yaml # Module configuration
```
## 🔗 Integration Points
### Input from Previous Modules
- **Tensor operations** → Custom implementations
- **Compressed models** → Hardware-optimized execution
- **Training workflows** → Performance-critical operations
### Output to Next Modules
- **Benchmarking** → Operations to evaluate systematically
- **MLOps** → Production-ready optimized operations
- **Complete system** → End-to-end performance optimization
## 🎓 Educational Philosophy
This module bridges the gap between **algorithmic understanding** and **systems performance**. Students learn that optimization isn't just about better algorithms—it's about understanding how algorithms interact with hardware to achieve real-world performance gains.
By the end, you'll think like a **performance engineer**, not just a machine learning practitioner.