mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 07:17:55 -05:00
- Updated all module references to start from 01 instead of 00 - Changed tagline to 'Build your own ML framework. Start small. Go deep.' - Added educational foundation section linking to ML Systems book - Updated README, documentation, CLI examples, and prerequisites - Regenerated book content with consistent numbering throughout - Maintains 14 modules total but with natural numbering (01-14)
148 lines
5.6 KiB
Markdown
148 lines
5.6 KiB
Markdown
# 🚀 Module 11: Kernels - Hardware-Aware Optimization
|
|
|
|
## 📊 Module Info
|
|
- **Difficulty**: ⭐⭐⭐⭐⭐ Expert
|
|
- **Time Estimate**: 8-10 hours
|
|
- **Prerequisites**: All previous modules (01-11), especially Compression
|
|
- **Next Steps**: Benchmarking, MLOps modules
|
|
|
|
**Bridge the gap between algorithmic optimization and hardware-level performance engineering**
|
|
|
|
## 🎯 Learning Objectives
|
|
|
|
After completing this module, you will:
|
|
- Understand how to implement custom ML operations beyond NumPy
|
|
- Apply SIMD vectorization and CPU optimization techniques
|
|
- Optimize memory layout and access patterns for cache efficiency
|
|
- Implement GPU-style parallel computing concepts
|
|
- Build comprehensive performance profiling and benchmarking tools
|
|
- Create hardware-optimized operations for quantized and pruned models
|
|
|
|
## 🔗 Connection to Previous Modules
|
|
|
|
### What You Already Know
|
|
- **Compression (Module 10)**: *What* to optimize (model size, computation)
|
|
- **Layers (Module 03)**: Basic matrix multiplication with `matmul()`
|
|
- **Training (Module 09)**: High-level optimization workflows
|
|
- **Networks (Module 04)**: How operations compose into architectures
|
|
|
|
### The Performance Gap
|
|
Students understand **algorithmic optimization** but not **hardware optimization**:
|
|
- ✅ **Algorithmic**: Pruning, quantization, knowledge distillation
|
|
- ❌ **Hardware**: Memory layout, vectorization, parallel processing
|
|
|
|
## 🧠 Build → Use → Optimize
|
|
|
|
This module follows the **"Build → Use → Optimize"** pedagogical framework:
|
|
|
|
### 1. **Build**: Custom Operations
|
|
- Move beyond NumPy's black box implementations
|
|
- Implement specialized matrix multiplication and activations
|
|
- Understand the computational patterns underlying ML
|
|
|
|
### 2. **Use**: Performance Optimization
|
|
- Apply SIMD vectorization for CPU optimization
|
|
- Implement cache-friendly memory layouts
|
|
- Build GPU-style parallel computing concepts
|
|
|
|
### 3. **Optimize**: Real-World Integration
|
|
- Profile and benchmark performance improvements
|
|
- Integrate with compressed models from Module 10
|
|
- Apply systematic evaluation to validate optimizations
|
|
|
|
## 📚 What You'll Build
|
|
|
|
### Core Operations
|
|
- **Matrix Multiplication**: Custom `matmul_baseline()` and `cache_friendly_matmul()`
|
|
- **Activation Functions**: Vectorized `vectorized_relu()` and `parallel_relu()`
|
|
- **Batch Processing**: `parallel_batch_processing()` with multiprocessing
|
|
- **Quantized Operations**: `quantized_matmul()` and `quantized_relu()`
|
|
|
|
### Performance Tools
|
|
- **Profiling**: `profile_operation()` for detailed timing analysis
|
|
- **Benchmarking**: `benchmark_operation()` for statistical validation
|
|
- **Memory Analysis**: Cache-friendly data layout optimization
|
|
- **Parallel Computing**: Multi-core processing patterns
|
|
|
|
## 🛠️ Key Components
|
|
|
|
### Hardware-Optimized Operations
|
|
- **Purpose**: Implement custom ML operations with hardware awareness
|
|
- **Methods**: `matmul_baseline()`, `vectorized_relu()`, `cache_friendly_matmul()`
|
|
- **Learning**: Understanding computational patterns beyond NumPy
|
|
|
|
### Parallel Processing Framework
|
|
- **Purpose**: Multi-core optimization for batch operations
|
|
- **Methods**: `parallel_batch_processing()`, `parallel_relu()`
|
|
- **Learning**: GPU-style parallel computing concepts
|
|
|
|
### Quantization Integration
|
|
- **Purpose**: Hardware-optimized operations for compressed models
|
|
- **Methods**: `quantized_matmul()`, `quantized_relu()`
|
|
- **Learning**: Bridging compression and performance optimization
|
|
|
|
### Performance Profiling
|
|
- **Purpose**: Systematic measurement and validation of optimizations
|
|
- **Methods**: `profile_operation()`, `benchmark_operation()`
|
|
- **Learning**: Evidence-based performance engineering
|
|
|
|
## 🌟 Real-World Applications
|
|
|
|
### Industry Examples
|
|
- **Google TPUs**: Custom hardware for ML operations
|
|
- **Intel MKL**: Optimized math libraries for CPU performance
|
|
- **NVIDIA cuDNN**: GPU-accelerated neural network operations
|
|
- **Apple Neural Engine**: Hardware-specific ML acceleration
|
|
|
|
### Performance Patterns
|
|
- **Memory Layout**: Row-major vs column-major access patterns
|
|
- **Vectorization**: SIMD instructions for parallel computation
|
|
- **Cache Optimization**: Data locality for memory hierarchy
|
|
- **Parallel Processing**: Multi-core utilization strategies
|
|
|
|
## 🚀 Getting Started
|
|
|
|
### Prerequisites Check
|
|
```bash
|
|
tito test --module compression # Should pass
|
|
tito status --module 11_kernels # Check module status
|
|
```
|
|
|
|
### Development Workflow
|
|
```bash
|
|
cd modules/source/11_kernels
|
|
jupyter notebook kernels_dev.py # or edit directly
|
|
```
|
|
|
|
### Testing Your Implementation
|
|
```bash
|
|
# Test inline (within notebook)
|
|
# Run comprehensive tests
|
|
tito test --module kernels
|
|
```
|
|
|
|
## 📖 Module Structure
|
|
```
|
|
modules/source/11_kernels/
|
|
├── kernels_dev.py # Main development file
|
|
├── README.md # This overview
|
|
└── module.yaml # Module configuration
|
|
```
|
|
|
|
## 🔗 Integration Points
|
|
|
|
### Input from Previous Modules
|
|
- **Tensor operations** → Custom implementations
|
|
- **Compressed models** → Hardware-optimized execution
|
|
- **Training workflows** → Performance-critical operations
|
|
|
|
### Output to Next Modules
|
|
- **Benchmarking** → Operations to evaluate systematically
|
|
- **MLOps** → Production-ready optimized operations
|
|
- **Complete system** → End-to-end performance optimization
|
|
|
|
## 🎓 Educational Philosophy
|
|
|
|
This module bridges the gap between **algorithmic understanding** and **systems performance**. Students learn that optimization isn't just about better algorithms—it's about understanding how algorithms interact with hardware to achieve real-world performance gains.
|
|
|
|
By the end, you'll think like a **performance engineer**, not just a machine learning practitioner. |