Files
TinyTorch/modules/source/11_compression/README.md
Vijay Janapa Reddi 74f5c140f7 Update module numbering from 00-13 to 01-14 and refresh tagline
- Updated all module references to start from 01 instead of 00
- Changed tagline to 'Build your own ML framework. Start small. Go deep.'
- Added educational foundation section linking to ML Systems book
- Updated README, documentation, CLI examples, and prerequisites
- Regenerated book content with consistent numbering throughout
- Maintains 14 modules total but with natural numbering (01-14)
2025-07-15 21:11:07 -04:00

178 lines
7.2 KiB
Markdown

# Module 10: Compression & Optimization
## Overview
This module teaches students to make neural networks smaller, faster, and more efficient for real-world deployment. Students implement four core compression techniques and learn to balance accuracy with efficiency.
## Learning Goals
- Understand model size and deployment constraints in real systems
- Implement magnitude-based pruning to remove unimportant weights
- Master quantization for 75% memory reduction (FP32 → INT8)
- Build knowledge distillation for training compact models
- Create structured pruning to optimize network architectures
- Compare compression techniques and their trade-offs
## Educational Flow
### Step 1: Understanding Model Size
- **Concept**: Parameter counting and memory footprint analysis
- **Implementation**: `CompressionMetrics` class for model analysis
- **Learning**: Foundation for compression decision-making
### Step 2: Magnitude-Based Pruning
- **Concept**: Remove weights with smallest absolute values
- **Implementation**: `prune_weights_by_magnitude()` and sparsity calculation
- **Learning**: Sparsity patterns and accuracy vs compression trade-offs
### Step 3: Quantization Experiments
- **Concept**: Reduce precision from FP32 to INT8 for memory efficiency
- **Implementation**: `quantize_layer_weights()` with scale/offset mapping
- **Learning**: Numerical precision impact on model performance
### Step 4: Knowledge Distillation
- **Concept**: Large models teach small models through soft targets
- **Implementation**: `DistillationLoss` with temperature scaling
- **Learning**: Advanced training techniques for compact models
### Step 5: Structured Pruning
- **Concept**: Remove entire neurons/channels, not just weights
- **Implementation**: `prune_layer_neurons()` with importance scoring
- **Learning**: Architecture optimization and cascade effects
### Step 6: Comprehensive Comparison
- **Concept**: Combine techniques for maximum efficiency
- **Implementation**: Integrated compression pipeline
- **Learning**: Systems thinking for production deployment
## Key Components
### CompressionMetrics
- **Purpose**: Analyze model size and parameter distribution
- **Methods**: `count_parameters()`, `calculate_model_size()`, `analyze_weight_distribution()`
- **Usage**: Foundation for compression target selection
### Pruning Functions
- **Purpose**: Remove unimportant weights and neurons
- **Methods**: `prune_weights_by_magnitude()`, `prune_model_by_magnitude()`, `calculate_sparsity()`
- **Usage**: Reduce model size while maintaining performance
### Quantization Functions
- **Purpose**: Reduce memory usage through lower precision
- **Methods**: `quantize_layer_weights()`, `dequantize_layer_weights()`
- **Usage**: 75% memory reduction for mobile deployment
### Knowledge Distillation
- **Purpose**: Train compact models with teacher guidance
- **Methods**: `DistillationLoss`, `train_with_distillation()`
- **Usage**: Achieve better small model performance
### Structured Pruning
- **Purpose**: Remove entire neurons for actual speedup
- **Methods**: `prune_layer_neurons()`, `compute_neuron_importance()`
- **Usage**: Architecture optimization and hardware efficiency
## Real-World Applications
### Mobile AI Deployment
- **Constraint**: Models must be < 10MB for smartphone apps
- **Solution**: Combine pruning and quantization for 90% size reduction
- **Examples**: Google Translate offline, mobile camera AI
### Edge Computing
- **Constraint**: Severe memory and compute limitations
- **Solution**: Structured pruning for actual inference speedup
- **Examples**: IoT sensors, smart cameras, voice assistants
### Cost Optimization
- **Constraint**: Expensive cloud inference at scale
- **Solution**: Reduce model size for lower compute costs
- **Examples**: Production recommendation systems, search engines
### Battery Efficiency
- **Constraint**: Wearable devices need long battery life
- **Solution**: Quantization and pruning for energy savings
- **Examples**: Smartwatches, fitness trackers, AR glasses
## Industry Connections
### MobileNet Architecture
- **Concept**: Depthwise separable convolutions for efficiency
- **Connection**: Structured optimization for mobile deployment
- **Learning**: Architecture design affects compression potential
### DistilBERT
- **Concept**: 60% smaller than BERT with 97% performance
- **Connection**: Knowledge distillation for language models
- **Learning**: Teacher-student training for different domains
### TinyML Movement
- **Concept**: ML on microcontrollers (< 1MB models)
- **Connection**: Extreme compression for embedded systems
- **Learning**: Efficiency requirements for edge deployment
### Neural Architecture Search
- **Concept**: Automated model design for efficiency
- **Connection**: Structured pruning as architecture optimization
- **Learning**: Automated techniques for compression
## Assessment Criteria
### Technical Implementation (40%)
- Correctly implement 4 compression techniques
- Handle edge cases and error conditions
- Provide comprehensive statistics and analysis
### Understanding Trade-offs (30%)
- Explain accuracy vs efficiency spectrum
- Identify appropriate techniques for different constraints
- Analyze compression effectiveness quantitatively
### Real-World Application (30%)
- Connect compression to deployment scenarios
- Understand hardware and system constraints
- Design compression strategies for specific use cases
## Next Steps
### Module 11: Kernels
- **Connection**: Hardware-aware optimization builds on compression
- **Skills**: GPU kernels, SIMD operations, memory optimization
- **Application**: Implement efficient compressed model inference
### Module 12: Benchmarking
- **Connection**: Measure compression effectiveness systematically
- **Skills**: Performance profiling, accuracy measurement, A/B testing
- **Application**: Evaluate compression trade-offs in production
### Module 13: MLOps
- **Connection**: Deploy compressed models in production systems
- **Skills**: Model versioning, monitoring, continuous optimization
- **Application**: Production-ready compressed model deployment
## File Structure
```
10_compression/
├── compression_dev.py # Main development notebook
├── module.yaml # Module configuration
├── README.md # This file
└── tests/ # Additional test files (if needed)
```
## Getting Started
1. **Review Dependencies**: Ensure modules 01, 02, 04, 05, 10 are complete
2. **Open Development File**: `compression_dev.py`
3. **Follow Educational Flow**: Work through Steps 1-6 sequentially
4. **Test Thoroughly**: Run all inline tests as you progress
5. **Export to Package**: Use `tito export 10_compression` when complete
## Key Takeaways
Students completing this module will:
- **Understand** the efficiency requirements of production AI systems
- **Implement** four essential compression techniques from scratch
- **Analyze** accuracy vs efficiency trade-offs quantitatively
- **Apply** compression strategies to real neural networks
- **Connect** compression to mobile, edge, and production deployment
- **Prepare** for advanced optimization and production deployment modules
This module bridges the gap between research-quality models and production-ready AI systems, teaching the essential skills for deploying AI in resource-constrained environments.