mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 17:31:08 -05:00
- Updated all module references to start from 01 instead of 00 - Changed tagline to 'Build your own ML framework. Start small. Go deep.' - Added educational foundation section linking to ML Systems book - Updated README, documentation, CLI examples, and prerequisites - Regenerated book content with consistent numbering throughout - Maintains 14 modules total but with natural numbering (01-14)
178 lines
7.2 KiB
Markdown
178 lines
7.2 KiB
Markdown
# Module 10: Compression & Optimization
|
|
|
|
## Overview
|
|
This module teaches students to make neural networks smaller, faster, and more efficient for real-world deployment. Students implement four core compression techniques and learn to balance accuracy with efficiency.
|
|
|
|
## Learning Goals
|
|
- Understand model size and deployment constraints in real systems
|
|
- Implement magnitude-based pruning to remove unimportant weights
|
|
- Master quantization for 75% memory reduction (FP32 → INT8)
|
|
- Build knowledge distillation for training compact models
|
|
- Create structured pruning to optimize network architectures
|
|
- Compare compression techniques and their trade-offs
|
|
|
|
## Educational Flow
|
|
|
|
### Step 1: Understanding Model Size
|
|
- **Concept**: Parameter counting and memory footprint analysis
|
|
- **Implementation**: `CompressionMetrics` class for model analysis
|
|
- **Learning**: Foundation for compression decision-making
|
|
|
|
### Step 2: Magnitude-Based Pruning
|
|
- **Concept**: Remove weights with smallest absolute values
|
|
- **Implementation**: `prune_weights_by_magnitude()` and sparsity calculation
|
|
- **Learning**: Sparsity patterns and accuracy vs compression trade-offs
|
|
|
|
### Step 3: Quantization Experiments
|
|
- **Concept**: Reduce precision from FP32 to INT8 for memory efficiency
|
|
- **Implementation**: `quantize_layer_weights()` with scale/offset mapping
|
|
- **Learning**: Numerical precision impact on model performance
|
|
|
|
### Step 4: Knowledge Distillation
|
|
- **Concept**: Large models teach small models through soft targets
|
|
- **Implementation**: `DistillationLoss` with temperature scaling
|
|
- **Learning**: Advanced training techniques for compact models
|
|
|
|
### Step 5: Structured Pruning
|
|
- **Concept**: Remove entire neurons/channels, not just weights
|
|
- **Implementation**: `prune_layer_neurons()` with importance scoring
|
|
- **Learning**: Architecture optimization and cascade effects
|
|
|
|
### Step 6: Comprehensive Comparison
|
|
- **Concept**: Combine techniques for maximum efficiency
|
|
- **Implementation**: Integrated compression pipeline
|
|
- **Learning**: Systems thinking for production deployment
|
|
|
|
## Key Components
|
|
|
|
### CompressionMetrics
|
|
- **Purpose**: Analyze model size and parameter distribution
|
|
- **Methods**: `count_parameters()`, `calculate_model_size()`, `analyze_weight_distribution()`
|
|
- **Usage**: Foundation for compression target selection
|
|
|
|
### Pruning Functions
|
|
- **Purpose**: Remove unimportant weights and neurons
|
|
- **Methods**: `prune_weights_by_magnitude()`, `prune_model_by_magnitude()`, `calculate_sparsity()`
|
|
- **Usage**: Reduce model size while maintaining performance
|
|
|
|
### Quantization Functions
|
|
- **Purpose**: Reduce memory usage through lower precision
|
|
- **Methods**: `quantize_layer_weights()`, `dequantize_layer_weights()`
|
|
- **Usage**: 75% memory reduction for mobile deployment
|
|
|
|
### Knowledge Distillation
|
|
- **Purpose**: Train compact models with teacher guidance
|
|
- **Methods**: `DistillationLoss`, `train_with_distillation()`
|
|
- **Usage**: Achieve better small model performance
|
|
|
|
### Structured Pruning
|
|
- **Purpose**: Remove entire neurons for actual speedup
|
|
- **Methods**: `prune_layer_neurons()`, `compute_neuron_importance()`
|
|
- **Usage**: Architecture optimization and hardware efficiency
|
|
|
|
## Real-World Applications
|
|
|
|
### Mobile AI Deployment
|
|
- **Constraint**: Models must be < 10MB for smartphone apps
|
|
- **Solution**: Combine pruning and quantization for 90% size reduction
|
|
- **Examples**: Google Translate offline, mobile camera AI
|
|
|
|
### Edge Computing
|
|
- **Constraint**: Severe memory and compute limitations
|
|
- **Solution**: Structured pruning for actual inference speedup
|
|
- **Examples**: IoT sensors, smart cameras, voice assistants
|
|
|
|
### Cost Optimization
|
|
- **Constraint**: Expensive cloud inference at scale
|
|
- **Solution**: Reduce model size for lower compute costs
|
|
- **Examples**: Production recommendation systems, search engines
|
|
|
|
### Battery Efficiency
|
|
- **Constraint**: Wearable devices need long battery life
|
|
- **Solution**: Quantization and pruning for energy savings
|
|
- **Examples**: Smartwatches, fitness trackers, AR glasses
|
|
|
|
## Industry Connections
|
|
|
|
### MobileNet Architecture
|
|
- **Concept**: Depthwise separable convolutions for efficiency
|
|
- **Connection**: Structured optimization for mobile deployment
|
|
- **Learning**: Architecture design affects compression potential
|
|
|
|
### DistilBERT
|
|
- **Concept**: 60% smaller than BERT with 97% performance
|
|
- **Connection**: Knowledge distillation for language models
|
|
- **Learning**: Teacher-student training for different domains
|
|
|
|
### TinyML Movement
|
|
- **Concept**: ML on microcontrollers (< 1MB models)
|
|
- **Connection**: Extreme compression for embedded systems
|
|
- **Learning**: Efficiency requirements for edge deployment
|
|
|
|
### Neural Architecture Search
|
|
- **Concept**: Automated model design for efficiency
|
|
- **Connection**: Structured pruning as architecture optimization
|
|
- **Learning**: Automated techniques for compression
|
|
|
|
## Assessment Criteria
|
|
|
|
### Technical Implementation (40%)
|
|
- Correctly implement 4 compression techniques
|
|
- Handle edge cases and error conditions
|
|
- Provide comprehensive statistics and analysis
|
|
|
|
### Understanding Trade-offs (30%)
|
|
- Explain accuracy vs efficiency spectrum
|
|
- Identify appropriate techniques for different constraints
|
|
- Analyze compression effectiveness quantitatively
|
|
|
|
### Real-World Application (30%)
|
|
- Connect compression to deployment scenarios
|
|
- Understand hardware and system constraints
|
|
- Design compression strategies for specific use cases
|
|
|
|
## Next Steps
|
|
|
|
### Module 11: Kernels
|
|
- **Connection**: Hardware-aware optimization builds on compression
|
|
- **Skills**: GPU kernels, SIMD operations, memory optimization
|
|
- **Application**: Implement efficient compressed model inference
|
|
|
|
### Module 12: Benchmarking
|
|
- **Connection**: Measure compression effectiveness systematically
|
|
- **Skills**: Performance profiling, accuracy measurement, A/B testing
|
|
- **Application**: Evaluate compression trade-offs in production
|
|
|
|
### Module 13: MLOps
|
|
- **Connection**: Deploy compressed models in production systems
|
|
- **Skills**: Model versioning, monitoring, continuous optimization
|
|
- **Application**: Production-ready compressed model deployment
|
|
|
|
## File Structure
|
|
```
|
|
10_compression/
|
|
├── compression_dev.py # Main development notebook
|
|
├── module.yaml # Module configuration
|
|
├── README.md # This file
|
|
└── tests/ # Additional test files (if needed)
|
|
```
|
|
|
|
## Getting Started
|
|
|
|
1. **Review Dependencies**: Ensure modules 01, 02, 04, 05, 10 are complete
|
|
2. **Open Development File**: `compression_dev.py`
|
|
3. **Follow Educational Flow**: Work through Steps 1-6 sequentially
|
|
4. **Test Thoroughly**: Run all inline tests as you progress
|
|
5. **Export to Package**: Use `tito export 10_compression` when complete
|
|
|
|
## Key Takeaways
|
|
|
|
Students completing this module will:
|
|
- **Understand** the efficiency requirements of production AI systems
|
|
- **Implement** four essential compression techniques from scratch
|
|
- **Analyze** accuracy vs efficiency trade-offs quantitatively
|
|
- **Apply** compression strategies to real neural networks
|
|
- **Connect** compression to mobile, edge, and production deployment
|
|
- **Prepare** for advanced optimization and production deployment modules
|
|
|
|
This module bridges the gap between research-quality models and production-ready AI systems, teaching the essential skills for deploying AI in resource-constrained environments. |