mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-06 15:52:31 -05:00
✅ Clean source file headers: 'Module X:' → clean descriptive titles ✅ Regenerate overview pages with clean headers ✅ More flexible content that works in any context ✅ Numbers still provided by book TOC structure Changes: - Remove 'Module X: ' prefix from all source file headers - Headers now focus on descriptive content titles - Book maintains proper chapter ordering via _toc.yml - Content is more reusable across different presentations
Module 10: Compression & Optimization
Overview
This module teaches students to make neural networks smaller, faster, and more efficient for real-world deployment. Students implement four core compression techniques and learn to balance accuracy with efficiency.
Learning Goals
- Understand model size and deployment constraints in real systems
- Implement magnitude-based pruning to remove unimportant weights
- Master quantization for 75% memory reduction (FP32 → INT8)
- Build knowledge distillation for training compact models
- Create structured pruning to optimize network architectures
- Compare compression techniques and their trade-offs
Educational Flow
Step 1: Understanding Model Size
- Concept: Parameter counting and memory footprint analysis
- Implementation:
CompressionMetricsclass for model analysis - Learning: Foundation for compression decision-making
Step 2: Magnitude-Based Pruning
- Concept: Remove weights with smallest absolute values
- Implementation:
prune_weights_by_magnitude()and sparsity calculation - Learning: Sparsity patterns and accuracy vs compression trade-offs
Step 3: Quantization Experiments
- Concept: Reduce precision from FP32 to INT8 for memory efficiency
- Implementation:
quantize_layer_weights()with scale/offset mapping - Learning: Numerical precision impact on model performance
Step 4: Knowledge Distillation
- Concept: Large models teach small models through soft targets
- Implementation:
DistillationLosswith temperature scaling - Learning: Advanced training techniques for compact models
Step 5: Structured Pruning
- Concept: Remove entire neurons/channels, not just weights
- Implementation:
prune_layer_neurons()with importance scoring - Learning: Architecture optimization and cascade effects
Step 6: Comprehensive Comparison
- Concept: Combine techniques for maximum efficiency
- Implementation: Integrated compression pipeline
- Learning: Systems thinking for production deployment
Key Components
CompressionMetrics
- Purpose: Analyze model size and parameter distribution
- Methods:
count_parameters(),calculate_model_size(),analyze_weight_distribution() - Usage: Foundation for compression target selection
Pruning Functions
- Purpose: Remove unimportant weights and neurons
- Methods:
prune_weights_by_magnitude(),prune_model_by_magnitude(),calculate_sparsity() - Usage: Reduce model size while maintaining performance
Quantization Functions
- Purpose: Reduce memory usage through lower precision
- Methods:
quantize_layer_weights(),dequantize_layer_weights() - Usage: 75% memory reduction for mobile deployment
Knowledge Distillation
- Purpose: Train compact models with teacher guidance
- Methods:
DistillationLoss,train_with_distillation() - Usage: Achieve better small model performance
Structured Pruning
- Purpose: Remove entire neurons for actual speedup
- Methods:
prune_layer_neurons(),compute_neuron_importance() - Usage: Architecture optimization and hardware efficiency
Real-World Applications
Mobile AI Deployment
- Constraint: Models must be < 10MB for smartphone apps
- Solution: Combine pruning and quantization for 90% size reduction
- Examples: Google Translate offline, mobile camera AI
Edge Computing
- Constraint: Severe memory and compute limitations
- Solution: Structured pruning for actual inference speedup
- Examples: IoT sensors, smart cameras, voice assistants
Cost Optimization
- Constraint: Expensive cloud inference at scale
- Solution: Reduce model size for lower compute costs
- Examples: Production recommendation systems, search engines
Battery Efficiency
- Constraint: Wearable devices need long battery life
- Solution: Quantization and pruning for energy savings
- Examples: Smartwatches, fitness trackers, AR glasses
Industry Connections
MobileNet Architecture
- Concept: Depthwise separable convolutions for efficiency
- Connection: Structured optimization for mobile deployment
- Learning: Architecture design affects compression potential
DistilBERT
- Concept: 60% smaller than BERT with 97% performance
- Connection: Knowledge distillation for language models
- Learning: Teacher-student training for different domains
TinyML Movement
- Concept: ML on microcontrollers (< 1MB models)
- Connection: Extreme compression for embedded systems
- Learning: Efficiency requirements for edge deployment
Neural Architecture Search
- Concept: Automated model design for efficiency
- Connection: Structured pruning as architecture optimization
- Learning: Automated techniques for compression
Assessment Criteria
Technical Implementation (40%)
- Correctly implement 4 compression techniques
- Handle edge cases and error conditions
- Provide comprehensive statistics and analysis
Understanding Trade-offs (30%)
- Explain accuracy vs efficiency spectrum
- Identify appropriate techniques for different constraints
- Analyze compression effectiveness quantitatively
Real-World Application (30%)
- Connect compression to deployment scenarios
- Understand hardware and system constraints
- Design compression strategies for specific use cases
Next Steps
Module 11: Kernels
- Connection: Hardware-aware optimization builds on compression
- Skills: GPU kernels, SIMD operations, memory optimization
- Application: Implement efficient compressed model inference
Module 12: Benchmarking
- Connection: Measure compression effectiveness systematically
- Skills: Performance profiling, accuracy measurement, A/B testing
- Application: Evaluate compression trade-offs in production
Module 13: MLOps
- Connection: Deploy compressed models in production systems
- Skills: Model versioning, monitoring, continuous optimization
- Application: Production-ready compressed model deployment
File Structure
10_compression/
├── compression_dev.py # Main development notebook
├── module.yaml # Module configuration
├── README.md # This file
└── tests/ # Additional test files (if needed)
Getting Started
- Review Dependencies: Ensure modules 00, 01, 03, 04, 09 are complete
- Open Development File:
compression_dev.py - Follow Educational Flow: Work through Steps 1-6 sequentially
- Test Thoroughly: Run all inline tests as you progress
- Export to Package: Use
tito export 10_compressionwhen complete
Key Takeaways
Students completing this module will:
- Understand the efficiency requirements of production AI systems
- Implement four essential compression techniques from scratch
- Analyze accuracy vs efficiency trade-offs quantitatively
- Apply compression strategies to real neural networks
- Connect compression to mobile, edge, and production deployment
- Prepare for advanced optimization and production deployment modules
This module bridges the gap between research-quality models and production-ready AI systems, teaching the essential skills for deploying AI in resource-constrained environments.