# ๐Ÿ”ฅ Module: Compression ## ๐Ÿ“Š Module Info - **Difficulty**: โญโญโญโญ Expert - **Time Estimate**: 8-10 hours - **Prerequisites**: Networks, Training modules - **Next Steps**: Kernels, MLOps modules Build model compression systems that make neural networks smaller, faster, and more efficient for real-world deployment. This module teaches the optimization techniques that bridge the gap between research-quality models and production-ready AI systems. ## ๐ŸŽฏ Learning Objectives By the end of this module, you will be able to: - **Understand deployment constraints**: Analyze model size, memory usage, and computational requirements for real-world systems - **Implement pruning techniques**: Build magnitude-based and structured pruning to remove unimportant weights and neurons - **Master quantization methods**: Reduce memory usage by 75% through FP32 โ†’ INT8 precision reduction - **Apply knowledge distillation**: Train compact models using larger teacher models for better performance - **Design compression strategies**: Combine techniques optimally for different deployment scenarios and constraints ## ๐Ÿง  Build โ†’ Use โ†’ Optimize This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: 1. **Build**: Implement pruning, quantization, knowledge distillation, and structured optimization from engineering principles 2. **Use**: Apply compression techniques to real neural networks with accuracy vs efficiency analysis 3. **Optimize**: Combine compression methods strategically for production deployment scenarios with specific constraints ## ๐Ÿ“š What You'll Build ### Model Compression Analysis System ```python # Comprehensive model analysis for compression planning metrics = CompressionMetrics() # Analyze original model original_size = metrics.calculate_model_size(model) param_count = metrics.count_parameters(model) weight_dist = metrics.analyze_weight_distribution(model) print(f"Original model: {original_size:.2f} MB, {param_count:,} parameters") print(f"Weight distribution: mean={weight_dist['mean']:.4f}, std={weight_dist['std']:.4f}") ``` ### Pruning Systems for Model Sparsity ```python # Magnitude-based pruning: remove smallest weights pruned_model = prune_model_by_magnitude(model, sparsity=0.5) # Remove 50% of weights sparsity = calculate_sparsity(pruned_model) print(f"Achieved sparsity: {sparsity:.2%}") # Structured pruning: remove entire neurons/channels optimized_model = prune_layer_neurons(model, layer_idx=0, neurons_to_remove=32) print(f"Removed 32 neurons from layer 0") # Sparsity analysis and performance impact original_acc = evaluate_model(model, test_loader) pruned_acc = evaluate_model(pruned_model, test_loader) print(f"Accuracy: {original_acc:.4f} โ†’ {pruned_acc:.4f} ({pruned_acc-original_acc:+.4f})") ``` ### Quantization for Memory Efficiency ```python # Quantize model weights from FP32 to INT8 quantized_model = quantize_model_weights(model) compressed_size = metrics.calculate_model_size(quantized_model) print(f"Size reduction: {original_size:.2f} MB โ†’ {compressed_size:.2f} MB") print(f"Compression ratio: {original_size/compressed_size:.1f}x smaller") # Test quantization impact on accuracy quantized_acc = evaluate_model(quantized_model, test_loader) print(f"Quantization accuracy impact: {quantized_acc-original_acc:+.4f}") ``` ### Knowledge Distillation for Compact Models ```python # Train small model using large teacher model teacher_model = load_pretrained_large_model() student_model = create_compact_model(compression_ratio=0.25) # 4x smaller # Distillation training with temperature scaling distillation_loss = DistillationLoss(temperature=4.0, alpha=0.7) # Training loop with teacher guidance for batch_inputs, batch_labels in train_loader: teacher_outputs = teacher_model(batch_inputs) student_outputs = student_model(batch_inputs) # Combined loss: distillation + task loss loss = distillation_loss(student_outputs, teacher_outputs, batch_labels) loss.backward() optimizer.step() print(f"Student model size: {metrics.calculate_model_size(student_model):.2f} MB") print(f"Student accuracy: {evaluate_model(student_model, test_loader):.4f}") ``` ### Comprehensive Compression Pipeline ```python # End-to-end compression with multiple techniques def compress_for_mobile_deployment(model, target_size_mb=5.0): """Compress model for mobile deployment under 5MB constraint""" # Step 1: Structured pruning for architecture optimization model = prune_redundant_neurons(model, importance_threshold=0.1) # Step 2: Magnitude-based pruning for sparsity model = prune_model_by_magnitude(model, sparsity=0.6) # Step 3: Quantization for memory reduction model = quantize_model_weights(model) # Step 4: Verify size constraint final_size = CompressionMetrics().calculate_model_size(model) print(f"Final compressed model: {final_size:.2f} MB") return model mobile_model = compress_for_mobile_deployment(trained_model) ``` ## ๐Ÿš€ Getting Started ### Prerequisites Ensure you have mastered the training foundation: ```bash # Activate TinyTorch environment source bin/activate-tinytorch.sh # Verify prerequisite modules tito test --module networks tito test --module training ``` ### Development Workflow 1. **Open the development file**: `modules/source/11_compression/compression_dev.py` 2. **Implement compression metrics**: Build model analysis tools for size and parameter counting 3. **Create pruning algorithms**: Implement magnitude-based and structured pruning techniques 4. **Build quantization system**: Add FP32 โ†’ INT8 weight quantization with scale/offset mapping 5. **Add knowledge distillation**: Implement teacher-student training for compact models 6. **Export and verify**: `tito export --module compression && tito test --module compression` ## ๐Ÿงช Testing Your Implementation ### Comprehensive Test Suite Run the full test suite to verify compression system functionality: ```bash # TinyTorch CLI (recommended) tito test --module compression # Direct pytest execution python -m pytest tests/ -k compression -v ``` ### Test Coverage Areas - โœ… **Compression Metrics**: Verify accurate model size and parameter analysis - โœ… **Pruning Algorithms**: Test magnitude-based and structured pruning correctness - โœ… **Quantization System**: Ensure proper FP32 โ†” INT8 conversion and accuracy preservation - โœ… **Knowledge Distillation**: Verify teacher-student training and loss computation - โœ… **Integrated Compression**: Test combined techniques on real neural networks ### Inline Testing & Compression Analysis The module includes comprehensive compression validation and performance analysis: ```python # Example inline test output ๐Ÿ”ฌ Unit Test: Model compression metrics... โœ… Parameter counting accurate โœ… Model size calculation correct โœ… Weight distribution analysis working ๐Ÿ“ˆ Progress: Compression Analysis โœ“ # Pruning validation ๐Ÿ”ฌ Unit Test: Magnitude-based pruning... โœ… Smallest weights identified correctly โœ… Sparsity calculation accurate โœ… Model functionality preserved ๐Ÿ“ˆ Progress: Pruning Systems โœ“ # Quantization testing ๐Ÿ”ฌ Unit Test: Weight quantization... โœ… FP32 โ†’ INT8 conversion correct โœ… Dequantization recovers values โœ… 75% memory reduction achieved ๐Ÿ“ˆ Progress: Quantization โœ“ ``` ### Manual Testing Examples ```python from compression_dev import CompressionMetrics, prune_model_by_magnitude, quantize_model_weights from networks_dev import Sequential from layers_dev import Dense from activations_dev import ReLU # Create test model model = Sequential([ Dense(784, 128), ReLU(), Dense(128, 64), ReLU(), Dense(64, 10) ]) # Analyze original model metrics = CompressionMetrics() original_size = metrics.calculate_model_size(model) original_params = metrics.count_parameters(model) print(f"Original: {original_size:.2f} MB, {original_params:,} parameters") # Test pruning pruned_model = prune_model_by_magnitude(model, sparsity=0.5) pruned_size = metrics.calculate_model_size(pruned_model) print(f"After 50% pruning: {pruned_size:.2f} MB ({original_size/pruned_size:.1f}x smaller)") # Test quantization quantized_model = quantize_model_weights(model) quantized_size = metrics.calculate_model_size(quantized_model) print(f"After quantization: {quantized_size:.2f} MB ({original_size/quantized_size:.1f}x smaller)") ``` ## ๐ŸŽฏ Key Concepts ### Real-World Applications - **Mobile AI**: Smartphone apps require models under 10MB for fast download and inference - **Edge Computing**: IoT devices have severe memory constraints requiring aggressive compression - **Cloud Cost Optimization**: Reducing model size decreases inference costs at scale - **Autonomous Systems**: Real-time requirements demand efficient models for safety-critical applications ### Compression Techniques - **Magnitude-based Pruning**: Remove weights with smallest absolute values to create sparse networks - **Structured Pruning**: Remove entire neurons/channels for actual hardware speedup benefits - **Quantization**: Reduce precision from FP32 to INT8 for 75% memory reduction - **Knowledge Distillation**: Transfer knowledge from large teacher to small student models ### Production Deployment Considerations - **Hardware Constraints**: Different devices have different memory, compute, and energy limitations - **Accuracy vs Efficiency Trade-offs**: Balancing model performance with deployment requirements - **Inference Speed**: Compression techniques that actually improve runtime performance - **Model Serving**: Considerations for batch processing, latency, and throughput ### Systems Engineering Patterns - **Compression Pipeline Design**: Sequential application of techniques for maximum benefit - **Performance Profiling**: Measuring actual improvements in memory, speed, and energy usage - **Quality Assurance**: Maintaining model accuracy while achieving compression targets - **Deployment Validation**: Testing compressed models in realistic production scenarios ## ๐ŸŽ‰ Ready to Build? You're about to master the optimization techniques that make AI practical for real-world deployment! From the smartphone in your pocket to autonomous vehicles, they all depend on compressed models that balance intelligence with efficiency. This module teaches you the systems engineering that separates research prototypes from production AI. You'll learn to think like a deployment engineer, balancing accuracy against constraints and building systems that work in the real world. Take your time, understand the trade-offs, and enjoy building AI that actually ships! ```{grid} 3 :gutter: 3 :margin: 2 {grid-item-card} ๐Ÿš€ Launch Builder :link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/11_compression/compression_dev.py :class-title: text-center :class-body: text-center Interactive development environment {grid-item-card} ๐Ÿ““ Open in Colab :link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/11_compression/compression_dev.ipynb :class-title: text-center :class-body: text-center Google Colab notebook {grid-item-card} ๐Ÿ‘€ View Source :link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/11_compression/compression_dev.py :class-title: text-center :class-body: text-center Browse the code on GitHub ```