Files
TinyTorch/modules/source/11_compression

Module 10: Compression & Optimization

Overview

This module teaches students to make neural networks smaller, faster, and more efficient for real-world deployment. Students implement four core compression techniques and learn to balance accuracy with efficiency.

Learning Goals

  • Understand model size and deployment constraints in real systems
  • Implement magnitude-based pruning to remove unimportant weights
  • Master quantization for 75% memory reduction (FP32 → INT8)
  • Build knowledge distillation for training compact models
  • Create structured pruning to optimize network architectures
  • Compare compression techniques and their trade-offs

Educational Flow

Step 1: Understanding Model Size

  • Concept: Parameter counting and memory footprint analysis
  • Implementation: CompressionMetrics class for model analysis
  • Learning: Foundation for compression decision-making

Step 2: Magnitude-Based Pruning

  • Concept: Remove weights with smallest absolute values
  • Implementation: prune_weights_by_magnitude() and sparsity calculation
  • Learning: Sparsity patterns and accuracy vs compression trade-offs

Step 3: Quantization Experiments

  • Concept: Reduce precision from FP32 to INT8 for memory efficiency
  • Implementation: quantize_layer_weights() with scale/offset mapping
  • Learning: Numerical precision impact on model performance

Step 4: Knowledge Distillation

  • Concept: Large models teach small models through soft targets
  • Implementation: DistillationLoss with temperature scaling
  • Learning: Advanced training techniques for compact models

Step 5: Structured Pruning

  • Concept: Remove entire neurons/channels, not just weights
  • Implementation: prune_layer_neurons() with importance scoring
  • Learning: Architecture optimization and cascade effects

Step 6: Comprehensive Comparison

  • Concept: Combine techniques for maximum efficiency
  • Implementation: Integrated compression pipeline
  • Learning: Systems thinking for production deployment

Key Components

CompressionMetrics

  • Purpose: Analyze model size and parameter distribution
  • Methods: count_parameters(), calculate_model_size(), analyze_weight_distribution()
  • Usage: Foundation for compression target selection

Pruning Functions

  • Purpose: Remove unimportant weights and neurons
  • Methods: prune_weights_by_magnitude(), prune_model_by_magnitude(), calculate_sparsity()
  • Usage: Reduce model size while maintaining performance

Quantization Functions

  • Purpose: Reduce memory usage through lower precision
  • Methods: quantize_layer_weights(), dequantize_layer_weights()
  • Usage: 75% memory reduction for mobile deployment

Knowledge Distillation

  • Purpose: Train compact models with teacher guidance
  • Methods: DistillationLoss, train_with_distillation()
  • Usage: Achieve better small model performance

Structured Pruning

  • Purpose: Remove entire neurons for actual speedup
  • Methods: prune_layer_neurons(), compute_neuron_importance()
  • Usage: Architecture optimization and hardware efficiency

Real-World Applications

Mobile AI Deployment

  • Constraint: Models must be < 10MB for smartphone apps
  • Solution: Combine pruning and quantization for 90% size reduction
  • Examples: Google Translate offline, mobile camera AI

Edge Computing

  • Constraint: Severe memory and compute limitations
  • Solution: Structured pruning for actual inference speedup
  • Examples: IoT sensors, smart cameras, voice assistants

Cost Optimization

  • Constraint: Expensive cloud inference at scale
  • Solution: Reduce model size for lower compute costs
  • Examples: Production recommendation systems, search engines

Battery Efficiency

  • Constraint: Wearable devices need long battery life
  • Solution: Quantization and pruning for energy savings
  • Examples: Smartwatches, fitness trackers, AR glasses

Industry Connections

MobileNet Architecture

  • Concept: Depthwise separable convolutions for efficiency
  • Connection: Structured optimization for mobile deployment
  • Learning: Architecture design affects compression potential

DistilBERT

  • Concept: 60% smaller than BERT with 97% performance
  • Connection: Knowledge distillation for language models
  • Learning: Teacher-student training for different domains

TinyML Movement

  • Concept: ML on microcontrollers (< 1MB models)
  • Connection: Extreme compression for embedded systems
  • Learning: Efficiency requirements for edge deployment
  • Concept: Automated model design for efficiency
  • Connection: Structured pruning as architecture optimization
  • Learning: Automated techniques for compression

Assessment Criteria

Technical Implementation (40%)

  • Correctly implement 4 compression techniques
  • Handle edge cases and error conditions
  • Provide comprehensive statistics and analysis

Understanding Trade-offs (30%)

  • Explain accuracy vs efficiency spectrum
  • Identify appropriate techniques for different constraints
  • Analyze compression effectiveness quantitatively

Real-World Application (30%)

  • Connect compression to deployment scenarios
  • Understand hardware and system constraints
  • Design compression strategies for specific use cases

Next Steps

Module 11: Kernels

  • Connection: Hardware-aware optimization builds on compression
  • Skills: GPU kernels, SIMD operations, memory optimization
  • Application: Implement efficient compressed model inference

Module 12: Benchmarking

  • Connection: Measure compression effectiveness systematically
  • Skills: Performance profiling, accuracy measurement, A/B testing
  • Application: Evaluate compression trade-offs in production

Module 13: MLOps

  • Connection: Deploy compressed models in production systems
  • Skills: Model versioning, monitoring, continuous optimization
  • Application: Production-ready compressed model deployment

File Structure

10_compression/
├── compression_dev.py       # Main development notebook
├── module.yaml              # Module configuration
├── README.md               # This file
└── tests/                  # Additional test files (if needed)

Getting Started

  1. Review Dependencies: Ensure modules 01, 02, 04, 05, 10 are complete
  2. Open Development File: compression_dev.py
  3. Follow Educational Flow: Work through Steps 1-6 sequentially
  4. Test Thoroughly: Run all inline tests as you progress
  5. Export to Package: Use tito export 10_compression when complete

Key Takeaways

Students completing this module will:

  • Understand the efficiency requirements of production AI systems
  • Implement four essential compression techniques from scratch
  • Analyze accuracy vs efficiency trade-offs quantitatively
  • Apply compression strategies to real neural networks
  • Connect compression to mobile, edge, and production deployment
  • Prepare for advanced optimization and production deployment modules

This module bridges the gap between research-quality models and production-ready AI systems, teaching the essential skills for deploying AI in resource-constrained environments.