All module references updated to reflect new ordering: - Module 15: Quantization (was 16) - Module 16: Compression (was 17) - Module 17: Memoization (was 15) Updated by module-developer and website-manager agents: - Module ABOUT files with correct numbers and prerequisites - Cross-references and "What's Next" chains - Website navigation (_toc.yml) and content - Learning path progression in LEARNING_PATH.md - Profile milestone completion message (Module 17) Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
4.0 KiB
title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
| title | description | difficulty | time_estimate | prerequisites | next_steps | learning_objectives | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Compression - Pruning and Model Compression | Prune unnecessary weights and compress models for deployment | 3 | 5-6 hours |
|
|
|
16. Compression
⚡ OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
Overview
Compress neural networks through pruning (removing weights) and combining with quantization. This module implements techniques to achieve 10-50× compression with minimal accuracy loss, enabling deployment on resource-constrained devices.
Learning Objectives
By completing this module, you will be able to:
- Implement magnitude-based pruning to identify and remove unimportant weights
- Design structured pruning strategies (channel pruning, layer-wise) for actual speedups
- Apply iterative pruning with fine-tuning to maintain model accuracy
- Combine pruning with quantization for maximum compression (50-100× possible)
- Measure compression ratios and verify inference speedup vs accuracy trade-offs
Why This Matters
Production Context
Compression enables practical deployment:
- BERT Distillation (DistilBERT): 40% smaller, 60% faster, 97% accuracy retention
- MobileNet: Structured pruning + quantization for mobile deployment
- Lottery Ticket Hypothesis: Sparse networks train as well as dense ones
- GPT-3 Distillation: Smaller models approaching GPT-3 performance
Historical Context
- Pre-2015: Limited compression work; models small enough for hardware
- 2015-2017: Magnitude pruning (Han et al.); Lottery Ticket Hypothesis
- 2018-2020: Structured pruning; distillation; BERT compression
- 2020+: Extreme compression (100×); sparse transformers; efficient architectures
Compression is now standard for deployment, not optional.
Implementation Guide
Core Techniques
Magnitude Pruning
- Sort weights by absolute value
- Remove smallest X% (typically 50-90%)
- Fine-tune remaining weights
- Can achieve 10× compression with <1% accuracy loss
Structured Pruning
- Remove entire channels/neurons
- Achieves actual speedup (vs unstructured sparsity)
- Typically 2-5× compression
- More aggressive accuracy impact
Iterative Pruning
- Prune gradually (10% at a time)
- Fine-tune after each pruning step
- Better accuracy than one-shot pruning
- More training cost
Pruning + Quantization
- Prune 90% of weights → 10× reduction
- Quantize FP32 → INT8 → 4× reduction
- Combined: 40× compression
Testing
tito export 16_compression
tito test 16_compression
Where This Code Lives
tinytorch/
├── compression/
│ └── prune.py
└── __init__.py
Systems Thinking Questions
-
Lottery Ticket Hypothesis: Why can pruned networks retrain to full accuracy? What does this say about overparameterization?
-
Structured vs Unstructured: Unstructured pruning achieves better compression but no speedup. Why? When is sparse computation actually faster?
-
Distillation vs Pruning: Both compress models. When would you use each? Can you combine them?
Real-World Connections
DistilBERT: 40% smaller BERT with 97% performance MobileNetV2: Efficient architectures + pruning for mobile NVIDIA TensorRT: Automatic pruning + quantization for deployment
What's Next?
In Module 17: Memoization, you'll learn computational reuse:
- KV-caching for transformers
- Eliminate redundant computation
- 10-15× speedup for autoregressive generation
- Memory-compute trade-offs
Ready to compress models? Open modules/16_compression/compression_dev.py and start implementing.