mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-25 22:29:41 -05:00

Files

Vijay Janapa Reddi a5679de141 Update documentation after module reordering

All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate

2025-11-10 19:37:41 -05:00

4.0 KiB

Raw Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives

title

description

difficulty

time_estimate

prerequisites

next_steps

learning_objectives

Compression - Pruning and Model Compression

Prune unnecessary weights and compress models for deployment

5-6 hours

Quantization

Acceleration

Implement magnitude-based pruning to remove unimportant weights

Design structured pruning strategies (channel, layer-wise)

Apply iterative pruning with fine-tuning for accuracy preservation

Combine pruning with quantization for maximum compression

Measure compression ratios and inference speedups

16. Compression

⚡ OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours

Overview

Compress neural networks through pruning (removing weights) and combining with quantization. This module implements techniques to achieve 10-50× compression with minimal accuracy loss, enabling deployment on resource-constrained devices.

Learning Objectives

By completing this module, you will be able to:

Implement magnitude-based pruning to identify and remove unimportant weights
Design structured pruning strategies (channel pruning, layer-wise) for actual speedups
Apply iterative pruning with fine-tuning to maintain model accuracy
Combine pruning with quantization for maximum compression (50-100× possible)
Measure compression ratios and verify inference speedup vs accuracy trade-offs

Why This Matters

Production Context

Compression enables practical deployment:

BERT Distillation (DistilBERT): 40% smaller, 60% faster, 97% accuracy retention
MobileNet: Structured pruning + quantization for mobile deployment
Lottery Ticket Hypothesis: Sparse networks train as well as dense ones
GPT-3 Distillation: Smaller models approaching GPT-3 performance

Historical Context

Pre-2015: Limited compression work; models small enough for hardware
2015-2017: Magnitude pruning (Han et al.); Lottery Ticket Hypothesis
2018-2020: Structured pruning; distillation; BERT compression
2020+: Extreme compression (100×); sparse transformers; efficient architectures

Compression is now standard for deployment, not optional.

Implementation Guide

Core Techniques

Magnitude Pruning

Sort weights by absolute value
Remove smallest X% (typically 50-90%)
Fine-tune remaining weights
Can achieve 10× compression with <1% accuracy loss

Structured Pruning

Remove entire channels/neurons
Achieves actual speedup (vs unstructured sparsity)
Typically 2-5× compression
More aggressive accuracy impact

Iterative Pruning

Prune gradually (10% at a time)
Fine-tune after each pruning step
Better accuracy than one-shot pruning
More training cost

Pruning + Quantization

Prune 90% of weights → 10× reduction
Quantize FP32 → INT8 → 4× reduction
Combined: 40× compression

Testing

tito export 16_compression
tito test 16_compression

Where This Code Lives

tinytorch/
├── compression/
│   └── prune.py
└── __init__.py

Systems Thinking Questions

Lottery Ticket Hypothesis: Why can pruned networks retrain to full accuracy? What does this say about overparameterization?
Structured vs Unstructured: Unstructured pruning achieves better compression but no speedup. Why? When is sparse computation actually faster?
Distillation vs Pruning: Both compress models. When would you use each? Can you combine them?

Real-World Connections

DistilBERT: 40% smaller BERT with 97% performance MobileNetV2: Efficient architectures + pruning for mobile NVIDIA TensorRT: Automatic pruning + quantization for deployment

What's Next?

In Module 17: Memoization, you'll learn computational reuse:

KV-caching for transformers
Eliminate redundant computation
10-15× speedup for autoregressive generation
Memory-compute trade-offs

Ready to compress models? Open modules/16_compression/compression_dev.py and start implementing.

4.0 KiB Raw Blame History Unescape Escape