Files
TinyTorch/modules/16_compression/ABOUT.md
Vijay Janapa Reddi a5679de141 Update documentation after module reordering
All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
2025-11-10 19:37:41 -05:00

4.0 KiB
Raw Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
title description difficulty time_estimate prerequisites next_steps learning_objectives
Compression - Pruning and Model Compression Prune unnecessary weights and compress models for deployment 3 5-6 hours
Quantization
Acceleration
Implement magnitude-based pruning to remove unimportant weights
Design structured pruning strategies (channel, layer-wise)
Apply iterative pruning with fine-tuning for accuracy preservation
Combine pruning with quantization for maximum compression
Measure compression ratios and inference speedups

16. Compression

OPTIMIZATION TIER | Difficulty: (3/4) | Time: 5-6 hours

Overview

Compress neural networks through pruning (removing weights) and combining with quantization. This module implements techniques to achieve 10-50× compression with minimal accuracy loss, enabling deployment on resource-constrained devices.

Learning Objectives

By completing this module, you will be able to:

  1. Implement magnitude-based pruning to identify and remove unimportant weights
  2. Design structured pruning strategies (channel pruning, layer-wise) for actual speedups
  3. Apply iterative pruning with fine-tuning to maintain model accuracy
  4. Combine pruning with quantization for maximum compression (50-100× possible)
  5. Measure compression ratios and verify inference speedup vs accuracy trade-offs

Why This Matters

Production Context

Compression enables practical deployment:

  • BERT Distillation (DistilBERT): 40% smaller, 60% faster, 97% accuracy retention
  • MobileNet: Structured pruning + quantization for mobile deployment
  • Lottery Ticket Hypothesis: Sparse networks train as well as dense ones
  • GPT-3 Distillation: Smaller models approaching GPT-3 performance

Historical Context

  • Pre-2015: Limited compression work; models small enough for hardware
  • 2015-2017: Magnitude pruning (Han et al.); Lottery Ticket Hypothesis
  • 2018-2020: Structured pruning; distillation; BERT compression
  • 2020+: Extreme compression (100×); sparse transformers; efficient architectures

Compression is now standard for deployment, not optional.

Implementation Guide

Core Techniques

Magnitude Pruning

  • Sort weights by absolute value
  • Remove smallest X% (typically 50-90%)
  • Fine-tune remaining weights
  • Can achieve 10× compression with <1% accuracy loss

Structured Pruning

  • Remove entire channels/neurons
  • Achieves actual speedup (vs unstructured sparsity)
  • Typically 2-5× compression
  • More aggressive accuracy impact

Iterative Pruning

  • Prune gradually (10% at a time)
  • Fine-tune after each pruning step
  • Better accuracy than one-shot pruning
  • More training cost

Pruning + Quantization

  • Prune 90% of weights → 10× reduction
  • Quantize FP32 → INT8 → 4× reduction
  • Combined: 40× compression

Testing

tito export 16_compression
tito test 16_compression

Where This Code Lives

tinytorch/
├── compression/
│   └── prune.py
└── __init__.py

Systems Thinking Questions

  1. Lottery Ticket Hypothesis: Why can pruned networks retrain to full accuracy? What does this say about overparameterization?

  2. Structured vs Unstructured: Unstructured pruning achieves better compression but no speedup. Why? When is sparse computation actually faster?

  3. Distillation vs Pruning: Both compress models. When would you use each? Can you combine them?

Real-World Connections

DistilBERT: 40% smaller BERT with 97% performance MobileNetV2: Efficient architectures + pruning for mobile NVIDIA TensorRT: Automatic pruning + quantization for deployment

What's Next?

In Module 17: Memoization, you'll learn computational reuse:

  • KV-caching for transformers
  • Eliminate redundant computation
  • 10-15× speedup for autoregressive generation
  • Memory-compute trade-offs

Ready to compress models? Open modules/16_compression/compression_dev.py and start implementing.