mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-07 09:37:31 -05:00
- Analyzed current TinyTorch foundation (modules 00-09) - Identified compression opportunities in Dense/CNN parameters - Ranked 4 compression techniques by educational value: 1. Magnitude-based pruning (★★★★★) - builds on weight matrices 2. Quantization FP32→INT8 (★★★★) - builds on tensor operations 3. Knowledge distillation (★★★★) - builds on training pipeline 4. Structured pruning (★★★) - builds on architecture design Educational progression: - Step 1: Parameter analysis and model size understanding - Step 2: Weight pruning with sparsity visualization - Step 3: Quantization experiments with bit-width trade-offs - Step 4: Teacher-student training with distillation loss - Step 5: Neuron removal and architecture modification - Step 6: Comprehensive technique comparison Real-world connections: - Mobile AI deployment constraints - Production ML system optimization - Research frontiers in model compression Perfect foundation for modules 11-13 (kernels, benchmarking, MLOps)
12 KiB
12 KiB
🗜️ Module 10: Compression & Optimization - Design Document
📊 Current Foundation Analysis
✅ What Students Already Know (Modules 00-09)
- Dense Layers: Weight matrices, bias vectors, Xavier initialization
- CNN Layers: 2D kernels, spatial processing, parameter sharing
- Model Architecture: Sequential composition, MLPs, CNNs
- Training Pipeline: Loss functions, optimizers, metrics, complete workflows
- Data Handling: Batch processing, DataLoader, real datasets
- Parameter Understanding: Shapes, initialization strategies, learned parameters
🎯 Compression Opportunities Identified
1. Dense Layer Parameters
- Weight matrices:
(input_size, output_size)- often largest component - Bias vectors:
(output_size,)- smaller but present in every layer - Compression potential: High - dense layers are parameter-heavy
2. CNN Parameters
- Kernels:
(kernel_height, kernel_width)- repeated across channels/filters - Compression potential: Moderate - already parameter-efficient through sharing
3. Model Architectures
- Sequential networks: Multiple layers with growing/shrinking dimensions
- Compression potential: High - architectural optimization can dramatically reduce size
🎓 Educational Compression Techniques (Ranked by Learning Value)
Priority 1: Magnitude-Based Pruning ⭐⭐⭐⭐⭐
Why this first: Builds directly on weight matrices students understand
Learning Objectives:
- Understand that not all parameters contribute equally to model performance
- Learn to identify and remove less important weights
- See the trade-off between model size and accuracy
- Experience sparsity in neural networks
Technical Implementation:
# Students will implement:
def prune_weights_by_magnitude(layer, pruning_ratio=0.5):
"""Remove smallest weights from Dense layer."""
weights = layer.weights.data
threshold = np.percentile(np.abs(weights), pruning_ratio * 100)
mask = np.abs(weights) > threshold
layer.weights.data = weights * mask
return layer
# Usage example:
dense_layer = Dense(784, 128)
compressed_layer = prune_weights_by_magnitude(dense_layer, pruning_ratio=0.3)
Educational Value:
- Immediate: See weight matrices become sparse
- Visual: Plot weight distributions before/after pruning
- Practical: Measure model size reduction and accuracy impact
- Conceptual: Understand parameter importance and redundancy
Priority 2: Quantization (FP32 → INT8) ⭐⭐⭐⭐
Why second: Builds on tensor operations students understand
Learning Objectives:
- Understand numerical precision trade-offs in ML
- Learn how reducing bits per parameter saves memory
- Experience the accuracy vs efficiency spectrum
- Connect to real mobile/edge deployment constraints
Technical Implementation:
# Students will implement:
def quantize_layer_weights(layer, bits=8):
"""Quantize layer weights to lower precision."""
weights = layer.weights.data
# Find min/max for quantization range
w_min, w_max = weights.min(), weights.max()
# Quantize to bits precision
scale = (w_max - w_min) / (2**bits - 1)
quantized = np.round((weights - w_min) / scale)
# Convert back to float (simulation of quantized weights)
dequantized = quantized * scale + w_min
layer.weights.data = dequantized.astype(np.float32)
return layer, scale, w_min
# Usage example:
layer = Dense(100, 50)
q_layer, scale, offset = quantize_layer_weights(layer, bits=8)
Educational Value:
- Mathematical: Understand linear quantization mapping
- Practical: See dramatic memory reduction (75% for FP32→INT8)
- Performance: Measure accuracy degradation vs compression
- Real-world: Connect to mobile AI and edge deployment
Priority 3: Knowledge Distillation ⭐⭐⭐⭐
Why third: Builds on training pipeline students just mastered
Learning Objectives:
- Learn how large models can teach small models
- Understand soft targets vs hard targets
- Experience training dynamics with teacher guidance
- See how knowledge can be compressed across architectures
Technical Implementation:
# Students will implement:
class DistillationLoss:
"""Combined loss for knowledge distillation."""
def __init__(self, temperature=3.0, alpha=0.5):
self.temperature = temperature
self.alpha = alpha
self.ce_loss = CrossEntropyLoss()
def __call__(self, student_logits, teacher_logits, true_labels):
# Hard loss (standard classification)
hard_loss = self.ce_loss(student_logits, true_labels)
# Soft loss (distillation from teacher)
soft_targets = softmax(teacher_logits / self.temperature)
soft_student = softmax(student_logits / self.temperature)
soft_loss = -np.sum(soft_targets * np.log(soft_student + 1e-10))
# Combined loss
return self.alpha * hard_loss + (1 - self.alpha) * soft_loss
# Usage example:
teacher = create_mlp(784, [512, 256, 128], 10) # Large model
student = create_mlp(784, [64, 32], 10) # Small model
distill_loss = DistillationLoss(temperature=3.0)
trainer = Trainer(student, optimizer, distill_loss)
Educational Value:
- Advanced Training: Beyond standard supervised learning
- Architecture Flexibility: Different sized models with same task
- Loss Design: Custom loss functions for specific objectives
- Transfer Learning: Knowledge transfer between models
Priority 4: Structured Pruning (Layer Width Reduction) ⭐⭐⭐
Why fourth: Builds on architecture design understanding
Learning Objectives:
- Understand structured vs unstructured sparsity
- Learn to remove entire neurons/channels systematically
- See how architecture changes affect model behavior
- Experience automated neural architecture search concepts
Technical Implementation:
# Students will implement:
def prune_layer_neurons(layer, importance_scores, keep_ratio=0.7):
"""Remove least important neurons from Dense layer."""
output_size = layer.output_size
keep_count = int(output_size * keep_ratio)
# Select most important neurons
top_indices = np.argsort(importance_scores)[-keep_count:]
# Prune weights and bias
layer.weights.data = layer.weights.data[:, top_indices]
if layer.bias is not None:
layer.bias.data = layer.bias.data[top_indices]
layer.output_size = keep_count
return layer
def compute_neuron_importance(layer, data_loader):
"""Compute importance scores for each neuron."""
# Students implement activation-based importance
pass
# Usage example:
importance = compute_neuron_importance(layer, train_loader)
compressed_layer = prune_layer_neurons(layer, importance, keep_ratio=0.6)
Educational Value:
- System Architecture: Modifying network structure itself
- Importance Metrics: Different ways to measure neuron contributions
- Cascade Effects: How pruning one layer affects next layers
- AutoML Connection: Automated architecture optimization
🎯 Module Structure (Educational Progression)
Step 1: Understanding Model Size and Parameters
- Count parameters in Dense and CNN layers
- Visualize parameter distributions
- Measure memory footprint of different architectures
- Build Foundation: "What makes models large?"
Step 2: Magnitude-Based Pruning
- Implement weight pruning with different thresholds
- Visualize sparse weight matrices
- Measure accuracy vs sparsity trade-offs
- Core Technique: "Remove unimportant weights"
Step 3: Quantization Experiments
- Implement FP32 → INT8 quantization
- Measure memory savings and accuracy impact
- Explore different bit widths (16-bit, 8-bit, 4-bit)
- Efficiency Focus: "Use fewer bits per parameter"
Step 4: Knowledge Distillation
- Train teacher model on full dataset
- Implement distillation loss function
- Train student model with teacher guidance
- Advanced Training: "Large models teach small models"
Step 5: Structured Pruning
- Implement neuron importance computation
- Remove entire neurons/channels
- Handle cascade effects on subsequent layers
- Architecture Optimization: "Modify network structure"
Step 6: Comprehensive Comparison
- Apply all techniques to same base model
- Create compression vs accuracy plots
- Benchmark inference speed improvements
- Systems Integration: "Combine techniques for maximum effect"
🛠️ Implementation Strategy
Building on Existing Components
- Dense layers: Primary target for compression techniques
- Training pipeline: Framework for measuring accuracy impact
- DataLoader: Consistent evaluation across compressed models
- Metrics: Accuracy measurement for compression trade-offs
New Components to Build
- CompressionMetrics: Model size, parameter count, sparsity measurement
- PruningUtils: Weight analysis, threshold selection, mask application
- QuantizationUtils: Bit-width conversion, scale/offset computation
- DistillationTrainer: Extended trainer for teacher-student training
- ComparisonTools: Visualization and benchmarking utilities
Educational Testing Framework
- Before/After Comparisons: Size, accuracy, speed for each technique
- Visualization Tools: Weight distributions, sparsity patterns, accuracy curves
- Interactive Exploration: Students experiment with different compression ratios
- Real-World Context: Connect to mobile deployment constraints
📚 Real-World Connections
Mobile and Edge AI
- Smartphone apps need small models (< 10MB)
- Embedded devices have severe memory constraints
- Battery life affected by computation intensity
- Student Understanding: Why compression matters in practice
Production ML Systems
- Cost optimization in cloud inference
- Latency requirements for real-time applications
- Memory bandwidth limitations in data centers
- Career Relevance: Skills needed for production deployment
Research Frontiers
- Neural architecture search (NAS)
- Hardware-aware model design
- Automatic compression techniques
- Advanced Topics: Connection to cutting-edge research
🎯 Success Metrics
Educational Outcomes
- Students understand parameter importance and redundancy
- Students can trade model size for accuracy systematically
- Students connect compression to real deployment constraints
- Students gain intuition for when different techniques work best
Technical Skills
- Implement 4 different compression techniques from scratch
- Measure and visualize compression trade-offs
- Modify existing models for better efficiency
- Design compression strategies for specific constraints
Real-World Preparation
- Understanding of mobile AI constraints
- Experience with production optimization techniques
- Knowledge of compression research landscape
- Skills for model deployment and optimization roles
🚀 Why This Module Design Works
Perfect Timing
- Students just mastered training (Module 9)
- Natural next step: optimize trained models
- Builds on solid foundation of layers, networks, training
Hands-On Learning
- Every technique implemented from scratch
- Immediate visual feedback on compression effects
- Real data and models, not toy examples
Progressive Complexity
- Start simple (magnitude pruning)
- Build to advanced (knowledge distillation)
- Integrate all techniques for maximum learning
Career Relevant
- Essential skills for production ML roles
- Understanding of efficiency constraints in real systems
- Foundation for research in model optimization
Foundation for Later Modules
- Benchmarking skills prepare for Module 12
- Performance optimization mindset prepares for Module 11
- Production awareness prepares for MLOps Module 13
This compression module design builds perfectly on students' current knowledge while introducing essential production ML skills. Students will gain practical experience with the efficiency techniques that make modern AI deployment possible!