mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-07 09:37:31 -05:00

Files

Vijay Janapa Reddi 4922498170 Design: Module 10 Compression comprehensive analysis

- Analyzed current TinyTorch foundation (modules 00-09)
- Identified compression opportunities in Dense/CNN parameters
- Ranked 4 compression techniques by educational value:
  1. Magnitude-based pruning (★★★★★) - builds on weight matrices
  2. Quantization FP32→INT8 (★★★★) - builds on tensor operations
  3. Knowledge distillation (★★★★) - builds on training pipeline
  4. Structured pruning (★★★) - builds on architecture design

Educational progression:
- Step 1: Parameter analysis and model size understanding
- Step 2: Weight pruning with sparsity visualization
- Step 3: Quantization experiments with bit-width trade-offs
- Step 4: Teacher-student training with distillation loss
- Step 5: Neuron removal and architecture modification
- Step 6: Comprehensive technique comparison

Real-world connections:
- Mobile AI deployment constraints
- Production ML system optimization
- Research frontiers in model compression

Perfect foundation for modules 11-13 (kernels, benchmarking, MLOps)

2025-07-14 08:35:39 -04:00

12 KiB

Raw Blame History

🗜️ Module 10: Compression & Optimization - Design Document

📊 Current Foundation Analysis

✅ What Students Already Know (Modules 00-09)

Dense Layers: Weight matrices, bias vectors, Xavier initialization
CNN Layers: 2D kernels, spatial processing, parameter sharing
Model Architecture: Sequential composition, MLPs, CNNs
Training Pipeline: Loss functions, optimizers, metrics, complete workflows
Data Handling: Batch processing, DataLoader, real datasets
Parameter Understanding: Shapes, initialization strategies, learned parameters

🎯 Compression Opportunities Identified

1. Dense Layer Parameters

Weight matrices: (input_size, output_size) - often largest component
Bias vectors: (output_size,) - smaller but present in every layer
Compression potential: High - dense layers are parameter-heavy

2. CNN Parameters

Kernels: (kernel_height, kernel_width) - repeated across channels/filters
Compression potential: Moderate - already parameter-efficient through sharing

3. Model Architectures

Sequential networks: Multiple layers with growing/shrinking dimensions
Compression potential: High - architectural optimization can dramatically reduce size

🎓 Educational Compression Techniques (Ranked by Learning Value)

Priority 1: Magnitude-Based Pruning ⭐⭐⭐⭐⭐

Why this first: Builds directly on weight matrices students understand

Learning Objectives:

Understand that not all parameters contribute equally to model performance
Learn to identify and remove less important weights
See the trade-off between model size and accuracy
Experience sparsity in neural networks

Technical Implementation:

# Students will implement:
def prune_weights_by_magnitude(layer, pruning_ratio=0.5):
    """Remove smallest weights from Dense layer."""
    weights = layer.weights.data
    threshold = np.percentile(np.abs(weights), pruning_ratio * 100)
    mask = np.abs(weights) > threshold
    layer.weights.data = weights * mask
    return layer

# Usage example:
dense_layer = Dense(784, 128)
compressed_layer = prune_weights_by_magnitude(dense_layer, pruning_ratio=0.3)

Educational Value:

Immediate: See weight matrices become sparse
Visual: Plot weight distributions before/after pruning
Practical: Measure model size reduction and accuracy impact
Conceptual: Understand parameter importance and redundancy

Priority 2: Quantization (FP32 → INT8) ⭐⭐⭐⭐

Why second: Builds on tensor operations students understand

Learning Objectives:

Understand numerical precision trade-offs in ML
Learn how reducing bits per parameter saves memory
Experience the accuracy vs efficiency spectrum
Connect to real mobile/edge deployment constraints

Technical Implementation:

# Students will implement:
def quantize_layer_weights(layer, bits=8):
    """Quantize layer weights to lower precision."""
    weights = layer.weights.data
    
    # Find min/max for quantization range
    w_min, w_max = weights.min(), weights.max()
    
    # Quantize to bits precision
    scale = (w_max - w_min) / (2**bits - 1)
    quantized = np.round((weights - w_min) / scale)
    
    # Convert back to float (simulation of quantized weights)
    dequantized = quantized * scale + w_min
    
    layer.weights.data = dequantized.astype(np.float32)
    return layer, scale, w_min

# Usage example:
layer = Dense(100, 50)
q_layer, scale, offset = quantize_layer_weights(layer, bits=8)

Educational Value:

Mathematical: Understand linear quantization mapping
Practical: See dramatic memory reduction (75% for FP32→INT8)
Performance: Measure accuracy degradation vs compression
Real-world: Connect to mobile AI and edge deployment

Priority 3: Knowledge Distillation ⭐⭐⭐⭐

Why third: Builds on training pipeline students just mastered

Learning Objectives:

Learn how large models can teach small models
Understand soft targets vs hard targets
Experience training dynamics with teacher guidance
See how knowledge can be compressed across architectures

Technical Implementation:

# Students will implement:
class DistillationLoss:
    """Combined loss for knowledge distillation."""
    def __init__(self, temperature=3.0, alpha=0.5):
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = CrossEntropyLoss()
        
    def __call__(self, student_logits, teacher_logits, true_labels):
        # Hard loss (standard classification)
        hard_loss = self.ce_loss(student_logits, true_labels)
        
        # Soft loss (distillation from teacher)
        soft_targets = softmax(teacher_logits / self.temperature)
        soft_student = softmax(student_logits / self.temperature)
        soft_loss = -np.sum(soft_targets * np.log(soft_student + 1e-10))
        
        # Combined loss
        return self.alpha * hard_loss + (1 - self.alpha) * soft_loss

# Usage example:
teacher = create_mlp(784, [512, 256, 128], 10)  # Large model
student = create_mlp(784, [64, 32], 10)         # Small model

distill_loss = DistillationLoss(temperature=3.0)
trainer = Trainer(student, optimizer, distill_loss)

Educational Value:

Advanced Training: Beyond standard supervised learning
Architecture Flexibility: Different sized models with same task
Loss Design: Custom loss functions for specific objectives
Transfer Learning: Knowledge transfer between models

Priority 4: Structured Pruning (Layer Width Reduction) ⭐⭐⭐

Why fourth: Builds on architecture design understanding

Learning Objectives:

Understand structured vs unstructured sparsity
Learn to remove entire neurons/channels systematically
See how architecture changes affect model behavior
Experience automated neural architecture search concepts

Technical Implementation:

# Students will implement:
def prune_layer_neurons(layer, importance_scores, keep_ratio=0.7):
    """Remove least important neurons from Dense layer."""
    output_size = layer.output_size
    keep_count = int(output_size * keep_ratio)
    
    # Select most important neurons
    top_indices = np.argsort(importance_scores)[-keep_count:]
    
    # Prune weights and bias
    layer.weights.data = layer.weights.data[:, top_indices]
    if layer.bias is not None:
        layer.bias.data = layer.bias.data[top_indices]
        
    layer.output_size = keep_count
    return layer

def compute_neuron_importance(layer, data_loader):
    """Compute importance scores for each neuron."""
    # Students implement activation-based importance
    pass

# Usage example:
importance = compute_neuron_importance(layer, train_loader)
compressed_layer = prune_layer_neurons(layer, importance, keep_ratio=0.6)

Educational Value:

System Architecture: Modifying network structure itself
Importance Metrics: Different ways to measure neuron contributions
Cascade Effects: How pruning one layer affects next layers
AutoML Connection: Automated architecture optimization

🎯 Module Structure (Educational Progression)

Step 1: Understanding Model Size and Parameters

Count parameters in Dense and CNN layers
Visualize parameter distributions
Measure memory footprint of different architectures
Build Foundation: "What makes models large?"

Step 2: Magnitude-Based Pruning

Implement weight pruning with different thresholds
Visualize sparse weight matrices
Measure accuracy vs sparsity trade-offs
Core Technique: "Remove unimportant weights"

Step 3: Quantization Experiments

Implement FP32 → INT8 quantization
Measure memory savings and accuracy impact
Explore different bit widths (16-bit, 8-bit, 4-bit)
Efficiency Focus: "Use fewer bits per parameter"

Step 4: Knowledge Distillation

Train teacher model on full dataset
Implement distillation loss function
Train student model with teacher guidance
Advanced Training: "Large models teach small models"

Step 5: Structured Pruning

Implement neuron importance computation
Remove entire neurons/channels
Handle cascade effects on subsequent layers
Architecture Optimization: "Modify network structure"

Step 6: Comprehensive Comparison

Apply all techniques to same base model
Create compression vs accuracy plots
Benchmark inference speed improvements
Systems Integration: "Combine techniques for maximum effect"

🛠️ Implementation Strategy

Building on Existing Components

Dense layers: Primary target for compression techniques
Training pipeline: Framework for measuring accuracy impact
DataLoader: Consistent evaluation across compressed models
Metrics: Accuracy measurement for compression trade-offs

New Components to Build

CompressionMetrics: Model size, parameter count, sparsity measurement
PruningUtils: Weight analysis, threshold selection, mask application
QuantizationUtils: Bit-width conversion, scale/offset computation
DistillationTrainer: Extended trainer for teacher-student training
ComparisonTools: Visualization and benchmarking utilities

Educational Testing Framework

Before/After Comparisons: Size, accuracy, speed for each technique
Visualization Tools: Weight distributions, sparsity patterns, accuracy curves
Interactive Exploration: Students experiment with different compression ratios
Real-World Context: Connect to mobile deployment constraints

📚 Real-World Connections

Mobile and Edge AI

Smartphone apps need small models (< 10MB)
Embedded devices have severe memory constraints
Battery life affected by computation intensity
Student Understanding: Why compression matters in practice

Production ML Systems

Cost optimization in cloud inference
Latency requirements for real-time applications
Memory bandwidth limitations in data centers
Career Relevance: Skills needed for production deployment

Research Frontiers

Neural architecture search (NAS)
Hardware-aware model design
Automatic compression techniques
Advanced Topics: Connection to cutting-edge research

🎯 Success Metrics

Educational Outcomes

Students understand parameter importance and redundancy
Students can trade model size for accuracy systematically
Students connect compression to real deployment constraints
Students gain intuition for when different techniques work best

Technical Skills

Implement 4 different compression techniques from scratch
Measure and visualize compression trade-offs
Modify existing models for better efficiency
Design compression strategies for specific constraints

Real-World Preparation

Understanding of mobile AI constraints
Experience with production optimization techniques
Knowledge of compression research landscape
Skills for model deployment and optimization roles

🚀 Why This Module Design Works

Perfect Timing

Students just mastered training (Module 9)
Natural next step: optimize trained models
Builds on solid foundation of layers, networks, training

Hands-On Learning

Every technique implemented from scratch
Immediate visual feedback on compression effects
Real data and models, not toy examples

Progressive Complexity

Start simple (magnitude pruning)
Build to advanced (knowledge distillation)
Integrate all techniques for maximum learning

Career Relevant

Essential skills for production ML roles
Understanding of efficiency constraints in real systems
Foundation for research in model optimization

Foundation for Later Modules

Benchmarking skills prepare for Module 12
Performance optimization mindset prepares for Module 11
Production awareness prepares for MLOps Module 13

This compression module design builds perfectly on students' current knowledge while introducing essential production ML skills. Students will gain practical experience with the efficiency techniques that make modern AI deployment possible!

12 KiB Raw Blame History

🗜️ Module 10: Compression & Optimization - Design Document

📊 Current Foundation Analysis

✅ What Students Already Know (Modules 00-09)

🎯 Compression Opportunities Identified

1. Dense Layer Parameters

2. CNN Parameters

3. Model Architectures

🎓 Educational Compression Techniques (Ranked by Learning Value)

Priority 1: Magnitude-Based Pruning ⭐⭐⭐⭐⭐

Learning Objectives:

Technical Implementation:

Educational Value:

Priority 2: Quantization (FP32 → INT8) ⭐⭐⭐⭐

Learning Objectives:

Technical Implementation:

Educational Value:

Priority 3: Knowledge Distillation ⭐⭐⭐⭐

Learning Objectives:

Technical Implementation:

Educational Value:

Priority 4: Structured Pruning (Layer Width Reduction) ⭐⭐⭐

Learning Objectives:

Technical Implementation:

Educational Value:

🎯 Module Structure (Educational Progression)

Step 1: Understanding Model Size and Parameters

Step 2: Magnitude-Based Pruning

Step 3: Quantization Experiments

Step 4: Knowledge Distillation

Step 5: Structured Pruning

Step 6: Comprehensive Comparison

🛠️ Implementation Strategy

Building on Existing Components

New Components to Build

Educational Testing Framework

📚 Real-World Connections

Mobile and Edge AI

Production ML Systems

Research Frontiers

🎯 Success Metrics

Educational Outcomes

Technical Skills

Real-World Preparation

🚀 Why This Module Design Works

Perfect Timing

Hands-On Learning

Progressive Complexity

Career Relevant

Foundation for Later Modules

12 KiB

Raw Blame History