Files
TinyTorch/compression_module_design.md
Vijay Janapa Reddi 4922498170 Design: Module 10 Compression comprehensive analysis
- Analyzed current TinyTorch foundation (modules 00-09)
- Identified compression opportunities in Dense/CNN parameters
- Ranked 4 compression techniques by educational value:
  1. Magnitude-based pruning (★★★★★) - builds on weight matrices
  2. Quantization FP32→INT8 (★★★★) - builds on tensor operations
  3. Knowledge distillation (★★★★) - builds on training pipeline
  4. Structured pruning (★★★) - builds on architecture design

Educational progression:
- Step 1: Parameter analysis and model size understanding
- Step 2: Weight pruning with sparsity visualization
- Step 3: Quantization experiments with bit-width trade-offs
- Step 4: Teacher-student training with distillation loss
- Step 5: Neuron removal and architecture modification
- Step 6: Comprehensive technique comparison

Real-world connections:
- Mobile AI deployment constraints
- Production ML system optimization
- Research frontiers in model compression

Perfect foundation for modules 11-13 (kernels, benchmarking, MLOps)
2025-07-14 08:35:39 -04:00

12 KiB

🗜️ Module 10: Compression & Optimization - Design Document

📊 Current Foundation Analysis

What Students Already Know (Modules 00-09)

  • Dense Layers: Weight matrices, bias vectors, Xavier initialization
  • CNN Layers: 2D kernels, spatial processing, parameter sharing
  • Model Architecture: Sequential composition, MLPs, CNNs
  • Training Pipeline: Loss functions, optimizers, metrics, complete workflows
  • Data Handling: Batch processing, DataLoader, real datasets
  • Parameter Understanding: Shapes, initialization strategies, learned parameters

🎯 Compression Opportunities Identified

1. Dense Layer Parameters

  • Weight matrices: (input_size, output_size) - often largest component
  • Bias vectors: (output_size,) - smaller but present in every layer
  • Compression potential: High - dense layers are parameter-heavy

2. CNN Parameters

  • Kernels: (kernel_height, kernel_width) - repeated across channels/filters
  • Compression potential: Moderate - already parameter-efficient through sharing

3. Model Architectures

  • Sequential networks: Multiple layers with growing/shrinking dimensions
  • Compression potential: High - architectural optimization can dramatically reduce size

🎓 Educational Compression Techniques (Ranked by Learning Value)

Priority 1: Magnitude-Based Pruning

Why this first: Builds directly on weight matrices students understand

Learning Objectives:

  • Understand that not all parameters contribute equally to model performance
  • Learn to identify and remove less important weights
  • See the trade-off between model size and accuracy
  • Experience sparsity in neural networks

Technical Implementation:

# Students will implement:
def prune_weights_by_magnitude(layer, pruning_ratio=0.5):
    """Remove smallest weights from Dense layer."""
    weights = layer.weights.data
    threshold = np.percentile(np.abs(weights), pruning_ratio * 100)
    mask = np.abs(weights) > threshold
    layer.weights.data = weights * mask
    return layer

# Usage example:
dense_layer = Dense(784, 128)
compressed_layer = prune_weights_by_magnitude(dense_layer, pruning_ratio=0.3)

Educational Value:

  • Immediate: See weight matrices become sparse
  • Visual: Plot weight distributions before/after pruning
  • Practical: Measure model size reduction and accuracy impact
  • Conceptual: Understand parameter importance and redundancy

Priority 2: Quantization (FP32 → INT8)

Why second: Builds on tensor operations students understand

Learning Objectives:

  • Understand numerical precision trade-offs in ML
  • Learn how reducing bits per parameter saves memory
  • Experience the accuracy vs efficiency spectrum
  • Connect to real mobile/edge deployment constraints

Technical Implementation:

# Students will implement:
def quantize_layer_weights(layer, bits=8):
    """Quantize layer weights to lower precision."""
    weights = layer.weights.data
    
    # Find min/max for quantization range
    w_min, w_max = weights.min(), weights.max()
    
    # Quantize to bits precision
    scale = (w_max - w_min) / (2**bits - 1)
    quantized = np.round((weights - w_min) / scale)
    
    # Convert back to float (simulation of quantized weights)
    dequantized = quantized * scale + w_min
    
    layer.weights.data = dequantized.astype(np.float32)
    return layer, scale, w_min

# Usage example:
layer = Dense(100, 50)
q_layer, scale, offset = quantize_layer_weights(layer, bits=8)

Educational Value:

  • Mathematical: Understand linear quantization mapping
  • Practical: See dramatic memory reduction (75% for FP32→INT8)
  • Performance: Measure accuracy degradation vs compression
  • Real-world: Connect to mobile AI and edge deployment

Priority 3: Knowledge Distillation

Why third: Builds on training pipeline students just mastered

Learning Objectives:

  • Learn how large models can teach small models
  • Understand soft targets vs hard targets
  • Experience training dynamics with teacher guidance
  • See how knowledge can be compressed across architectures

Technical Implementation:

# Students will implement:
class DistillationLoss:
    """Combined loss for knowledge distillation."""
    def __init__(self, temperature=3.0, alpha=0.5):
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = CrossEntropyLoss()
        
    def __call__(self, student_logits, teacher_logits, true_labels):
        # Hard loss (standard classification)
        hard_loss = self.ce_loss(student_logits, true_labels)
        
        # Soft loss (distillation from teacher)
        soft_targets = softmax(teacher_logits / self.temperature)
        soft_student = softmax(student_logits / self.temperature)
        soft_loss = -np.sum(soft_targets * np.log(soft_student + 1e-10))
        
        # Combined loss
        return self.alpha * hard_loss + (1 - self.alpha) * soft_loss

# Usage example:
teacher = create_mlp(784, [512, 256, 128], 10)  # Large model
student = create_mlp(784, [64, 32], 10)         # Small model

distill_loss = DistillationLoss(temperature=3.0)
trainer = Trainer(student, optimizer, distill_loss)

Educational Value:

  • Advanced Training: Beyond standard supervised learning
  • Architecture Flexibility: Different sized models with same task
  • Loss Design: Custom loss functions for specific objectives
  • Transfer Learning: Knowledge transfer between models

Priority 4: Structured Pruning (Layer Width Reduction)

Why fourth: Builds on architecture design understanding

Learning Objectives:

  • Understand structured vs unstructured sparsity
  • Learn to remove entire neurons/channels systematically
  • See how architecture changes affect model behavior
  • Experience automated neural architecture search concepts

Technical Implementation:

# Students will implement:
def prune_layer_neurons(layer, importance_scores, keep_ratio=0.7):
    """Remove least important neurons from Dense layer."""
    output_size = layer.output_size
    keep_count = int(output_size * keep_ratio)
    
    # Select most important neurons
    top_indices = np.argsort(importance_scores)[-keep_count:]
    
    # Prune weights and bias
    layer.weights.data = layer.weights.data[:, top_indices]
    if layer.bias is not None:
        layer.bias.data = layer.bias.data[top_indices]
        
    layer.output_size = keep_count
    return layer

def compute_neuron_importance(layer, data_loader):
    """Compute importance scores for each neuron."""
    # Students implement activation-based importance
    pass

# Usage example:
importance = compute_neuron_importance(layer, train_loader)
compressed_layer = prune_layer_neurons(layer, importance, keep_ratio=0.6)

Educational Value:

  • System Architecture: Modifying network structure itself
  • Importance Metrics: Different ways to measure neuron contributions
  • Cascade Effects: How pruning one layer affects next layers
  • AutoML Connection: Automated architecture optimization

🎯 Module Structure (Educational Progression)

Step 1: Understanding Model Size and Parameters

  • Count parameters in Dense and CNN layers
  • Visualize parameter distributions
  • Measure memory footprint of different architectures
  • Build Foundation: "What makes models large?"

Step 2: Magnitude-Based Pruning

  • Implement weight pruning with different thresholds
  • Visualize sparse weight matrices
  • Measure accuracy vs sparsity trade-offs
  • Core Technique: "Remove unimportant weights"

Step 3: Quantization Experiments

  • Implement FP32 → INT8 quantization
  • Measure memory savings and accuracy impact
  • Explore different bit widths (16-bit, 8-bit, 4-bit)
  • Efficiency Focus: "Use fewer bits per parameter"

Step 4: Knowledge Distillation

  • Train teacher model on full dataset
  • Implement distillation loss function
  • Train student model with teacher guidance
  • Advanced Training: "Large models teach small models"

Step 5: Structured Pruning

  • Implement neuron importance computation
  • Remove entire neurons/channels
  • Handle cascade effects on subsequent layers
  • Architecture Optimization: "Modify network structure"

Step 6: Comprehensive Comparison

  • Apply all techniques to same base model
  • Create compression vs accuracy plots
  • Benchmark inference speed improvements
  • Systems Integration: "Combine techniques for maximum effect"

🛠️ Implementation Strategy

Building on Existing Components

  • Dense layers: Primary target for compression techniques
  • Training pipeline: Framework for measuring accuracy impact
  • DataLoader: Consistent evaluation across compressed models
  • Metrics: Accuracy measurement for compression trade-offs

New Components to Build

  1. CompressionMetrics: Model size, parameter count, sparsity measurement
  2. PruningUtils: Weight analysis, threshold selection, mask application
  3. QuantizationUtils: Bit-width conversion, scale/offset computation
  4. DistillationTrainer: Extended trainer for teacher-student training
  5. ComparisonTools: Visualization and benchmarking utilities

Educational Testing Framework

  • Before/After Comparisons: Size, accuracy, speed for each technique
  • Visualization Tools: Weight distributions, sparsity patterns, accuracy curves
  • Interactive Exploration: Students experiment with different compression ratios
  • Real-World Context: Connect to mobile deployment constraints

📚 Real-World Connections

Mobile and Edge AI

  • Smartphone apps need small models (< 10MB)
  • Embedded devices have severe memory constraints
  • Battery life affected by computation intensity
  • Student Understanding: Why compression matters in practice

Production ML Systems

  • Cost optimization in cloud inference
  • Latency requirements for real-time applications
  • Memory bandwidth limitations in data centers
  • Career Relevance: Skills needed for production deployment

Research Frontiers

  • Neural architecture search (NAS)
  • Hardware-aware model design
  • Automatic compression techniques
  • Advanced Topics: Connection to cutting-edge research

🎯 Success Metrics

Educational Outcomes

  • Students understand parameter importance and redundancy
  • Students can trade model size for accuracy systematically
  • Students connect compression to real deployment constraints
  • Students gain intuition for when different techniques work best

Technical Skills

  • Implement 4 different compression techniques from scratch
  • Measure and visualize compression trade-offs
  • Modify existing models for better efficiency
  • Design compression strategies for specific constraints

Real-World Preparation

  • Understanding of mobile AI constraints
  • Experience with production optimization techniques
  • Knowledge of compression research landscape
  • Skills for model deployment and optimization roles

🚀 Why This Module Design Works

Perfect Timing

  • Students just mastered training (Module 9)
  • Natural next step: optimize trained models
  • Builds on solid foundation of layers, networks, training

Hands-On Learning

  • Every technique implemented from scratch
  • Immediate visual feedback on compression effects
  • Real data and models, not toy examples

Progressive Complexity

  • Start simple (magnitude pruning)
  • Build to advanced (knowledge distillation)
  • Integrate all techniques for maximum learning

Career Relevant

  • Essential skills for production ML roles
  • Understanding of efficiency constraints in real systems
  • Foundation for research in model optimization

Foundation for Later Modules

  • Benchmarking skills prepare for Module 12
  • Performance optimization mindset prepares for Module 11
  • Production awareness prepares for MLOps Module 13

This compression module design builds perfectly on students' current knowledge while introducing essential production ML skills. Students will gain practical experience with the efficiency techniques that make modern AI deployment possible!