mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 00:45:26 -05:00
PROBLEM: - nbdev requires #| export directive on EACH cell to export when using # %% markers - Cell markers inside class definitions split classes across multiple cells - Only partial classes were being exported to tinytorch package - Missing matmul, arithmetic operations, and activation classes in exports SOLUTION: 1. Removed # %% cell markers INSIDE class definitions (kept classes as single units) 2. Added #| export to imports cell at top of each module 3. Added #| export before each exportable class definition in all 20 modules 4. Added __call__ method to Sigmoid for functional usage 5. Fixed numpy import (moved to module level from __init__) MODULES FIXED: - 01_tensor: Tensor class with all operations (matmul, arithmetic, shape ops) - 02_activations: Sigmoid, ReLU, Tanh, GELU, Softmax classes - 03_layers: Linear, Dropout classes - 04_losses: MSELoss, CrossEntropyLoss, BinaryCrossEntropyLoss classes - 05_autograd: Function, AddBackward, MulBackward, MatmulBackward, SumBackward - 06_optimizers: Optimizer, SGD, Adam, AdamW classes - 07_training: CosineSchedule, Trainer classes - 08_dataloader: Dataset, TensorDataset, DataLoader classes - 09_spatial: Conv2d, MaxPool2d, AvgPool2d, SimpleCNN classes - 10-20: All exportable classes in remaining modules TESTING: - Test functions use 'if __name__ == "__main__"' guards - Tests run in notebooks but NOT on import - Rosenblatt Perceptron milestone working perfectly RESULT: ✅ All 20 modules export correctly ✅ Perceptron (1957) milestone functional ✅ Clean separation: development (modules/source) vs package (tinytorch)
1729 lines
76 KiB
Plaintext
1729 lines
76 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7c0b2b14",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 18: Compression - Making Models Smaller\n",
|
||
"\n",
|
||
"Welcome to Module 18! You're about to build model compression techniques that make neural networks smaller and more efficient while preserving their intelligence.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Full TinyGPT pipeline with profiling, acceleration, and quantization\n",
|
||
"**You'll Build**: Pruning (magnitude & structured), knowledge distillation, and low-rank approximation\n",
|
||
"**You'll Enable**: Compressed models that maintain accuracy while using dramatically less storage and memory\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"Quantization → Compression → Benchmarking\n",
|
||
"(precision) (sparsity) (evaluation)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement magnitude-based and structured pruning\n",
|
||
"2. Build knowledge distillation for model compression\n",
|
||
"3. Create low-rank approximations of weight matrices\n",
|
||
"4. Measure compression ratios and sparsity levels\n",
|
||
"5. Understand structured vs unstructured sparsity trade-offs\n",
|
||
"\n",
|
||
"Let's get started!\n",
|
||
"\n",
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/18_compression/compression_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.optimization.compression`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.optimization.compression import magnitude_prune, structured_prune, measure_sparsity\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete compression system in one focused module for deep understanding\n",
|
||
"- **Production:** Proper organization like real compression libraries with all techniques together\n",
|
||
"- **Consistency:** All compression operations and sparsity management in optimization.compression\n",
|
||
"- **Integration:** Works seamlessly with models and quantization for complete optimization pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "37872416",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "imports",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp optimization.compression\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"import numpy as np\n",
|
||
"import copy\n",
|
||
"from typing import List, Dict, Any, Tuple, Optional\n",
|
||
"import time\n",
|
||
"\n",
|
||
"# Import from previous modules\n",
|
||
"# Note: In the full package, these would be imports like:\n",
|
||
"# from tinytorch.core.tensor import Tensor\n",
|
||
"# from tinytorch.core.layers import Linear\n",
|
||
"# For development, we'll create minimal implementations\n",
|
||
"\n",
|
||
"class Tensor:\n",
|
||
" \"\"\"Minimal Tensor class for compression development - imports from Module 01 in practice.\"\"\"\n",
|
||
" def __init__(self, data, requires_grad=False):\n",
|
||
" self.data = np.array(data)\n",
|
||
" self.shape = self.data.shape\n",
|
||
" self.size = self.data.size\n",
|
||
" self.requires_grad = requires_grad\n",
|
||
" self.grad = None\n",
|
||
"\n",
|
||
" def __add__(self, other):\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data + other.data)\n",
|
||
" return Tensor(self.data + other)\n",
|
||
"\n",
|
||
" def __mul__(self, other):\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data * other.data)\n",
|
||
" return Tensor(self.data * other)\n",
|
||
"\n",
|
||
" def matmul(self, other):\n",
|
||
" return Tensor(np.dot(self.data, other.data))\n",
|
||
"\n",
|
||
" def abs(self):\n",
|
||
" return Tensor(np.abs(self.data))\n",
|
||
"\n",
|
||
" def sum(self, axis=None):\n",
|
||
" return Tensor(self.data.sum(axis=axis))\n",
|
||
"\n",
|
||
" def __repr__(self):\n",
|
||
" return f\"Tensor(shape={self.shape})\"\n",
|
||
"\n",
|
||
"class Linear:\n",
|
||
" \"\"\"Minimal Linear layer for compression development - imports from Module 03 in practice.\"\"\"\n",
|
||
" def __init__(self, in_features, out_features, bias=True):\n",
|
||
" self.in_features = in_features\n",
|
||
" self.out_features = out_features\n",
|
||
" # Initialize with He initialization\n",
|
||
" self.weight = Tensor(np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features))\n",
|
||
" self.bias = Tensor(np.zeros(out_features)) if bias else None\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" output = x.matmul(self.weight)\n",
|
||
" if self.bias is not None:\n",
|
||
" output = output + self.bias\n",
|
||
" return output\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" params = [self.weight]\n",
|
||
" if self.bias is not None:\n",
|
||
" params.append(self.bias)\n",
|
||
" return params\n",
|
||
"\n",
|
||
"class Sequential:\n",
|
||
" \"\"\"Minimal Sequential container for model compression.\"\"\"\n",
|
||
" def __init__(self, *layers):\n",
|
||
" self.layers = list(layers)\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" for layer in self.layers:\n",
|
||
" x = layer.forward(x)\n",
|
||
" return x\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" params = []\n",
|
||
" for layer in self.layers:\n",
|
||
" if hasattr(layer, 'parameters'):\n",
|
||
" params.extend(layer.parameters())\n",
|
||
" return params"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "252e20ce",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction: What is Model Compression?\n",
|
||
"\n",
|
||
"Imagine you have a massive library with millions of books, but you only reference 10% of them regularly. Model compression is like creating a curated collection that keeps the essential knowledge while dramatically reducing storage space.\n",
|
||
"\n",
|
||
"Model compression reduces the size and computational requirements of neural networks while preserving their intelligence. It's the bridge between powerful research models and practical deployment.\n",
|
||
"\n",
|
||
"### Why Compression Matters in ML Systems\n",
|
||
"\n",
|
||
"**The Storage Challenge:**\n",
|
||
"- Modern language models: 100GB+ (GPT-3 scale)\n",
|
||
"- Mobile devices: <1GB available for models\n",
|
||
"- Edge devices: <100MB realistic limits\n",
|
||
"- Network bandwidth: Slow downloads kill user experience\n",
|
||
"\n",
|
||
"**The Speed Challenge:**\n",
|
||
"- Research models: Designed for accuracy, not efficiency\n",
|
||
"- Production needs: Sub-second response times\n",
|
||
"- Battery life: Energy consumption matters for mobile\n",
|
||
"- Cost scaling: Inference costs grow with model size\n",
|
||
"\n",
|
||
"### The Compression Landscape\n",
|
||
"\n",
|
||
"```\n",
|
||
"Neural Network Compression Techniques:\n",
|
||
"\n",
|
||
"┌─────────────────────────────────────────────────────────────┐\n",
|
||
"│ COMPRESSION METHODS │\n",
|
||
"├─────────────────────────────────────────────────────────────┤\n",
|
||
"│ WEIGHT-BASED │ ARCHITECTURE-BASED │\n",
|
||
"│ ┌─────────────────────────────┐ │ ┌─────────────────────┐ │\n",
|
||
"│ │ Magnitude Pruning │ │ │ Knowledge Distillation│ │\n",
|
||
"│ │ • Remove small weights │ │ │ • Teacher → Student │ │\n",
|
||
"│ │ • 90% sparsity achievable │ │ │ • 10x size reduction │ │\n",
|
||
"│ │ │ │ │ │ │\n",
|
||
"│ │ Structured Pruning │ │ │ Neural Architecture │ │\n",
|
||
"│ │ • Remove entire channels │ │ │ Search (NAS) │ │\n",
|
||
"│ │ • Hardware-friendly │ │ │ • Automated design │ │\n",
|
||
"│ │ │ │ │ │ │\n",
|
||
"│ │ Low-Rank Approximation │ │ │ Early Exit │ │\n",
|
||
"│ │ • Matrix factorization │ │ │ • Adaptive compute │ │\n",
|
||
"│ │ • SVD decomposition │ │ │ │ │\n",
|
||
"│ └─────────────────────────────┘ │ └─────────────────────┘ │\n",
|
||
"└─────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"Think of compression like optimizing a recipe - you want to keep the essential ingredients that create the flavor while removing anything that doesn't contribute to the final dish."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "30325dfe",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations: Mathematical Background\n",
|
||
"\n",
|
||
"Understanding the mathematics behind compression helps us choose the right technique for each situation and predict their effects on model performance.\n",
|
||
"\n",
|
||
"### Magnitude-Based Pruning: The Simple Approach\n",
|
||
"\n",
|
||
"The core insight: small weights contribute little to the final prediction. Magnitude pruning removes weights based on their absolute values.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Mathematical Foundation:\n",
|
||
"For weight w_ij in layer l:\n",
|
||
" If |w_ij| < threshold_l → w_ij = 0\n",
|
||
"\n",
|
||
"Threshold Selection:\n",
|
||
"- Global: One threshold for entire model\n",
|
||
"- Layer-wise: Different threshold per layer\n",
|
||
"- Percentile-based: Remove bottom k% of weights\n",
|
||
"\n",
|
||
"Sparsity Calculation:\n",
|
||
" Sparsity = (Zero weights / Total weights) × 100%\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Structured Pruning: Hardware-Friendly Compression\n",
|
||
"\n",
|
||
"Unlike magnitude pruning which creates scattered zeros, structured pruning removes entire computational units (neurons, channels, attention heads).\n",
|
||
"\n",
|
||
"```\n",
|
||
"Channel Importance Metrics:\n",
|
||
"\n",
|
||
"Method 1: L2 Norm\n",
|
||
" Importance(channel_i) = ||W[:,i]||₂ = √(Σⱼ W²ⱼᵢ)\n",
|
||
"\n",
|
||
"Method 2: Gradient-based\n",
|
||
" Importance(channel_i) = |∂Loss/∂W[:,i]|\n",
|
||
"\n",
|
||
"Method 3: Activation-based\n",
|
||
" Importance(channel_i) = E[|activations_i|]\n",
|
||
"\n",
|
||
"Pruning Decision:\n",
|
||
" Remove bottom k% of channels based on importance ranking\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Knowledge Distillation: Learning from Teachers\n",
|
||
"\n",
|
||
"Knowledge distillation transfers knowledge from a large \"teacher\" model to a smaller \"student\" model. The student learns not just the correct answers, but the teacher's reasoning process.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Distillation Loss Function:\n",
|
||
" L_total = α × L_soft + (1-α) × L_hard\n",
|
||
"\n",
|
||
"Where:\n",
|
||
" L_soft = KL_divergence(σ(z_s/T), σ(z_t/T)) # Soft targets\n",
|
||
" L_hard = CrossEntropy(σ(z_s), y_true) # Hard targets\n",
|
||
"\n",
|
||
" σ(z/T) = Softmax with temperature T\n",
|
||
" z_s = Student logits, z_t = Teacher logits\n",
|
||
" α = Balance parameter (typically 0.7)\n",
|
||
" T = Temperature parameter (typically 3-5)\n",
|
||
"\n",
|
||
"Temperature Effect:\n",
|
||
" T=1: Standard softmax (sharp probabilities)\n",
|
||
" T>1: Softer distributions (reveals teacher's uncertainty)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Low-Rank Approximation: Matrix Compression\n",
|
||
"\n",
|
||
"Large weight matrices often have redundancy that can be captured with lower-rank approximations using Singular Value Decomposition (SVD).\n",
|
||
"\n",
|
||
"```\n",
|
||
"SVD Decomposition:\n",
|
||
" W_{m×n} = U_{m×k} × Σ_{k×k} × V^T_{k×n}\n",
|
||
"\n",
|
||
"Parameter Reduction:\n",
|
||
" Original: m × n parameters\n",
|
||
" Compressed: (m × k) + k + (k × n) = k(m + n + 1) parameters\n",
|
||
"\n",
|
||
" Compression achieved when: k < mn/(m+n+1)\n",
|
||
"\n",
|
||
"Reconstruction Error:\n",
|
||
" ||W - W_approx||_F = √(Σᵢ₌ₖ₊₁ʳ σᵢ²)\n",
|
||
"\n",
|
||
" Where σᵢ are singular values, r = rank(W)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ce0801cd",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 3. Sparsity Measurement - Understanding Model Density\n",
|
||
"\n",
|
||
"Before we can compress models, we need to understand how dense they are. Sparsity measurement tells us what percentage of weights are zero (or effectively zero).\n",
|
||
"\n",
|
||
"### Understanding Sparsity\n",
|
||
"\n",
|
||
"Sparsity is like measuring how much of a parking lot is empty. A 90% sparse model means 90% of its weights are zero - only 10% of the \"parking spaces\" are occupied.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Sparsity Visualization:\n",
|
||
"\n",
|
||
"Dense Matrix (0% sparse): Sparse Matrix (75% sparse):\n",
|
||
"┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐\n",
|
||
"│ 2.1 1.3 0.8 1.9 2.4 1.1 0.7 │ │ 2.1 0.0 0.0 1.9 0.0 0.0 0.0 │\n",
|
||
"│ 1.5 2.8 1.2 0.9 1.6 2.2 1.4 │ │ 0.0 2.8 0.0 0.0 0.0 2.2 0.0 │\n",
|
||
"│ 0.6 1.7 2.5 1.1 0.8 1.3 2.0 │ │ 0.0 0.0 2.5 0.0 0.0 0.0 2.0 │\n",
|
||
"│ 1.9 1.0 1.6 2.3 1.8 0.9 1.2 │ │ 1.9 0.0 0.0 2.3 0.0 0.0 0.0 │\n",
|
||
"└─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘ └─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘\n",
|
||
"All weights active Only 7/28 weights active\n",
|
||
"Storage: 28 values Storage: 7 values + indices\n",
|
||
"```\n",
|
||
"\n",
|
||
"Why this matters: Sparsity directly relates to memory savings, but achieving speedup requires special sparse computation libraries."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4440ec7a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def measure_sparsity(model) -> float:\n",
|
||
" \"\"\"\n",
|
||
" Calculate the percentage of zero weights in a model.\n",
|
||
"\n",
|
||
" TODO: Count zero weights and total weights across all layers\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Iterate through all model parameters\n",
|
||
" 2. Count zeros using np.sum(weights == 0)\n",
|
||
" 3. Count total parameters\n",
|
||
" 4. Return percentage: zeros / total * 100\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = Sequential(Linear(10, 5), Linear(5, 2))\n",
|
||
" >>> sparsity = measure_sparsity(model)\n",
|
||
" >>> print(f\"Model sparsity: {sparsity:.1f}%\")\n",
|
||
" Model sparsity: 0.0% # Before pruning\n",
|
||
"\n",
|
||
" HINT: Use np.sum() to count zeros efficiently\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" total_params = 0\n",
|
||
" zero_params = 0\n",
|
||
"\n",
|
||
" for param in model.parameters():\n",
|
||
" total_params += param.size\n",
|
||
" zero_params += np.sum(param.data == 0)\n",
|
||
"\n",
|
||
" if total_params == 0:\n",
|
||
" return 0.0\n",
|
||
"\n",
|
||
" return (zero_params / total_params) * 100.0\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_measure_sparsity():\n",
|
||
" \"\"\"🔬 Test sparsity measurement functionality.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Measure Sparsity...\")\n",
|
||
"\n",
|
||
" # Test with dense model\n",
|
||
" model = Sequential(Linear(4, 3), Linear(3, 2))\n",
|
||
" initial_sparsity = measure_sparsity(model)\n",
|
||
" assert initial_sparsity == 0.0, f\"Expected 0% sparsity, got {initial_sparsity}%\"\n",
|
||
"\n",
|
||
" # Test with manually sparse model\n",
|
||
" model.layers[0].weight.data[0, 0] = 0\n",
|
||
" model.layers[0].weight.data[1, 1] = 0\n",
|
||
" sparse_sparsity = measure_sparsity(model)\n",
|
||
" assert sparse_sparsity > 0, f\"Expected >0% sparsity, got {sparse_sparsity}%\"\n",
|
||
"\n",
|
||
" print(\"✅ measure_sparsity works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_measure_sparsity()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fc5fb46e",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 4. Magnitude-Based Pruning - Removing Small Weights\n",
|
||
"\n",
|
||
"Magnitude pruning is the simplest and most intuitive compression technique. It's based on the observation that weights with small magnitudes contribute little to the model's output.\n",
|
||
"\n",
|
||
"### How Magnitude Pruning Works\n",
|
||
"\n",
|
||
"Think of magnitude pruning like editing a document - you remove words that don't significantly change the meaning. In neural networks, we remove weights that don't significantly affect predictions.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Magnitude Pruning Process:\n",
|
||
"\n",
|
||
"Step 1: Collect All Weights\n",
|
||
"┌──────────────────────────────────────────────────┐\n",
|
||
"│ Layer 1: [2.1, 0.1, -1.8, 0.05, 3.2, -0.02] │\n",
|
||
"│ Layer 2: [1.5, -0.03, 2.8, 0.08, -2.1, 0.01] │\n",
|
||
"│ Layer 3: [0.7, 2.4, -0.06, 1.9, 0.04, -1.3] │\n",
|
||
"└──────────────────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"Step 2: Calculate Magnitudes\n",
|
||
"┌──────────────────────────────────────────────────┐\n",
|
||
"│ Magnitudes: [2.1, 0.1, 1.8, 0.05, 3.2, 0.02, │\n",
|
||
"│ 1.5, 0.03, 2.8, 0.08, 2.1, 0.01, │\n",
|
||
"│ 0.7, 2.4, 0.06, 1.9, 0.04, 1.3] │\n",
|
||
"└──────────────────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"Step 3: Find Threshold (e.g., 70th percentile)\n",
|
||
"┌──────────────────────────────────────────────────┐\n",
|
||
"│ Sorted: [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, │\n",
|
||
"│ 0.08, 0.1, 0.7, 1.3, 1.5, 1.8, │ Threshold: 0.1\n",
|
||
"│ 1.9, 2.1, 2.1, 2.4, 2.8, 3.2] │ (70% of weights removed)\n",
|
||
"└──────────────────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"Step 4: Apply Pruning Mask\n",
|
||
"┌──────────────────────────────────────────────────┐\n",
|
||
"│ Layer 1: [2.1, 0.0, -1.8, 0.0, 3.2, 0.0] │\n",
|
||
"│ Layer 2: [1.5, 0.0, 2.8, 0.0, -2.1, 0.0] │ 70% weights → 0\n",
|
||
"│ Layer 3: [0.7, 2.4, 0.0, 1.9, 0.0, -1.3] │ 30% preserved\n",
|
||
"└──────────────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"Memory Impact:\n",
|
||
"- Dense storage: 18 values\n",
|
||
"- Sparse storage: 6 values + 6 indices = 12 values (33% savings)\n",
|
||
"- Theoretical limit: 70% savings with perfect sparse format\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why Global Thresholding Works\n",
|
||
"\n",
|
||
"Global thresholding treats the entire model as one big collection of weights, finding a single threshold that achieves the target sparsity across all layers.\n",
|
||
"\n",
|
||
"**Advantages:**\n",
|
||
"- Simple to implement and understand\n",
|
||
"- Preserves overall model capacity\n",
|
||
"- Works well for uniform network architectures\n",
|
||
"\n",
|
||
"**Disadvantages:**\n",
|
||
"- May over-prune some layers, under-prune others\n",
|
||
"- Doesn't account for layer-specific importance\n",
|
||
"- Can hurt performance if layers have very different weight distributions"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d8f12c15",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def magnitude_prune(model, sparsity=0.9):\n",
|
||
" \"\"\"\n",
|
||
" Remove weights with smallest magnitudes to achieve target sparsity.\n",
|
||
"\n",
|
||
" TODO: Implement global magnitude-based pruning\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Collect all weights from the model\n",
|
||
" 2. Calculate absolute values to get magnitudes\n",
|
||
" 3. Find threshold at desired sparsity percentile\n",
|
||
" 4. Set weights below threshold to zero (in-place)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = Sequential(Linear(100, 50), Linear(50, 10))\n",
|
||
" >>> original_params = sum(p.size for p in model.parameters())\n",
|
||
" >>> magnitude_prune(model, sparsity=0.8)\n",
|
||
" >>> final_sparsity = measure_sparsity(model)\n",
|
||
" >>> print(f\"Achieved {final_sparsity:.1f}% sparsity\")\n",
|
||
" Achieved 80.0% sparsity\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use np.percentile() to find threshold\n",
|
||
" - Modify model parameters in-place\n",
|
||
" - Consider only weight matrices, not biases\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Collect all weights (excluding biases)\n",
|
||
" all_weights = []\n",
|
||
" weight_params = []\n",
|
||
"\n",
|
||
" for param in model.parameters():\n",
|
||
" # Skip biases (typically 1D)\n",
|
||
" if len(param.shape) > 1:\n",
|
||
" all_weights.extend(param.data.flatten())\n",
|
||
" weight_params.append(param)\n",
|
||
"\n",
|
||
" if not all_weights:\n",
|
||
" return\n",
|
||
"\n",
|
||
" # Calculate magnitude threshold\n",
|
||
" magnitudes = np.abs(all_weights)\n",
|
||
" threshold = np.percentile(magnitudes, sparsity * 100)\n",
|
||
"\n",
|
||
" # Apply pruning to each weight parameter\n",
|
||
" for param in weight_params:\n",
|
||
" mask = np.abs(param.data) >= threshold\n",
|
||
" param.data = param.data * mask\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_magnitude_prune():\n",
|
||
" \"\"\"🔬 Test magnitude-based pruning functionality.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Magnitude Prune...\")\n",
|
||
"\n",
|
||
" # Create test model with known weights\n",
|
||
" model = Sequential(Linear(4, 3), Linear(3, 2))\n",
|
||
"\n",
|
||
" # Set specific weight values for predictable testing\n",
|
||
" model.layers[0].weight.data = np.array([\n",
|
||
" [1.0, 2.0, 3.0],\n",
|
||
" [0.1, 0.2, 0.3],\n",
|
||
" [4.0, 5.0, 6.0],\n",
|
||
" [0.01, 0.02, 0.03]\n",
|
||
" ])\n",
|
||
"\n",
|
||
" initial_sparsity = measure_sparsity(model)\n",
|
||
" assert initial_sparsity == 0.0, \"Model should start with no sparsity\"\n",
|
||
"\n",
|
||
" # Apply 50% pruning\n",
|
||
" magnitude_prune(model, sparsity=0.5)\n",
|
||
" final_sparsity = measure_sparsity(model)\n",
|
||
"\n",
|
||
" # Should achieve approximately 50% sparsity\n",
|
||
" assert 40 <= final_sparsity <= 60, f\"Expected ~50% sparsity, got {final_sparsity}%\"\n",
|
||
"\n",
|
||
" # Verify largest weights survived\n",
|
||
" remaining_weights = model.layers[0].weight.data[model.layers[0].weight.data != 0]\n",
|
||
" assert len(remaining_weights) > 0, \"Some weights should remain\"\n",
|
||
" assert np.all(np.abs(remaining_weights) >= 0.1), \"Large weights should survive\"\n",
|
||
"\n",
|
||
" print(\"✅ magnitude_prune works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_magnitude_prune()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8ddc8e18",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 5. Structured Pruning - Hardware-Friendly Compression\n",
|
||
"\n",
|
||
"While magnitude pruning creates scattered zeros throughout the network, structured pruning removes entire computational units (channels, neurons, heads). This creates sparsity patterns that modern hardware can actually accelerate.\n",
|
||
"\n",
|
||
"### Why Structured Pruning Matters\n",
|
||
"\n",
|
||
"Think of the difference between removing random words from a paragraph versus removing entire sentences. Structured pruning removes entire \"sentences\" (channels) rather than random \"words\" (individual weights).\n",
|
||
"\n",
|
||
"```\n",
|
||
"Unstructured vs Structured Sparsity:\n",
|
||
"\n",
|
||
"UNSTRUCTURED (Magnitude Pruning):\n",
|
||
"┌─────────────────────────────────────────────┐\n",
|
||
"│ Channel 0: [2.1, 0.0, 1.8, 0.0, 3.2] │ ← Sparse weights\n",
|
||
"│ Channel 1: [0.0, 2.8, 0.0, 2.1, 0.0] │ ← Sparse weights\n",
|
||
"│ Channel 2: [1.5, 0.0, 2.4, 0.0, 1.9] │ ← Sparse weights\n",
|
||
"│ Channel 3: [0.0, 1.7, 0.0, 2.0, 0.0] │ ← Sparse weights\n",
|
||
"└─────────────────────────────────────────────┘\n",
|
||
"Issues: Irregular memory access, no hardware speedup\n",
|
||
"\n",
|
||
"STRUCTURED (Channel Pruning):\n",
|
||
"┌─────────────────────────────────────────────┐\n",
|
||
"│ Channel 0: [2.1, 1.3, 1.8, 0.9, 3.2] │ ← Fully preserved\n",
|
||
"│ Channel 1: [0.0, 0.0, 0.0, 0.0, 0.0] │ ← Fully removed\n",
|
||
"│ Channel 2: [1.5, 2.2, 2.4, 1.1, 1.9] │ ← Fully preserved\n",
|
||
"│ Channel 3: [0.0, 0.0, 0.0, 0.0, 0.0] │ ← Fully removed\n",
|
||
"└─────────────────────────────────────────────┘\n",
|
||
"Benefits: Regular patterns, hardware acceleration possible\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Channel Importance Ranking\n",
|
||
"\n",
|
||
"How do we decide which channels to remove? We rank them by importance using various metrics:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Channel Importance Metrics:\n",
|
||
"\n",
|
||
"Method 1: L2 Norm (Most Common)\n",
|
||
" For each output channel i:\n",
|
||
" Importance_i = ||W[:, i]||_2 = √(Σⱼ w²ⱼᵢ)\n",
|
||
"\n",
|
||
" Intuition: Channels with larger weights have bigger impact\n",
|
||
"\n",
|
||
"Method 2: Activation-Based\n",
|
||
" Importance_i = E[|activation_i|] over dataset\n",
|
||
"\n",
|
||
" Intuition: Channels that activate more are more important\n",
|
||
"\n",
|
||
"Method 3: Gradient-Based\n",
|
||
" Importance_i = |∂Loss/∂W[:, i]|\n",
|
||
"\n",
|
||
" Intuition: Channels with larger gradients affect loss more\n",
|
||
"\n",
|
||
"Ranking Process:\n",
|
||
" 1. Calculate importance for all channels\n",
|
||
" 2. Sort channels by importance (ascending)\n",
|
||
" 3. Remove bottom k% (least important)\n",
|
||
" 4. Zero out entire channels, not individual weights\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Hardware Benefits of Structured Sparsity\n",
|
||
"\n",
|
||
"Structured sparsity enables real hardware acceleration because:\n",
|
||
"\n",
|
||
"1. **Memory Coalescing**: Accessing contiguous memory chunks is faster\n",
|
||
"2. **SIMD Operations**: Can process multiple remaining channels in parallel\n",
|
||
"3. **No Indexing Overhead**: Don't need to track locations of sparse weights\n",
|
||
"4. **Cache Efficiency**: Better spatial locality of memory access"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ede3f6c9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def structured_prune(model, prune_ratio=0.5):\n",
|
||
" \"\"\"\n",
|
||
" Remove entire channels/neurons based on L2 norm importance.\n",
|
||
"\n",
|
||
" TODO: Implement structured pruning for Linear layers\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. For each Linear layer, calculate L2 norm of each output channel\n",
|
||
" 2. Rank channels by importance (L2 norm)\n",
|
||
" 3. Remove lowest importance channels by setting to zero\n",
|
||
" 4. This creates block sparsity that's hardware-friendly\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = Sequential(Linear(100, 50), Linear(50, 10))\n",
|
||
" >>> original_shape = model.layers[0].weight.shape\n",
|
||
" >>> structured_prune(model, prune_ratio=0.3)\n",
|
||
" >>> # 30% of channels are now completely zero\n",
|
||
" >>> final_sparsity = measure_sparsity(model)\n",
|
||
" >>> print(f\"Structured sparsity: {final_sparsity:.1f}%\")\n",
|
||
" Structured sparsity: 30.0%\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Calculate L2 norm along input dimension for each output channel\n",
|
||
" - Use np.linalg.norm(weights[:, channel]) for channel importance\n",
|
||
" - Set entire channels to zero (not just individual weights)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" for layer in model.layers:\n",
|
||
" if isinstance(layer, Linear) and hasattr(layer, 'weight'):\n",
|
||
" weight = layer.weight.data\n",
|
||
"\n",
|
||
" # Calculate L2 norm for each output channel (column)\n",
|
||
" channel_norms = np.linalg.norm(weight, axis=0)\n",
|
||
"\n",
|
||
" # Find channels to prune (lowest importance)\n",
|
||
" num_channels = weight.shape[1]\n",
|
||
" num_to_prune = int(num_channels * prune_ratio)\n",
|
||
"\n",
|
||
" if num_to_prune > 0:\n",
|
||
" # Get indices of channels to prune (smallest norms)\n",
|
||
" prune_indices = np.argpartition(channel_norms, num_to_prune)[:num_to_prune]\n",
|
||
"\n",
|
||
" # Zero out entire channels\n",
|
||
" weight[:, prune_indices] = 0\n",
|
||
"\n",
|
||
" # Also zero corresponding bias elements if bias exists\n",
|
||
" if layer.bias is not None:\n",
|
||
" layer.bias.data[prune_indices] = 0\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_structured_prune():\n",
|
||
" \"\"\"🔬 Test structured pruning functionality.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Structured Prune...\")\n",
|
||
"\n",
|
||
" # Create test model\n",
|
||
" model = Sequential(Linear(4, 6), Linear(6, 2))\n",
|
||
"\n",
|
||
" # Set predictable weights for testing\n",
|
||
" model.layers[0].weight.data = np.array([\n",
|
||
" [1.0, 0.1, 2.0, 0.05, 3.0, 0.01], # Channels with varying importance\n",
|
||
" [1.1, 0.11, 2.1, 0.06, 3.1, 0.02],\n",
|
||
" [1.2, 0.12, 2.2, 0.07, 3.2, 0.03],\n",
|
||
" [1.3, 0.13, 2.3, 0.08, 3.3, 0.04]\n",
|
||
" ])\n",
|
||
"\n",
|
||
" initial_sparsity = measure_sparsity(model)\n",
|
||
" assert initial_sparsity == 0.0, \"Model should start with no sparsity\"\n",
|
||
"\n",
|
||
" # Apply 33% structured pruning (2 out of 6 channels)\n",
|
||
" structured_prune(model, prune_ratio=0.33)\n",
|
||
" final_sparsity = measure_sparsity(model)\n",
|
||
"\n",
|
||
" # Check that some channels are completely zero\n",
|
||
" weight = model.layers[0].weight.data\n",
|
||
" zero_channels = np.sum(np.all(weight == 0, axis=0))\n",
|
||
" assert zero_channels >= 1, f\"Expected at least 1 zero channel, got {zero_channels}\"\n",
|
||
"\n",
|
||
" # Check that non-zero channels are completely preserved\n",
|
||
" for col in range(weight.shape[1]):\n",
|
||
" channel = weight[:, col]\n",
|
||
" assert np.all(channel == 0) or np.all(channel != 0), \"Channels should be fully zero or fully non-zero\"\n",
|
||
"\n",
|
||
" print(\"✅ structured_prune works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_structured_prune()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "74c8202f",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 6. Low-Rank Approximation - Matrix Compression Through Factorization\n",
|
||
"\n",
|
||
"Low-rank approximation discovers that large weight matrices often contain redundant information that can be captured with much smaller matrices through mathematical decomposition.\n",
|
||
"\n",
|
||
"### The Intuition Behind Low-Rank Approximation\n",
|
||
"\n",
|
||
"Imagine you're storing a massive spreadsheet where many columns are highly correlated. Instead of storing all columns separately, you could store a few \"basis\" columns and coefficients for how to combine them to recreate the original data.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Low-Rank Decomposition Visualization:\n",
|
||
"\n",
|
||
"Original Matrix W (large): Factorized Form (smaller):\n",
|
||
"┌─────────────────────────┐ ┌──────┐ ┌──────────────┐\n",
|
||
"│ 2.1 1.3 0.8 1.9 2.4 │ │ 1.1 │ │ 1.9 1.2 0.7│\n",
|
||
"│ 1.5 2.8 1.2 0.9 1.6 │ ≈ │ 2.4 │ @ │ 0.6 1.2 0.5│\n",
|
||
"│ 0.6 1.7 2.5 1.1 0.8 │ │ 0.8 │ │ 1.4 2.1 0.9│\n",
|
||
"│ 1.9 1.0 1.6 2.3 1.8 │ │ 1.6 │ │ 0.5 0.6 1.1│\n",
|
||
"└─────────────────────────┘ └──────┘ └──────────────┘\n",
|
||
" W (4×5) = 20 params U (4×2)=8 + V (2×5)=10 = 18 params\n",
|
||
"\n",
|
||
"Parameter Reduction:\n",
|
||
"- Original: 4 × 5 = 20 parameters\n",
|
||
"- Compressed: (4 × 2) + (2 × 5) = 18 parameters\n",
|
||
"- Compression ratio: 18/20 = 0.9 (10% savings)\n",
|
||
"\n",
|
||
"For larger matrices, savings become dramatic:\n",
|
||
"- W (1000×1000): 1M parameters → U (1000×100) + V (100×1000): 200K parameters\n",
|
||
"- Compression ratio: 0.2 (80% savings)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### SVD: The Mathematical Foundation\n",
|
||
"\n",
|
||
"Singular Value Decomposition (SVD) finds the optimal low-rank approximation by identifying the most important \"directions\" in the data:\n",
|
||
"\n",
|
||
"```\n",
|
||
"SVD Decomposition:\n",
|
||
" W = U × Σ × V^T\n",
|
||
"\n",
|
||
"Where:\n",
|
||
" U: Left singular vectors (input patterns)\n",
|
||
" Σ: Singular values (importance weights)\n",
|
||
" V^T: Right singular vectors (output patterns)\n",
|
||
"\n",
|
||
"Truncated SVD (Rank-k approximation):\n",
|
||
" W ≈ U[:,:k] × Σ[:k] × V^T[:k,:]\n",
|
||
"\n",
|
||
"Quality vs Compression Trade-off:\n",
|
||
" Higher k → Better approximation, less compression\n",
|
||
" Lower k → More compression, worse approximation\n",
|
||
"\n",
|
||
"Choosing Optimal Rank:\n",
|
||
" Method 1: Fixed ratio (k = ratio × min(m,n))\n",
|
||
" Method 2: Energy threshold (keep 90% of singular value energy)\n",
|
||
" Method 3: Error threshold (reconstruction error < threshold)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### When Low-Rank Works Best\n",
|
||
"\n",
|
||
"Low-rank approximation works well when:\n",
|
||
"- **Matrices are large**: Compression benefits scale with size\n",
|
||
"- **Data has structure**: Correlated patterns enable compression\n",
|
||
"- **Moderate accuracy loss acceptable**: Some precision traded for efficiency\n",
|
||
"\n",
|
||
"It works poorly when:\n",
|
||
"- **Matrices are already small**: Overhead exceeds benefits\n",
|
||
"- **Data is random**: No patterns to exploit\n",
|
||
"- **High precision required**: SVD introduces approximation error"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "bdbedbf4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def low_rank_approximate(weight_matrix, rank_ratio=0.5):\n",
|
||
" \"\"\"\n",
|
||
" Approximate weight matrix using low-rank decomposition (SVD).\n",
|
||
"\n",
|
||
" TODO: Implement SVD-based low-rank approximation\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Perform SVD: W = U @ S @ V^T\n",
|
||
" 2. Keep only top k singular values where k = rank_ratio * min(dimensions)\n",
|
||
" 3. Reconstruct: W_approx = U[:,:k] @ diag(S[:k]) @ V[:k,:]\n",
|
||
" 4. Return decomposed matrices for memory savings\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> weight = np.random.randn(100, 50)\n",
|
||
" >>> U, S, V = low_rank_approximate(weight, rank_ratio=0.3)\n",
|
||
" >>> # Original: 100*50 = 5000 params\n",
|
||
" >>> # Compressed: 100*15 + 15*50 = 2250 params (55% reduction)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use np.linalg.svd() for decomposition\n",
|
||
" - Choose k = int(rank_ratio * min(m, n))\n",
|
||
" - Return U[:,:k], S[:k], V[:k,:] for reconstruction\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" m, n = weight_matrix.shape\n",
|
||
"\n",
|
||
" # Perform SVD\n",
|
||
" U, S, V = np.linalg.svd(weight_matrix, full_matrices=False)\n",
|
||
"\n",
|
||
" # Determine target rank\n",
|
||
" max_rank = min(m, n)\n",
|
||
" target_rank = max(1, int(rank_ratio * max_rank))\n",
|
||
"\n",
|
||
" # Truncate to target rank\n",
|
||
" U_truncated = U[:, :target_rank]\n",
|
||
" S_truncated = S[:target_rank]\n",
|
||
" V_truncated = V[:target_rank, :]\n",
|
||
"\n",
|
||
" return U_truncated, S_truncated, V_truncated\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_low_rank_approximate():\n",
|
||
" \"\"\"🔬 Test low-rank approximation functionality.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Low-Rank Approximate...\")\n",
|
||
"\n",
|
||
" # Create test weight matrix\n",
|
||
" original_weight = np.random.randn(20, 15)\n",
|
||
" original_params = original_weight.size\n",
|
||
"\n",
|
||
" # Apply low-rank approximation\n",
|
||
" U, S, V = low_rank_approximate(original_weight, rank_ratio=0.4)\n",
|
||
"\n",
|
||
" # Check dimensions\n",
|
||
" target_rank = int(0.4 * min(20, 15)) # min(20,15) = 15, so 0.4*15 = 6\n",
|
||
" assert U.shape == (20, target_rank), f\"Expected U shape (20, {target_rank}), got {U.shape}\"\n",
|
||
" assert S.shape == (target_rank,), f\"Expected S shape ({target_rank},), got {S.shape}\"\n",
|
||
" assert V.shape == (target_rank, 15), f\"Expected V shape ({target_rank}, 15), got {V.shape}\"\n",
|
||
"\n",
|
||
" # Check parameter reduction\n",
|
||
" compressed_params = U.size + S.size + V.size\n",
|
||
" compression_ratio = compressed_params / original_params\n",
|
||
" assert compression_ratio < 1.0, f\"Should compress, but ratio is {compression_ratio}\"\n",
|
||
"\n",
|
||
" # Check reconstruction quality\n",
|
||
" reconstructed = U @ np.diag(S) @ V\n",
|
||
" reconstruction_error = np.linalg.norm(original_weight - reconstructed)\n",
|
||
" relative_error = reconstruction_error / np.linalg.norm(original_weight)\n",
|
||
" assert relative_error < 0.5, f\"Reconstruction error too high: {relative_error}\"\n",
|
||
"\n",
|
||
" print(\"✅ low_rank_approximate works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_low_rank_approximate()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a51cbe39",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 7. Knowledge Distillation - Learning from Teacher Models\n",
|
||
"\n",
|
||
"Knowledge distillation is like having an expert teacher simplify complex concepts for a student. The large \"teacher\" model shares its knowledge with a smaller \"student\" model, achieving similar performance with far fewer parameters.\n",
|
||
"\n",
|
||
"### The Teacher-Student Learning Process\n",
|
||
"\n",
|
||
"Unlike traditional training where models learn from hard labels (cat/dog), knowledge distillation uses \"soft\" targets that contain richer information about the teacher's decision-making process.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Knowledge Distillation Process:\n",
|
||
"\n",
|
||
" TEACHER MODEL (Large)\n",
|
||
" ┌─────────────────────┐\n",
|
||
"Input Data ────────→│ 100M parameters │\n",
|
||
" │ 95% accuracy │\n",
|
||
" │ 500ms inference │\n",
|
||
" └─────────────────────┘\n",
|
||
" │\n",
|
||
" ↓ Soft Targets\n",
|
||
" ┌─────────────────────┐\n",
|
||
" │ Logits: [2.1, 0.3, │\n",
|
||
" │ 0.8, 4.2] │ ← Rich information\n",
|
||
" └─────────────────────┘\n",
|
||
" │\n",
|
||
" ↓ Distillation Loss\n",
|
||
" ┌─────────────────────┐\n",
|
||
"Input Data ────────→│ STUDENT MODEL │\n",
|
||
"Hard Labels ───────→│ 10M parameters │ ← 10x smaller\n",
|
||
" │ 93% accuracy │ ← 2% loss\n",
|
||
" │ 50ms inference │ ← 10x faster\n",
|
||
" └─────────────────────┘\n",
|
||
"\n",
|
||
"Benefits:\n",
|
||
"• Size: 10x smaller models\n",
|
||
"• Speed: 10x faster inference\n",
|
||
"• Accuracy: Only 2-5% degradation\n",
|
||
"• Knowledge transfer: Student learns teacher's \"reasoning\"\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Temperature Scaling: Softening Decisions\n",
|
||
"\n",
|
||
"Temperature scaling is a key innovation that makes knowledge distillation effective. It \"softens\" the teacher's confidence, revealing uncertainty that helps the student learn.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Temperature Effect on Probability Distributions:\n",
|
||
"\n",
|
||
"Without Temperature (T=1): With Temperature (T=3):\n",
|
||
"Teacher Logits: [1.0, 2.0, 0.5] Teacher Logits: [1.0, 2.0, 0.5]\n",
|
||
" ↓ ↓ ÷ 3\n",
|
||
"Softmax: [0.09, 0.67, 0.24] Logits/T: [0.33, 0.67, 0.17]\n",
|
||
" ^ ^ ^ ↓\n",
|
||
" Low High Med Softmax: [0.21, 0.42, 0.17]\n",
|
||
" ^ ^ ^\n",
|
||
"Sharp decisions (hard to learn) Soft decisions (easier to learn)\n",
|
||
"\n",
|
||
"Why Soft Targets Help:\n",
|
||
"1. Reveal teacher's uncertainty about similar classes\n",
|
||
"2. Provide richer gradients for student learning\n",
|
||
"3. Transfer knowledge about class relationships\n",
|
||
"4. Reduce overfitting to hard labels\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Loss Function Design\n",
|
||
"\n",
|
||
"The distillation loss balances learning from both the teacher's soft knowledge and the ground truth hard labels:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Combined Loss Function:\n",
|
||
"\n",
|
||
"L_total = α × L_soft + (1-α) × L_hard\n",
|
||
"\n",
|
||
"Where:\n",
|
||
" L_soft = KL_divergence(Student_soft, Teacher_soft)\n",
|
||
" │\n",
|
||
" └─ Measures how well student mimics teacher\n",
|
||
"\n",
|
||
" L_hard = CrossEntropy(Student_predictions, True_labels)\n",
|
||
" │\n",
|
||
" └─ Ensures student still learns correct answers\n",
|
||
"\n",
|
||
"Balance Parameter α:\n",
|
||
"• α = 0.7: Focus mainly on teacher (typical)\n",
|
||
"• α = 0.9: Almost pure distillation\n",
|
||
"• α = 0.3: Balance teacher and ground truth\n",
|
||
"• α = 0.0: Ignore teacher (regular training)\n",
|
||
"\n",
|
||
"Temperature T:\n",
|
||
"• T = 1: No softening (standard softmax)\n",
|
||
"• T = 3-5: Good balance (typical range)\n",
|
||
"• T = 10+: Very soft (may lose information)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "bf1a9ab1",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class KnowledgeDistillation:\n",
|
||
" \"\"\"\n",
|
||
" Knowledge distillation for model compression.\n",
|
||
"\n",
|
||
" Train a smaller student model to mimic a larger teacher model.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):\n",
|
||
" \"\"\"\n",
|
||
" Initialize knowledge distillation.\n",
|
||
"\n",
|
||
" TODO: Set up teacher and student models with distillation parameters\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Store teacher and student models\n",
|
||
" 2. Set temperature for softening probability distributions\n",
|
||
" 3. Set alpha for balancing hard vs soft targets\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" teacher_model: Large, pre-trained model\n",
|
||
" student_model: Smaller model to train\n",
|
||
" temperature: Softening parameter for distributions\n",
|
||
" alpha: Weight for soft target loss (1-alpha for hard targets)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.teacher_model = teacher_model\n",
|
||
" self.student_model = student_model\n",
|
||
" self.temperature = temperature\n",
|
||
" self.alpha = alpha\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def distillation_loss(self, student_logits, teacher_logits, true_labels):\n",
|
||
" \"\"\"\n",
|
||
" Calculate combined distillation loss.\n",
|
||
"\n",
|
||
" TODO: Implement knowledge distillation loss function\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Calculate hard target loss (student vs true labels)\n",
|
||
" 2. Calculate soft target loss (student vs teacher, with temperature)\n",
|
||
" 3. Combine losses: alpha * soft_loss + (1-alpha) * hard_loss\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> kd = KnowledgeDistillation(teacher, student)\n",
|
||
" >>> loss = kd.distillation_loss(student_out, teacher_out, labels)\n",
|
||
" >>> print(f\"Distillation loss: {loss:.4f}\")\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use temperature to soften distributions: logits/temperature\n",
|
||
" - Soft targets use KL divergence or cross-entropy\n",
|
||
" - Hard targets use standard classification loss\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Convert to numpy for this implementation\n",
|
||
" if hasattr(student_logits, 'data'):\n",
|
||
" student_logits = student_logits.data\n",
|
||
" if hasattr(teacher_logits, 'data'):\n",
|
||
" teacher_logits = teacher_logits.data\n",
|
||
" if hasattr(true_labels, 'data'):\n",
|
||
" true_labels = true_labels.data\n",
|
||
"\n",
|
||
" # Soften distributions with temperature\n",
|
||
" student_soft = self._softmax(student_logits / self.temperature)\n",
|
||
" teacher_soft = self._softmax(teacher_logits / self.temperature)\n",
|
||
"\n",
|
||
" # Soft target loss (KL divergence)\n",
|
||
" soft_loss = self._kl_divergence(student_soft, teacher_soft)\n",
|
||
"\n",
|
||
" # Hard target loss (cross-entropy)\n",
|
||
" student_hard = self._softmax(student_logits)\n",
|
||
" hard_loss = self._cross_entropy(student_hard, true_labels)\n",
|
||
"\n",
|
||
" # Combined loss\n",
|
||
" total_loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss\n",
|
||
"\n",
|
||
" return total_loss\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _softmax(self, logits):\n",
|
||
" \"\"\"Compute softmax with numerical stability.\"\"\"\n",
|
||
" exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))\n",
|
||
" return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)\n",
|
||
"\n",
|
||
" def _kl_divergence(self, p, q):\n",
|
||
" \"\"\"Compute KL divergence between distributions.\"\"\"\n",
|
||
" return np.sum(p * np.log(p / (q + 1e-8) + 1e-8))\n",
|
||
"\n",
|
||
" def _cross_entropy(self, predictions, labels):\n",
|
||
" \"\"\"Compute cross-entropy loss.\"\"\"\n",
|
||
" # Simple implementation for integer labels\n",
|
||
" if labels.ndim == 1:\n",
|
||
" return -np.mean(np.log(predictions[np.arange(len(labels)), labels] + 1e-8))\n",
|
||
" else:\n",
|
||
" return -np.mean(np.sum(labels * np.log(predictions + 1e-8), axis=1))\n",
|
||
"\n",
|
||
"def test_unit_knowledge_distillation():\n",
|
||
" \"\"\"🔬 Test knowledge distillation functionality.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Knowledge Distillation...\")\n",
|
||
"\n",
|
||
" # Create teacher and student models\n",
|
||
" teacher = Sequential(Linear(10, 20), Linear(20, 5))\n",
|
||
" student = Sequential(Linear(10, 5)) # Smaller model\n",
|
||
"\n",
|
||
" # Initialize knowledge distillation\n",
|
||
" kd = KnowledgeDistillation(teacher, student, temperature=3.0, alpha=0.7)\n",
|
||
"\n",
|
||
" # Create dummy data\n",
|
||
" input_data = Tensor(np.random.randn(8, 10)) # Batch of 8\n",
|
||
" true_labels = np.array([0, 1, 2, 3, 4, 0, 1, 2]) # Class labels\n",
|
||
"\n",
|
||
" # Forward passes\n",
|
||
" teacher_output = teacher.forward(input_data)\n",
|
||
" student_output = student.forward(input_data)\n",
|
||
"\n",
|
||
" # Calculate distillation loss\n",
|
||
" loss = kd.distillation_loss(student_output, teacher_output, true_labels)\n",
|
||
"\n",
|
||
" # Verify loss is reasonable\n",
|
||
" assert isinstance(loss, (float, np.floating)), f\"Loss should be float, got {type(loss)}\"\n",
|
||
" assert loss > 0, f\"Loss should be positive, got {loss}\"\n",
|
||
" assert not np.isnan(loss), \"Loss should not be NaN\"\n",
|
||
"\n",
|
||
" print(\"✅ knowledge_distillation works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_knowledge_distillation()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bea12725",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 8. Integration: Complete Compression Pipeline\n",
|
||
"\n",
|
||
"Now let's combine all our compression techniques into a unified system that can apply multiple methods and track their cumulative effects.\n",
|
||
"\n",
|
||
"### Compression Strategy Design\n",
|
||
"\n",
|
||
"Real-world compression often combines multiple techniques in sequence, each targeting different types of redundancy:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Multi-Stage Compression Pipeline:\n",
|
||
"\n",
|
||
"Original Model (100MB, 100% accuracy)\n",
|
||
" │\n",
|
||
" ↓ Stage 1: Magnitude Pruning (remove 80% of small weights)\n",
|
||
"Sparse Model (20MB, 98% accuracy)\n",
|
||
" │\n",
|
||
" ↓ Stage 2: Structured Pruning (remove 30% of channels)\n",
|
||
"Compact Model (14MB, 96% accuracy)\n",
|
||
" │\n",
|
||
" ↓ Stage 3: Low-Rank Approximation (compress large layers)\n",
|
||
"Factorized Model (10MB, 95% accuracy)\n",
|
||
" │\n",
|
||
" ↓ Stage 4: Knowledge Distillation (train smaller architecture)\n",
|
||
"Student Model (5MB, 93% accuracy)\n",
|
||
"\n",
|
||
"Final Result: 20x size reduction, 7% accuracy loss\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Compression Configuration\n",
|
||
"\n",
|
||
"Different deployment scenarios require different compression strategies:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Deployment Scenarios and Strategies:\n",
|
||
"\n",
|
||
"MOBILE APP (Aggressive compression needed):\n",
|
||
"┌─────────────────────────────────────────┐\n",
|
||
"│ Target: <10MB, <100ms inference │\n",
|
||
"│ Strategy: │\n",
|
||
"│ • Magnitude pruning: 95% sparsity │\n",
|
||
"│ • Structured pruning: 50% channels │\n",
|
||
"│ • Knowledge distillation: 10x reduction │\n",
|
||
"│ • Quantization: 8-bit weights │\n",
|
||
"└─────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"EDGE DEVICE (Balanced compression):\n",
|
||
"┌─────────────────────────────────────────┐\n",
|
||
"│ Target: <50MB, <200ms inference │\n",
|
||
"│ Strategy: │\n",
|
||
"│ • Magnitude pruning: 80% sparsity │\n",
|
||
"│ • Structured pruning: 30% channels │\n",
|
||
"│ • Low-rank: 50% rank reduction │\n",
|
||
"│ • Quantization: 16-bit weights │\n",
|
||
"└─────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"CLOUD SERVICE (Minimal compression):\n",
|
||
"┌─────────────────────────────────────────┐\n",
|
||
"│ Target: Maintain accuracy, reduce cost │\n",
|
||
"│ Strategy: │\n",
|
||
"│ • Magnitude pruning: 50% sparsity │\n",
|
||
"│ • Structured pruning: 10% channels │\n",
|
||
"│ • Dynamic batching optimization │\n",
|
||
"│ • Mixed precision inference │\n",
|
||
"└─────────────────────────────────────────┘\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "68de6767",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def compress_model(model, compression_config):\n",
|
||
" \"\"\"\n",
|
||
" Apply comprehensive model compression based on configuration.\n",
|
||
"\n",
|
||
" TODO: Implement complete compression pipeline\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Apply magnitude pruning if specified\n",
|
||
" 2. Apply structured pruning if specified\n",
|
||
" 3. Apply low-rank approximation if specified\n",
|
||
" 4. Return compression statistics\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> config = {\n",
|
||
" ... 'magnitude_prune': 0.8,\n",
|
||
" ... 'structured_prune': 0.3,\n",
|
||
" ... 'low_rank': 0.5\n",
|
||
" ... }\n",
|
||
" >>> stats = compress_model(model, config)\n",
|
||
" >>> print(f\"Final sparsity: {stats['sparsity']:.1f}%\")\n",
|
||
" Final sparsity: 85.0%\n",
|
||
"\n",
|
||
" HINT: Apply techniques sequentially and measure results\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" original_params = sum(p.size for p in model.parameters())\n",
|
||
" original_sparsity = measure_sparsity(model)\n",
|
||
"\n",
|
||
" stats = {\n",
|
||
" 'original_params': original_params,\n",
|
||
" 'original_sparsity': original_sparsity,\n",
|
||
" 'applied_techniques': []\n",
|
||
" }\n",
|
||
"\n",
|
||
" # Apply magnitude pruning\n",
|
||
" if 'magnitude_prune' in compression_config:\n",
|
||
" sparsity = compression_config['magnitude_prune']\n",
|
||
" magnitude_prune(model, sparsity=sparsity)\n",
|
||
" stats['applied_techniques'].append(f'magnitude_prune_{sparsity}')\n",
|
||
"\n",
|
||
" # Apply structured pruning\n",
|
||
" if 'structured_prune' in compression_config:\n",
|
||
" ratio = compression_config['structured_prune']\n",
|
||
" structured_prune(model, prune_ratio=ratio)\n",
|
||
" stats['applied_techniques'].append(f'structured_prune_{ratio}')\n",
|
||
"\n",
|
||
" # Apply low-rank approximation (conceptually - would need architecture changes)\n",
|
||
" if 'low_rank' in compression_config:\n",
|
||
" ratio = compression_config['low_rank']\n",
|
||
" # For demo, we'll just record that it would be applied\n",
|
||
" stats['applied_techniques'].append(f'low_rank_{ratio}')\n",
|
||
"\n",
|
||
" # Final measurements\n",
|
||
" final_sparsity = measure_sparsity(model)\n",
|
||
" stats['final_sparsity'] = final_sparsity\n",
|
||
" stats['sparsity_increase'] = final_sparsity - original_sparsity\n",
|
||
"\n",
|
||
" return stats\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_compress_model():\n",
|
||
" \"\"\"🔬 Test comprehensive model compression.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Compress Model...\")\n",
|
||
"\n",
|
||
" # Create test model\n",
|
||
" model = Sequential(Linear(20, 15), Linear(15, 10), Linear(10, 5))\n",
|
||
"\n",
|
||
" # Define compression configuration\n",
|
||
" config = {\n",
|
||
" 'magnitude_prune': 0.7,\n",
|
||
" 'structured_prune': 0.2\n",
|
||
" }\n",
|
||
"\n",
|
||
" # Apply compression\n",
|
||
" stats = compress_model(model, config)\n",
|
||
"\n",
|
||
" # Verify statistics\n",
|
||
" assert 'original_params' in stats, \"Should track original parameter count\"\n",
|
||
" assert 'final_sparsity' in stats, \"Should track final sparsity\"\n",
|
||
" assert 'applied_techniques' in stats, \"Should track applied techniques\"\n",
|
||
"\n",
|
||
" # Verify compression was applied\n",
|
||
" assert stats['final_sparsity'] > stats['original_sparsity'], \"Sparsity should increase\"\n",
|
||
" assert len(stats['applied_techniques']) == 2, \"Should apply both techniques\"\n",
|
||
"\n",
|
||
" # Verify model still has reasonable structure\n",
|
||
" remaining_params = sum(np.count_nonzero(p.data) for p in model.parameters())\n",
|
||
" assert remaining_params > 0, \"Model should retain some parameters\"\n",
|
||
"\n",
|
||
" print(\"✅ compress_model works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_compress_model()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "78b4d5fb",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 9. Systems Analysis: Compression Performance and Trade-offs\n",
|
||
"\n",
|
||
"Understanding how compression techniques affect real-world deployment metrics like storage, memory, speed, and accuracy.\n",
|
||
"\n",
|
||
"### Compression Effectiveness Analysis\n",
|
||
"\n",
|
||
"Different techniques excel in different scenarios. Let's measure their effectiveness across various model sizes and architectures."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "f8025b3f",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_compression_ratios():\n",
|
||
" \"\"\"📊 Analyze compression ratios for different techniques.\"\"\"\n",
|
||
" print(\"📊 Analyzing Compression Ratios...\")\n",
|
||
"\n",
|
||
" # Create test models of different sizes\n",
|
||
" models = {\n",
|
||
" 'Small': Sequential(Linear(50, 30), Linear(30, 10)),\n",
|
||
" 'Medium': Sequential(Linear(200, 128), Linear(128, 64), Linear(64, 10)),\n",
|
||
" 'Large': Sequential(Linear(500, 256), Linear(256, 128), Linear(128, 10))\n",
|
||
" }\n",
|
||
"\n",
|
||
" compression_techniques = [\n",
|
||
" ('Magnitude 50%', {'magnitude_prune': 0.5}),\n",
|
||
" ('Magnitude 90%', {'magnitude_prune': 0.9}),\n",
|
||
" ('Structured 30%', {'structured_prune': 0.3}),\n",
|
||
" ('Combined', {'magnitude_prune': 0.8, 'structured_prune': 0.2})\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(f\"{'Model':<8} {'Technique':<15} {'Original':<10} {'Final':<10} {'Reduction':<10}\")\n",
|
||
" print(\"-\" * 65)\n",
|
||
"\n",
|
||
" for model_name, model in models.items():\n",
|
||
" original_params = sum(p.size for p in model.parameters())\n",
|
||
"\n",
|
||
" for tech_name, config in compression_techniques:\n",
|
||
" # Create fresh copy for each test\n",
|
||
" test_model = copy.deepcopy(model)\n",
|
||
"\n",
|
||
" # Apply compression\n",
|
||
" stats = compress_model(test_model, config)\n",
|
||
"\n",
|
||
" # Calculate compression ratio\n",
|
||
" remaining_params = sum(np.count_nonzero(p.data) for p in test_model.parameters())\n",
|
||
" reduction = (1 - remaining_params / original_params) * 100\n",
|
||
"\n",
|
||
" print(f\"{model_name:<8} {tech_name:<15} {original_params:<10} {remaining_params:<10} {reduction:<9.1f}%\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Key Insights:\")\n",
|
||
" print(\"• Magnitude pruning achieves predictable sparsity levels\")\n",
|
||
" print(\"• Structured pruning creates hardware-friendly sparsity\")\n",
|
||
" print(\"• Combined techniques offer maximum compression\")\n",
|
||
" print(\"• Larger models compress better (more redundancy)\")\n",
|
||
"\n",
|
||
"analyze_compression_ratios()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "f29e9dc0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_compression_speed():\n",
|
||
" \"\"\"📊 Analyze inference speed with different compression levels.\"\"\"\n",
|
||
" print(\"📊 Analyzing Compression Speed Impact...\")\n",
|
||
"\n",
|
||
" # Create test model\n",
|
||
" model = Sequential(Linear(512, 256), Linear(256, 128), Linear(128, 10))\n",
|
||
" test_input = Tensor(np.random.randn(100, 512)) # Batch of 100\n",
|
||
"\n",
|
||
" def time_inference(model, input_data, iterations=50):\n",
|
||
" \"\"\"Time model inference.\"\"\"\n",
|
||
" times = []\n",
|
||
" for _ in range(iterations):\n",
|
||
" start = time.time()\n",
|
||
" _ = model.forward(input_data)\n",
|
||
" times.append(time.time() - start)\n",
|
||
" return np.mean(times[5:]) # Skip first few for warmup\n",
|
||
"\n",
|
||
" # Test different compression levels\n",
|
||
" compression_levels = [\n",
|
||
" ('Original', {}),\n",
|
||
" ('Light Pruning', {'magnitude_prune': 0.5}),\n",
|
||
" ('Heavy Pruning', {'magnitude_prune': 0.9}),\n",
|
||
" ('Structured', {'structured_prune': 0.3}),\n",
|
||
" ('Combined', {'magnitude_prune': 0.8, 'structured_prune': 0.2})\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(f\"{'Compression':<15} {'Sparsity':<10} {'Time (ms)':<12} {'Speedup':<10}\")\n",
|
||
" print(\"-\" * 50)\n",
|
||
"\n",
|
||
" baseline_time = None\n",
|
||
"\n",
|
||
" for name, config in compression_levels:\n",
|
||
" # Create fresh model copy\n",
|
||
" test_model = copy.deepcopy(model)\n",
|
||
"\n",
|
||
" # Apply compression\n",
|
||
" if config:\n",
|
||
" compress_model(test_model, config)\n",
|
||
"\n",
|
||
" # Measure performance\n",
|
||
" sparsity = measure_sparsity(test_model)\n",
|
||
" inference_time = time_inference(test_model, test_input) * 1000 # Convert to ms\n",
|
||
"\n",
|
||
" if baseline_time is None:\n",
|
||
" baseline_time = inference_time\n",
|
||
" speedup = 1.0\n",
|
||
" else:\n",
|
||
" speedup = baseline_time / inference_time\n",
|
||
"\n",
|
||
" print(f\"{name:<15} {sparsity:<9.1f}% {inference_time:<11.2f} {speedup:<9.2f}x\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Speed Insights:\")\n",
|
||
" print(\"• Dense matrix operations show minimal speedup from unstructured sparsity\")\n",
|
||
" print(\"• Structured sparsity enables better hardware acceleration\")\n",
|
||
" print(\"• Real speedups require sparse-optimized libraries (e.g., NVIDIA 2:4 sparsity)\")\n",
|
||
" print(\"• Memory bandwidth often more important than parameter count\")\n",
|
||
"\n",
|
||
"analyze_compression_speed()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e6c5926b",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 10. Optimization Insights: Production Compression Strategy\n",
|
||
"\n",
|
||
"Understanding the real-world implications of compression choices and how to design compression strategies for different deployment scenarios.\n",
|
||
"\n",
|
||
"### Accuracy vs Compression Trade-offs\n",
|
||
"\n",
|
||
"The fundamental challenge in model compression is balancing three competing objectives: model size, inference speed, and prediction accuracy."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "351bffdb",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_compression_accuracy_tradeoff():\n",
|
||
" \"\"\"📊 Analyze accuracy vs compression trade-offs.\"\"\"\n",
|
||
" print(\"📊 Analyzing Accuracy vs Compression Trade-offs...\")\n",
|
||
"\n",
|
||
" # Simulate accuracy degradation (in practice, would need real training/testing)\n",
|
||
" def simulate_accuracy_loss(sparsity, technique_type):\n",
|
||
" \"\"\"Simulate realistic accuracy loss patterns.\"\"\"\n",
|
||
" if technique_type == 'magnitude':\n",
|
||
" # Magnitude pruning: gradual degradation\n",
|
||
" return max(0, sparsity * 0.3 + np.random.normal(0, 0.05))\n",
|
||
" elif technique_type == 'structured':\n",
|
||
" # Structured pruning: more aggressive early loss\n",
|
||
" return max(0, sparsity * 0.5 + np.random.normal(0, 0.1))\n",
|
||
" elif technique_type == 'knowledge_distillation':\n",
|
||
" # Knowledge distillation: better preservation\n",
|
||
" return max(0, sparsity * 0.1 + np.random.normal(0, 0.02))\n",
|
||
" else:\n",
|
||
" return sparsity * 0.4\n",
|
||
"\n",
|
||
" # Test different compression strategies\n",
|
||
" strategies = [\n",
|
||
" ('Magnitude Only', 'magnitude'),\n",
|
||
" ('Structured Only', 'structured'),\n",
|
||
" ('Knowledge Distillation', 'knowledge_distillation'),\n",
|
||
" ('Combined Approach', 'combined')\n",
|
||
" ]\n",
|
||
"\n",
|
||
" sparsity_levels = np.arange(0.1, 1.0, 0.1)\n",
|
||
"\n",
|
||
" print(f\"{'Strategy':<20} {'Sparsity':<10} {'Accuracy Loss':<15}\")\n",
|
||
" print(\"-\" * 50)\n",
|
||
"\n",
|
||
" for strategy_name, strategy_type in strategies:\n",
|
||
" print(f\"\\n{strategy_name}:\")\n",
|
||
" for sparsity in sparsity_levels:\n",
|
||
" if strategy_type == 'combined':\n",
|
||
" # Combined approach uses multiple techniques\n",
|
||
" loss = min(\n",
|
||
" simulate_accuracy_loss(sparsity * 0.7, 'magnitude'),\n",
|
||
" simulate_accuracy_loss(sparsity * 0.3, 'structured')\n",
|
||
" )\n",
|
||
" else:\n",
|
||
" loss = simulate_accuracy_loss(sparsity, strategy_type)\n",
|
||
"\n",
|
||
" print(f\"{'':20} {sparsity:<9.1f} {loss:<14.3f}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Trade-off Insights:\")\n",
|
||
" print(\"• Knowledge distillation preserves accuracy best at high compression\")\n",
|
||
" print(\"• Magnitude pruning offers gradual degradation curve\")\n",
|
||
" print(\"• Structured pruning enables hardware acceleration but higher accuracy loss\")\n",
|
||
" print(\"• Combined approaches balance multiple objectives\")\n",
|
||
" print(\"• Early stopping based on accuracy threshold is crucial\")\n",
|
||
"\n",
|
||
"analyze_compression_accuracy_tradeoff()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8a67dffa",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 11. Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that all compression techniques work together correctly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4d51b541",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire compression module functionality.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All unit tests pass\n",
|
||
" - Functions work together correctly\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_measure_sparsity()\n",
|
||
" test_unit_magnitude_prune()\n",
|
||
" test_unit_structured_prune()\n",
|
||
" test_unit_low_rank_approximate()\n",
|
||
" test_unit_knowledge_distillation()\n",
|
||
" test_unit_compress_model()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test 1: Complete compression pipeline\n",
|
||
" print(\"🔬 Integration Test: Complete compression pipeline...\")\n",
|
||
"\n",
|
||
" # Create a realistic model\n",
|
||
" model = Sequential(\n",
|
||
" Linear(784, 512), # Input layer (like MNIST)\n",
|
||
" Linear(512, 256), # Hidden layer 1\n",
|
||
" Linear(256, 128), # Hidden layer 2\n",
|
||
" Linear(128, 10) # Output layer\n",
|
||
" )\n",
|
||
"\n",
|
||
" original_params = sum(p.size for p in model.parameters())\n",
|
||
" print(f\"Original model: {original_params:,} parameters\")\n",
|
||
"\n",
|
||
" # Apply comprehensive compression\n",
|
||
" compression_config = {\n",
|
||
" 'magnitude_prune': 0.8,\n",
|
||
" 'structured_prune': 0.3\n",
|
||
" }\n",
|
||
"\n",
|
||
" stats = compress_model(model, compression_config)\n",
|
||
" final_sparsity = measure_sparsity(model)\n",
|
||
"\n",
|
||
" # Validate compression results\n",
|
||
" assert final_sparsity > 70, f\"Expected >70% sparsity, got {final_sparsity:.1f}%\"\n",
|
||
" assert stats['sparsity_increase'] > 70, \"Should achieve significant compression\"\n",
|
||
" assert len(stats['applied_techniques']) == 2, \"Should apply both techniques\"\n",
|
||
"\n",
|
||
" print(f\"✅ Achieved {final_sparsity:.1f}% sparsity with {len(stats['applied_techniques'])} techniques\")\n",
|
||
"\n",
|
||
" # Test 2: Knowledge distillation setup\n",
|
||
" print(\"🔬 Integration Test: Knowledge distillation...\")\n",
|
||
"\n",
|
||
" teacher = Sequential(Linear(100, 200), Linear(200, 50))\n",
|
||
" student = Sequential(Linear(100, 50)) # 3x fewer parameters\n",
|
||
"\n",
|
||
" kd = KnowledgeDistillation(teacher, student, temperature=4.0, alpha=0.8)\n",
|
||
"\n",
|
||
" # Verify setup\n",
|
||
" teacher_params = sum(p.size for p in teacher.parameters())\n",
|
||
" student_params = sum(p.size for p in student.parameters())\n",
|
||
" compression_ratio = student_params / teacher_params\n",
|
||
"\n",
|
||
" assert compression_ratio < 0.5, f\"Student should be <50% of teacher size, got {compression_ratio:.2f}\"\n",
|
||
" assert kd.temperature == 4.0, \"Temperature should be set correctly\"\n",
|
||
" assert kd.alpha == 0.8, \"Alpha should be set correctly\"\n",
|
||
"\n",
|
||
" print(f\"✅ Knowledge distillation: {compression_ratio:.2f}x size reduction\")\n",
|
||
"\n",
|
||
" # Test 3: Low-rank approximation\n",
|
||
" print(\"🔬 Integration Test: Low-rank approximation...\")\n",
|
||
"\n",
|
||
" large_matrix = np.random.randn(200, 150)\n",
|
||
" U, S, V = low_rank_approximate(large_matrix, rank_ratio=0.3)\n",
|
||
"\n",
|
||
" original_size = large_matrix.size\n",
|
||
" compressed_size = U.size + S.size + V.size\n",
|
||
" compression_ratio = compressed_size / original_size\n",
|
||
"\n",
|
||
" assert compression_ratio < 0.7, f\"Should achieve compression, got ratio {compression_ratio:.2f}\"\n",
|
||
"\n",
|
||
" # Test reconstruction\n",
|
||
" reconstructed = U @ np.diag(S) @ V\n",
|
||
" error = np.linalg.norm(large_matrix - reconstructed) / np.linalg.norm(large_matrix)\n",
|
||
" assert error < 0.5, f\"Reconstruction error too high: {error:.3f}\"\n",
|
||
"\n",
|
||
" print(f\"✅ Low-rank: {compression_ratio:.2f}x compression, {error:.3f} error\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 18\")\n",
|
||
"\n",
|
||
"# Call the integration test\n",
|
||
"test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8445b205",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if __name__ == \"__main__\":\n",
|
||
" print(\"🚀 Running Compression module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eb215fc2",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Compression Foundations\n",
|
||
"\n",
|
||
"### Question 1: Compression Trade-offs\n",
|
||
"You implemented magnitude pruning that removes 90% of weights from a 10M parameter model.\n",
|
||
"- How many parameters remain active? _____ M parameters\n",
|
||
"- If the original model was 40MB, what's the theoretical minimum storage? _____ MB\n",
|
||
"- Why might actual speedup be less than 10x? _____________\n",
|
||
"\n",
|
||
"### Question 2: Structured vs Unstructured Sparsity\n",
|
||
"Your structured pruning removes entire channels, while magnitude pruning creates scattered zeros.\n",
|
||
"- Which enables better hardware acceleration? _____________\n",
|
||
"- Which preserves accuracy better at high sparsity? _____________\n",
|
||
"- Which creates more predictable memory access patterns? _____________\n",
|
||
"\n",
|
||
"### Question 3: Knowledge Distillation Efficiency\n",
|
||
"A teacher model has 100M parameters, student has 10M parameters, both achieve 85% accuracy.\n",
|
||
"- What's the compression ratio? _____x\n",
|
||
"- If teacher inference takes 100ms, student takes 15ms, what's the speedup? _____x\n",
|
||
"- Why is the speedup greater than the compression ratio? _____________\n",
|
||
"\n",
|
||
"### Question 4: Low-Rank Decomposition\n",
|
||
"You approximate a (512, 256) weight matrix with rank 64 using SVD.\n",
|
||
"- Original parameter count: _____ parameters\n",
|
||
"- Decomposed parameter count: _____ parameters\n",
|
||
"- Compression ratio: _____x\n",
|
||
"- At what rank does compression become ineffective? rank > _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0506c01f",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Compression\n",
|
||
"\n",
|
||
"Congratulations! You've built a comprehensive model compression system that can dramatically reduce model size while preserving intelligence!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built magnitude-based and structured pruning techniques with clear sparsity patterns\n",
|
||
"- Implemented knowledge distillation for teacher-student compression with temperature scaling\n",
|
||
"- Created low-rank approximation using SVD decomposition for matrix factorization\n",
|
||
"- Developed sparsity measurement and comprehensive compression pipeline\n",
|
||
"- Analyzed compression trade-offs between size, speed, and accuracy with real measurements\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Systems Insights Gained\n",
|
||
"- **Structured vs Unstructured**: Hardware-friendly sparsity patterns vs maximum compression ratios\n",
|
||
"- **Compression Cascading**: Multiple techniques compound benefits but require careful sequencing\n",
|
||
"- **Accuracy Preservation**: Knowledge distillation maintains performance better than pruning alone\n",
|
||
"- **Memory vs Speed**: Parameter reduction doesn't guarantee proportional speedup without sparse libraries\n",
|
||
"- **Deployment Strategy**: Different scenarios (mobile, edge, cloud) require different compression approaches\n",
|
||
"\n",
|
||
"### Technical Mastery\n",
|
||
"- **Sparsity Measurement**: Calculate and track zero weight percentages across models\n",
|
||
"- **Magnitude Pruning**: Global thresholding based on weight importance ranking\n",
|
||
"- **Structured Pruning**: Channel-wise removal using L2 norm importance metrics\n",
|
||
"- **Knowledge Distillation**: Teacher-student training with temperature-scaled soft targets\n",
|
||
"- **Low-Rank Approximation**: SVD-based matrix factorization for parameter reduction\n",
|
||
"- **Pipeline Integration**: Sequential application of multiple compression techniques\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your compression implementation enables efficient model deployment across diverse hardware constraints!\n",
|
||
"Export with: `tito module complete 18`\n",
|
||
"\n",
|
||
"**Next**: Module 19 will add comprehensive benchmarking to evaluate all optimization techniques together, measuring the cumulative effects of quantization, acceleration, and compression!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|