TinyTorch/modules/source/18_compression/compression_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7c0b2b14",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module 18: Compression - Making Models Smaller\n",
    "\n",
    "Welcome to Module 18! You're about to build model compression techniques that make neural networks smaller and more efficient while preserving their intelligence.\n",
    "\n",
    "## 🔗 Prerequisites & Progress\n",
    "**You've Built**: Full TinyGPT pipeline with profiling, acceleration, and quantization\n",
    "**You'll Build**: Pruning (magnitude & structured), knowledge distillation, and low-rank approximation\n",
    "**You'll Enable**: Compressed models that maintain accuracy while using dramatically less storage and memory\n",
    "\n",
    "**Connection Map**:\n",
    "```\n",
    "Quantization → Compression → Benchmarking\n",
    "(precision)   (sparsity)    (evaluation)\n",
    "```\n",
    "\n",
    "## Learning Objectives\n",
    "By the end of this module, you will:\n",
    "1. Implement magnitude-based and structured pruning\n",
    "2. Build knowledge distillation for model compression\n",
    "3. Create low-rank approximations of weight matrices\n",
    "4. Measure compression ratios and sparsity levels\n",
    "5. Understand structured vs unstructured sparsity trade-offs\n",
    "\n",
    "Let's get started!\n",
    "\n",
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/18_compression/compression_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.optimization.compression`\n",
    "\n",
    "```python\n",
    "# How to use this module:\n",
    "from tinytorch.optimization.compression import magnitude_prune, structured_prune, measure_sparsity\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Complete compression system in one focused module for deep understanding\n",
    "- **Production:** Proper organization like real compression libraries with all techniques together\n",
    "- **Consistency:** All compression operations and sparsity management in optimization.compression\n",
    "- **Integration:** Works seamlessly with models and quantization for complete optimization pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37872416",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "imports",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp optimization.compression\n",
    "#| export\n",
    "\n",
    "import numpy as np\n",
    "import copy\n",
    "from typing import List, Dict, Any, Tuple, Optional\n",
    "import time\n",
    "\n",
    "# Import from previous modules\n",
    "# Note: In the full package, these would be imports like:\n",
    "# from tinytorch.core.tensor import Tensor\n",
    "# from tinytorch.core.layers import Linear\n",
    "# For development, we'll create minimal implementations\n",
    "\n",
    "class Tensor:\n",
    "    \"\"\"Minimal Tensor class for compression development - imports from Module 01 in practice.\"\"\"\n",
    "    def __init__(self, data, requires_grad=False):\n",
    "        self.data = np.array(data)\n",
    "        self.shape = self.data.shape\n",
    "        self.size = self.data.size\n",
    "        self.requires_grad = requires_grad\n",
    "        self.grad = None\n",
    "\n",
    "    def __add__(self, other):\n",
    "        if isinstance(other, Tensor):\n",
    "            return Tensor(self.data + other.data)\n",
    "        return Tensor(self.data + other)\n",
    "\n",
    "    def __mul__(self, other):\n",
    "        if isinstance(other, Tensor):\n",
    "            return Tensor(self.data * other.data)\n",
    "        return Tensor(self.data * other)\n",
    "\n",
    "    def matmul(self, other):\n",
    "        return Tensor(np.dot(self.data, other.data))\n",
    "\n",
    "    def abs(self):\n",
    "        return Tensor(np.abs(self.data))\n",
    "\n",
    "    def sum(self, axis=None):\n",
    "        return Tensor(self.data.sum(axis=axis))\n",
    "\n",
    "    def __repr__(self):\n",
    "        return f\"Tensor(shape={self.shape})\"\n",
    "\n",
    "class Linear:\n",
    "    \"\"\"Minimal Linear layer for compression development - imports from Module 03 in practice.\"\"\"\n",
    "    def __init__(self, in_features, out_features, bias=True):\n",
    "        self.in_features = in_features\n",
    "        self.out_features = out_features\n",
    "        # Initialize with He initialization\n",
    "        self.weight = Tensor(np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features))\n",
    "        self.bias = Tensor(np.zeros(out_features)) if bias else None\n",
    "\n",
    "    def forward(self, x):\n",
    "        output = x.matmul(self.weight)\n",
    "        if self.bias is not None:\n",
    "            output = output + self.bias\n",
    "        return output\n",
    "\n",
    "    def parameters(self):\n",
    "        params = [self.weight]\n",
    "        if self.bias is not None:\n",
    "            params.append(self.bias)\n",
    "        return params\n",
    "\n",
    "class Sequential:\n",
    "    \"\"\"Minimal Sequential container for model compression.\"\"\"\n",
    "    def __init__(self, *layers):\n",
    "        self.layers = list(layers)\n",
    "\n",
    "    def forward(self, x):\n",
    "        for layer in self.layers:\n",
    "            x = layer.forward(x)\n",
    "        return x\n",
    "\n",
    "    def parameters(self):\n",
    "        params = []\n",
    "        for layer in self.layers:\n",
    "            if hasattr(layer, 'parameters'):\n",
    "                params.extend(layer.parameters())\n",
    "        return params"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "252e20ce",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 1. Introduction: What is Model Compression?\n",
    "\n",
    "Imagine you have a massive library with millions of books, but you only reference 10% of them regularly. Model compression is like creating a curated collection that keeps the essential knowledge while dramatically reducing storage space.\n",
    "\n",
    "Model compression reduces the size and computational requirements of neural networks while preserving their intelligence. It's the bridge between powerful research models and practical deployment.\n",
    "\n",
    "### Why Compression Matters in ML Systems\n",
    "\n",
    "**The Storage Challenge:**\n",
    "- Modern language models: 100GB+ (GPT-3 scale)\n",
    "- Mobile devices: <1GB available for models\n",
    "- Edge devices: <100MB realistic limits\n",
    "- Network bandwidth: Slow downloads kill user experience\n",
    "\n",
    "**The Speed Challenge:**\n",
    "- Research models: Designed for accuracy, not efficiency\n",
    "- Production needs: Sub-second response times\n",
    "- Battery life: Energy consumption matters for mobile\n",
    "- Cost scaling: Inference costs grow with model size\n",
    "\n",
    "### The Compression Landscape\n",
    "\n",
    "```\n",
    "Neural Network Compression Techniques:\n",
    "\n",
    "┌─────────────────────────────────────────────────────────────┐\n",
    "│                    COMPRESSION METHODS                      │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│  WEIGHT-BASED                    │  ARCHITECTURE-BASED      │\n",
    "│  ┌─────────────────────────────┐ │  ┌─────────────────────┐ │\n",
    "│  │ Magnitude Pruning           │ │  │ Knowledge Distillation│ │\n",
    "│  │ • Remove small weights      │ │  │ • Teacher → Student  │ │\n",
    "│  │ • 90% sparsity achievable   │ │  │ • 10x size reduction │ │\n",
    "│  │                             │ │  │                     │ │\n",
    "│  │ Structured Pruning          │ │  │ Neural Architecture │ │\n",
    "│  │ • Remove entire channels    │ │  │ Search (NAS)        │ │\n",
    "│  │ • Hardware-friendly         │ │  │ • Automated design  │ │\n",
    "│  │                             │ │  │                     │ │\n",
    "│  │ Low-Rank Approximation      │ │  │ Early Exit          │ │\n",
    "│  │ • Matrix factorization      │ │  │ • Adaptive compute  │ │\n",
    "│  │ • SVD decomposition         │ │  │                     │ │\n",
    "│  └─────────────────────────────┘ │  └─────────────────────┘ │\n",
    "└─────────────────────────────────────────────────────────────┘\n",
    "```\n",
    "\n",
    "Think of compression like optimizing a recipe - you want to keep the essential ingredients that create the flavor while removing anything that doesn't contribute to the final dish."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30325dfe",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 2. Foundations: Mathematical Background\n",
    "\n",
    "Understanding the mathematics behind compression helps us choose the right technique for each situation and predict their effects on model performance.\n",
    "\n",
    "### Magnitude-Based Pruning: The Simple Approach\n",
    "\n",
    "The core insight: small weights contribute little to the final prediction. Magnitude pruning removes weights based on their absolute values.\n",
    "\n",
    "```\n",
    "Mathematical Foundation:\n",
    "For weight w_ij in layer l:\n",
    "    If |w_ij| < threshold_l → w_ij = 0\n",
    "\n",
    "Threshold Selection:\n",
    "- Global: One threshold for entire model\n",
    "- Layer-wise: Different threshold per layer\n",
    "- Percentile-based: Remove bottom k% of weights\n",
    "\n",
    "Sparsity Calculation:\n",
    "    Sparsity = (Zero weights / Total weights) × 100%\n",
    "```\n",
    "\n",
    "### Structured Pruning: Hardware-Friendly Compression\n",
    "\n",
    "Unlike magnitude pruning which creates scattered zeros, structured pruning removes entire computational units (neurons, channels, attention heads).\n",
    "\n",
    "```\n",
    "Channel Importance Metrics:\n",
    "\n",
    "Method 1: L2 Norm\n",
    "    Importance(channel_i) = ||W[:,i]||₂ = √(Σⱼ W²ⱼᵢ)\n",
    "\n",
    "Method 2: Gradient-based\n",
    "    Importance(channel_i) = |∂Loss/∂W[:,i]|\n",
    "\n",
    "Method 3: Activation-based\n",
    "    Importance(channel_i) = E[|activations_i|]\n",
    "\n",
    "Pruning Decision:\n",
    "    Remove bottom k% of channels based on importance ranking\n",
    "```\n",
    "\n",
    "### Knowledge Distillation: Learning from Teachers\n",
    "\n",
    "Knowledge distillation transfers knowledge from a large \"teacher\" model to a smaller \"student\" model. The student learns not just the correct answers, but the teacher's reasoning process.\n",
    "\n",
    "```\n",
    "Distillation Loss Function:\n",
    "    L_total = α × L_soft + (1-α) × L_hard\n",
    "\n",
    "Where:\n",
    "    L_soft = KL_divergence(σ(z_s/T), σ(z_t/T))  # Soft targets\n",
    "    L_hard = CrossEntropy(σ(z_s), y_true)        # Hard targets\n",
    "\n",
    "    σ(z/T) = Softmax with temperature T\n",
    "    z_s = Student logits, z_t = Teacher logits\n",
    "    α = Balance parameter (typically 0.7)\n",
    "    T = Temperature parameter (typically 3-5)\n",
    "\n",
    "Temperature Effect:\n",
    "    T=1: Standard softmax (sharp probabilities)\n",
    "    T>1: Softer distributions (reveals teacher's uncertainty)\n",
    "```\n",
    "\n",
    "### Low-Rank Approximation: Matrix Compression\n",
    "\n",
    "Large weight matrices often have redundancy that can be captured with lower-rank approximations using Singular Value Decomposition (SVD).\n",
    "\n",
    "```\n",
    "SVD Decomposition:\n",
    "    W_{m×n} = U_{m×k} × Σ_{k×k} × V^T_{k×n}\n",
    "\n",
    "Parameter Reduction:\n",
    "    Original: m × n parameters\n",
    "    Compressed: (m × k) + k + (k × n) = k(m + n + 1) parameters\n",
    "\n",
    "    Compression achieved when: k < mn/(m+n+1)\n",
    "\n",
    "Reconstruction Error:\n",
    "    ||W - W_approx||_F = √(Σᵢ₌ₖ₊₁ʳ σᵢ²)\n",
    "\n",
    "    Where σᵢ are singular values, r = rank(W)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce0801cd",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 3. Sparsity Measurement - Understanding Model Density\n",
    "\n",
    "Before we can compress models, we need to understand how dense they are. Sparsity measurement tells us what percentage of weights are zero (or effectively zero).\n",
    "\n",
    "### Understanding Sparsity\n",
    "\n",
    "Sparsity is like measuring how much of a parking lot is empty. A 90% sparse model means 90% of its weights are zero - only 10% of the \"parking spaces\" are occupied.\n",
    "\n",
    "```\n",
    "Sparsity Visualization:\n",
    "\n",
    "Dense Matrix (0% sparse):           Sparse Matrix (75% sparse):\n",
    "┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐    ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐\n",
    "│ 2.1 1.3 0.8 1.9 2.4 1.1 0.7 │    │ 2.1 0.0 0.0 1.9 0.0 0.0 0.0 │\n",
    "│ 1.5 2.8 1.2 0.9 1.6 2.2 1.4 │    │ 0.0 2.8 0.0 0.0 0.0 2.2 0.0 │\n",
    "│ 0.6 1.7 2.5 1.1 0.8 1.3 2.0 │    │ 0.0 0.0 2.5 0.0 0.0 0.0 2.0 │\n",
    "│ 1.9 1.0 1.6 2.3 1.8 0.9 1.2 │    │ 1.9 0.0 0.0 2.3 0.0 0.0 0.0 │\n",
    "└─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘    └─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘\n",
    "All weights active                   Only 7/28 weights active\n",
    "Storage: 28 values                   Storage: 7 values + indices\n",
    "```\n",
    "\n",
    "Why this matters: Sparsity directly relates to memory savings, but achieving speedup requires special sparse computation libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4440ec7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def measure_sparsity(model) -> float:\n",
    "    \"\"\"\n",
    "    Calculate the percentage of zero weights in a model.\n",
    "\n",
    "    TODO: Count zero weights and total weights across all layers\n",
    "\n",
    "    APPROACH:\n",
    "    1. Iterate through all model parameters\n",
    "    2. Count zeros using np.sum(weights == 0)\n",
    "    3. Count total parameters\n",
    "    4. Return percentage: zeros / total * 100\n",
    "\n",
    "    EXAMPLE:\n",
    "    >>> model = Sequential(Linear(10, 5), Linear(5, 2))\n",
    "    >>> sparsity = measure_sparsity(model)\n",
    "    >>> print(f\"Model sparsity: {sparsity:.1f}%\")\n",
    "    Model sparsity: 0.0%  # Before pruning\n",
    "\n",
    "    HINT: Use np.sum() to count zeros efficiently\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    total_params = 0\n",
    "    zero_params = 0\n",
    "\n",
    "    for param in model.parameters():\n",
    "        total_params += param.size\n",
    "        zero_params += np.sum(param.data == 0)\n",
    "\n",
    "    if total_params == 0:\n",
    "        return 0.0\n",
    "\n",
    "    return (zero_params / total_params) * 100.0\n",
    "    ### END SOLUTION\n",
    "\n",
    "def test_unit_measure_sparsity():\n",
    "    \"\"\"🔬 Test sparsity measurement functionality.\"\"\"\n",
    "    print(\"🔬 Unit Test: Measure Sparsity...\")\n",
    "\n",
    "    # Test with dense model\n",
    "    model = Sequential(Linear(4, 3), Linear(3, 2))\n",
    "    initial_sparsity = measure_sparsity(model)\n",
    "    assert initial_sparsity == 0.0, f\"Expected 0% sparsity, got {initial_sparsity}%\"\n",
    "\n",
    "    # Test with manually sparse model\n",
    "    model.layers[0].weight.data[0, 0] = 0\n",
    "    model.layers[0].weight.data[1, 1] = 0\n",
    "    sparse_sparsity = measure_sparsity(model)\n",
    "    assert sparse_sparsity > 0, f\"Expected >0% sparsity, got {sparse_sparsity}%\"\n",
    "\n",
    "    print(\"✅ measure_sparsity works correctly!\")\n",
    "\n",
    "test_unit_measure_sparsity()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc5fb46e",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 4. Magnitude-Based Pruning - Removing Small Weights\n",
    "\n",
    "Magnitude pruning is the simplest and most intuitive compression technique. It's based on the observation that weights with small magnitudes contribute little to the model's output.\n",
    "\n",
    "### How Magnitude Pruning Works\n",
    "\n",
    "Think of magnitude pruning like editing a document - you remove words that don't significantly change the meaning. In neural networks, we remove weights that don't significantly affect predictions.\n",
    "\n",
    "```\n",
    "Magnitude Pruning Process:\n",
    "\n",
    "Step 1: Collect All Weights\n",
    "┌──────────────────────────────────────────────────┐\n",
    "│ Layer 1: [2.1, 0.1, -1.8, 0.05, 3.2, -0.02]    │\n",
    "│ Layer 2: [1.5, -0.03, 2.8, 0.08, -2.1, 0.01]   │\n",
    "│ Layer 3: [0.7, 2.4, -0.06, 1.9, 0.04, -1.3]    │\n",
    "└──────────────────────────────────────────────────┘\n",
    "                    ↓\n",
    "Step 2: Calculate Magnitudes\n",
    "┌──────────────────────────────────────────────────┐\n",
    "│ Magnitudes: [2.1, 0.1, 1.8, 0.05, 3.2, 0.02,   │\n",
    "│              1.5, 0.03, 2.8, 0.08, 2.1, 0.01,   │\n",
    "│              0.7, 2.4, 0.06, 1.9, 0.04, 1.3]    │\n",
    "└──────────────────────────────────────────────────┘\n",
    "                    ↓\n",
    "Step 3: Find Threshold (e.g., 70th percentile)\n",
    "┌──────────────────────────────────────────────────┐\n",
    "│ Sorted: [0.01, 0.02, 0.03, 0.04, 0.05, 0.06,    │\n",
    "│          0.08, 0.1, 0.7, 1.3, 1.5, 1.8,         │ Threshold: 0.1\n",
    "│          1.9, 2.1, 2.1, 2.4, 2.8, 3.2]          │ (70% of weights removed)\n",
    "└──────────────────────────────────────────────────┘\n",
    "                    ↓\n",
    "Step 4: Apply Pruning Mask\n",
    "┌──────────────────────────────────────────────────┐\n",
    "│ Layer 1: [2.1, 0.0, -1.8, 0.0, 3.2, 0.0]       │\n",
    "│ Layer 2: [1.5, 0.0, 2.8, 0.0, -2.1, 0.0]       │ 70% weights → 0\n",
    "│ Layer 3: [0.7, 2.4, 0.0, 1.9, 0.0, -1.3]       │ 30% preserved\n",
    "└──────────────────────────────────────────────────┘\n",
    "\n",
    "Memory Impact:\n",
    "- Dense storage: 18 values\n",
    "- Sparse storage: 6 values + 6 indices = 12 values (33% savings)\n",
    "- Theoretical limit: 70% savings with perfect sparse format\n",
    "```\n",
    "\n",
    "### Why Global Thresholding Works\n",
    "\n",
    "Global thresholding treats the entire model as one big collection of weights, finding a single threshold that achieves the target sparsity across all layers.\n",
    "\n",
    "**Advantages:**\n",
    "- Simple to implement and understand\n",
    "- Preserves overall model capacity\n",
    "- Works well for uniform network architectures\n",
    "\n",
    "**Disadvantages:**\n",
    "- May over-prune some layers, under-prune others\n",
    "- Doesn't account for layer-specific importance\n",
    "- Can hurt performance if layers have very different weight distributions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8f12c15",
   "metadata": {},
   "outputs": [],
   "source": [
    "def magnitude_prune(model, sparsity=0.9):\n",
    "    \"\"\"\n",
    "    Remove weights with smallest magnitudes to achieve target sparsity.\n",
    "\n",
    "    TODO: Implement global magnitude-based pruning\n",
    "\n",
    "    APPROACH:\n",
    "    1. Collect all weights from the model\n",
    "    2. Calculate absolute values to get magnitudes\n",
    "    3. Find threshold at desired sparsity percentile\n",
    "    4. Set weights below threshold to zero (in-place)\n",
    "\n",
    "    EXAMPLE:\n",
    "    >>> model = Sequential(Linear(100, 50), Linear(50, 10))\n",
    "    >>> original_params = sum(p.size for p in model.parameters())\n",
    "    >>> magnitude_prune(model, sparsity=0.8)\n",
    "    >>> final_sparsity = measure_sparsity(model)\n",
    "    >>> print(f\"Achieved {final_sparsity:.1f}% sparsity\")\n",
    "    Achieved 80.0% sparsity\n",
    "\n",
    "    HINTS:\n",
    "    - Use np.percentile() to find threshold\n",
    "    - Modify model parameters in-place\n",
    "    - Consider only weight matrices, not biases\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Collect all weights (excluding biases)\n",
    "    all_weights = []\n",
    "    weight_params = []\n",
    "\n",
    "    for param in model.parameters():\n",
    "        # Skip biases (typically 1D)\n",
    "        if len(param.shape) > 1:\n",
    "            all_weights.extend(param.data.flatten())\n",
    "            weight_params.append(param)\n",
    "\n",
    "    if not all_weights:\n",
    "        return\n",
    "\n",
    "    # Calculate magnitude threshold\n",
    "    magnitudes = np.abs(all_weights)\n",
    "    threshold = np.percentile(magnitudes, sparsity * 100)\n",
    "\n",
    "    # Apply pruning to each weight parameter\n",
    "    for param in weight_params:\n",
    "        mask = np.abs(param.data) >= threshold\n",
    "        param.data = param.data * mask\n",
    "    ### END SOLUTION\n",
    "\n",
    "def test_unit_magnitude_prune():\n",
    "    \"\"\"🔬 Test magnitude-based pruning functionality.\"\"\"\n",
    "    print(\"🔬 Unit Test: Magnitude Prune...\")\n",
    "\n",
    "    # Create test model with known weights\n",
    "    model = Sequential(Linear(4, 3), Linear(3, 2))\n",
    "\n",
    "    # Set specific weight values for predictable testing\n",
    "    model.layers[0].weight.data = np.array([\n",
    "        [1.0, 2.0, 3.0],\n",
    "        [0.1, 0.2, 0.3],\n",
    "        [4.0, 5.0, 6.0],\n",
    "        [0.01, 0.02, 0.03]\n",
    "    ])\n",
    "\n",
    "    initial_sparsity = measure_sparsity(model)\n",
    "    assert initial_sparsity == 0.0, \"Model should start with no sparsity\"\n",
    "\n",
    "    # Apply 50% pruning\n",
    "    magnitude_prune(model, sparsity=0.5)\n",
    "    final_sparsity = measure_sparsity(model)\n",
    "\n",
    "    # Should achieve approximately 50% sparsity\n",
    "    assert 40 <= final_sparsity <= 60, f\"Expected ~50% sparsity, got {final_sparsity}%\"\n",
    "\n",
    "    # Verify largest weights survived\n",
    "    remaining_weights = model.layers[0].weight.data[model.layers[0].weight.data != 0]\n",
    "    assert len(remaining_weights) > 0, \"Some weights should remain\"\n",
    "    assert np.all(np.abs(remaining_weights) >= 0.1), \"Large weights should survive\"\n",
    "\n",
    "    print(\"✅ magnitude_prune works correctly!\")\n",
    "\n",
    "test_unit_magnitude_prune()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ddc8e18",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 5. Structured Pruning - Hardware-Friendly Compression\n",
    "\n",
    "While magnitude pruning creates scattered zeros throughout the network, structured pruning removes entire computational units (channels, neurons, heads). This creates sparsity patterns that modern hardware can actually accelerate.\n",
    "\n",
    "### Why Structured Pruning Matters\n",
    "\n",
    "Think of the difference between removing random words from a paragraph versus removing entire sentences. Structured pruning removes entire \"sentences\" (channels) rather than random \"words\" (individual weights).\n",
    "\n",
    "```\n",
    "Unstructured vs Structured Sparsity:\n",
    "\n",
    "UNSTRUCTURED (Magnitude Pruning):\n",
    "┌─────────────────────────────────────────────┐\n",
    "│ Channel 0: [2.1, 0.0, 1.8, 0.0, 3.2]       │ ← Sparse weights\n",
    "│ Channel 1: [0.0, 2.8, 0.0, 2.1, 0.0]       │ ← Sparse weights\n",
    "│ Channel 2: [1.5, 0.0, 2.4, 0.0, 1.9]       │ ← Sparse weights\n",
    "│ Channel 3: [0.0, 1.7, 0.0, 2.0, 0.0]       │ ← Sparse weights\n",
    "└─────────────────────────────────────────────┘\n",
    "Issues: Irregular memory access, no hardware speedup\n",
    "\n",
    "STRUCTURED (Channel Pruning):\n",
    "┌─────────────────────────────────────────────┐\n",
    "│ Channel 0: [2.1, 1.3, 1.8, 0.9, 3.2]       │ ← Fully preserved\n",
    "│ Channel 1: [0.0, 0.0, 0.0, 0.0, 0.0]       │ ← Fully removed\n",
    "│ Channel 2: [1.5, 2.2, 2.4, 1.1, 1.9]       │ ← Fully preserved\n",
    "│ Channel 3: [0.0, 0.0, 0.0, 0.0, 0.0]       │ ← Fully removed\n",
    "└─────────────────────────────────────────────┘\n",
    "Benefits: Regular patterns, hardware acceleration possible\n",
    "```\n",
    "\n",
    "### Channel Importance Ranking\n",
    "\n",
    "How do we decide which channels to remove? We rank them by importance using various metrics:\n",
    "\n",
    "```\n",
    "Channel Importance Metrics:\n",
    "\n",
    "Method 1: L2 Norm (Most Common)\n",
    "    For each output channel i:\n",
    "    Importance_i = ||W[:, i]||_2 = √(Σⱼ w²ⱼᵢ)\n",
    "\n",
    "    Intuition: Channels with larger weights have bigger impact\n",
    "\n",
    "Method 2: Activation-Based\n",
    "    Importance_i = E[|activation_i|] over dataset\n",
    "\n",
    "    Intuition: Channels that activate more are more important\n",
    "\n",
    "Method 3: Gradient-Based\n",
    "    Importance_i = |∂Loss/∂W[:, i]|\n",
    "\n",
    "    Intuition: Channels with larger gradients affect loss more\n",
    "\n",
    "Ranking Process:\n",
    "    1. Calculate importance for all channels\n",
    "    2. Sort channels by importance (ascending)\n",
    "    3. Remove bottom k% (least important)\n",
    "    4. Zero out entire channels, not individual weights\n",
    "```\n",
    "\n",
    "### Hardware Benefits of Structured Sparsity\n",
    "\n",
    "Structured sparsity enables real hardware acceleration because:\n",
    "\n",
    "1. **Memory Coalescing**: Accessing contiguous memory chunks is faster\n",
    "2. **SIMD Operations**: Can process multiple remaining channels in parallel\n",
    "3. **No Indexing Overhead**: Don't need to track locations of sparse weights\n",
    "4. **Cache Efficiency**: Better spatial locality of memory access"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ede3f6c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def structured_prune(model, prune_ratio=0.5):\n",
    "    \"\"\"\n",
    "    Remove entire channels/neurons based on L2 norm importance.\n",
    "\n",
    "    TODO: Implement structured pruning for Linear layers\n",
    "\n",
    "    APPROACH:\n",
    "    1. For each Linear layer, calculate L2 norm of each output channel\n",
    "    2. Rank channels by importance (L2 norm)\n",
    "    3. Remove lowest importance channels by setting to zero\n",
    "    4. This creates block sparsity that's hardware-friendly\n",
    "\n",
    "    EXAMPLE:\n",
    "    >>> model = Sequential(Linear(100, 50), Linear(50, 10))\n",
    "    >>> original_shape = model.layers[0].weight.shape\n",
    "    >>> structured_prune(model, prune_ratio=0.3)\n",
    "    >>> # 30% of channels are now completely zero\n",
    "    >>> final_sparsity = measure_sparsity(model)\n",
    "    >>> print(f\"Structured sparsity: {final_sparsity:.1f}%\")\n",
    "    Structured sparsity: 30.0%\n",
    "\n",
    "    HINTS:\n",
    "    - Calculate L2 norm along input dimension for each output channel\n",
    "    - Use np.linalg.norm(weights[:, channel]) for channel importance\n",
    "    - Set entire channels to zero (not just individual weights)\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    for layer in model.layers:\n",
    "        if isinstance(layer, Linear) and hasattr(layer, 'weight'):\n",
    "            weight = layer.weight.data\n",
    "\n",
    "            # Calculate L2 norm for each output channel (column)\n",
    "            channel_norms = np.linalg.norm(weight, axis=0)\n",
    "\n",
    "            # Find channels to prune (lowest importance)\n",
    "            num_channels = weight.shape[1]\n",
    "            num_to_prune = int(num_channels * prune_ratio)\n",
    "\n",
    "            if num_to_prune > 0:\n",
    "                # Get indices of channels to prune (smallest norms)\n",
    "                prune_indices = np.argpartition(channel_norms, num_to_prune)[:num_to_prune]\n",
    "\n",
    "                # Zero out entire channels\n",
    "                weight[:, prune_indices] = 0\n",
    "\n",
    "                # Also zero corresponding bias elements if bias exists\n",
    "                if layer.bias is not None:\n",
    "                    layer.bias.data[prune_indices] = 0\n",
    "    ### END SOLUTION\n",
    "\n",
    "def test_unit_structured_prune():\n",
    "    \"\"\"🔬 Test structured pruning functionality.\"\"\"\n",
    "    print(\"🔬 Unit Test: Structured Prune...\")\n",
    "\n",
    "    # Create test model\n",
    "    model = Sequential(Linear(4, 6), Linear(6, 2))\n",
    "\n",
    "    # Set predictable weights for testing\n",
    "    model.layers[0].weight.data = np.array([\n",
    "        [1.0, 0.1, 2.0, 0.05, 3.0, 0.01],  # Channels with varying importance\n",
    "        [1.1, 0.11, 2.1, 0.06, 3.1, 0.02],\n",
    "        [1.2, 0.12, 2.2, 0.07, 3.2, 0.03],\n",
    "        [1.3, 0.13, 2.3, 0.08, 3.3, 0.04]\n",
    "    ])\n",
    "\n",
    "    initial_sparsity = measure_sparsity(model)\n",
    "    assert initial_sparsity == 0.0, \"Model should start with no sparsity\"\n",
    "\n",
    "    # Apply 33% structured pruning (2 out of 6 channels)\n",
    "    structured_prune(model, prune_ratio=0.33)\n",
    "    final_sparsity = measure_sparsity(model)\n",
    "\n",
    "    # Check that some channels are completely zero\n",
    "    weight = model.layers[0].weight.data\n",
    "    zero_channels = np.sum(np.all(weight == 0, axis=0))\n",
    "    assert zero_channels >= 1, f\"Expected at least 1 zero channel, got {zero_channels}\"\n",
    "\n",
    "    # Check that non-zero channels are completely preserved\n",
    "    for col in range(weight.shape[1]):\n",
    "        channel = weight[:, col]\n",
    "        assert np.all(channel == 0) or np.all(channel != 0), \"Channels should be fully zero or fully non-zero\"\n",
    "\n",
    "    print(\"✅ structured_prune works correctly!\")\n",
    "\n",
    "test_unit_structured_prune()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74c8202f",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 6. Low-Rank Approximation - Matrix Compression Through Factorization\n",
    "\n",
    "Low-rank approximation discovers that large weight matrices often contain redundant information that can be captured with much smaller matrices through mathematical decomposition.\n",
    "\n",
    "### The Intuition Behind Low-Rank Approximation\n",
    "\n",
    "Imagine you're storing a massive spreadsheet where many columns are highly correlated. Instead of storing all columns separately, you could store a few \"basis\" columns and coefficients for how to combine them to recreate the original data.\n",
    "\n",
    "```\n",
    "Low-Rank Decomposition Visualization:\n",
    "\n",
    "Original Matrix W (large):           Factorized Form (smaller):\n",
    "┌─────────────────────────┐         ┌──────┐    ┌──────────────┐\n",
    "│ 2.1  1.3  0.8  1.9  2.4 │         │ 1.1  │    │ 1.9  1.2  0.7│\n",
    "│ 1.5  2.8  1.2  0.9  1.6 │    ≈    │ 2.4  │ @  │ 0.6  1.2  0.5│\n",
    "│ 0.6  1.7  2.5  1.1  0.8 │         │ 0.8  │    │ 1.4  2.1  0.9│\n",
    "│ 1.9  1.0  1.6  2.3  1.8 │         │ 1.6  │    │ 0.5  0.6  1.1│\n",
    "└─────────────────────────┘         └──────┘    └──────────────┘\n",
    "    W (4×5) = 20 params           U (4×2)=8  +  V (2×5)=10  = 18 params\n",
    "\n",
    "Parameter Reduction:\n",
    "- Original: 4 × 5 = 20 parameters\n",
    "- Compressed: (4 × 2) + (2 × 5) = 18 parameters\n",
    "- Compression ratio: 18/20 = 0.9 (10% savings)\n",
    "\n",
    "For larger matrices, savings become dramatic:\n",
    "- W (1000×1000): 1M parameters → U (1000×100) + V (100×1000): 200K parameters\n",
    "- Compression ratio: 0.2 (80% savings)\n",
    "```\n",
    "\n",
    "### SVD: The Mathematical Foundation\n",
    "\n",
    "Singular Value Decomposition (SVD) finds the optimal low-rank approximation by identifying the most important \"directions\" in the data:\n",
    "\n",
    "```\n",
    "SVD Decomposition:\n",
    "    W = U × Σ × V^T\n",
    "\n",
    "Where:\n",
    "    U: Left singular vectors (input patterns)\n",
    "    Σ: Singular values (importance weights)\n",
    "    V^T: Right singular vectors (output patterns)\n",
    "\n",
    "Truncated SVD (Rank-k approximation):\n",
    "    W ≈ U[:,:k] × Σ[:k] × V^T[:k,:]\n",
    "\n",
    "Quality vs Compression Trade-off:\n",
    "    Higher k → Better approximation, less compression\n",
    "    Lower k → More compression, worse approximation\n",
    "\n",
    "Choosing Optimal Rank:\n",
    "    Method 1: Fixed ratio (k = ratio × min(m,n))\n",
    "    Method 2: Energy threshold (keep 90% of singular value energy)\n",
    "    Method 3: Error threshold (reconstruction error < threshold)\n",
    "```\n",
    "\n",
    "### When Low-Rank Works Best\n",
    "\n",
    "Low-rank approximation works well when:\n",
    "- **Matrices are large**: Compression benefits scale with size\n",
    "- **Data has structure**: Correlated patterns enable compression\n",
    "- **Moderate accuracy loss acceptable**: Some precision traded for efficiency\n",
    "\n",
    "It works poorly when:\n",
    "- **Matrices are already small**: Overhead exceeds benefits\n",
    "- **Data is random**: No patterns to exploit\n",
    "- **High precision required**: SVD introduces approximation error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdbedbf4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def low_rank_approximate(weight_matrix, rank_ratio=0.5):\n",
    "    \"\"\"\n",
    "    Approximate weight matrix using low-rank decomposition (SVD).\n",
    "\n",
    "    TODO: Implement SVD-based low-rank approximation\n",
    "\n",
    "    APPROACH:\n",
    "    1. Perform SVD: W = U @ S @ V^T\n",
    "    2. Keep only top k singular values where k = rank_ratio * min(dimensions)\n",
    "    3. Reconstruct: W_approx = U[:,:k] @ diag(S[:k]) @ V[:k,:]\n",
    "    4. Return decomposed matrices for memory savings\n",
    "\n",
    "    EXAMPLE:\n",
    "    >>> weight = np.random.randn(100, 50)\n",
    "    >>> U, S, V = low_rank_approximate(weight, rank_ratio=0.3)\n",
    "    >>> # Original: 100*50 = 5000 params\n",
    "    >>> # Compressed: 100*15 + 15*50 = 2250 params (55% reduction)\n",
    "\n",
    "    HINTS:\n",
    "    - Use np.linalg.svd() for decomposition\n",
    "    - Choose k = int(rank_ratio * min(m, n))\n",
    "    - Return U[:,:k], S[:k], V[:k,:] for reconstruction\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    m, n = weight_matrix.shape\n",
    "\n",
    "    # Perform SVD\n",
    "    U, S, V = np.linalg.svd(weight_matrix, full_matrices=False)\n",
    "\n",
    "    # Determine target rank\n",
    "    max_rank = min(m, n)\n",
    "    target_rank = max(1, int(rank_ratio * max_rank))\n",
    "\n",
    "    # Truncate to target rank\n",
    "    U_truncated = U[:, :target_rank]\n",
    "    S_truncated = S[:target_rank]\n",
    "    V_truncated = V[:target_rank, :]\n",
    "\n",
    "    return U_truncated, S_truncated, V_truncated\n",
    "    ### END SOLUTION\n",
    "\n",
    "def test_unit_low_rank_approximate():\n",
    "    \"\"\"🔬 Test low-rank approximation functionality.\"\"\"\n",
    "    print(\"🔬 Unit Test: Low-Rank Approximate...\")\n",
    "\n",
    "    # Create test weight matrix\n",
    "    original_weight = np.random.randn(20, 15)\n",
    "    original_params = original_weight.size\n",
    "\n",
    "    # Apply low-rank approximation\n",
    "    U, S, V = low_rank_approximate(original_weight, rank_ratio=0.4)\n",
    "\n",
    "    # Check dimensions\n",
    "    target_rank = int(0.4 * min(20, 15))  # min(20,15) = 15, so 0.4*15 = 6\n",
    "    assert U.shape == (20, target_rank), f\"Expected U shape (20, {target_rank}), got {U.shape}\"\n",
    "    assert S.shape == (target_rank,), f\"Expected S shape ({target_rank},), got {S.shape}\"\n",
    "    assert V.shape == (target_rank, 15), f\"Expected V shape ({target_rank}, 15), got {V.shape}\"\n",
    "\n",
    "    # Check parameter reduction\n",
    "    compressed_params = U.size + S.size + V.size\n",
    "    compression_ratio = compressed_params / original_params\n",
    "    assert compression_ratio < 1.0, f\"Should compress, but ratio is {compression_ratio}\"\n",
    "\n",
    "    # Check reconstruction quality\n",
    "    reconstructed = U @ np.diag(S) @ V\n",
    "    reconstruction_error = np.linalg.norm(original_weight - reconstructed)\n",
    "    relative_error = reconstruction_error / np.linalg.norm(original_weight)\n",
    "    assert relative_error < 0.5, f\"Reconstruction error too high: {relative_error}\"\n",
    "\n",
    "    print(\"✅ low_rank_approximate works correctly!\")\n",
    "\n",
    "test_unit_low_rank_approximate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a51cbe39",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 7. Knowledge Distillation - Learning from Teacher Models\n",
    "\n",
    "Knowledge distillation is like having an expert teacher simplify complex concepts for a student. The large \"teacher\" model shares its knowledge with a smaller \"student\" model, achieving similar performance with far fewer parameters.\n",
    "\n",
    "### The Teacher-Student Learning Process\n",
    "\n",
    "Unlike traditional training where models learn from hard labels (cat/dog), knowledge distillation uses \"soft\" targets that contain richer information about the teacher's decision-making process.\n",
    "\n",
    "```\n",
    "Knowledge Distillation Process:\n",
    "\n",
    "                    TEACHER MODEL (Large)\n",
    "                    ┌─────────────────────┐\n",
    "Input Data ────────→│ 100M parameters     │\n",
    "                    │ 95% accuracy        │\n",
    "                    │ 500ms inference     │\n",
    "                    └─────────────────────┘\n",
    "                             │\n",
    "                             ↓ Soft Targets\n",
    "                    ┌─────────────────────┐\n",
    "                    │  Logits: [2.1, 0.3, │\n",
    "                    │           0.8, 4.2] │ ← Rich information\n",
    "                    └─────────────────────┘\n",
    "                             │\n",
    "                             ↓ Distillation Loss\n",
    "                    ┌─────────────────────┐\n",
    "Input Data ────────→│ STUDENT MODEL       │\n",
    "Hard Labels ───────→│ 10M parameters      │ ← 10x smaller\n",
    "                    │ 93% accuracy        │ ← 2% loss\n",
    "                    │ 50ms inference      │ ← 10x faster\n",
    "                    └─────────────────────┘\n",
    "\n",
    "Benefits:\n",
    "• Size: 10x smaller models\n",
    "• Speed: 10x faster inference\n",
    "• Accuracy: Only 2-5% degradation\n",
    "• Knowledge transfer: Student learns teacher's \"reasoning\"\n",
    "```\n",
    "\n",
    "### Temperature Scaling: Softening Decisions\n",
    "\n",
    "Temperature scaling is a key innovation that makes knowledge distillation effective. It \"softens\" the teacher's confidence, revealing uncertainty that helps the student learn.\n",
    "\n",
    "```\n",
    "Temperature Effect on Probability Distributions:\n",
    "\n",
    "Without Temperature (T=1):           With Temperature (T=3):\n",
    "Teacher Logits: [1.0, 2.0, 0.5]    Teacher Logits: [1.0, 2.0, 0.5]\n",
    "                       ↓                               ↓ ÷ 3\n",
    "Softmax: [0.09, 0.67, 0.24]         Logits/T: [0.33, 0.67, 0.17]\n",
    "         ^      ^      ^                       ↓\n",
    "      Low   High   Med              Softmax: [0.21, 0.42, 0.17]\n",
    "                                             ^      ^      ^\n",
    "Sharp decisions (hard to learn)           Soft   decisions (easier to learn)\n",
    "\n",
    "Why Soft Targets Help:\n",
    "1. Reveal teacher's uncertainty about similar classes\n",
    "2. Provide richer gradients for student learning\n",
    "3. Transfer knowledge about class relationships\n",
    "4. Reduce overfitting to hard labels\n",
    "```\n",
    "\n",
    "### Loss Function Design\n",
    "\n",
    "The distillation loss balances learning from both the teacher's soft knowledge and the ground truth hard labels:\n",
    "\n",
    "```\n",
    "Combined Loss Function:\n",
    "\n",
    "L_total = α × L_soft + (1-α) × L_hard\n",
    "\n",
    "Where:\n",
    "    L_soft = KL_divergence(Student_soft, Teacher_soft)\n",
    "             │\n",
    "             └─ Measures how well student mimics teacher\n",
    "\n",
    "    L_hard = CrossEntropy(Student_predictions, True_labels)\n",
    "             │\n",
    "             └─ Ensures student still learns correct answers\n",
    "\n",
    "Balance Parameter α:\n",
    "• α = 0.7: Focus mainly on teacher (typical)\n",
    "• α = 0.9: Almost pure distillation\n",
    "• α = 0.3: Balance teacher and ground truth\n",
    "• α = 0.0: Ignore teacher (regular training)\n",
    "\n",
    "Temperature T:\n",
    "• T = 1: No softening (standard softmax)\n",
    "• T = 3-5: Good balance (typical range)\n",
    "• T = 10+: Very soft (may lose information)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf1a9ab1",
   "metadata": {},
   "outputs": [],
   "source": [
    "class KnowledgeDistillation:\n",
    "    \"\"\"\n",
    "    Knowledge distillation for model compression.\n",
    "\n",
    "    Train a smaller student model to mimic a larger teacher model.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):\n",
    "        \"\"\"\n",
    "        Initialize knowledge distillation.\n",
    "\n",
    "        TODO: Set up teacher and student models with distillation parameters\n",
    "\n",
    "        APPROACH:\n",
    "        1. Store teacher and student models\n",
    "        2. Set temperature for softening probability distributions\n",
    "        3. Set alpha for balancing hard vs soft targets\n",
    "\n",
    "        Args:\n",
    "            teacher_model: Large, pre-trained model\n",
    "            student_model: Smaller model to train\n",
    "            temperature: Softening parameter for distributions\n",
    "            alpha: Weight for soft target loss (1-alpha for hard targets)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.teacher_model = teacher_model\n",
    "        self.student_model = student_model\n",
    "        self.temperature = temperature\n",
    "        self.alpha = alpha\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def distillation_loss(self, student_logits, teacher_logits, true_labels):\n",
    "        \"\"\"\n",
    "        Calculate combined distillation loss.\n",
    "\n",
    "        TODO: Implement knowledge distillation loss function\n",
    "\n",
    "        APPROACH:\n",
    "        1. Calculate hard target loss (student vs true labels)\n",
    "        2. Calculate soft target loss (student vs teacher, with temperature)\n",
    "        3. Combine losses: alpha * soft_loss + (1-alpha) * hard_loss\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> kd = KnowledgeDistillation(teacher, student)\n",
    "        >>> loss = kd.distillation_loss(student_out, teacher_out, labels)\n",
    "        >>> print(f\"Distillation loss: {loss:.4f}\")\n",
    "\n",
    "        HINTS:\n",
    "        - Use temperature to soften distributions: logits/temperature\n",
    "        - Soft targets use KL divergence or cross-entropy\n",
    "        - Hard targets use standard classification loss\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Convert to numpy for this implementation\n",
    "        if hasattr(student_logits, 'data'):\n",
    "            student_logits = student_logits.data\n",
    "        if hasattr(teacher_logits, 'data'):\n",
    "            teacher_logits = teacher_logits.data\n",
    "        if hasattr(true_labels, 'data'):\n",
    "            true_labels = true_labels.data\n",
    "\n",
    "        # Soften distributions with temperature\n",
    "        student_soft = self._softmax(student_logits / self.temperature)\n",
    "        teacher_soft = self._softmax(teacher_logits / self.temperature)\n",
    "\n",
    "        # Soft target loss (KL divergence)\n",
    "        soft_loss = self._kl_divergence(student_soft, teacher_soft)\n",
    "\n",
    "        # Hard target loss (cross-entropy)\n",
    "        student_hard = self._softmax(student_logits)\n",
    "        hard_loss = self._cross_entropy(student_hard, true_labels)\n",
    "\n",
    "        # Combined loss\n",
    "        total_loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss\n",
    "\n",
    "        return total_loss\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def _softmax(self, logits):\n",
    "        \"\"\"Compute softmax with numerical stability.\"\"\"\n",
    "        exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))\n",
    "        return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)\n",
    "\n",
    "    def _kl_divergence(self, p, q):\n",
    "        \"\"\"Compute KL divergence between distributions.\"\"\"\n",
    "        return np.sum(p * np.log(p / (q + 1e-8) + 1e-8))\n",
    "\n",
    "    def _cross_entropy(self, predictions, labels):\n",
    "        \"\"\"Compute cross-entropy loss.\"\"\"\n",
    "        # Simple implementation for integer labels\n",
    "        if labels.ndim == 1:\n",
    "            return -np.mean(np.log(predictions[np.arange(len(labels)), labels] + 1e-8))\n",
    "        else:\n",
    "            return -np.mean(np.sum(labels * np.log(predictions + 1e-8), axis=1))\n",
    "\n",
    "def test_unit_knowledge_distillation():\n",
    "    \"\"\"🔬 Test knowledge distillation functionality.\"\"\"\n",
    "    print(\"🔬 Unit Test: Knowledge Distillation...\")\n",
    "\n",
    "    # Create teacher and student models\n",
    "    teacher = Sequential(Linear(10, 20), Linear(20, 5))\n",
    "    student = Sequential(Linear(10, 5))  # Smaller model\n",
    "\n",
    "    # Initialize knowledge distillation\n",
    "    kd = KnowledgeDistillation(teacher, student, temperature=3.0, alpha=0.7)\n",
    "\n",
    "    # Create dummy data\n",
    "    input_data = Tensor(np.random.randn(8, 10))  # Batch of 8\n",
    "    true_labels = np.array([0, 1, 2, 3, 4, 0, 1, 2])  # Class labels\n",
    "\n",
    "    # Forward passes\n",
    "    teacher_output = teacher.forward(input_data)\n",
    "    student_output = student.forward(input_data)\n",
    "\n",
    "    # Calculate distillation loss\n",
    "    loss = kd.distillation_loss(student_output, teacher_output, true_labels)\n",
    "\n",
    "    # Verify loss is reasonable\n",
    "    assert isinstance(loss, (float, np.floating)), f\"Loss should be float, got {type(loss)}\"\n",
    "    assert loss > 0, f\"Loss should be positive, got {loss}\"\n",
    "    assert not np.isnan(loss), \"Loss should not be NaN\"\n",
    "\n",
    "    print(\"✅ knowledge_distillation works correctly!\")\n",
    "\n",
    "test_unit_knowledge_distillation()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bea12725",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 8. Integration: Complete Compression Pipeline\n",
    "\n",
    "Now let's combine all our compression techniques into a unified system that can apply multiple methods and track their cumulative effects.\n",
    "\n",
    "### Compression Strategy Design\n",
    "\n",
    "Real-world compression often combines multiple techniques in sequence, each targeting different types of redundancy:\n",
    "\n",
    "```\n",
    "Multi-Stage Compression Pipeline:\n",
    "\n",
    "Original Model (100MB, 100% accuracy)\n",
    "         │\n",
    "         ↓ Stage 1: Magnitude Pruning (remove 80% of small weights)\n",
    "Sparse Model (20MB, 98% accuracy)\n",
    "         │\n",
    "         ↓ Stage 2: Structured Pruning (remove 30% of channels)\n",
    "Compact Model (14MB, 96% accuracy)\n",
    "         │\n",
    "         ↓ Stage 3: Low-Rank Approximation (compress large layers)\n",
    "Factorized Model (10MB, 95% accuracy)\n",
    "         │\n",
    "         ↓ Stage 4: Knowledge Distillation (train smaller architecture)\n",
    "Student Model (5MB, 93% accuracy)\n",
    "\n",
    "Final Result: 20x size reduction, 7% accuracy loss\n",
    "```\n",
    "\n",
    "### Compression Configuration\n",
    "\n",
    "Different deployment scenarios require different compression strategies:\n",
    "\n",
    "```\n",
    "Deployment Scenarios and Strategies:\n",
    "\n",
    "MOBILE APP (Aggressive compression needed):\n",
    "┌─────────────────────────────────────────┐\n",
    "│ Target: <10MB, <100ms inference         │\n",
    "│ Strategy:                               │\n",
    "│ • Magnitude pruning: 95% sparsity       │\n",
    "│ • Structured pruning: 50% channels      │\n",
    "│ • Knowledge distillation: 10x reduction │\n",
    "│ • Quantization: 8-bit weights           │\n",
    "└─────────────────────────────────────────┘\n",
    "\n",
    "EDGE DEVICE (Balanced compression):\n",
    "┌─────────────────────────────────────────┐\n",
    "│ Target: <50MB, <200ms inference         │\n",
    "│ Strategy:                               │\n",
    "│ • Magnitude pruning: 80% sparsity       │\n",
    "│ • Structured pruning: 30% channels      │\n",
    "│ • Low-rank: 50% rank reduction          │\n",
    "│ • Quantization: 16-bit weights          │\n",
    "└─────────────────────────────────────────┘\n",
    "\n",
    "CLOUD SERVICE (Minimal compression):\n",
    "┌─────────────────────────────────────────┐\n",
    "│ Target: Maintain accuracy, reduce cost  │\n",
    "│ Strategy:                               │\n",
    "│ • Magnitude pruning: 50% sparsity       │\n",
    "│ • Structured pruning: 10% channels      │\n",
    "│ • Dynamic batching optimization         │\n",
    "│ • Mixed precision inference            │\n",
    "└─────────────────────────────────────────┘\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68de6767",
   "metadata": {},
   "outputs": [],
   "source": [
    "def compress_model(model, compression_config):\n",
    "    \"\"\"\n",
    "    Apply comprehensive model compression based on configuration.\n",
    "\n",
    "    TODO: Implement complete compression pipeline\n",
    "\n",
    "    APPROACH:\n",
    "    1. Apply magnitude pruning if specified\n",
    "    2. Apply structured pruning if specified\n",
    "    3. Apply low-rank approximation if specified\n",
    "    4. Return compression statistics\n",
    "\n",
    "    EXAMPLE:\n",
    "    >>> config = {\n",
    "    ...     'magnitude_prune': 0.8,\n",
    "    ...     'structured_prune': 0.3,\n",
    "    ...     'low_rank': 0.5\n",
    "    ... }\n",
    "    >>> stats = compress_model(model, config)\n",
    "    >>> print(f\"Final sparsity: {stats['sparsity']:.1f}%\")\n",
    "    Final sparsity: 85.0%\n",
    "\n",
    "    HINT: Apply techniques sequentially and measure results\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    original_params = sum(p.size for p in model.parameters())\n",
    "    original_sparsity = measure_sparsity(model)\n",
    "\n",
    "    stats = {\n",
    "        'original_params': original_params,\n",
    "        'original_sparsity': original_sparsity,\n",
    "        'applied_techniques': []\n",
    "    }\n",
    "\n",
    "    # Apply magnitude pruning\n",
    "    if 'magnitude_prune' in compression_config:\n",
    "        sparsity = compression_config['magnitude_prune']\n",
    "        magnitude_prune(model, sparsity=sparsity)\n",
    "        stats['applied_techniques'].append(f'magnitude_prune_{sparsity}')\n",
    "\n",
    "    # Apply structured pruning\n",
    "    if 'structured_prune' in compression_config:\n",
    "        ratio = compression_config['structured_prune']\n",
    "        structured_prune(model, prune_ratio=ratio)\n",
    "        stats['applied_techniques'].append(f'structured_prune_{ratio}')\n",
    "\n",
    "    # Apply low-rank approximation (conceptually - would need architecture changes)\n",
    "    if 'low_rank' in compression_config:\n",
    "        ratio = compression_config['low_rank']\n",
    "        # For demo, we'll just record that it would be applied\n",
    "        stats['applied_techniques'].append(f'low_rank_{ratio}')\n",
    "\n",
    "    # Final measurements\n",
    "    final_sparsity = measure_sparsity(model)\n",
    "    stats['final_sparsity'] = final_sparsity\n",
    "    stats['sparsity_increase'] = final_sparsity - original_sparsity\n",
    "\n",
    "    return stats\n",
    "    ### END SOLUTION\n",
    "\n",
    "def test_unit_compress_model():\n",
    "    \"\"\"🔬 Test comprehensive model compression.\"\"\"\n",
    "    print(\"🔬 Unit Test: Compress Model...\")\n",
    "\n",
    "    # Create test model\n",
    "    model = Sequential(Linear(20, 15), Linear(15, 10), Linear(10, 5))\n",
    "\n",
    "    # Define compression configuration\n",
    "    config = {\n",
    "        'magnitude_prune': 0.7,\n",
    "        'structured_prune': 0.2\n",
    "    }\n",
    "\n",
    "    # Apply compression\n",
    "    stats = compress_model(model, config)\n",
    "\n",
    "    # Verify statistics\n",
    "    assert 'original_params' in stats, \"Should track original parameter count\"\n",
    "    assert 'final_sparsity' in stats, \"Should track final sparsity\"\n",
    "    assert 'applied_techniques' in stats, \"Should track applied techniques\"\n",
    "\n",
    "    # Verify compression was applied\n",
    "    assert stats['final_sparsity'] > stats['original_sparsity'], \"Sparsity should increase\"\n",
    "    assert len(stats['applied_techniques']) == 2, \"Should apply both techniques\"\n",
    "\n",
    "    # Verify model still has reasonable structure\n",
    "    remaining_params = sum(np.count_nonzero(p.data) for p in model.parameters())\n",
    "    assert remaining_params > 0, \"Model should retain some parameters\"\n",
    "\n",
    "    print(\"✅ compress_model works correctly!\")\n",
    "\n",
    "test_unit_compress_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78b4d5fb",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 9. Systems Analysis: Compression Performance and Trade-offs\n",
    "\n",
    "Understanding how compression techniques affect real-world deployment metrics like storage, memory, speed, and accuracy.\n",
    "\n",
    "### Compression Effectiveness Analysis\n",
    "\n",
    "Different techniques excel in different scenarios. Let's measure their effectiveness across various model sizes and architectures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8025b3f",
   "metadata": {
    "lines_to_next_cell": 1
   },
   "outputs": [],
   "source": [
    "def analyze_compression_ratios():\n",
    "    \"\"\"📊 Analyze compression ratios for different techniques.\"\"\"\n",
    "    print(\"📊 Analyzing Compression Ratios...\")\n",
    "\n",
    "    # Create test models of different sizes\n",
    "    models = {\n",
    "        'Small': Sequential(Linear(50, 30), Linear(30, 10)),\n",
    "        'Medium': Sequential(Linear(200, 128), Linear(128, 64), Linear(64, 10)),\n",
    "        'Large': Sequential(Linear(500, 256), Linear(256, 128), Linear(128, 10))\n",
    "    }\n",
    "\n",
    "    compression_techniques = [\n",
    "        ('Magnitude 50%', {'magnitude_prune': 0.5}),\n",
    "        ('Magnitude 90%', {'magnitude_prune': 0.9}),\n",
    "        ('Structured 30%', {'structured_prune': 0.3}),\n",
    "        ('Combined', {'magnitude_prune': 0.8, 'structured_prune': 0.2})\n",
    "    ]\n",
    "\n",
    "    print(f\"{'Model':<8} {'Technique':<15} {'Original':<10} {'Final':<10} {'Reduction':<10}\")\n",
    "    print(\"-\" * 65)\n",
    "\n",
    "    for model_name, model in models.items():\n",
    "        original_params = sum(p.size for p in model.parameters())\n",
    "\n",
    "        for tech_name, config in compression_techniques:\n",
    "            # Create fresh copy for each test\n",
    "            test_model = copy.deepcopy(model)\n",
    "\n",
    "            # Apply compression\n",
    "            stats = compress_model(test_model, config)\n",
    "\n",
    "            # Calculate compression ratio\n",
    "            remaining_params = sum(np.count_nonzero(p.data) for p in test_model.parameters())\n",
    "            reduction = (1 - remaining_params / original_params) * 100\n",
    "\n",
    "            print(f\"{model_name:<8} {tech_name:<15} {original_params:<10} {remaining_params:<10} {reduction:<9.1f}%\")\n",
    "\n",
    "    print(\"\\n💡 Key Insights:\")\n",
    "    print(\"• Magnitude pruning achieves predictable sparsity levels\")\n",
    "    print(\"• Structured pruning creates hardware-friendly sparsity\")\n",
    "    print(\"• Combined techniques offer maximum compression\")\n",
    "    print(\"• Larger models compress better (more redundancy)\")\n",
    "\n",
    "analyze_compression_ratios()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f29e9dc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def analyze_compression_speed():\n",
    "    \"\"\"📊 Analyze inference speed with different compression levels.\"\"\"\n",
    "    print(\"📊 Analyzing Compression Speed Impact...\")\n",
    "\n",
    "    # Create test model\n",
    "    model = Sequential(Linear(512, 256), Linear(256, 128), Linear(128, 10))\n",
    "    test_input = Tensor(np.random.randn(100, 512))  # Batch of 100\n",
    "\n",
    "    def time_inference(model, input_data, iterations=50):\n",
    "        \"\"\"Time model inference.\"\"\"\n",
    "        times = []\n",
    "        for _ in range(iterations):\n",
    "            start = time.time()\n",
    "            _ = model.forward(input_data)\n",
    "            times.append(time.time() - start)\n",
    "        return np.mean(times[5:])  # Skip first few for warmup\n",
    "\n",
    "    # Test different compression levels\n",
    "    compression_levels = [\n",
    "        ('Original', {}),\n",
    "        ('Light Pruning', {'magnitude_prune': 0.5}),\n",
    "        ('Heavy Pruning', {'magnitude_prune': 0.9}),\n",
    "        ('Structured', {'structured_prune': 0.3}),\n",
    "        ('Combined', {'magnitude_prune': 0.8, 'structured_prune': 0.2})\n",
    "    ]\n",
    "\n",
    "    print(f\"{'Compression':<15} {'Sparsity':<10} {'Time (ms)':<12} {'Speedup':<10}\")\n",
    "    print(\"-\" * 50)\n",
    "\n",
    "    baseline_time = None\n",
    "\n",
    "    for name, config in compression_levels:\n",
    "        # Create fresh model copy\n",
    "        test_model = copy.deepcopy(model)\n",
    "\n",
    "        # Apply compression\n",
    "        if config:\n",
    "            compress_model(test_model, config)\n",
    "\n",
    "        # Measure performance\n",
    "        sparsity = measure_sparsity(test_model)\n",
    "        inference_time = time_inference(test_model, test_input) * 1000  # Convert to ms\n",
    "\n",
    "        if baseline_time is None:\n",
    "            baseline_time = inference_time\n",
    "            speedup = 1.0\n",
    "        else:\n",
    "            speedup = baseline_time / inference_time\n",
    "\n",
    "        print(f\"{name:<15} {sparsity:<9.1f}% {inference_time:<11.2f} {speedup:<9.2f}x\")\n",
    "\n",
    "    print(\"\\n💡 Speed Insights:\")\n",
    "    print(\"• Dense matrix operations show minimal speedup from unstructured sparsity\")\n",
    "    print(\"• Structured sparsity enables better hardware acceleration\")\n",
    "    print(\"• Real speedups require sparse-optimized libraries (e.g., NVIDIA 2:4 sparsity)\")\n",
    "    print(\"• Memory bandwidth often more important than parameter count\")\n",
    "\n",
    "analyze_compression_speed()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6c5926b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 10. Optimization Insights: Production Compression Strategy\n",
    "\n",
    "Understanding the real-world implications of compression choices and how to design compression strategies for different deployment scenarios.\n",
    "\n",
    "### Accuracy vs Compression Trade-offs\n",
    "\n",
    "The fundamental challenge in model compression is balancing three competing objectives: model size, inference speed, and prediction accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "351bffdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "def analyze_compression_accuracy_tradeoff():\n",
    "    \"\"\"📊 Analyze accuracy vs compression trade-offs.\"\"\"\n",
    "    print(\"📊 Analyzing Accuracy vs Compression Trade-offs...\")\n",
    "\n",
    "    # Simulate accuracy degradation (in practice, would need real training/testing)\n",
    "    def simulate_accuracy_loss(sparsity, technique_type):\n",
    "        \"\"\"Simulate realistic accuracy loss patterns.\"\"\"\n",
    "        if technique_type == 'magnitude':\n",
    "            # Magnitude pruning: gradual degradation\n",
    "            return max(0, sparsity * 0.3 + np.random.normal(0, 0.05))\n",
    "        elif technique_type == 'structured':\n",
    "            # Structured pruning: more aggressive early loss\n",
    "            return max(0, sparsity * 0.5 + np.random.normal(0, 0.1))\n",
    "        elif technique_type == 'knowledge_distillation':\n",
    "            # Knowledge distillation: better preservation\n",
    "            return max(0, sparsity * 0.1 + np.random.normal(0, 0.02))\n",
    "        else:\n",
    "            return sparsity * 0.4\n",
    "\n",
    "    # Test different compression strategies\n",
    "    strategies = [\n",
    "        ('Magnitude Only', 'magnitude'),\n",
    "        ('Structured Only', 'structured'),\n",
    "        ('Knowledge Distillation', 'knowledge_distillation'),\n",
    "        ('Combined Approach', 'combined')\n",
    "    ]\n",
    "\n",
    "    sparsity_levels = np.arange(0.1, 1.0, 0.1)\n",
    "\n",
    "    print(f\"{'Strategy':<20} {'Sparsity':<10} {'Accuracy Loss':<15}\")\n",
    "    print(\"-\" * 50)\n",
    "\n",
    "    for strategy_name, strategy_type in strategies:\n",
    "        print(f\"\\n{strategy_name}:\")\n",
    "        for sparsity in sparsity_levels:\n",
    "            if strategy_type == 'combined':\n",
    "                # Combined approach uses multiple techniques\n",
    "                loss = min(\n",
    "                    simulate_accuracy_loss(sparsity * 0.7, 'magnitude'),\n",
    "                    simulate_accuracy_loss(sparsity * 0.3, 'structured')\n",
    "                )\n",
    "            else:\n",
    "                loss = simulate_accuracy_loss(sparsity, strategy_type)\n",
    "\n",
    "            print(f\"{'':20} {sparsity:<9.1f} {loss:<14.3f}\")\n",
    "\n",
    "    print(\"\\n💡 Trade-off Insights:\")\n",
    "    print(\"• Knowledge distillation preserves accuracy best at high compression\")\n",
    "    print(\"• Magnitude pruning offers gradual degradation curve\")\n",
    "    print(\"• Structured pruning enables hardware acceleration but higher accuracy loss\")\n",
    "    print(\"• Combined approaches balance multiple objectives\")\n",
    "    print(\"• Early stopping based on accuracy threshold is crucial\")\n",
    "\n",
    "analyze_compression_accuracy_tradeoff()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a67dffa",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 11. Module Integration Test\n",
    "\n",
    "Final validation that all compression techniques work together correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d51b541",
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_module():\n",
    "    \"\"\"\n",
    "    Comprehensive test of entire compression module functionality.\n",
    "\n",
    "    This final test runs before module summary to ensure:\n",
    "    - All unit tests pass\n",
    "    - Functions work together correctly\n",
    "    - Module is ready for integration with TinyTorch\n",
    "    \"\"\"\n",
    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
    "    print(\"=\" * 50)\n",
    "\n",
    "    # Run all unit tests\n",
    "    print(\"Running unit tests...\")\n",
    "    test_unit_measure_sparsity()\n",
    "    test_unit_magnitude_prune()\n",
    "    test_unit_structured_prune()\n",
    "    test_unit_low_rank_approximate()\n",
    "    test_unit_knowledge_distillation()\n",
    "    test_unit_compress_model()\n",
    "\n",
    "    print(\"\\nRunning integration scenarios...\")\n",
    "\n",
    "    # Test 1: Complete compression pipeline\n",
    "    print(\"🔬 Integration Test: Complete compression pipeline...\")\n",
    "\n",
    "    # Create a realistic model\n",
    "    model = Sequential(\n",
    "        Linear(784, 512),  # Input layer (like MNIST)\n",
    "        Linear(512, 256),  # Hidden layer 1\n",
    "        Linear(256, 128),  # Hidden layer 2\n",
    "        Linear(128, 10)    # Output layer\n",
    "    )\n",
    "\n",
    "    original_params = sum(p.size for p in model.parameters())\n",
    "    print(f\"Original model: {original_params:,} parameters\")\n",
    "\n",
    "    # Apply comprehensive compression\n",
    "    compression_config = {\n",
    "        'magnitude_prune': 0.8,\n",
    "        'structured_prune': 0.3\n",
    "    }\n",
    "\n",
    "    stats = compress_model(model, compression_config)\n",
    "    final_sparsity = measure_sparsity(model)\n",
    "\n",
    "    # Validate compression results\n",
    "    assert final_sparsity > 70, f\"Expected >70% sparsity, got {final_sparsity:.1f}%\"\n",
    "    assert stats['sparsity_increase'] > 70, \"Should achieve significant compression\"\n",
    "    assert len(stats['applied_techniques']) == 2, \"Should apply both techniques\"\n",
    "\n",
    "    print(f\"✅ Achieved {final_sparsity:.1f}% sparsity with {len(stats['applied_techniques'])} techniques\")\n",
    "\n",
    "    # Test 2: Knowledge distillation setup\n",
    "    print(\"🔬 Integration Test: Knowledge distillation...\")\n",
    "\n",
    "    teacher = Sequential(Linear(100, 200), Linear(200, 50))\n",
    "    student = Sequential(Linear(100, 50))  # 3x fewer parameters\n",
    "\n",
    "    kd = KnowledgeDistillation(teacher, student, temperature=4.0, alpha=0.8)\n",
    "\n",
    "    # Verify setup\n",
    "    teacher_params = sum(p.size for p in teacher.parameters())\n",
    "    student_params = sum(p.size for p in student.parameters())\n",
    "    compression_ratio = student_params / teacher_params\n",
    "\n",
    "    assert compression_ratio < 0.5, f\"Student should be <50% of teacher size, got {compression_ratio:.2f}\"\n",
    "    assert kd.temperature == 4.0, \"Temperature should be set correctly\"\n",
    "    assert kd.alpha == 0.8, \"Alpha should be set correctly\"\n",
    "\n",
    "    print(f\"✅ Knowledge distillation: {compression_ratio:.2f}x size reduction\")\n",
    "\n",
    "    # Test 3: Low-rank approximation\n",
    "    print(\"🔬 Integration Test: Low-rank approximation...\")\n",
    "\n",
    "    large_matrix = np.random.randn(200, 150)\n",
    "    U, S, V = low_rank_approximate(large_matrix, rank_ratio=0.3)\n",
    "\n",
    "    original_size = large_matrix.size\n",
    "    compressed_size = U.size + S.size + V.size\n",
    "    compression_ratio = compressed_size / original_size\n",
    "\n",
    "    assert compression_ratio < 0.7, f\"Should achieve compression, got ratio {compression_ratio:.2f}\"\n",
    "\n",
    "    # Test reconstruction\n",
    "    reconstructed = U @ np.diag(S) @ V\n",
    "    error = np.linalg.norm(large_matrix - reconstructed) / np.linalg.norm(large_matrix)\n",
    "    assert error < 0.5, f\"Reconstruction error too high: {error:.3f}\"\n",
    "\n",
    "    print(f\"✅ Low-rank: {compression_ratio:.2f}x compression, {error:.3f} error\")\n",
    "\n",
    "    print(\"\\n\" + \"=\" * 50)\n",
    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
    "    print(\"Run: tito module complete 18\")\n",
    "\n",
    "# Call the integration test\n",
    "test_module()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8445b205",
   "metadata": {},
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    print(\"🚀 Running Compression module...\")\n",
    "    test_module()\n",
    "    print(\"✅ Module validation complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb215fc2",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🤔 ML Systems Thinking: Compression Foundations\n",
    "\n",
    "### Question 1: Compression Trade-offs\n",
    "You implemented magnitude pruning that removes 90% of weights from a 10M parameter model.\n",
    "- How many parameters remain active? _____ M parameters\n",
    "- If the original model was 40MB, what's the theoretical minimum storage? _____ MB\n",
    "- Why might actual speedup be less than 10x? _____________\n",
    "\n",
    "### Question 2: Structured vs Unstructured Sparsity\n",
    "Your structured pruning removes entire channels, while magnitude pruning creates scattered zeros.\n",
    "- Which enables better hardware acceleration? _____________\n",
    "- Which preserves accuracy better at high sparsity? _____________\n",
    "- Which creates more predictable memory access patterns? _____________\n",
    "\n",
    "### Question 3: Knowledge Distillation Efficiency\n",
    "A teacher model has 100M parameters, student has 10M parameters, both achieve 85% accuracy.\n",
    "- What's the compression ratio? _____x\n",
    "- If teacher inference takes 100ms, student takes 15ms, what's the speedup? _____x\n",
    "- Why is the speedup greater than the compression ratio? _____________\n",
    "\n",
    "### Question 4: Low-Rank Decomposition\n",
    "You approximate a (512, 256) weight matrix with rank 64 using SVD.\n",
    "- Original parameter count: _____ parameters\n",
    "- Decomposed parameter count: _____ parameters\n",
    "- Compression ratio: _____x\n",
    "- At what rank does compression become ineffective? rank > _____"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0506c01f",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 MODULE SUMMARY: Compression\n",
    "\n",
    "Congratulations! You've built a comprehensive model compression system that can dramatically reduce model size while preserving intelligence!\n",
    "\n",
    "### Key Accomplishments\n",
    "- Built magnitude-based and structured pruning techniques with clear sparsity patterns\n",
    "- Implemented knowledge distillation for teacher-student compression with temperature scaling\n",
    "- Created low-rank approximation using SVD decomposition for matrix factorization\n",
    "- Developed sparsity measurement and comprehensive compression pipeline\n",
    "- Analyzed compression trade-offs between size, speed, and accuracy with real measurements\n",
    "- All tests pass ✅ (validated by `test_module()`)\n",
    "\n",
    "### Systems Insights Gained\n",
    "- **Structured vs Unstructured**: Hardware-friendly sparsity patterns vs maximum compression ratios\n",
    "- **Compression Cascading**: Multiple techniques compound benefits but require careful sequencing\n",
    "- **Accuracy Preservation**: Knowledge distillation maintains performance better than pruning alone\n",
    "- **Memory vs Speed**: Parameter reduction doesn't guarantee proportional speedup without sparse libraries\n",
    "- **Deployment Strategy**: Different scenarios (mobile, edge, cloud) require different compression approaches\n",
    "\n",
    "### Technical Mastery\n",
    "- **Sparsity Measurement**: Calculate and track zero weight percentages across models\n",
    "- **Magnitude Pruning**: Global thresholding based on weight importance ranking\n",
    "- **Structured Pruning**: Channel-wise removal using L2 norm importance metrics\n",
    "- **Knowledge Distillation**: Teacher-student training with temperature-scaled soft targets\n",
    "- **Low-Rank Approximation**: SVD-based matrix factorization for parameter reduction\n",
    "- **Pipeline Integration**: Sequential application of multiple compression techniques\n",
    "\n",
    "### Ready for Next Steps\n",
    "Your compression implementation enables efficient model deployment across diverse hardware constraints!\n",
    "Export with: `tito module complete 18`\n",
    "\n",
    "**Next**: Module 19 will add comprehensive benchmarking to evaluate all optimization techniques together, measuring the cumulative effects of quantization, acceleration, and compression!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}