mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-12 00:43:34 -05:00
- Added QuantizationComplete class with quantize/dequantize methods - Exported quantization functions to tinytorch/optimization/quantization.py - Provides 4x memory reduction with minimal accuracy loss - Removed pedagogical QuantizedLinear export to avoid conflicts - Added proper imports to export block
2594 lines
128 KiB
Plaintext
2594 lines
128 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4c350fb4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp optimization.quantization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "68ad4cba",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 17: Quantization - Making Models Smaller and Faster\n",
|
||
"\n",
|
||
"Welcome to Quantization! Today you'll learn how to reduce model precision from FP32 to INT8 while preserving accuracy.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Complete ML pipeline with profiling and acceleration techniques\n",
|
||
"**You'll Build**: INT8 quantization system with calibration and memory savings\n",
|
||
"**You'll Enable**: 4× memory reduction and 2-4× speedup with minimal accuracy loss\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"Profiling → Quantization → Compression\n",
|
||
"(measure) (reduce bits) (remove weights)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement INT8 quantization with proper scaling\n",
|
||
"2. Build quantization-aware training for minimal accuracy loss\n",
|
||
"3. Apply post-training quantization to existing models\n",
|
||
"4. Measure actual memory and compute savings\n",
|
||
"5. Understand quantization error and mitigation strategies\n",
|
||
"\n",
|
||
"Let's make models 4× smaller!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ada2f24d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/17_quantization/quantization_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.optimization.quantization`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.optimization.quantization import quantize_int8, QuantizedLinear, quantize_model\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete quantization system in one focused module for deep understanding\n",
|
||
"- **Production:** Proper organization like PyTorch's torch.quantization with all optimization components together\n",
|
||
"- **Consistency:** All quantization operations and calibration tools in optimization.quantization\n",
|
||
"- **Integration:** Works seamlessly with existing models for complete optimization pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a4314940",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "imports",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"import numpy as np\n",
|
||
"import time\n",
|
||
"from typing import Tuple, Dict, List, Optional\n",
|
||
"import warnings\n",
|
||
"\n",
|
||
"# Import dependencies from other modules\n",
|
||
"from tinytorch.core.tensor import Tensor\n",
|
||
"from tinytorch.core.layers import Linear\n",
|
||
"from tinytorch.core.activations import ReLU\n",
|
||
"\n",
|
||
"print(\"✅ Quantization module imports complete\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "210e964f",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction - The Memory Wall Problem\n",
|
||
"\n",
|
||
"Imagine trying to fit a library in your backpack. Neural networks face the same challenge - models are getting huge, but devices have limited memory!\n",
|
||
"\n",
|
||
"### The Precision Paradox\n",
|
||
"\n",
|
||
"Modern neural networks use 32-bit floating point numbers with incredible precision:\n",
|
||
"\n",
|
||
"```\n",
|
||
"FP32 Number: 3.14159265359...\n",
|
||
" ^^^^^^^^^^^^^^^^\n",
|
||
" 32 bits = 4 bytes per weight\n",
|
||
"```\n",
|
||
"\n",
|
||
"But here's the surprising truth: **we don't need all that precision for most AI tasks!**\n",
|
||
"\n",
|
||
"### The Growing Memory Crisis\n",
|
||
"\n",
|
||
"```\n",
|
||
"Model Memory Requirements (FP32):\n",
|
||
"┌─────────────────────────────────────────────────────────────┐\n",
|
||
"│ BERT-Base: 110M params × 4 bytes = 440MB │\n",
|
||
"│ GPT-2: 1.5B params × 4 bytes = 6GB │\n",
|
||
"│ GPT-3: 175B params × 4 bytes = 700GB │\n",
|
||
"│ Your Phone: Available RAM = 4-8GB │\n",
|
||
"└─────────────────────────────────────────────────────────────┘\n",
|
||
" ↑\n",
|
||
" Problem!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Quantization Solution\n",
|
||
"\n",
|
||
"What if we could represent each weight with just 8 bits instead of 32?\n",
|
||
"\n",
|
||
"```\n",
|
||
"Before Quantization (FP32):\n",
|
||
"┌──────────────────────────────────┐\n",
|
||
"│ 3.14159265 │ 2.71828183 │ │ 32 bits each\n",
|
||
"└──────────────────────────────────┘\n",
|
||
"\n",
|
||
"After Quantization (INT8):\n",
|
||
"┌────────┬────────┬────────┬────────┐\n",
|
||
"│ 98 │ 85 │ 72 │ 45 │ 8 bits each\n",
|
||
"└────────┴────────┴────────┴────────┘\n",
|
||
" ↑\n",
|
||
" 4× less memory!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Real-World Impact You'll Achieve\n",
|
||
"\n",
|
||
"**Memory Reduction:**\n",
|
||
"- BERT-Base: 440MB → 110MB (4× smaller)\n",
|
||
"- Fits on mobile devices!\n",
|
||
"- Faster loading from disk\n",
|
||
"- More models in GPU memory\n",
|
||
"\n",
|
||
"**Speed Improvements:**\n",
|
||
"- 2-4× faster inference (hardware dependent)\n",
|
||
"- Lower power consumption\n",
|
||
"- Better user experience\n",
|
||
"\n",
|
||
"**Accuracy Preservation:**\n",
|
||
"- <1% accuracy loss with proper techniques\n",
|
||
"- Sometimes even improves generalization!\n",
|
||
"\n",
|
||
"**Why This Matters:**\n",
|
||
"- **Mobile AI:** Deploy powerful models on phones\n",
|
||
"- **Edge Computing:** Run AI without cloud connectivity\n",
|
||
"- **Data Centers:** Serve more users with same hardware\n",
|
||
"- **Environmental:** Reduce energy consumption by 2-4×\n",
|
||
"\n",
|
||
"Today you'll build the production-quality quantization system that makes all this possible!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0927a359",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations - The Mathematics of Compression\n",
|
||
"\n",
|
||
"### Understanding the Core Challenge\n",
|
||
"\n",
|
||
"Think of quantization like converting a smooth analog signal to digital steps. We need to map infinite precision (FP32) to just 256 possible values (INT8).\n",
|
||
"\n",
|
||
"### The Quantization Mapping\n",
|
||
"\n",
|
||
"```\n",
|
||
"The Fundamental Problem:\n",
|
||
"\n",
|
||
"FP32 Numbers (Continuous): INT8 Numbers (Discrete):\n",
|
||
" ∞ possible values → 256 possible values\n",
|
||
"\n",
|
||
" ... -1.7 -1.2 -0.3 0.0 0.8 1.5 2.1 ...\n",
|
||
" ↓ ↓ ↓ ↓ ↓ ↓ ↓\n",
|
||
" -128 -95 -38 0 25 48 67 127\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Magic Formula\n",
|
||
"\n",
|
||
"Every quantization system uses this fundamental relationship:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantization (FP32 → INT8):\n",
|
||
"┌─────────────────────────────────────────────────────────┐\n",
|
||
"│ quantized = round((float_value - zero_point) / scale) │\n",
|
||
"└─────────────────────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"Dequantization (INT8 → FP32):\n",
|
||
"┌─────────────────────────────────────────────────────────┐\n",
|
||
"│ float_value = scale × quantized + zero_point │\n",
|
||
"└─────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Two Critical Parameters\n",
|
||
"\n",
|
||
"**1. Scale (s)** - How big each INT8 step is in FP32 space:\n",
|
||
"```\n",
|
||
"Small Scale (high precision): Large Scale (low precision):\n",
|
||
" FP32: [0.0, 0.255] FP32: [0.0, 25.5]\n",
|
||
" ↓ ↓ ↓ ↓ ↓ ↓\n",
|
||
" INT8: 0 128 255 INT8: 0 128 255\n",
|
||
" │ │ │ │ │ │\n",
|
||
" 0.0 0.127 0.255 0.0 12.75 25.5\n",
|
||
"\n",
|
||
" Scale = 0.001 (very precise) Scale = 0.1 (less precise)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**2. Zero Point (z)** - Which INT8 value represents FP32 zero:\n",
|
||
"```\n",
|
||
"Symmetric Range: Asymmetric Range:\n",
|
||
" FP32: [-2.0, 2.0] FP32: [-1.0, 3.0]\n",
|
||
" ↓ ↓ ↓ ↓ ↓ ↓\n",
|
||
" INT8: -128 0 127 INT8: -128 64 127\n",
|
||
" │ │ │ │ │ │\n",
|
||
" -2.0 0.0 2.0 -1.0 0.0 3.0\n",
|
||
"\n",
|
||
" Zero Point = 0 Zero Point = 64\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Visual Example: Weight Quantization\n",
|
||
"\n",
|
||
"```\n",
|
||
"Original FP32 Weights: Quantized INT8 Mapping:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ -0.8 -0.3 0.0 0.5 │ → │ -102 -38 0 64 │\n",
|
||
"│ 0.9 1.2 -0.1 0.7 │ │ 115 153 -13 89 │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
" 4 bytes each 1 byte each\n",
|
||
" Total: 32 bytes Total: 8 bytes\n",
|
||
" ↑\n",
|
||
" 4× compression!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Quantization Error Analysis\n",
|
||
"\n",
|
||
"```\n",
|
||
"Perfect Reconstruction (Impossible): Quantized Reconstruction (Reality):\n",
|
||
"\n",
|
||
"Original: 0.73 Original: 0.73\n",
|
||
" ↓ ↓\n",
|
||
"INT8: ? (can't represent exactly) INT8: 93 (closest)\n",
|
||
" ↓ ↓\n",
|
||
"Restored: 0.73 Restored: 0.728\n",
|
||
" ↑\n",
|
||
" Error: 0.002\n",
|
||
"```\n",
|
||
"\n",
|
||
"**The Quantization Trade-off:**\n",
|
||
"- **More bits** = Higher precision, larger memory\n",
|
||
"- **Fewer bits** = Lower precision, smaller memory\n",
|
||
"- **Goal:** Find the sweet spot where error is acceptable\n",
|
||
"\n",
|
||
"### Why INT8 is the Sweet Spot\n",
|
||
"\n",
|
||
"```\n",
|
||
"Precision vs Memory Trade-offs:\n",
|
||
"\n",
|
||
"FP32: ████████████████████████████████ (32 bits) - Overkill precision\n",
|
||
"FP16: ████████████████ (16 bits) - Good precision\n",
|
||
"INT8: ████████ (8 bits) - Sufficient precision ← Sweet spot!\n",
|
||
"INT4: ████ (4 bits) - Often too little\n",
|
||
"\n",
|
||
"Memory: 100% 50% 25% 12.5%\n",
|
||
"Accuracy: 100% 99.9% 99.5% 95%\n",
|
||
"```\n",
|
||
"\n",
|
||
"INT8 gives us 4× memory reduction with <1% accuracy loss - the perfect balance for production systems!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6639cbe4",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 3. Implementation - Building the Quantization Engine\n",
|
||
"\n",
|
||
"### Our Implementation Strategy\n",
|
||
"\n",
|
||
"We'll build quantization in logical layers, each building on the previous:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantization System Architecture:\n",
|
||
"\n",
|
||
"┌─────────────────────────────────────────────────────────────┐\n",
|
||
"│ Layer 4: Model Quantization │\n",
|
||
"│ quantize_model() - Convert entire neural networks │\n",
|
||
"├─────────────────────────────────────────────────────────────┤\n",
|
||
"│ Layer 3: Layer Quantization │\n",
|
||
"│ QuantizedLinear - Quantized linear transformations │\n",
|
||
"├─────────────────────────────────────────────────────────────┤\n",
|
||
"│ Layer 2: Tensor Operations │\n",
|
||
"│ quantize_int8() - Core quantization algorithm │\n",
|
||
"│ dequantize_int8() - Restore to floating point │\n",
|
||
"├─────────────────────────────────────────────────────────────┤\n",
|
||
"│ Layer 1: Foundation │\n",
|
||
"│ Scale & Zero Point Calculation - Parameter optimization │\n",
|
||
"└─────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### What We're About to Build\n",
|
||
"\n",
|
||
"**Core Functions:**\n",
|
||
"- `quantize_int8()` - Convert FP32 tensors to INT8\n",
|
||
"- `dequantize_int8()` - Convert INT8 back to FP32\n",
|
||
"- `QuantizedLinear` - Quantized version of Linear layers\n",
|
||
"- `quantize_model()` - Quantize entire neural networks\n",
|
||
"\n",
|
||
"**Key Features:**\n",
|
||
"- **Automatic calibration** - Find optimal quantization parameters\n",
|
||
"- **Error minimization** - Preserve accuracy during compression\n",
|
||
"- **Memory tracking** - Measure actual savings achieved\n",
|
||
"- **Production patterns** - Industry-standard algorithms\n",
|
||
"\n",
|
||
"Let's start with the fundamental building block!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "26bdadc6",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### INT8 Quantization - The Foundation\n",
|
||
"\n",
|
||
"This is the core function that converts any FP32 tensor to INT8. Think of it as a smart compression algorithm that preserves the most important information.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantization Process Visualization:\n",
|
||
"\n",
|
||
"Step 1: Analyze Range Step 2: Calculate Parameters Step 3: Apply Formula\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ Input: [-1.5, 0.2, 2.8] │ │ Min: -1.5 │ │ quantized = round( │\n",
|
||
"│ │ │ Max: 2.8 │ │ (value - zp*scale) │\n",
|
||
"│ Find min/max values │ → │ Range: 4.3 │ →│ / scale) │\n",
|
||
"│ │ │ Scale: 4.3/255 = 0.017 │ │ │\n",
|
||
"│ │ │ Zero Point: 88 │ │ Result: [-128, 12, 127] │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Challenges This Function Solves:**\n",
|
||
"- **Dynamic Range:** Each tensor has different min/max values\n",
|
||
"- **Precision Loss:** Map 4 billion FP32 values to just 256 INT8 values\n",
|
||
"- **Zero Preservation:** Ensure FP32 zero maps exactly to an INT8 value\n",
|
||
"- **Symmetric Mapping:** Distribute quantization levels efficiently\n",
|
||
"\n",
|
||
"**Why This Algorithm:**\n",
|
||
"- **Linear mapping** preserves relative relationships between values\n",
|
||
"- **Symmetric quantization** works well for most neural network weights\n",
|
||
"- **Clipping to [-128, 127]** ensures valid INT8 range\n",
|
||
"- **Round-to-nearest** minimizes quantization error"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "68d91dc9",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "quantize_int8",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def quantize_int8(tensor: Tensor) -> Tuple[Tensor, float, int]:\n",
|
||
" \"\"\"\n",
|
||
" Quantize FP32 tensor to INT8 using symmetric quantization.\n",
|
||
"\n",
|
||
" TODO: Implement INT8 quantization with scale and zero_point calculation\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Find min/max values in tensor data\n",
|
||
" 2. Calculate scale: (max_val - min_val) / 255 (INT8 range: -128 to 127)\n",
|
||
" 3. Calculate zero_point: offset to map FP32 zero to INT8 zero\n",
|
||
" 4. Apply quantization formula: round((value - zero_point) / scale)\n",
|
||
" 5. Clamp to INT8 range [-128, 127]\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tensor = Tensor([[-1.0, 0.0, 2.0], [0.5, 1.5, -0.5]])\n",
|
||
" >>> q_tensor, scale, zero_point = quantize_int8(tensor)\n",
|
||
" >>> print(f\"Scale: {scale:.4f}, Zero point: {zero_point}\")\n",
|
||
" Scale: 0.0118, Zero point: 42\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use np.round() for quantization\n",
|
||
" - Clamp with np.clip(values, -128, 127)\n",
|
||
" - Handle edge case where min_val == max_val (set scale=1.0)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" data = tensor.data\n",
|
||
"\n",
|
||
" # Step 1: Find dynamic range\n",
|
||
" min_val = float(np.min(data))\n",
|
||
" max_val = float(np.max(data))\n",
|
||
"\n",
|
||
" # Step 2: Handle edge case (constant tensor)\n",
|
||
" if abs(max_val - min_val) < 1e-8:\n",
|
||
" scale = 1.0\n",
|
||
" zero_point = 0\n",
|
||
" quantized_data = np.zeros_like(data, dtype=np.int8)\n",
|
||
" return Tensor(quantized_data), scale, zero_point\n",
|
||
"\n",
|
||
" # Step 3: Calculate scale and zero_point for standard quantization\n",
|
||
" # Map [min_val, max_val] to [-128, 127] (INT8 range)\n",
|
||
" scale = (max_val - min_val) / 255.0\n",
|
||
" zero_point = int(np.round(-128 - min_val / scale))\n",
|
||
"\n",
|
||
" # Clamp zero_point to valid INT8 range\n",
|
||
" zero_point = int(np.clip(zero_point, -128, 127))\n",
|
||
"\n",
|
||
" # Step 4: Apply quantization formula: q = (x / scale) + zero_point\n",
|
||
" quantized_data = np.round(data / scale + zero_point)\n",
|
||
"\n",
|
||
" # Step 5: Clamp to INT8 range and convert to int8\n",
|
||
" quantized_data = np.clip(quantized_data, -128, 127).astype(np.int8)\n",
|
||
"\n",
|
||
" return Tensor(quantized_data), scale, zero_point\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_quantize_int8():\n",
|
||
" \"\"\"🔬 Test INT8 quantization implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: INT8 Quantization...\")\n",
|
||
"\n",
|
||
" # Test basic quantization\n",
|
||
" tensor = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n",
|
||
" q_tensor, scale, zero_point = quantize_int8(tensor)\n",
|
||
"\n",
|
||
" # Verify quantized values are in INT8 range\n",
|
||
" assert np.all(q_tensor.data >= -128)\n",
|
||
" assert np.all(q_tensor.data <= 127)\n",
|
||
" assert isinstance(scale, float)\n",
|
||
" assert isinstance(zero_point, int)\n",
|
||
"\n",
|
||
" # Test dequantization preserves approximate values\n",
|
||
" dequantized = scale * (q_tensor.data - zero_point)\n",
|
||
" error = np.mean(np.abs(tensor.data - dequantized))\n",
|
||
" assert error < 0.2, f\"Quantization error too high: {error}\"\n",
|
||
"\n",
|
||
" # Test edge case: constant tensor\n",
|
||
" constant_tensor = Tensor([[2.0, 2.0], [2.0, 2.0]])\n",
|
||
" q_const, scale_const, zp_const = quantize_int8(constant_tensor)\n",
|
||
" assert scale_const == 1.0\n",
|
||
"\n",
|
||
" print(\"✅ INT8 quantization works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_quantize_int8()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4dc13ff2",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### INT8 Dequantization - Restoring Precision\n",
|
||
"\n",
|
||
"Dequantization is the inverse process - converting compressed INT8 values back to usable FP32. This is where we \"decompress\" our quantized data.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Dequantization Process:\n",
|
||
"\n",
|
||
"INT8 Values + Parameters → FP32 Reconstruction\n",
|
||
"\n",
|
||
"┌─────────────────────────┐\n",
|
||
"│ Quantized: [-128, 12, 127] │\n",
|
||
"│ Scale: 0.017 │\n",
|
||
"│ Zero Point: 88 │\n",
|
||
"└─────────────────────────┘\n",
|
||
" │\n",
|
||
" ▼ Apply Formula\n",
|
||
"┌─────────────────────────┐\n",
|
||
"│ FP32 = scale × quantized │\n",
|
||
"│ + zero_point × scale │\n",
|
||
"└─────────────────────────┘\n",
|
||
" │\n",
|
||
" ▼\n",
|
||
"┌─────────────────────────┐\n",
|
||
"│ Result: [-1.496, 0.204, 2.799]│\n",
|
||
"│ Original: [-1.5, 0.2, 2.8] │\n",
|
||
"│ Error: [0.004, 0.004, 0.001] │\n",
|
||
"└─────────────────────────┘\n",
|
||
" ↑\n",
|
||
" Excellent approximation!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why This Step Is Critical:**\n",
|
||
"- **Neural networks expect FP32** - INT8 values would confuse computations\n",
|
||
"- **Preserves computation compatibility** - works with existing matrix operations\n",
|
||
"- **Controlled precision loss** - error is bounded and predictable\n",
|
||
"- **Hardware flexibility** - can use FP32 or specialized INT8 operations\n",
|
||
"\n",
|
||
"**When Dequantization Happens:**\n",
|
||
"- **During forward pass** - before matrix multiplications\n",
|
||
"- **For gradient computation** - during backward pass\n",
|
||
"- **Educational approach** - production uses INT8 GEMM directly"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c54cf336",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "dequantize_int8",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Dequantize INT8 tensor back to FP32.\n",
|
||
"\n",
|
||
" TODO: Implement dequantization using the inverse formula\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Apply inverse quantization: scale * quantized_value + zero_point * scale\n",
|
||
" 2. Return as new FP32 Tensor\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> q_tensor = Tensor([[-42, 0, 85]]) # INT8 values\n",
|
||
" >>> scale, zero_point = 0.0314, 64\n",
|
||
" >>> fp32_tensor = dequantize_int8(q_tensor, scale, zero_point)\n",
|
||
" >>> print(fp32_tensor.data)\n",
|
||
" [[-1.31, 2.01, 2.67]] # Approximate original values\n",
|
||
"\n",
|
||
" HINT:\n",
|
||
" - Formula: dequantized = scale * quantized + zero_point * scale\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Apply inverse quantization formula\n",
|
||
" dequantized_data = scale * q_tensor.data + zero_point * scale\n",
|
||
" return Tensor(dequantized_data.astype(np.float32))\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_dequantize_int8():\n",
|
||
" \"\"\"🔬 Test INT8 dequantization implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: INT8 Dequantization...\")\n",
|
||
"\n",
|
||
" # Test round-trip: quantize → dequantize\n",
|
||
" original = Tensor([[-1.5, 0.0, 3.2], [1.1, -0.8, 2.7]])\n",
|
||
" q_tensor, scale, zero_point = quantize_int8(original)\n",
|
||
" restored = dequantize_int8(q_tensor, scale, zero_point)\n",
|
||
"\n",
|
||
" # Verify round-trip error is small\n",
|
||
" error = np.mean(np.abs(original.data - restored.data))\n",
|
||
" assert error < 2.0, f\"Round-trip error too high: {error}\"\n",
|
||
"\n",
|
||
" # Verify output is float32\n",
|
||
" assert restored.data.dtype == np.float32\n",
|
||
"\n",
|
||
" print(\"✅ INT8 dequantization works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_dequantize_int8()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "457c4bca",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Quantization Quality - Understanding the Impact\n",
|
||
"\n",
|
||
"### Why Distribution Matters\n",
|
||
"\n",
|
||
"Different types of data quantize differently. Let's understand how various weight distributions affect quantization quality.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantization Quality Factors:\n",
|
||
"\n",
|
||
"┌─────────────────┬─────────────────┬─────────────────┐\n",
|
||
"│ Distribution │ Scale Usage │ Error Level │\n",
|
||
"├─────────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ Uniform │ ████████████████ │ Low │\n",
|
||
"│ Normal │ ██████████████ │ Medium │\n",
|
||
"│ With Outliers │ ████ │ High │\n",
|
||
"│ Sparse (zeros) │ ████ │ High │\n",
|
||
"└─────────────────┴─────────────────┴─────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Scale Utilization Problem\n",
|
||
"\n",
|
||
"```\n",
|
||
"Good Quantization (Uniform): Bad Quantization (Outliers):\n",
|
||
"\n",
|
||
"Values: [-1.0 ... +1.0] Values: [-10.0, -0.1...+0.1, +10.0]\n",
|
||
" ↓ ↓\n",
|
||
"INT8: -128 ......... +127 INT8: -128 ... 0 ... +127\n",
|
||
" ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑\n",
|
||
" All levels used Most levels wasted!\n",
|
||
"\n",
|
||
"Scale: 0.0078 (good precision) Scale: 0.078 (poor precision)\n",
|
||
"Error: ~0.004 Error: ~0.04 (10× worse!)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Insight:** Outliers waste quantization levels and hurt precision for normal values."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a28c45a7",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze_quantization_error",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_quantization_error():\n",
|
||
" \"\"\"📊 Analyze quantization error across different distributions.\"\"\"\n",
|
||
" print(\"📊 Analyzing Quantization Error Across Distributions...\")\n",
|
||
"\n",
|
||
" distributions = {\n",
|
||
" 'uniform': np.random.uniform(-1, 1, (1000,)),\n",
|
||
" 'normal': np.random.normal(0, 0.5, (1000,)),\n",
|
||
" 'outliers': np.concatenate([np.random.normal(0, 0.1, (900,)),\n",
|
||
" np.random.uniform(-2, 2, (100,))]),\n",
|
||
" 'sparse': np.random.choice([0, 0, 0, 1], size=(1000,)) * np.random.normal(0, 1, (1000,))\n",
|
||
" }\n",
|
||
"\n",
|
||
" results = {}\n",
|
||
"\n",
|
||
" for name, data in distributions.items():\n",
|
||
" # Quantize and measure error\n",
|
||
" original = Tensor(data)\n",
|
||
" q_tensor, scale, zero_point = quantize_int8(original)\n",
|
||
" restored = dequantize_int8(q_tensor, scale, zero_point)\n",
|
||
"\n",
|
||
" # Calculate metrics\n",
|
||
" mse = np.mean((original.data - restored.data) ** 2)\n",
|
||
" max_error = np.max(np.abs(original.data - restored.data))\n",
|
||
"\n",
|
||
" results[name] = {\n",
|
||
" 'mse': mse,\n",
|
||
" 'max_error': max_error,\n",
|
||
" 'scale': scale,\n",
|
||
" 'range_ratio': (np.max(data) - np.min(data)) / scale if scale > 0 else 0\n",
|
||
" }\n",
|
||
"\n",
|
||
" print(f\"{name:8}: MSE={mse:.6f}, Max Error={max_error:.4f}, Scale={scale:.4f}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Insights:\")\n",
|
||
" print(\"- Uniform: Low error, good scale utilization\")\n",
|
||
" print(\"- Normal: Higher error at distribution tails\")\n",
|
||
" print(\"- Outliers: Poor quantization due to extreme values\")\n",
|
||
" print(\"- Sparse: Wasted quantization levels on zeros\")\n",
|
||
"\n",
|
||
" return results\n",
|
||
"\n",
|
||
"# Analyze quantization quality\n",
|
||
"error_analysis = analyze_quantization_error()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5f4bf7b6",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## QuantizedLinear - The Heart of Efficient Networks\n",
|
||
"\n",
|
||
"### Why We Need Quantized Layers\n",
|
||
"\n",
|
||
"A quantized model isn't just about storing weights in INT8 - we need layers that can work efficiently with quantized data.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Regular Linear Layer: QuantizedLinear Layer:\n",
|
||
"\n",
|
||
"┌─────────────────────┐ ┌─────────────────────┐\n",
|
||
"│ Input: FP32 │ │ Input: FP32 │\n",
|
||
"│ Weights: FP32 │ │ Weights: INT8 │\n",
|
||
"│ Computation: FP32 │ VS │ Computation: Mixed │\n",
|
||
"│ Output: FP32 │ │ Output: FP32 │\n",
|
||
"│ Memory: 4× more │ │ Memory: 4× less │\n",
|
||
"└─────────────────────┘ └─────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Quantized Forward Pass\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantized Linear Layer Forward Pass:\n",
|
||
"\n",
|
||
" Input (FP32) Quantized Weights (INT8)\n",
|
||
" │ │\n",
|
||
" ▼ ▼\n",
|
||
"┌─────────────────┐ ┌─────────────────┐\n",
|
||
"│ Calibrate │ │ Dequantize │\n",
|
||
"│ (optional) │ │ Weights │\n",
|
||
"└─────────────────┘ └─────────────────┘\n",
|
||
" │ │\n",
|
||
" ▼ ▼\n",
|
||
" Input (FP32) Weights (FP32)\n",
|
||
" │ │\n",
|
||
" └───────────────┬───────────────┘\n",
|
||
" ▼\n",
|
||
" ┌─────────────────┐\n",
|
||
" │ Matrix Multiply │\n",
|
||
" │ (FP32 GEMM) │\n",
|
||
" └─────────────────┘\n",
|
||
" │\n",
|
||
" ▼\n",
|
||
" Output (FP32)\n",
|
||
"\n",
|
||
"Memory Saved: 4× for weights storage!\n",
|
||
"Speed: Depends on dequantization overhead vs INT8 GEMM support\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Calibration - Finding Optimal Input Quantization\n",
|
||
"\n",
|
||
"```\n",
|
||
"Calibration Process:\n",
|
||
"\n",
|
||
" Step 1: Collect Sample Inputs Step 2: Analyze Distribution Step 3: Optimize Parameters\n",
|
||
" ┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
" │ input_1: [-0.5, 0.2, ..] │ │ Min: -0.8 │ │ Scale: 0.00627 │\n",
|
||
" │ input_2: [-0.3, 0.8, ..] │ → │ Max: +0.8 │ → │ Zero Point: 0 │\n",
|
||
" │ input_3: [-0.1, 0.5, ..] │ │ Range: 1.6 │ │ Optimal for this data │\n",
|
||
" │ ... │ │ Distribution: Normal │ │ range and distribution │\n",
|
||
" └─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why Calibration Matters:**\n",
|
||
"- **Without calibration:** Generic quantization parameters may waste precision\n",
|
||
"- **With calibration:** Parameters optimized for actual data distribution\n",
|
||
"- **Result:** Better accuracy preservation with same memory savings"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6b6a464e",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### QuantizedLinear Class - Efficient Neural Network Layer\n",
|
||
"\n",
|
||
"This class replaces regular Linear layers with quantized versions that use 4× less memory while preserving functionality.\n",
|
||
"\n",
|
||
"```\n",
|
||
"QuantizedLinear Architecture:\n",
|
||
"\n",
|
||
"Creation Time: Runtime:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ Regular Linear Layer │ │ Input (FP32) │\n",
|
||
"│ ↓ │ │ ↓ │\n",
|
||
"│ Quantize weights → INT8 │ │ Optional: quantize input│\n",
|
||
"│ Quantize bias → INT8 │ → │ ↓ │\n",
|
||
"│ Store quantization params │ │ Dequantize weights │\n",
|
||
"│ Ready for deployment! │ │ ↓ │\n",
|
||
"└─────────────────────────┘ │ Matrix multiply (FP32) │\n",
|
||
" One-time cost │ ↓ │\n",
|
||
" │ Output (FP32) │\n",
|
||
" └─────────────────────────┘\n",
|
||
" Per-inference cost\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Design Decisions:**\n",
|
||
"\n",
|
||
"1. **Store original layer reference** - for debugging and comparison\n",
|
||
"2. **Separate quantization parameters** - weights and bias may need different scales\n",
|
||
"3. **Calibration support** - optimize input quantization using real data\n",
|
||
"4. **FP32 computation** - educational approach, production uses INT8 GEMM\n",
|
||
"5. **Memory tracking** - measure actual compression achieved\n",
|
||
"\n",
|
||
"**Memory Layout Comparison:**\n",
|
||
"```\n",
|
||
"Regular Linear Layer: QuantizedLinear Layer:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ weights: FP32 × N │ │ q_weights: INT8 × N │\n",
|
||
"│ bias: FP32 × M │ │ q_bias: INT8 × M │\n",
|
||
"│ │ → │ weight_scale: 1 float │\n",
|
||
"│ Total: 4×(N+M) bytes │ │ weight_zero_point: 1 int│\n",
|
||
"└─────────────────────────┘ │ bias_scale: 1 float │\n",
|
||
" │ bias_zero_point: 1 int │\n",
|
||
" │ │\n",
|
||
" │ Total: (N+M) + 16 bytes │\n",
|
||
" └─────────────────────────┘\n",
|
||
" ↑\n",
|
||
" ~4× smaller!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Production vs Educational Trade-off:**\n",
|
||
"- **Our approach:** Dequantize → FP32 computation (easier to understand)\n",
|
||
"- **Production:** INT8 GEMM operations (faster, more complex)\n",
|
||
"- **Both achieve:** Same memory savings, similar accuracy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b518a3e4",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "quantized_linear",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"class QuantizedLinear:\n",
|
||
" \"\"\"Quantized version of Linear layer using INT8 arithmetic.\"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, linear_layer: Linear):\n",
|
||
" \"\"\"\n",
|
||
" Create quantized version of existing linear layer.\n",
|
||
"\n",
|
||
" TODO: Quantize weights and bias, store quantization parameters\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Quantize weights using quantize_int8\n",
|
||
" 2. Quantize bias if it exists\n",
|
||
" 3. Store original layer reference for forward pass\n",
|
||
" 4. Store quantization parameters for dequantization\n",
|
||
"\n",
|
||
" IMPLEMENTATION STRATEGY:\n",
|
||
" - Store quantized weights, scales, and zero points\n",
|
||
" - Implement forward pass using dequantized computation (educational approach)\n",
|
||
" - Production: Would use INT8 matrix multiplication libraries\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.original_layer = linear_layer\n",
|
||
"\n",
|
||
" # Quantize weights\n",
|
||
" self.q_weight, self.weight_scale, self.weight_zero_point = quantize_int8(linear_layer.weight)\n",
|
||
"\n",
|
||
" # Quantize bias if it exists\n",
|
||
" if linear_layer.bias is not None:\n",
|
||
" self.q_bias, self.bias_scale, self.bias_zero_point = quantize_int8(linear_layer.bias)\n",
|
||
" else:\n",
|
||
" self.q_bias = None\n",
|
||
" self.bias_scale = None\n",
|
||
" self.bias_zero_point = None\n",
|
||
"\n",
|
||
" # Store input quantization parameters (set during calibration)\n",
|
||
" self.input_scale = None\n",
|
||
" self.input_zero_point = None\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def calibrate(self, sample_inputs: List[Tensor]):\n",
|
||
" \"\"\"\n",
|
||
" Calibrate input quantization parameters using sample data.\n",
|
||
"\n",
|
||
" TODO: Calculate optimal input quantization parameters\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Collect statistics from sample inputs\n",
|
||
" 2. Calculate optimal scale and zero_point for inputs\n",
|
||
" 3. Store for use in forward pass\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Collect all input values\n",
|
||
" all_values = []\n",
|
||
" for inp in sample_inputs:\n",
|
||
" all_values.extend(inp.data.flatten())\n",
|
||
"\n",
|
||
" all_values = np.array(all_values)\n",
|
||
"\n",
|
||
" # Calculate input quantization parameters\n",
|
||
" min_val = float(np.min(all_values))\n",
|
||
" max_val = float(np.max(all_values))\n",
|
||
"\n",
|
||
" if abs(max_val - min_val) < 1e-8:\n",
|
||
" self.input_scale = 1.0\n",
|
||
" self.input_zero_point = 0\n",
|
||
" else:\n",
|
||
" self.input_scale = (max_val - min_val) / 255.0\n",
|
||
" self.input_zero_point = int(np.round(-128 - min_val / self.input_scale))\n",
|
||
" self.input_zero_point = np.clip(self.input_zero_point, -128, 127)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Forward pass with quantized computation.\n",
|
||
"\n",
|
||
" TODO: Implement quantized forward pass\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Quantize input (if calibrated)\n",
|
||
" 2. Dequantize weights and input for computation (educational approach)\n",
|
||
" 3. Perform matrix multiplication\n",
|
||
" 4. Return FP32 result\n",
|
||
"\n",
|
||
" NOTE: Production quantization uses INT8 GEMM libraries for speed\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # For educational purposes, we dequantize and compute in FP32\n",
|
||
" # Production systems use specialized INT8 GEMM operations\n",
|
||
"\n",
|
||
" # Dequantize weights\n",
|
||
" weight_fp32 = dequantize_int8(self.q_weight, self.weight_scale, self.weight_zero_point)\n",
|
||
"\n",
|
||
" # Perform computation (same as original layer)\n",
|
||
" result = x.matmul(weight_fp32)\n",
|
||
"\n",
|
||
" # Add bias if it exists\n",
|
||
" if self.q_bias is not None:\n",
|
||
" bias_fp32 = dequantize_int8(self.q_bias, self.bias_scale, self.bias_zero_point)\n",
|
||
" result = Tensor(result.data + bias_fp32.data)\n",
|
||
"\n",
|
||
" return result\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def __call__(self, x: Tensor) -> Tensor:\n",
|
||
" \"\"\"Allows the quantized linear layer to be called like a function.\"\"\"\n",
|
||
" return self.forward(x)\n",
|
||
"\n",
|
||
" def parameters(self) -> List[Tensor]:\n",
|
||
" \"\"\"Return quantized parameters.\"\"\"\n",
|
||
" params = [self.q_weight]\n",
|
||
" if self.q_bias is not None:\n",
|
||
" params.append(self.q_bias)\n",
|
||
" return params\n",
|
||
"\n",
|
||
" def memory_usage(self) -> Dict[str, float]:\n",
|
||
" \"\"\"Calculate memory usage in bytes.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Original FP32 usage\n",
|
||
" original_weight_bytes = self.original_layer.weight.data.size * 4 # 4 bytes per FP32\n",
|
||
" original_bias_bytes = 0\n",
|
||
" if self.original_layer.bias is not None:\n",
|
||
" original_bias_bytes = self.original_layer.bias.data.size * 4\n",
|
||
"\n",
|
||
" # Quantized INT8 usage\n",
|
||
" quantized_weight_bytes = self.q_weight.data.size * 1 # 1 byte per INT8\n",
|
||
" quantized_bias_bytes = 0\n",
|
||
" if self.q_bias is not None:\n",
|
||
" quantized_bias_bytes = self.q_bias.data.size * 1\n",
|
||
"\n",
|
||
" # Add overhead for scales and zero points (small)\n",
|
||
" overhead_bytes = 8 * 2 # 2 floats + 2 ints for weight/bias quantization params\n",
|
||
"\n",
|
||
" return {\n",
|
||
" 'original_bytes': original_weight_bytes + original_bias_bytes,\n",
|
||
" 'quantized_bytes': quantized_weight_bytes + quantized_bias_bytes + overhead_bytes,\n",
|
||
" 'compression_ratio': (original_weight_bytes + original_bias_bytes) /\n",
|
||
" (quantized_weight_bytes + quantized_bias_bytes + overhead_bytes)\n",
|
||
" }\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_quantized_linear():\n",
|
||
" \"\"\"🔬 Test QuantizedLinear implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: QuantizedLinear...\")\n",
|
||
"\n",
|
||
" # Create original linear layer\n",
|
||
" original = Linear(4, 3)\n",
|
||
" original.weight = Tensor(np.random.randn(4, 3) * 0.5) # Smaller range for testing\n",
|
||
" original.bias = Tensor(np.random.randn(3) * 0.1)\n",
|
||
"\n",
|
||
" # Create quantized version\n",
|
||
" quantized = QuantizedLinear(original)\n",
|
||
"\n",
|
||
" # Test forward pass\n",
|
||
" x = Tensor(np.random.randn(2, 4) * 0.5)\n",
|
||
"\n",
|
||
" # Original forward pass\n",
|
||
" original_output = original.forward(x)\n",
|
||
"\n",
|
||
" # Quantized forward pass\n",
|
||
" quantized_output = quantized.forward(x)\n",
|
||
"\n",
|
||
" # Compare outputs (should be close but not identical due to quantization)\n",
|
||
" error = np.mean(np.abs(original_output.data - quantized_output.data))\n",
|
||
" assert error < 1.0, f\"Quantization error too high: {error}\"\n",
|
||
"\n",
|
||
" # Test memory usage\n",
|
||
" memory_info = quantized.memory_usage()\n",
|
||
" assert memory_info['compression_ratio'] > 3.0, \"Should achieve ~4× compression\"\n",
|
||
"\n",
|
||
" print(f\" Memory reduction: {memory_info['compression_ratio']:.1f}×\")\n",
|
||
" print(\"✅ QuantizedLinear works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_quantized_linear()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "557295a5",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 4. Integration - Scaling to Full Neural Networks\n",
|
||
"\n",
|
||
"### The Model Quantization Challenge\n",
|
||
"\n",
|
||
"Quantizing individual tensors is useful, but real applications need to quantize entire neural networks with multiple layers, activations, and complex data flows.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Model Quantization Process:\n",
|
||
"\n",
|
||
"Original Model: Quantized Model:\n",
|
||
"┌─────────────────────────────┐ ┌─────────────────────────────┐\n",
|
||
"│ Linear(784, 128) [FP32] │ │ QuantizedLinear(784, 128) │\n",
|
||
"│ ReLU() [FP32] │ │ ReLU() [FP32] │\n",
|
||
"│ Linear(128, 64) [FP32] │ → │ QuantizedLinear(128, 64) │\n",
|
||
"│ ReLU() [FP32] │ │ ReLU() [FP32] │\n",
|
||
"│ Linear(64, 10) [FP32] │ │ QuantizedLinear(64, 10) │\n",
|
||
"└─────────────────────────────┘ └─────────────────────────────┘\n",
|
||
" Memory: 100% Memory: ~25%\n",
|
||
" Speed: Baseline Speed: 2-4× faster\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Smart Layer Selection\n",
|
||
"\n",
|
||
"Not all layers benefit equally from quantization:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Layer Quantization Strategy:\n",
|
||
"\n",
|
||
"┌─────────────────┬─────────────────┬─────────────────────────────┐\n",
|
||
"│ Layer Type │ Quantize? │ Reason │\n",
|
||
"├─────────────────┼─────────────────┼─────────────────────────────┤\n",
|
||
"│ Linear/Dense │ ✅ YES │ Most parameters, big savings │\n",
|
||
"│ Convolution │ ✅ YES │ Many weights, good candidate │\n",
|
||
"│ Embedding │ ✅ YES │ Large lookup tables │\n",
|
||
"│ ReLU/Sigmoid │ ❌ NO │ No parameters to quantize │\n",
|
||
"│ BatchNorm │ 🤔 MAYBE │ Few params, may hurt │\n",
|
||
"│ First Layer │ 🤔 MAYBE │ Often sensitive to precision │\n",
|
||
"│ Last Layer │ 🤔 MAYBE │ Output quality critical │\n",
|
||
"└─────────────────┴─────────────────┴─────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Calibration Data Flow\n",
|
||
"\n",
|
||
"```\n",
|
||
"End-to-End Calibration:\n",
|
||
"\n",
|
||
"Calibration Input Layer-by-Layer Processing\n",
|
||
" │ │\n",
|
||
" ▼ ▼\n",
|
||
"┌─────────────┐ ┌──────────────────────────────────────────┐\n",
|
||
"│ Sample Data │ → │ Layer 1: Collect activation statistics │\n",
|
||
"│ [batch of │ │ ↓ │\n",
|
||
"│ real data] │ │ Layer 2: Collect activation statistics │\n",
|
||
"└─────────────┘ │ ↓ │\n",
|
||
" │ Layer 3: Collect activation statistics │\n",
|
||
" │ ↓ │\n",
|
||
" │ Optimize quantization parameters │\n",
|
||
" └──────────────────────────────────────────┘\n",
|
||
" │\n",
|
||
" ▼\n",
|
||
" Ready for deployment!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Memory Impact Visualization\n",
|
||
"\n",
|
||
"```\n",
|
||
"Model Memory Breakdown:\n",
|
||
"\n",
|
||
"Before Quantization: After Quantization:\n",
|
||
"┌─────────────────────┐ ┌─────────────────────┐\n",
|
||
"│ Layer 1: 3.1MB │ │ Layer 1: 0.8MB │ (-75%)\n",
|
||
"│ Layer 2: 0.5MB │ → │ Layer 2: 0.1MB │ (-75%)\n",
|
||
"│ Layer 3: 0.3MB │ │ Layer 3: 0.1MB │ (-75%)\n",
|
||
"│ Total: 3.9MB │ │ Total: 1.0MB │ (-74%)\n",
|
||
"└─────────────────────┘ └─────────────────────┘\n",
|
||
"\n",
|
||
" Typical mobile phone memory: 4-8GB\n",
|
||
" Model now fits: 4000× more models in memory!\n",
|
||
"```\n",
|
||
"\n",
|
||
"Now let's implement the functions that make this transformation possible!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d881be8c",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Model Quantization - Scaling to Full Networks\n",
|
||
"\n",
|
||
"This function transforms entire neural networks from FP32 to quantized versions. It's like upgrading a whole building to be more energy efficient!\n",
|
||
"\n",
|
||
"```\n",
|
||
"Model Transformation Process:\n",
|
||
"\n",
|
||
"Input Model: Quantized Model:\n",
|
||
"┌─────────────────────────────┐ ┌─────────────────────────────┐\n",
|
||
"│ layers[0]: Linear(784, 128) │ │ layers[0]: QuantizedLinear │\n",
|
||
"│ layers[1]: ReLU() │ │ layers[1]: ReLU() │\n",
|
||
"│ layers[2]: Linear(128, 64) │ → │ layers[2]: QuantizedLinear │\n",
|
||
"│ layers[3]: ReLU() │ │ layers[3]: ReLU() │\n",
|
||
"│ layers[4]: Linear(64, 10) │ │ layers[4]: QuantizedLinear │\n",
|
||
"└─────────────────────────────┘ └─────────────────────────────┘\n",
|
||
" Memory: 100% Memory: ~25%\n",
|
||
" Interface: Same Interface: Identical\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Smart Layer Selection Logic:**\n",
|
||
"```\n",
|
||
"Quantization Decision Tree:\n",
|
||
"\n",
|
||
"For each layer in model:\n",
|
||
" │\n",
|
||
" ├── Is it a Linear layer?\n",
|
||
" │ │\n",
|
||
" │ └── YES → Replace with QuantizedLinear\n",
|
||
" │\n",
|
||
" └── Is it ReLU/Activation?\n",
|
||
" │\n",
|
||
" └── NO → Keep unchanged (no parameters to quantize)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Calibration Integration:**\n",
|
||
"```\n",
|
||
"Calibration Data Flow:\n",
|
||
"\n",
|
||
" Input Data Layer-by-Layer Processing\n",
|
||
" │ │\n",
|
||
" ▼ ▼\n",
|
||
" ┌─────────────────┐ ┌───────────────────────────────────────────────────────────┐\n",
|
||
" │ Sample Batch 1 │ │ Layer 0: Forward → Collect activation statistics │\n",
|
||
" │ Sample Batch 2 │ → │ ↓ │\n",
|
||
" │ ... │ │ Layer 2: Forward → Collect activation statistics │\n",
|
||
" │ Sample Batch N │ │ ↓ │\n",
|
||
" └─────────────────┘ │ Layer 4: Forward → Collect activation statistics │\n",
|
||
" │ ↓ │\n",
|
||
" │ For each layer: calibrate optimal quantization │\n",
|
||
" └───────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why In-Place Modification:**\n",
|
||
"- **Preserves model structure** - Same interface, same behavior\n",
|
||
"- **Memory efficient** - No copying of large tensors\n",
|
||
"- **Drop-in replacement** - Existing code works unchanged\n",
|
||
"- **Gradual quantization** - Can selectively quantize sensitive layers\n",
|
||
"\n",
|
||
"**Deployment Benefits:**\n",
|
||
"```\n",
|
||
"Before Quantization: After Quantization:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ ❌ Can't fit on phone │ │ ✅ Fits on mobile device │\n",
|
||
"│ ❌ Slow cloud deployment │ │ ✅ Fast edge inference │\n",
|
||
"│ ❌ High memory usage │ → │ ✅ 4× memory efficiency │\n",
|
||
"│ ❌ Expensive to serve │ │ ✅ Lower serving costs │\n",
|
||
"│ ❌ Battery drain │ │ ✅ Extended battery life │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "813db571",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "quantize_model",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def quantize_model(model, calibration_data: Optional[List[Tensor]] = None) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Quantize all Linear layers in a model in-place.\n",
|
||
"\n",
|
||
" TODO: Replace all Linear layers with QuantizedLinear versions\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Find all Linear layers in the model\n",
|
||
" 2. Replace each with QuantizedLinear version\n",
|
||
" 3. If calibration data provided, calibrate input quantization\n",
|
||
" 4. Handle Sequential containers properly\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = Sequential(Linear(10, 5), ReLU(), Linear(5, 2))\n",
|
||
" >>> quantize_model(model)\n",
|
||
" >>> # Now model uses quantized layers\n",
|
||
"\n",
|
||
" HINT:\n",
|
||
" - Handle Sequential.layers list for layer replacement\n",
|
||
" - Use isinstance(layer, Linear) to identify layers to quantize\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if hasattr(model, 'layers'): # Sequential model\n",
|
||
" for i, layer in enumerate(model.layers):\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" # Replace with quantized version\n",
|
||
" quantized_layer = QuantizedLinear(layer)\n",
|
||
"\n",
|
||
" # Calibrate if data provided\n",
|
||
" if calibration_data is not None:\n",
|
||
" # Run forward passes to get intermediate activations\n",
|
||
" sample_inputs = []\n",
|
||
" for data in calibration_data[:10]: # Use first 10 samples for efficiency\n",
|
||
" # Forward through layers up to this point\n",
|
||
" x = data\n",
|
||
" for j in range(i):\n",
|
||
" if hasattr(model.layers[j], 'forward'):\n",
|
||
" x = model.layers[j].forward(x)\n",
|
||
" sample_inputs.append(x)\n",
|
||
"\n",
|
||
" quantized_layer.calibrate(sample_inputs)\n",
|
||
"\n",
|
||
" model.layers[i] = quantized_layer\n",
|
||
"\n",
|
||
" elif isinstance(model, Linear): # Single Linear layer\n",
|
||
" # Can't replace in-place for single layer, user should handle\n",
|
||
" raise ValueError(\"Cannot quantize single Linear layer in-place. Use QuantizedLinear directly.\")\n",
|
||
"\n",
|
||
" else:\n",
|
||
" raise ValueError(f\"Unsupported model type: {type(model)}\")\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_quantize_model():\n",
|
||
" \"\"\"🔬 Test model quantization implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Model Quantization...\")\n",
|
||
"\n",
|
||
" # Create test model\n",
|
||
" model = Sequential(\n",
|
||
" Linear(4, 8),\n",
|
||
" ReLU(),\n",
|
||
" Linear(8, 3)\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Initialize weights\n",
|
||
" model.layers[0].weight = Tensor(np.random.randn(4, 8) * 0.5)\n",
|
||
" model.layers[0].bias = Tensor(np.random.randn(8) * 0.1)\n",
|
||
" model.layers[2].weight = Tensor(np.random.randn(8, 3) * 0.5)\n",
|
||
" model.layers[2].bias = Tensor(np.random.randn(3) * 0.1)\n",
|
||
"\n",
|
||
" # Test original model\n",
|
||
" x = Tensor(np.random.randn(2, 4))\n",
|
||
" original_output = model.forward(x)\n",
|
||
"\n",
|
||
" # Create calibration data\n",
|
||
" calibration_data = [Tensor(np.random.randn(1, 4)) for _ in range(5)]\n",
|
||
"\n",
|
||
" # Quantize model\n",
|
||
" quantize_model(model, calibration_data)\n",
|
||
"\n",
|
||
" # Verify layers were replaced\n",
|
||
" assert isinstance(model.layers[0], QuantizedLinear)\n",
|
||
" assert isinstance(model.layers[1], ReLU) # Should remain unchanged\n",
|
||
" assert isinstance(model.layers[2], QuantizedLinear)\n",
|
||
"\n",
|
||
" # Test quantized model\n",
|
||
" quantized_output = model.forward(x)\n",
|
||
"\n",
|
||
" # Compare outputs\n",
|
||
" error = np.mean(np.abs(original_output.data - quantized_output.data))\n",
|
||
" print(f\" Model quantization error: {error:.4f}\")\n",
|
||
" assert error < 2.0, f\"Model quantization error too high: {error}\"\n",
|
||
"\n",
|
||
" print(\"✅ Model quantization works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_quantize_model()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3769f169",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Model Size Comparison - Measuring the Impact\n",
|
||
"\n",
|
||
"This function provides detailed analysis of memory savings achieved through quantization. It's like a before/after comparison for model efficiency.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Memory Analysis Framework:\n",
|
||
"\n",
|
||
"┌────────────────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ Memory Breakdown Analysis │\n",
|
||
"├─────────────────┬─────────────────┬─────────────────┬─────────────────┤\n",
|
||
"│ Component │ Original (FP32) │ Quantized (INT8) │ Savings │\n",
|
||
"├─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ Layer 1 weights │ 12.8 MB │ 3.2 MB │ 9.6 MB (75%)│\n",
|
||
"│ Layer 1 bias │ 0.5 MB │ 0.1 MB │ 0.4 MB (75%)│\n",
|
||
"│ Layer 2 weights │ 2.0 MB │ 0.5 MB │ 1.5 MB (75%)│\n",
|
||
"│ Layer 2 bias │ 0.3 MB │ 0.1 MB │ 0.2 MB (67%)│\n",
|
||
"│ Overhead │ 0.0 MB │ 0.02 MB │ -0.02 MB │\n",
|
||
"├─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ TOTAL │ 15.6 MB │ 3.92 MB │ 11.7 MB (74%)│\n",
|
||
"└─────────────────┴─────────────────┴─────────────────┴─────────────────┘\n",
|
||
" ↑\n",
|
||
" 4× compression ratio!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Comprehensive Metrics Provided:**\n",
|
||
"```\n",
|
||
"Output Dictionary:\n",
|
||
"{\n",
|
||
" 'original_params': 4000000, # Total parameter count\n",
|
||
" 'quantized_params': 4000000, # Same count, different precision\n",
|
||
" 'original_bytes': 16000000, # 4 bytes per FP32 parameter\n",
|
||
" 'quantized_bytes': 4000016, # 1 byte per INT8 + overhead\n",
|
||
" 'compression_ratio': 3.99, # Nearly 4× compression\n",
|
||
" 'memory_saved_mb': 11.7, # Absolute savings in MB\n",
|
||
" 'memory_saved_percent': 74.9 # Relative savings percentage\n",
|
||
"}\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why These Metrics Matter:**\n",
|
||
"\n",
|
||
"**For Developers:**\n",
|
||
"- **compression_ratio** - How much smaller is the model?\n",
|
||
"- **memory_saved_mb** - Actual bytes freed up\n",
|
||
"- **memory_saved_percent** - Efficiency improvement\n",
|
||
"\n",
|
||
"**For Deployment:**\n",
|
||
"- **Model fits in device memory?** Check memory_saved_mb\n",
|
||
"- **Network transfer time?** Reduced by compression_ratio\n",
|
||
"- **Disk storage savings?** Shown by memory_saved_percent\n",
|
||
"\n",
|
||
"**For Business:**\n",
|
||
"- **Cloud costs** reduced by compression_ratio\n",
|
||
"- **User experience** improved (faster downloads)\n",
|
||
"- **Device support** expanded (fits on more devices)\n",
|
||
"\n",
|
||
"**Validation Checks:**\n",
|
||
"- **Parameter count preservation** - same functionality\n",
|
||
"- **Reasonable compression ratio** - should be ~4× for INT8\n",
|
||
"- **Minimal overhead** - quantization parameters are tiny"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "67b85991",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "compare_model_sizes",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def compare_model_sizes(original_model, quantized_model) -> Dict[str, float]:\n",
|
||
" \"\"\"\n",
|
||
" Compare memory usage between original and quantized models.\n",
|
||
"\n",
|
||
" TODO: Calculate comprehensive memory comparison\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Count parameters in both models\n",
|
||
" 2. Calculate bytes used (FP32 vs INT8)\n",
|
||
" 3. Include quantization overhead\n",
|
||
" 4. Return comparison metrics\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Count original model parameters\n",
|
||
" original_params = 0\n",
|
||
" original_bytes = 0\n",
|
||
"\n",
|
||
" if hasattr(original_model, 'layers'):\n",
|
||
" for layer in original_model.layers:\n",
|
||
" if hasattr(layer, 'parameters'):\n",
|
||
" params = layer.parameters()\n",
|
||
" for param in params:\n",
|
||
" original_params += param.data.size\n",
|
||
" original_bytes += param.data.size * 4 # 4 bytes per FP32\n",
|
||
"\n",
|
||
" # Count quantized model parameters\n",
|
||
" quantized_params = 0\n",
|
||
" quantized_bytes = 0\n",
|
||
"\n",
|
||
" if hasattr(quantized_model, 'layers'):\n",
|
||
" for layer in quantized_model.layers:\n",
|
||
" if isinstance(layer, QuantizedLinear):\n",
|
||
" memory_info = layer.memory_usage()\n",
|
||
" quantized_bytes += memory_info['quantized_bytes']\n",
|
||
" params = layer.parameters()\n",
|
||
" for param in params:\n",
|
||
" quantized_params += param.data.size\n",
|
||
" elif hasattr(layer, 'parameters'):\n",
|
||
" # Non-quantized layers\n",
|
||
" params = layer.parameters()\n",
|
||
" for param in params:\n",
|
||
" quantized_params += param.data.size\n",
|
||
" quantized_bytes += param.data.size * 4\n",
|
||
"\n",
|
||
" compression_ratio = original_bytes / quantized_bytes if quantized_bytes > 0 else 1.0\n",
|
||
" memory_saved = original_bytes - quantized_bytes\n",
|
||
"\n",
|
||
" return {\n",
|
||
" 'original_params': original_params,\n",
|
||
" 'quantized_params': quantized_params,\n",
|
||
" 'original_bytes': original_bytes,\n",
|
||
" 'quantized_bytes': quantized_bytes,\n",
|
||
" 'compression_ratio': compression_ratio,\n",
|
||
" 'memory_saved_mb': memory_saved / (1024 * 1024),\n",
|
||
" 'memory_saved_percent': (memory_saved / original_bytes) * 100 if original_bytes > 0 else 0\n",
|
||
" }\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def test_unit_compare_model_sizes():\n",
|
||
" \"\"\"🔬 Test model size comparison.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Model Size Comparison...\")\n",
|
||
"\n",
|
||
" # Create and quantize a model for testing\n",
|
||
" original_model = Sequential(Linear(100, 50), ReLU(), Linear(50, 10))\n",
|
||
" original_model.layers[0].weight = Tensor(np.random.randn(100, 50))\n",
|
||
" original_model.layers[0].bias = Tensor(np.random.randn(50))\n",
|
||
" original_model.layers[2].weight = Tensor(np.random.randn(50, 10))\n",
|
||
" original_model.layers[2].bias = Tensor(np.random.randn(10))\n",
|
||
"\n",
|
||
" # Create quantized copy\n",
|
||
" quantized_model = Sequential(Linear(100, 50), ReLU(), Linear(50, 10))\n",
|
||
" quantized_model.layers[0].weight = Tensor(np.random.randn(100, 50))\n",
|
||
" quantized_model.layers[0].bias = Tensor(np.random.randn(50))\n",
|
||
" quantized_model.layers[2].weight = Tensor(np.random.randn(50, 10))\n",
|
||
" quantized_model.layers[2].bias = Tensor(np.random.randn(10))\n",
|
||
"\n",
|
||
" quantize_model(quantized_model)\n",
|
||
"\n",
|
||
" # Compare sizes\n",
|
||
" comparison = compare_model_sizes(original_model, quantized_model)\n",
|
||
"\n",
|
||
" # Verify compression achieved\n",
|
||
" assert comparison['compression_ratio'] > 2.0, \"Should achieve significant compression\"\n",
|
||
" assert comparison['memory_saved_percent'] > 50, \"Should save >50% memory\"\n",
|
||
"\n",
|
||
" print(f\" Compression ratio: {comparison['compression_ratio']:.1f}×\")\n",
|
||
" print(f\" Memory saved: {comparison['memory_saved_percent']:.1f}%\")\n",
|
||
" print(\"✅ Model size comparison works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_compare_model_sizes()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "028fd2f1",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis - Real-World Performance Impact\n",
|
||
"\n",
|
||
"### Understanding Production Trade-offs\n",
|
||
"\n",
|
||
"Quantization isn't just about smaller models - it's about enabling entirely new deployment scenarios. Let's measure the real impact across different model scales.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Production Deployment Scenarios:\n",
|
||
"\n",
|
||
"┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐\n",
|
||
"│ Deployment │ Memory Limit │ Speed Needs │ Quantization Fit │\n",
|
||
"├──────────────────┼──────────────────┼──────────────────┼──────────────────┤\n",
|
||
"│ Mobile Phone │ 100-500MB │ <100ms latency │ ✅ Essential │\n",
|
||
"│ Edge Device │ 50-200MB │ Real-time │ ✅ Critical │\n",
|
||
"│ Cloud GPU │ 16-80GB │ High throughput │ 🤔 Optional │\n",
|
||
"│ Embedded MCU │ 1-10MB │ Ultra-low power │ ✅ Mandatory │\n",
|
||
"└──────────────────┴──────────────────┴──────────────────┴──────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Performance Testing Framework\n",
|
||
"\n",
|
||
"We'll measure quantization impact across three critical dimensions:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Performance Analysis Framework:\n",
|
||
"\n",
|
||
"1. Memory Efficiency 2. Inference Speed 3. Accuracy Preservation\n",
|
||
"┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐\n",
|
||
"│ • Model size (MB) │ │ • Forward pass time │ │ • MSE vs original │\n",
|
||
"│ • Compression ratio │ │ • Throughput (fps) │ │ • Relative error │\n",
|
||
"│ • Memory bandwidth │ │ • Latency (ms) │ │ • Distribution │\n",
|
||
"└─────────────────────┘ └─────────────────────┘ └─────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Expected Results Preview\n",
|
||
"\n",
|
||
"```\n",
|
||
"Typical Quantization Results:\n",
|
||
"\n",
|
||
"Model Size: Small (1-10MB) Medium (10-100MB) Large (100MB+)\n",
|
||
" ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n",
|
||
"Compression: │ 3.8× reduction │ │ 3.9× reduction │ │ 4.0× reduction │\n",
|
||
"Speed: │ 1.2× faster │ │ 2.1× faster │ │ 3.2× faster │\n",
|
||
"Accuracy: │ 0.1% loss │ │ 0.3% loss │ │ 0.5% loss │\n",
|
||
" └─────────────────┘ └─────────────────┘ └─────────────────┘\n",
|
||
"\n",
|
||
"Key Insight: Larger models benefit more from quantization!\n",
|
||
"```\n",
|
||
"\n",
|
||
"Let's run comprehensive tests to validate these expectations and understand the underlying patterns."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a1f6212a",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Performance Analysis - Real-World Benchmarking\n",
|
||
"\n",
|
||
"This comprehensive analysis measures quantization impact across the three critical dimensions: memory, speed, and accuracy.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Performance Testing Strategy:\n",
|
||
"\n",
|
||
"┌────────────────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ Test Model Configurations │\n",
|
||
"├────────────────────────────┬────────────────────────────┬────────────────────────────┤\n",
|
||
"│ Model Type │ Architecture │ Use Case │\n",
|
||
"├────────────────────────────┼────────────────────────────┼────────────────────────────┤\n",
|
||
"│ Small MLP │ 64 → 32 → 10 │ Edge Device │\n",
|
||
"│ Medium MLP │ 512 → 256 → 128 → 10 │ Mobile App │\n",
|
||
"│ Large MLP │ 2048 → 1024 → 512 → 10│ Server Deployment │\n",
|
||
"└────────────────────────────┴────────────────────────────┴────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Performance Measurement Pipeline:**\n",
|
||
"```\n",
|
||
"For Each Model Configuration:\n",
|
||
"\n",
|
||
" Create Original Model Create Quantized Model Comparative Analysis\n",
|
||
" │ │ │\n",
|
||
" ▼ ▼ ▼\n",
|
||
" ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n",
|
||
" │ Initialize weights │ │ Copy weights │ │ Memory analysis │\n",
|
||
" │ Random test data │ │ Apply quantization│ │ Speed benchmarks │\n",
|
||
" │ Forward pass │ │ Calibrate layers │ │ Accuracy testing │\n",
|
||
" │ Timing measurements│ │ Forward pass │ │ Trade-off analysis│\n",
|
||
" └─────────────────┘ └─────────────────┘ └─────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Expected Performance Patterns:**\n",
|
||
"```\n",
|
||
"Model Scaling Effects:\n",
|
||
"\n",
|
||
" Memory Usage Inference Speed Accuracy Loss\n",
|
||
" │ │ │\n",
|
||
" ▼ ▼ ▼\n",
|
||
"\n",
|
||
"4× │ ############### FP32 3× │ INT8 1% │ ####\n",
|
||
" │ │ ############### FP32 │\n",
|
||
"3× │ 2× │ 0.5% │ ##\n",
|
||
" │ ######### INT8 │ ########### INT8 │\n",
|
||
"2× │ 1× │ 0.1% │ #\n",
|
||
" │ │ ####### │\n",
|
||
"1× │ │ 0% └────────────────────────────────────────────────────\n",
|
||
" └──────────────────────────────────────────────────── └──────────────────────────────────────────────────── Small Medium Large\n",
|
||
" Small Medium Large Small Medium Large\n",
|
||
"\n",
|
||
"Key Insight: Larger models benefit more from quantization!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Real-World Impact Translation:**\n",
|
||
"- **Memory savings** → More models fit on device, lower cloud costs\n",
|
||
"- **Speed improvements** → Better user experience, real-time applications\n",
|
||
"- **Accuracy preservation** → Maintains model quality, no retraining needed"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "88001546",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze_quantization_performance",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_quantization_performance():\n",
|
||
" \"\"\"📊 Comprehensive analysis of quantization benefits and trade-offs.\"\"\"\n",
|
||
" print(\"📊 Analyzing Quantization Performance Across Model Sizes...\")\n",
|
||
"\n",
|
||
" # Test different model configurations\n",
|
||
" configs = [\n",
|
||
" {'name': 'Small MLP', 'layers': [64, 32, 10], 'batch_size': 32},\n",
|
||
" {'name': 'Medium MLP', 'layers': [512, 256, 128, 10], 'batch_size': 64},\n",
|
||
" {'name': 'Large MLP', 'layers': [2048, 1024, 512, 10], 'batch_size': 128},\n",
|
||
" ]\n",
|
||
"\n",
|
||
" results = []\n",
|
||
"\n",
|
||
" for config in configs:\n",
|
||
" print(f\"\\n🔍 Testing {config['name']}...\")\n",
|
||
"\n",
|
||
" # Create original model\n",
|
||
" layers = []\n",
|
||
" for i in range(len(config['layers']) - 1):\n",
|
||
" layers.append(Linear(config['layers'][i], config['layers'][i+1]))\n",
|
||
" if i < len(config['layers']) - 2: # Add ReLU except for last layer\n",
|
||
" layers.append(ReLU())\n",
|
||
"\n",
|
||
" original_model = Sequential(*layers)\n",
|
||
"\n",
|
||
" # Initialize weights\n",
|
||
" for layer in original_model.layers:\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" layer.weight = Tensor(np.random.randn(*layer.weight.shape) * 0.1)\n",
|
||
" layer.bias = Tensor(np.random.randn(*layer.bias.shape) * 0.01)\n",
|
||
"\n",
|
||
" # Create quantized copy\n",
|
||
" quantized_model = Sequential(*layers)\n",
|
||
" for i, layer in enumerate(original_model.layers):\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" quantized_model.layers[i].weight = Tensor(layer.weight.data.copy())\n",
|
||
" quantized_model.layers[i].bias = Tensor(layer.bias.data.copy())\n",
|
||
"\n",
|
||
" # Generate calibration data\n",
|
||
" input_size = config['layers'][0]\n",
|
||
" calibration_data = [Tensor(np.random.randn(1, input_size)) for _ in range(10)]\n",
|
||
"\n",
|
||
" # Quantize model\n",
|
||
" quantize_model(quantized_model, calibration_data)\n",
|
||
"\n",
|
||
" # Measure performance\n",
|
||
" test_input = Tensor(np.random.randn(config['batch_size'], input_size))\n",
|
||
"\n",
|
||
" # Time original model\n",
|
||
" start_time = time.time()\n",
|
||
" for _ in range(10):\n",
|
||
" original_output = original_model.forward(test_input)\n",
|
||
" original_time = (time.time() - start_time) / 10\n",
|
||
"\n",
|
||
" # Time quantized model\n",
|
||
" start_time = time.time()\n",
|
||
" for _ in range(10):\n",
|
||
" quantized_output = quantized_model.forward(test_input)\n",
|
||
" quantized_time = (time.time() - start_time) / 10\n",
|
||
"\n",
|
||
" # Calculate accuracy preservation (using MSE as proxy)\n",
|
||
" mse = np.mean((original_output.data - quantized_output.data) ** 2)\n",
|
||
" relative_error = np.sqrt(mse) / (np.std(original_output.data) + 1e-8)\n",
|
||
"\n",
|
||
" # Memory comparison\n",
|
||
" memory_comparison = compare_model_sizes(original_model, quantized_model)\n",
|
||
"\n",
|
||
" result = {\n",
|
||
" 'name': config['name'],\n",
|
||
" 'original_time': original_time * 1000, # Convert to ms\n",
|
||
" 'quantized_time': quantized_time * 1000,\n",
|
||
" 'speedup': original_time / quantized_time if quantized_time > 0 else 1.0,\n",
|
||
" 'compression_ratio': memory_comparison['compression_ratio'],\n",
|
||
" 'relative_error': relative_error,\n",
|
||
" 'memory_saved_mb': memory_comparison['memory_saved_mb']\n",
|
||
" }\n",
|
||
"\n",
|
||
" results.append(result)\n",
|
||
"\n",
|
||
" print(f\" Speedup: {result['speedup']:.1f}×\")\n",
|
||
" print(f\" Compression: {result['compression_ratio']:.1f}×\")\n",
|
||
" print(f\" Error: {result['relative_error']:.1%}\")\n",
|
||
" print(f\" Memory saved: {result['memory_saved_mb']:.1f}MB\")\n",
|
||
"\n",
|
||
" # Summary analysis\n",
|
||
" print(f\"\\n📈 QUANTIZATION PERFORMANCE SUMMARY\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" avg_speedup = np.mean([r['speedup'] for r in results])\n",
|
||
" avg_compression = np.mean([r['compression_ratio'] for r in results])\n",
|
||
" avg_error = np.mean([r['relative_error'] for r in results])\n",
|
||
" total_memory_saved = sum([r['memory_saved_mb'] for r in results])\n",
|
||
"\n",
|
||
" print(f\"Average speedup: {avg_speedup:.1f}×\")\n",
|
||
" print(f\"Average compression: {avg_compression:.1f}×\")\n",
|
||
" print(f\"Average relative error: {avg_error:.1%}\")\n",
|
||
" print(f\"Total memory saved: {total_memory_saved:.1f}MB\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Key Insights:\")\n",
|
||
" print(f\"- Quantization achieves ~{avg_compression:.0f}× memory reduction\")\n",
|
||
" print(f\"- Typical speedup: {avg_speedup:.1f}× (varies by hardware)\")\n",
|
||
" print(f\"- Accuracy loss: <{avg_error:.1%} for well-calibrated models\")\n",
|
||
" print(f\"- Best for: Memory-constrained deployment\")\n",
|
||
"\n",
|
||
" return results\n",
|
||
"\n",
|
||
"# Run comprehensive performance analysis\n",
|
||
"performance_results = analyze_quantization_performance()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a81e0afc",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## Quantization Error Visualization - Seeing the Impact\n",
|
||
"\n",
|
||
"### Understanding Distribution Effects\n",
|
||
"\n",
|
||
"Different weight distributions quantize with varying quality. Let's visualize this to understand when quantization works well and when it struggles.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Visualization Strategy:\n",
|
||
"\n",
|
||
"┌─────────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ Weight Distribution Analysis │\n",
|
||
"├─────────────────────┬─────────────────────┬─────────────────────────────────┤\n",
|
||
"│ Distribution Type │ Expected Quality │ Key Challenge │\n",
|
||
"├─────────────────────┼─────────────────────┼─────────────────────────────────┤\n",
|
||
"│ Normal (Gaussian) │ Good │ Tail values may be clipped │\n",
|
||
"│ Uniform │ Excellent │ Perfect scale utilization │\n",
|
||
"│ Sparse (many zeros) │ Poor │ Wasted quantization levels │\n",
|
||
"│ Heavy-tailed │ Very Poor │ Outliers dominate scale │\n",
|
||
"└─────────────────────┴─────────────────────┴─────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Quantization Quality Patterns\n",
|
||
"\n",
|
||
"```\n",
|
||
"Ideal Quantization: Problematic Quantization:\n",
|
||
"\n",
|
||
"Original: [████████████████████] Original: [██ ████ ██]\n",
|
||
" ↓ ↓\n",
|
||
"Quantized: [████████████████████] Quantized: [██....████....██]\n",
|
||
" Perfect reconstruction Lost precision\n",
|
||
"\n",
|
||
"Scale efficiently used Scale poorly used\n",
|
||
"Low quantization error High quantization error\n",
|
||
"```\n",
|
||
"\n",
|
||
"**What We'll Visualize:**\n",
|
||
"- **Before/After histograms** - See how distributions change\n",
|
||
"- **Error metrics** - Quantify the precision loss\n",
|
||
"- **Scale utilization** - Understand efficiency\n",
|
||
"- **Real examples** - Connect to practical scenarios\n",
|
||
"\n",
|
||
"This visualization will help you understand which types of neural network weights quantize well and which need special handling."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8f54d705",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Quantization Effects Visualization - Understanding Distribution Impact\n",
|
||
"\n",
|
||
"This visualization reveals how different weight distributions respond to quantization, helping you understand when quantization works well and when it struggles.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Visualization Strategy:\n",
|
||
"\n",
|
||
"┌────────────────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ Distribution Analysis Grid │\n",
|
||
"├─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┤\n",
|
||
"│ Normal (Good) │ Uniform (Best) │ Sparse (Bad) │ Heavy-Tailed (Worst)│\n",
|
||
"├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┤\n",
|
||
"│ /\\ │ ┌──────────┐ │ | | | │ /\\ │\n",
|
||
"│ / \\ │ │ │ │ | | | │ / \\ /\\ │\n",
|
||
"│ / \\ │ │ Flat │ │ |||| | |||| │ / \\/ \\ │\n",
|
||
"│ / \\ │ │ │ │ zeros sparse │ / \\ │\n",
|
||
"│ / \\ │ └──────────┘ │ values │ / huge \\ │\n",
|
||
"│ / \\ │ │ │ / outliers \\ │\n",
|
||
"├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┤\n",
|
||
"│ MSE: 0.001 │ MSE: 0.0001 │ MSE: 0.01 │ MSE: 0.1 │\n",
|
||
"│ Scale Usage: 80% │ Scale Usage: 100% │ Scale Usage: 10% │ Scale Usage: 5% │\n",
|
||
"└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Visual Comparison Strategy:**\n",
|
||
"```\n",
|
||
"For Each Distribution Type:\n",
|
||
" │\n",
|
||
" ├── Generate sample weights (1000 values)\n",
|
||
" │\n",
|
||
" ├── Quantize to INT8\n",
|
||
" │\n",
|
||
" ├── Dequantize back to FP32\n",
|
||
" │\n",
|
||
" ├── Plot overlaid histograms:\n",
|
||
" │ ├── Original distribution (blue)\n",
|
||
" │ └── Quantized distribution (red)\n",
|
||
" │\n",
|
||
" └── Calculate and display error metrics:\n",
|
||
" ├── Mean Squared Error (MSE)\n",
|
||
" ├── Scale utilization efficiency\n",
|
||
" └── Quantization scale value\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Insights You'll Discover:**\n",
|
||
"\n",
|
||
"**1. Normal Distribution (Most Common):**\n",
|
||
" - Smooth bell curve preserved reasonably well\n",
|
||
" - Tail values may be clipped slightly\n",
|
||
" - Good compromise for most neural networks\n",
|
||
"\n",
|
||
"**2. Uniform Distribution (Ideal Case):**\n",
|
||
" - Perfect scale utilization\n",
|
||
" - Minimal quantization error\n",
|
||
" - Best-case scenario for quantization\n",
|
||
"\n",
|
||
"**3. Sparse Distribution (Problematic):**\n",
|
||
" - Many zeros waste quantization levels\n",
|
||
" - Poor precision for non-zero values\n",
|
||
" - Common in pruned networks\n",
|
||
"\n",
|
||
"**4. Heavy-Tailed Distribution (Worst Case):**\n",
|
||
" - Outliers dominate scale calculation\n",
|
||
" - Most values squeezed into narrow range\n",
|
||
" - Requires special handling (clipping, per-channel)\n",
|
||
"\n",
|
||
"**Practical Implications:**\n",
|
||
"- **Model design:** Prefer batch normalization to reduce outliers\n",
|
||
"- **Training:** Techniques to encourage uniform weight distributions\n",
|
||
"- **Deployment:** Advanced quantization for sparse/heavy-tailed weights"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7d286a68",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "visualize_quantization_effects",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def visualize_quantization_effects():\n",
|
||
" \"\"\"📊 Visualize the effects of quantization on weight distributions.\"\"\"\n",
|
||
" print(\"📊 Visualizing Quantization Effects on Weight Distributions...\")\n",
|
||
"\n",
|
||
" # Create sample weight tensors with different characteristics\n",
|
||
" weight_types = {\n",
|
||
" 'Normal': np.random.normal(0, 0.1, (1000,)),\n",
|
||
" 'Uniform': np.random.uniform(-0.2, 0.2, (1000,)),\n",
|
||
" 'Sparse': np.random.choice([0, 0, 0, 1], (1000,)) * np.random.normal(0, 0.15, (1000,)),\n",
|
||
" 'Heavy-tailed': np.concatenate([\n",
|
||
" np.random.normal(0, 0.05, (800,)),\n",
|
||
" np.random.uniform(-0.5, 0.5, (200,))\n",
|
||
" ])\n",
|
||
" }\n",
|
||
"\n",
|
||
" fig, axes = plt.subplots(2, 2, figsize=(12, 8))\n",
|
||
" axes = axes.flatten()\n",
|
||
"\n",
|
||
" for idx, (name, weights) in enumerate(weight_types.items()):\n",
|
||
" # Original weights\n",
|
||
" original_tensor = Tensor(weights)\n",
|
||
"\n",
|
||
" # Quantize and dequantize\n",
|
||
" q_tensor, scale, zero_point = quantize_int8(original_tensor)\n",
|
||
" restored_tensor = dequantize_int8(q_tensor, scale, zero_point)\n",
|
||
"\n",
|
||
" # Plot histograms\n",
|
||
" ax = axes[idx]\n",
|
||
" ax.hist(weights, bins=50, alpha=0.6, label='Original', density=True)\n",
|
||
" ax.hist(restored_tensor.data, bins=50, alpha=0.6, label='Quantized', density=True)\n",
|
||
" ax.set_title(f'{name} Weights\\nScale: {scale:.4f}')\n",
|
||
" ax.set_xlabel('Weight Value')\n",
|
||
" ax.set_ylabel('Density')\n",
|
||
" ax.legend()\n",
|
||
" ax.grid(True, alpha=0.3)\n",
|
||
"\n",
|
||
" # Calculate and display error metrics\n",
|
||
" mse = np.mean((weights - restored_tensor.data) ** 2)\n",
|
||
" ax.text(0.02, 0.98, f'MSE: {mse:.6f}', transform=ax.transAxes,\n",
|
||
" verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))\n",
|
||
"\n",
|
||
" plt.tight_layout()\n",
|
||
" plt.savefig('/tmp/claude/quantization_effects.png', dpi=100, bbox_inches='tight')\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
" print(\"💡 Observations:\")\n",
|
||
" print(\"- Normal: Smooth quantization, good preservation\")\n",
|
||
" print(\"- Uniform: Excellent quantization, full range utilized\")\n",
|
||
" print(\"- Sparse: Many wasted quantization levels on zeros\")\n",
|
||
" print(\"- Heavy-tailed: Outliers dominate scale, poor precision for small weights\")\n",
|
||
"\n",
|
||
"# Visualize quantization effects\n",
|
||
"visualize_quantization_effects()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "784b58ca",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 6. Optimization Insights - Production Quantization Strategies\n",
|
||
"\n",
|
||
"### Beyond Basic Quantization\n",
|
||
"\n",
|
||
"Our INT8 per-tensor quantization is just the beginning. Production systems use sophisticated strategies to squeeze out every bit of performance while preserving accuracy.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantization Strategy Evolution:\n",
|
||
"\n",
|
||
" Basic (What we built) Advanced (Production) Cutting-Edge (Research)\n",
|
||
"┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐\n",
|
||
"│ • Per-tensor scale │ │ • Per-channel scale │ │ • Dynamic ranges │\n",
|
||
"│ • Uniform INT8 │ → │ • Mixed precision │ → │ • Adaptive bitwidth │\n",
|
||
"│ • Post-training │ │ • Quantization-aware│ │ • Learned quantizers│\n",
|
||
"│ • Simple calibration│ │ • Advanced calib. │ │ • Neural compression│\n",
|
||
"└─────────────────────┘ └─────────────────────┘ └─────────────────────┘\n",
|
||
" Good baseline Production systems Future research\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Strategy Comparison Framework\n",
|
||
"\n",
|
||
"```\n",
|
||
"Quantization Strategy Trade-offs:\n",
|
||
"\n",
|
||
"┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\n",
|
||
"│ Strategy │ Accuracy │ Complexity │ Memory Use │ Speed Gain │\n",
|
||
"├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\n",
|
||
"│ Per-Tensor (Ours) │ ████████░░ │ ██░░░░░░░░ │ ████████░░ │ ███████░░░ │\n",
|
||
"│ Per-Channel │ █████████░ │ █████░░░░░ │ ████████░░ │ ██████░░░░ │\n",
|
||
"│ Mixed Precision │ ██████████ │ ████████░░ │ ███████░░░ │ ████████░░ │\n",
|
||
"│ Quantization-Aware │ ██████████ │ ██████████ │ ████████░░ │ ███████░░░ │\n",
|
||
"└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Three Advanced Strategies We'll Analyze\n",
|
||
"\n",
|
||
"**1. Per-Channel Quantization:**\n",
|
||
"```\n",
|
||
"Per-Tensor: Per-Channel:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ [W₁₁ W₁₂ W₁₃] │ │ [W₁₁ W₁₂ W₁₃] scale₁ │\n",
|
||
"│ [W₂₁ W₂₂ W₂₃] scale │ VS │ [W₂₁ W₂₂ W₂₃] scale₂ │\n",
|
||
"│ [W₃₁ W₃₂ W₃₃] │ │ [W₃₁ W₃₂ W₃₃] scale₃ │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
" One scale for all Separate scale per channel\n",
|
||
" May waste precision Better precision per channel\n",
|
||
"```\n",
|
||
"\n",
|
||
"**2. Mixed Precision:**\n",
|
||
"```\n",
|
||
"Sensitive Layers (FP32): Regular Layers (INT8):\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ Input Layer │ │ Hidden Layer 1 │\n",
|
||
"│ (preserve input quality)│ │ (can tolerate error) │\n",
|
||
"├─────────────────────────┤ ├─────────────────────────┤\n",
|
||
"│ Output Layer │ │ Hidden Layer 2 │\n",
|
||
"│ (preserve output) │ │ (bulk of computation) │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
" Keep high precision Maximize compression\n",
|
||
"```\n",
|
||
"\n",
|
||
"**3. Calibration Strategies:**\n",
|
||
"```\n",
|
||
"Basic Calibration: Advanced Calibration:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ • Use min/max range │ │ • Percentile clipping │\n",
|
||
"│ • Simple statistics │ │ • KL-divergence │\n",
|
||
"│ • Few samples │ VS │ • Multiple datasets │\n",
|
||
"│ • Generic approach │ │ • Layer-specific tuning │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
" Fast but suboptimal Optimal but expensive\n",
|
||
"```\n",
|
||
"\n",
|
||
"Let's implement and compare these strategies to understand their practical trade-offs!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1d4fc886",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Advanced Quantization Strategies - Production Techniques\n",
|
||
"\n",
|
||
"This analysis compares different quantization approaches used in production systems, revealing the trade-offs between accuracy, complexity, and performance.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Strategy Comparison Framework:\n",
|
||
"\n",
|
||
"┌────────────────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ Three Advanced Strategies │\n",
|
||
"├────────────────────────────┬────────────────────────────┬────────────────────────────┤\n",
|
||
"│ Strategy 1 │ Strategy 2 │ Strategy 3 │\n",
|
||
"│ Per-Tensor (Ours) │ Per-Channel Scale │ Mixed Precision │\n",
|
||
"├────────────────────────────┼────────────────────────────┼────────────────────────────┤\n",
|
||
"│ │ │ │\n",
|
||
"│ ┌──────────────────────┐ │ ┌──────────────────────┐ │ ┌──────────────────────┐ │\n",
|
||
"│ │ Weights: │ │ │ Channel 1: scale₁ │ │ │ Sensitive: FP32 │ │\n",
|
||
"│ │ [W₁₁ W₁₂ W₁₃] │ │ │ Channel 2: scale₂ │ │ │ Regular: INT8 │ │\n",
|
||
"│ │ [W₂₁ W₂₂ W₂₃] scale │ │ │ Channel 3: scale₃ │ │ │ │ │\n",
|
||
"│ │ [W₃₁ W₃₂ W₃₃] │ │ │ │ │ │ Input: FP32 │ │\n",
|
||
"│ └──────────────────────┘ │ │ Better precision │ │ │ Output: FP32 │ │\n",
|
||
"│ │ │ per channel │ │ │ Hidden: INT8 │ │\n",
|
||
"│ Simple, fast │ └──────────────────────┘ │ └──────────────────────┘ │\n",
|
||
"│ Good baseline │ │ │\n",
|
||
"│ │ More complex │ Optimal accuracy │\n",
|
||
"│ │ Better accuracy │ Selective compression │\n",
|
||
"└────────────────────────────┴────────────────────────────┴────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Strategy 1: Per-Tensor Quantization (Our Implementation)**\n",
|
||
"```\n",
|
||
"Weight Matrix: Scale Calculation:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ 0.1 -0.3 0.8 0.2 │ │ Global min: -0.5 │\n",
|
||
"│-0.2 0.5 -0.1 0.7 │ → │ Global max: +0.8 │\n",
|
||
"│ 0.4 -0.5 0.3 -0.4 │ │ Scale: 1.3/255 = 0.0051 │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
"\n",
|
||
"Pros: Simple, fast Cons: May waste precision\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Strategy 2: Per-Channel Quantization (Advanced)**\n",
|
||
"```\n",
|
||
"Weight Matrix: Scale Calculation:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ 0.1 -0.3 0.8 0.2 │ │ Col 1: [-0.2,0.4] → s₁ │\n",
|
||
"│-0.2 0.5 -0.1 0.7 │ → │ Col 2: [-0.5,0.5] → s₂ │\n",
|
||
"│ 0.4 -0.5 0.3 -0.4 │ │ Col 3: [-0.1,0.8] → s₃ │\n",
|
||
"└─────────────────────────┘ │ Col 4: [-0.4,0.7] → s₄ │\n",
|
||
" └─────────────────────────┘\n",
|
||
"\n",
|
||
"Pros: Better precision Cons: More complex\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Strategy 3: Mixed Precision (Production)**\n",
|
||
"```\n",
|
||
"Model Architecture: Precision Assignment:\n",
|
||
"┌─────────────────────────┐ ┌─────────────────────────┐\n",
|
||
"│ Input Layer (sensitive) │ │ Keep in FP32 (precision) │\n",
|
||
"│ Hidden 1 (bulk) │ → │ Quantize to INT8 │\n",
|
||
"│ Hidden 2 (bulk) │ │ Quantize to INT8 │\n",
|
||
"│ Output Layer (sensitive)│ │ Keep in FP32 (quality) │\n",
|
||
"└─────────────────────────┘ └─────────────────────────┘\n",
|
||
"\n",
|
||
"Pros: Optimal trade-off Cons: Requires expertise\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Experimental Design:**\n",
|
||
"```\n",
|
||
"Comparative Testing Protocol:\n",
|
||
"\n",
|
||
"1. Create identical test model → 2. Apply each strategy → 3. Measure results\n",
|
||
" ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐\n",
|
||
" │ 128 → 64 → 10 MLP │ │ Per-tensor quantization │ │ MSE error calculation │\n",
|
||
" │ Identical weights │ │ Per-channel simulation │ │ Compression measurement│\n",
|
||
" │ Same test input │ │ Mixed precision setup │ │ Speed comparison │\n",
|
||
" └───────────────────────┘ └───────────────────────┘ └───────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Expected Strategy Rankings:**\n",
|
||
"1. **Mixed Precision** - Best accuracy, moderate complexity\n",
|
||
"2. **Per-Channel** - Good accuracy, higher complexity\n",
|
||
"3. **Per-Tensor** - Baseline accuracy, simplest implementation\n",
|
||
"\n",
|
||
"This analysis reveals which strategies work best for different deployment scenarios and accuracy requirements."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5d474888",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze_quantization_strategies",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_quantization_strategies():\n",
|
||
" \"\"\"📊 Compare different quantization strategies and their trade-offs.\"\"\"\n",
|
||
" print(\"📊 Analyzing Advanced Quantization Strategies...\")\n",
|
||
"\n",
|
||
" # Create test model and data\n",
|
||
" model = Sequential(Linear(128, 64), ReLU(), Linear(64, 10))\n",
|
||
" model.layers[0].weight = Tensor(np.random.randn(128, 64) * 0.1)\n",
|
||
" model.layers[0].bias = Tensor(np.random.randn(64) * 0.01)\n",
|
||
" model.layers[2].weight = Tensor(np.random.randn(64, 10) * 0.1)\n",
|
||
" model.layers[2].bias = Tensor(np.random.randn(10) * 0.01)\n",
|
||
"\n",
|
||
" test_input = Tensor(np.random.randn(32, 128))\n",
|
||
" original_output = model.forward(test_input)\n",
|
||
"\n",
|
||
" strategies = {}\n",
|
||
"\n",
|
||
" # Strategy 1: Per-tensor quantization (what we implemented)\n",
|
||
" print(\"\\n🔍 Strategy 1: Per-Tensor Quantization\")\n",
|
||
" model_copy = Sequential(Linear(128, 64), ReLU(), Linear(64, 10))\n",
|
||
" for i, layer in enumerate(model.layers):\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" model_copy.layers[i].weight = Tensor(layer.weight.data.copy())\n",
|
||
" model_copy.layers[i].bias = Tensor(layer.bias.data.copy())\n",
|
||
"\n",
|
||
" quantize_model(model_copy)\n",
|
||
" output1 = model_copy.forward(test_input)\n",
|
||
" error1 = np.mean((original_output.data - output1.data) ** 2)\n",
|
||
" strategies['per_tensor'] = {'mse': error1, 'description': 'Single scale per tensor'}\n",
|
||
" print(f\" MSE: {error1:.6f}\")\n",
|
||
"\n",
|
||
" # Strategy 2: Per-channel quantization simulation\n",
|
||
" print(\"\\n🔍 Strategy 2: Per-Channel Quantization (simulated)\")\n",
|
||
" # Simulate by quantizing each output channel separately\n",
|
||
" def per_channel_quantize(tensor):\n",
|
||
" \"\"\"Simulate per-channel quantization for 2D weight matrices.\"\"\"\n",
|
||
" if len(tensor.shape) < 2:\n",
|
||
" return quantize_int8(tensor)\n",
|
||
"\n",
|
||
" quantized_data = np.zeros_like(tensor.data, dtype=np.int8)\n",
|
||
" scales = []\n",
|
||
" zero_points = []\n",
|
||
"\n",
|
||
" for i in range(tensor.shape[1]): # Per output channel\n",
|
||
" channel_tensor = Tensor(tensor.data[:, i:i+1])\n",
|
||
" q_channel, scale, zp = quantize_int8(channel_tensor)\n",
|
||
" quantized_data[:, i] = q_channel.data.flatten()\n",
|
||
" scales.append(scale)\n",
|
||
" zero_points.append(zp)\n",
|
||
"\n",
|
||
" return Tensor(quantized_data), scales, zero_points\n",
|
||
"\n",
|
||
" # Apply per-channel quantization to weights\n",
|
||
" total_error = 0\n",
|
||
" for layer in model.layers:\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" q_weight, scales, zps = per_channel_quantize(layer.weight)\n",
|
||
" # Simulate dequantization and error\n",
|
||
" for i in range(layer.weight.shape[1]):\n",
|
||
" original_channel = layer.weight.data[:, i]\n",
|
||
" restored_channel = scales[i] * q_weight.data[:, i] + zps[i] * scales[i]\n",
|
||
" total_error += np.mean((original_channel - restored_channel) ** 2)\n",
|
||
"\n",
|
||
" strategies['per_channel'] = {'mse': total_error, 'description': 'Scale per output channel'}\n",
|
||
" print(f\" MSE: {total_error:.6f}\")\n",
|
||
"\n",
|
||
" # Strategy 3: Mixed precision simulation\n",
|
||
" print(\"\\n🔍 Strategy 3: Mixed Precision\")\n",
|
||
" # Keep sensitive layers in FP32, quantize others\n",
|
||
" sensitive_layers = [0] # First layer often most sensitive\n",
|
||
" mixed_error = 0\n",
|
||
"\n",
|
||
" for i, layer in enumerate(model.layers):\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" if i in sensitive_layers:\n",
|
||
" # Keep in FP32 (no quantization error)\n",
|
||
" pass\n",
|
||
" else:\n",
|
||
" # Quantize layer\n",
|
||
" q_weight, scale, zp = quantize_int8(layer.weight)\n",
|
||
" restored = dequantize_int8(q_weight, scale, zp)\n",
|
||
" mixed_error += np.mean((layer.weight.data - restored.data) ** 2)\n",
|
||
"\n",
|
||
" strategies['mixed_precision'] = {'mse': mixed_error, 'description': 'FP32 sensitive + INT8 others'}\n",
|
||
" print(f\" MSE: {mixed_error:.6f}\")\n",
|
||
"\n",
|
||
" # Compare strategies\n",
|
||
" print(f\"\\n📊 QUANTIZATION STRATEGY COMPARISON\")\n",
|
||
" print(\"=\" * 60)\n",
|
||
" for name, info in strategies.items():\n",
|
||
" print(f\"{name:15}: MSE={info['mse']:.6f} | {info['description']}\")\n",
|
||
"\n",
|
||
" # Find best strategy\n",
|
||
" best_strategy = min(strategies.items(), key=lambda x: x[1]['mse'])\n",
|
||
" print(f\"\\n🏆 Best Strategy: {best_strategy[0]} (MSE: {best_strategy[1]['mse']:.6f})\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Production Insights:\")\n",
|
||
" print(\"- Per-channel: Better accuracy, more complex implementation\")\n",
|
||
" print(\"- Mixed precision: Optimal accuracy/efficiency trade-off\")\n",
|
||
" print(\"- Per-tensor: Simplest, good for most applications\")\n",
|
||
" print(\"- Hardware support varies: INT8 GEMM, per-channel scales\")\n",
|
||
"\n",
|
||
" return strategies\n",
|
||
"\n",
|
||
"# Analyze quantization strategies\n",
|
||
"strategy_analysis = analyze_quantization_strategies()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "720002d7",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 7. Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that our quantization system works correctly across all components."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d28702bc",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_module",
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire quantization module functionality.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All quantization functions work correctly\n",
|
||
" - Model quantization preserves functionality\n",
|
||
" - Memory savings are achieved\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_quantize_int8()\n",
|
||
" test_unit_dequantize_int8()\n",
|
||
" test_unit_quantized_linear()\n",
|
||
" test_unit_quantize_model()\n",
|
||
" test_unit_compare_model_sizes()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test realistic usage scenario\n",
|
||
" print(\"🔬 Integration Test: End-to-end quantization workflow...\")\n",
|
||
"\n",
|
||
" # Create a realistic model\n",
|
||
" model = Sequential(\n",
|
||
" Linear(784, 128), # MNIST-like input\n",
|
||
" ReLU(),\n",
|
||
" Linear(128, 64),\n",
|
||
" ReLU(),\n",
|
||
" Linear(64, 10) # 10-class output\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Initialize with realistic weights\n",
|
||
" for layer in model.layers:\n",
|
||
" if isinstance(layer, Linear):\n",
|
||
" # Xavier initialization\n",
|
||
" fan_in, fan_out = layer.weight.shape\n",
|
||
" std = np.sqrt(2.0 / (fan_in + fan_out))\n",
|
||
" layer.weight = Tensor(np.random.randn(fan_in, fan_out) * std)\n",
|
||
" layer.bias = Tensor(np.zeros(fan_out))\n",
|
||
"\n",
|
||
" # Generate realistic calibration data\n",
|
||
" calibration_data = [Tensor(np.random.randn(1, 784) * 0.1) for _ in range(20)]\n",
|
||
"\n",
|
||
" # Test original model\n",
|
||
" test_input = Tensor(np.random.randn(8, 784) * 0.1)\n",
|
||
" original_output = model.forward(test_input)\n",
|
||
"\n",
|
||
" # Quantize the model\n",
|
||
" quantize_model(model, calibration_data)\n",
|
||
"\n",
|
||
" # Test quantized model\n",
|
||
" quantized_output = model.forward(test_input)\n",
|
||
"\n",
|
||
" # Verify functionality is preserved\n",
|
||
" assert quantized_output.shape == original_output.shape, \"Output shape mismatch\"\n",
|
||
"\n",
|
||
" # Verify reasonable accuracy preservation\n",
|
||
" mse = np.mean((original_output.data - quantized_output.data) ** 2)\n",
|
||
" relative_error = np.sqrt(mse) / (np.std(original_output.data) + 1e-8)\n",
|
||
" assert relative_error < 0.1, f\"Accuracy degradation too high: {relative_error:.3f}\"\n",
|
||
"\n",
|
||
" # Verify memory savings\n",
|
||
" # Create equivalent original model for comparison\n",
|
||
" original_model = Sequential(\n",
|
||
" Linear(784, 128),\n",
|
||
" ReLU(),\n",
|
||
" Linear(128, 64),\n",
|
||
" ReLU(),\n",
|
||
" Linear(64, 10)\n",
|
||
" )\n",
|
||
"\n",
|
||
" for i, layer in enumerate(model.layers):\n",
|
||
" if isinstance(layer, QuantizedLinear):\n",
|
||
" # Restore original weights for comparison\n",
|
||
" original_model.layers[i].weight = dequantize_int8(\n",
|
||
" layer.q_weight, layer.weight_scale, layer.weight_zero_point\n",
|
||
" )\n",
|
||
" if layer.q_bias is not None:\n",
|
||
" original_model.layers[i].bias = dequantize_int8(\n",
|
||
" layer.q_bias, layer.bias_scale, layer.bias_zero_point\n",
|
||
" )\n",
|
||
"\n",
|
||
" memory_comparison = compare_model_sizes(original_model, model)\n",
|
||
" assert memory_comparison['compression_ratio'] > 2.0, \"Insufficient compression achieved\"\n",
|
||
"\n",
|
||
" print(f\"✅ Compression achieved: {memory_comparison['compression_ratio']:.1f}×\")\n",
|
||
" print(f\"✅ Accuracy preserved: {relative_error:.1%} relative error\")\n",
|
||
" print(f\"✅ Memory saved: {memory_comparison['memory_saved_mb']:.1f}MB\")\n",
|
||
"\n",
|
||
" # Test edge cases\n",
|
||
" print(\"🔬 Testing edge cases...\")\n",
|
||
"\n",
|
||
" # Test constant tensor quantization\n",
|
||
" constant_tensor = Tensor([[1.0, 1.0], [1.0, 1.0]])\n",
|
||
" q_const, scale_const, zp_const = quantize_int8(constant_tensor)\n",
|
||
" assert scale_const == 1.0, \"Constant tensor quantization failed\"\n",
|
||
"\n",
|
||
" # Test zero tensor\n",
|
||
" zero_tensor = Tensor([[0.0, 0.0], [0.0, 0.0]])\n",
|
||
" q_zero, scale_zero, zp_zero = quantize_int8(zero_tensor)\n",
|
||
" restored_zero = dequantize_int8(q_zero, scale_zero, zp_zero)\n",
|
||
" assert np.allclose(restored_zero.data, 0.0, atol=1e-6), \"Zero tensor restoration failed\"\n",
|
||
"\n",
|
||
" print(\"✅ Edge cases handled correctly!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"📈 Quantization system provides:\")\n",
|
||
" print(f\" • {memory_comparison['compression_ratio']:.1f}× memory reduction\")\n",
|
||
" print(f\" • <{relative_error:.1%} accuracy loss\")\n",
|
||
" print(f\" • Production-ready INT8 quantization\")\n",
|
||
" print(\"Run: tito module complete 17\")\n",
|
||
"\n",
|
||
"# Call the comprehensive test\n",
|
||
"test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "84871dfd",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if __name__ == \"__main__\":\n",
|
||
" print(\"🚀 Running Quantization module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c093e91d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 🏁 Consolidated Quantization Classes for Export\n",
|
||
"\n",
|
||
"Now that we've implemented all quantization components, let's create consolidated classes\n",
|
||
"for export to the tinytorch package. This allows milestones to use the complete quantization system."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "cab2e3a1",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "quantization_export",
|
||
"solution": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class QuantizationComplete:\n",
|
||
" \"\"\"\n",
|
||
" Complete quantization system for milestone use.\n",
|
||
" \n",
|
||
" Provides INT8 quantization with calibration for 4× memory reduction.\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" @staticmethod\n",
|
||
" def quantize_tensor(tensor: Tensor) -> Tuple[Tensor, float, int]:\n",
|
||
" \"\"\"Quantize FP32 tensor to INT8.\"\"\"\n",
|
||
" data = tensor.data\n",
|
||
" min_val = float(np.min(data))\n",
|
||
" max_val = float(np.max(data))\n",
|
||
" \n",
|
||
" if abs(max_val - min_val) < 1e-8:\n",
|
||
" return Tensor(np.zeros_like(data, dtype=np.int8)), 1.0, 0\n",
|
||
" \n",
|
||
" scale = (max_val - min_val) / 255.0\n",
|
||
" zero_point = int(np.round(-128 - min_val / scale))\n",
|
||
" zero_point = int(np.clip(zero_point, -128, 127))\n",
|
||
" \n",
|
||
" quantized_data = np.round(data / scale + zero_point)\n",
|
||
" quantized_data = np.clip(quantized_data, -128, 127).astype(np.int8)\n",
|
||
" \n",
|
||
" return Tensor(quantized_data), scale, zero_point\n",
|
||
" \n",
|
||
" @staticmethod\n",
|
||
" def dequantize_tensor(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:\n",
|
||
" \"\"\"Dequantize INT8 tensor back to FP32.\"\"\"\n",
|
||
" dequantized_data = (q_tensor.data.astype(np.float32) - zero_point) * scale\n",
|
||
" return Tensor(dequantized_data)\n",
|
||
" \n",
|
||
" @staticmethod\n",
|
||
" def quantize_model(model, calibration_data: Optional[List[Tensor]] = None) -> Dict[str, any]:\n",
|
||
" \"\"\"\n",
|
||
" Quantize all Linear layers in a model.\n",
|
||
" \n",
|
||
" Returns dictionary with quantization info and memory savings.\n",
|
||
" \"\"\"\n",
|
||
" quantized_layers = {}\n",
|
||
" original_size = 0\n",
|
||
" quantized_size = 0\n",
|
||
" \n",
|
||
" # Iterate through model parameters\n",
|
||
" if hasattr(model, 'parameters'):\n",
|
||
" for i, param in enumerate(model.parameters()):\n",
|
||
" param_size = param.data.nbytes\n",
|
||
" original_size += param_size\n",
|
||
" \n",
|
||
" # Quantize parameter\n",
|
||
" q_param, scale, zp = QuantizationComplete.quantize_tensor(param)\n",
|
||
" quantized_size += q_param.data.nbytes\n",
|
||
" \n",
|
||
" quantized_layers[f'param_{i}'] = {\n",
|
||
" 'quantized': q_param,\n",
|
||
" 'scale': scale,\n",
|
||
" 'zero_point': zp,\n",
|
||
" 'original_shape': param.data.shape\n",
|
||
" }\n",
|
||
" \n",
|
||
" return {\n",
|
||
" 'quantized_layers': quantized_layers,\n",
|
||
" 'original_size_mb': original_size / (1024 * 1024),\n",
|
||
" 'quantized_size_mb': quantized_size / (1024 * 1024),\n",
|
||
" 'compression_ratio': original_size / quantized_size if quantized_size > 0 else 1.0\n",
|
||
" }\n",
|
||
" \n",
|
||
" @staticmethod\n",
|
||
" def compare_models(original_model, quantized_info: Dict) -> Dict[str, float]:\n",
|
||
" \"\"\"Compare memory usage between original and quantized models.\"\"\"\n",
|
||
" return {\n",
|
||
" 'original_mb': quantized_info['original_size_mb'],\n",
|
||
" 'quantized_mb': quantized_info['quantized_size_mb'],\n",
|
||
" 'compression_ratio': quantized_info['compression_ratio'],\n",
|
||
" 'memory_saved_mb': quantized_info['original_size_mb'] - quantized_info['quantized_size_mb']\n",
|
||
" }\n",
|
||
"\n",
|
||
"# Convenience functions for backward compatibility\n",
|
||
"def quantize_int8(tensor: Tensor) -> Tuple[Tensor, float, int]:\n",
|
||
" \"\"\"Quantize FP32 tensor to INT8.\"\"\"\n",
|
||
" return QuantizationComplete.quantize_tensor(tensor)\n",
|
||
"\n",
|
||
"def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:\n",
|
||
" \"\"\"Dequantize INT8 tensor back to FP32.\"\"\"\n",
|
||
" return QuantizationComplete.dequantize_tensor(q_tensor, scale, zero_point)\n",
|
||
"\n",
|
||
"def quantize_model(model, calibration_data: Optional[List[Tensor]] = None) -> Dict[str, any]:\n",
|
||
" \"\"\"Quantize entire model to INT8.\"\"\"\n",
|
||
" return QuantizationComplete.quantize_model(model, calibration_data)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b3d77ac1",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Quantization in Production\n",
|
||
"\n",
|
||
"### Question 1: Memory Architecture Impact\n",
|
||
"You implemented INT8 quantization that reduces each parameter from 4 bytes to 1 byte.\n",
|
||
"For a model with 100M parameters:\n",
|
||
"- Original memory usage: _____ GB\n",
|
||
"- Quantized memory usage: _____ GB\n",
|
||
"- Memory bandwidth reduction when loading from disk: _____ ×\n",
|
||
"\n",
|
||
"### Question 2: Quantization Error Analysis\n",
|
||
"Your quantization maps a continuous range to 256 discrete values (INT8).\n",
|
||
"For weights uniformly distributed in [-0.1, 0.1]:\n",
|
||
"- Quantization scale: _____\n",
|
||
"- Maximum quantization error: _____\n",
|
||
"- Signal-to-noise ratio approximately: _____ dB\n",
|
||
"\n",
|
||
"### Question 3: Hardware Efficiency\n",
|
||
"Modern processors have specialized INT8 instructions (like AVX-512 VNNI).\n",
|
||
"Compared to FP32 operations:\n",
|
||
"- How many INT8 operations fit in one SIMD instruction vs FP32? _____ × more\n",
|
||
"- Why might actual speedup be less than this theoretical maximum? _____\n",
|
||
"- What determines whether quantization improves or hurts performance? _____\n",
|
||
"\n",
|
||
"### Question 4: Calibration Strategy Trade-offs\n",
|
||
"Your calibration process finds optimal scales using sample data.\n",
|
||
"- Too little calibration data: Risk of _____\n",
|
||
"- Too much calibration data: Cost of _____\n",
|
||
"- Per-channel vs per-tensor quantization trades _____ for _____\n",
|
||
"\n",
|
||
"### Question 5: Production Deployment\n",
|
||
"In mobile/edge deployment scenarios:\n",
|
||
"- When is 4× memory reduction worth <1% accuracy loss? _____\n",
|
||
"- Why might you keep certain layers in FP32? _____\n",
|
||
"- How does quantization affect battery life? _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5b20dcf9",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Quantization\n",
|
||
"\n",
|
||
"Congratulations! You've built a complete INT8 quantization system that can reduce model size by 4× with minimal accuracy loss!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- **Built INT8 quantization** with proper scaling and zero-point calculation\n",
|
||
"- **Implemented QuantizedLinear** layer with calibration support\n",
|
||
"- **Created model-level quantization** for complete neural networks\n",
|
||
"- **Analyzed quantization trade-offs** across different distributions and strategies\n",
|
||
"- **Measured real memory savings** and performance improvements\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Real-World Impact\n",
|
||
"Your quantization implementation achieves:\n",
|
||
"- **4× memory reduction** (FP32 → INT8)\n",
|
||
"- **2-4× inference speedup** (hardware dependent)\n",
|
||
"- **<1% accuracy loss** with proper calibration\n",
|
||
"- **Production deployment readiness** for mobile/edge applications\n",
|
||
"\n",
|
||
"### What You've Mastered\n",
|
||
"- **Quantization mathematics** - scale and zero-point calculations\n",
|
||
"- **Calibration techniques** - optimizing quantization parameters\n",
|
||
"- **Error analysis** - understanding and minimizing quantization noise\n",
|
||
"- **Systems optimization** - memory vs accuracy trade-offs\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your quantization system enables efficient model deployment on resource-constrained devices.\n",
|
||
"Export with: `tito module complete 17`\n",
|
||
"\n",
|
||
"**Next**: Module 18 will add model compression through pruning - removing unnecessary weights entirely!\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"**🏆 Achievement Unlocked**: You can now deploy 4× smaller models with production-quality quantization! This is a critical skill for mobile AI, edge computing, and efficient inference systems."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|