mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-21 19:30:47 -05:00
PROBLEM: - nbdev requires #| export directive on EACH cell to export when using # %% markers - Cell markers inside class definitions split classes across multiple cells - Only partial classes were being exported to tinytorch package - Missing matmul, arithmetic operations, and activation classes in exports SOLUTION: 1. Removed # %% cell markers INSIDE class definitions (kept classes as single units) 2. Added #| export to imports cell at top of each module 3. Added #| export before each exportable class definition in all 20 modules 4. Added __call__ method to Sigmoid for functional usage 5. Fixed numpy import (moved to module level from __init__) MODULES FIXED: - 01_tensor: Tensor class with all operations (matmul, arithmetic, shape ops) - 02_activations: Sigmoid, ReLU, Tanh, GELU, Softmax classes - 03_layers: Linear, Dropout classes - 04_losses: MSELoss, CrossEntropyLoss, BinaryCrossEntropyLoss classes - 05_autograd: Function, AddBackward, MulBackward, MatmulBackward, SumBackward - 06_optimizers: Optimizer, SGD, Adam, AdamW classes - 07_training: CosineSchedule, Trainer classes - 08_dataloader: Dataset, TensorDataset, DataLoader classes - 09_spatial: Conv2d, MaxPool2d, AvgPool2d, SimpleCNN classes - 10-20: All exportable classes in remaining modules TESTING: - Test functions use 'if __name__ == "__main__"' guards - Tests run in notebooks but NOT on import - Rosenblatt Perceptron milestone working perfectly RESULT: ✅ All 20 modules export correctly ✅ Perceptron (1957) milestone functional ✅ Clean separation: development (modules/source) vs package (tinytorch)
2020 lines
89 KiB
Plaintext
2020 lines
89 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6a0bea02",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp optimization.acceleration\n",
|
||
"#| export"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a9ac4364",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 16: Acceleration - Making Models Run Faster\n",
|
||
"\n",
|
||
"Welcome to Module 16! You're about to master the art of neural network acceleration through vectorization, kernel fusion, and mixed precision training.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Complete training pipeline with profiling capabilities\n",
|
||
"**You'll Build**: Acceleration techniques including vectorization, operation fusion, and mixed precision\n",
|
||
"**You'll Enable**: Production-ready optimization for real-world deployment\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"Profiling (Module 15) → Acceleration (Module 16) → Quantization (Module 17)\n",
|
||
"(measurement) (optimization) (precision reduction)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement vectorized operations for maximum throughput\n",
|
||
"2. Create fused operations to reduce memory bandwidth\n",
|
||
"3. Build mixed precision training for memory efficiency\n",
|
||
"4. Understand the relationship between compute and memory bandwidth\n",
|
||
"5. Analyze acceleration trade-offs in production systems\n",
|
||
"\n",
|
||
"Let's optimize for speed!\n",
|
||
"\n",
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/16_acceleration/acceleration_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.optimization.acceleration`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.optimization.acceleration import vectorized_matmul, fused_gelu, MixedPrecisionTrainer\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete acceleration system in one focused module for deep understanding\n",
|
||
"- **Production:** Proper organization like PyTorch's torch.amp and torch.jit with optimization components\n",
|
||
"- **Consistency:** All acceleration operations and mixed precision training in optimization.acceleration\n",
|
||
"- **Integration:** Works seamlessly with profiling for complete performance optimization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "59fd81f7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import time\n",
|
||
"from typing import Dict, List, Tuple, Optional, Any, Union\n",
|
||
"import warnings"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e350bf3e",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction - The Performance Challenge\n",
|
||
"\n",
|
||
"Modern neural networks face two fundamental bottlenecks that limit their speed:\n",
|
||
"\n",
|
||
"### The Two Enemies of Performance\n",
|
||
"\n",
|
||
"**1. Compute Bound Operations:**\n",
|
||
"```\n",
|
||
"CPU/GPU Cores: [====BUSY====] [====BUSY====] [====BUSY====]\n",
|
||
"Memory Bus: [---idle---] [---idle---] [---idle---]\n",
|
||
"\n",
|
||
"When: Matrix multiplication, convolutions\n",
|
||
"Solution: Vectorization, better algorithms\n",
|
||
"```\n",
|
||
"\n",
|
||
"**2. Memory Bound Operations:**\n",
|
||
"```\n",
|
||
"CPU/GPU Cores: [--idle--] [--idle--] [--idle--]\n",
|
||
"Memory Bus: [========SATURATED========]\n",
|
||
"\n",
|
||
"When: Element-wise operations, small tensors\n",
|
||
"Solution: Kernel fusion, memory layout optimization\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Roofline Model - Your Performance Compass\n",
|
||
"\n",
|
||
"Every processor has fundamental limits:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Performance │ Compute Bound Region\n",
|
||
"(GFLOPS) │ ┌─────────────────────\n",
|
||
" │ │ Peak Performance\n",
|
||
" │ │\n",
|
||
" │ ╱│ Memory Bound Region\n",
|
||
" │╱ │\n",
|
||
" ╱│ │\n",
|
||
" ╱ │ │\n",
|
||
" ╱ │ │\n",
|
||
" ╱───│──│───────────────────────\n",
|
||
" ╱ │ │\n",
|
||
" ╱ │ │\n",
|
||
" ╱──────│──│────────────────── Arithmetic Intensity\n",
|
||
" │ │ (FLOPs/Byte)\n",
|
||
" Low│ │High\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Insight**: Understand where your operations live on this graph to optimize effectively.\n",
|
||
"\n",
|
||
"### Why This Module Matters\n",
|
||
"\n",
|
||
"Real-world performance wins:\n",
|
||
"- **2-5× speedup** from vectorization\n",
|
||
"- **30-50% memory reduction** from mixed precision\n",
|
||
"- **2-3× throughput** from kernel fusion\n",
|
||
"- **10× scaling improvement** for large models"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8c8b7618",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "tensor-import",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Import required dependencies\n",
|
||
"### BEGIN SOLUTION\n",
|
||
"# Import tensor from our implementation\n",
|
||
"import sys\n",
|
||
"import os\n",
|
||
"sys.path.append('/Users/VJ/GitHub/TinyTorch')\n",
|
||
"\n",
|
||
"try:\n",
|
||
" # Import from the modules directory structure\n",
|
||
" import importlib.util\n",
|
||
" spec = importlib.util.spec_from_file_location(\"tensor_dev\", \"/Users/VJ/GitHub/TinyTorch/modules/01_tensor/tensor_dev.py\")\n",
|
||
" tensor_module = importlib.util.module_from_spec(spec)\n",
|
||
" spec.loader.exec_module(tensor_module)\n",
|
||
" Tensor = tensor_module.Tensor\n",
|
||
"except ImportError:\n",
|
||
" # Fallback for testing\n",
|
||
" class Tensor:\n",
|
||
" def __init__(self, data, requires_grad=False):\n",
|
||
" self.data = np.array(data, dtype=np.float32)\n",
|
||
" self.shape = self.data.shape\n",
|
||
" self.requires_grad = requires_grad\n",
|
||
" self.grad = None\n",
|
||
"\n",
|
||
" def __add__(self, other):\n",
|
||
" return Tensor(self.data + other.data)\n",
|
||
"\n",
|
||
" def __mul__(self, other):\n",
|
||
" return Tensor(self.data * other.data)\n",
|
||
"\n",
|
||
" def matmul(self, other):\n",
|
||
" return Tensor(np.dot(self.data, other.data))\n",
|
||
"\n",
|
||
" def reshape(self, *shape):\n",
|
||
" return Tensor(self.data.reshape(shape))\n",
|
||
"\n",
|
||
" def sum(self, axis=None):\n",
|
||
" return Tensor(self.data.sum(axis=axis))\n",
|
||
"\n",
|
||
" def backward(self):\n",
|
||
" pass\n",
|
||
"### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9a445584",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 2. Foundations - Vectorization: From Loops to Lightning\n",
|
||
"\n",
|
||
"### The SIMD Revolution\n",
|
||
"\n",
|
||
"Modern processors can execute **Single Instruction, Multiple Data** operations:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Traditional Loop (Scalar): SIMD Vectorized:\n",
|
||
"for i in range(4): ┌─────┐ ┌─────┬─────┬─────┬─────┐\n",
|
||
" c[i] = a[i] + b[i] │ ALU │ → │ALU 0│ALU 1│ALU 2│ALU 3│\n",
|
||
" └─────┘ └─────┴─────┴─────┴─────┘\n",
|
||
" 1 element 4 elements per cycle\n",
|
||
" per cycle\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Memory Access Patterns: The Hidden Performance Killer\n",
|
||
"\n",
|
||
"```\n",
|
||
"Sequential Access (FAST):\n",
|
||
"Memory: [A][B][C][D][E][F][G][H]\n",
|
||
"Access: ↓ ↓ ↓ ↓ → Cache friendly\n",
|
||
"\n",
|
||
"Strided Access (SLOWER):\n",
|
||
"Memory: [A][ ][B][ ][C][ ][D][ ]\n",
|
||
"Access: ↓ ↓ ↓ ↓ → Cache misses\n",
|
||
"\n",
|
||
"Random Access (SLOWEST):\n",
|
||
"Memory: [A][B][C][D][E][F][G][H]\n",
|
||
"Access: ↓ ↑ ↓ ↑ → Cache chaos\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Matrix Multiplication: The King of Vectorization\n",
|
||
"\n",
|
||
"Matrix multiplication is **perfectly suited** for vectorization:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Matrix A (M×K) × Matrix B (K×N) = Matrix C (M×N)\n",
|
||
"\n",
|
||
"Computation Pattern:\n",
|
||
"┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n",
|
||
"│ a₁₁ a₁₂ a₁₃ a₁₄│ × │ b₁₁ b₁₂ b₁₃ b₁₄│ = │ c₁₁ c₁₂ c₁₃ c₁₄│\n",
|
||
"│ a₂₁ a₂₂ a₂₃ a₂₄│ │ b₂₁ b₂₂ b₂₃ b₂₄│ │ c₂₁ c₂₂ c₂₃ c₂₄│\n",
|
||
"│ a₃₁ a₃₂ a₃₃ a₃₄│ │ b₃₁ b₃₂ b₃₃ b₃₄│ │ c₃₁ c₃₂ c₃₃ c₃₄│\n",
|
||
"│ a₄₁ a₄₂ a₄₃ a₄₄│ │ b₄₁ b₄₂ b₄₃ b₄₄│ │ c₄₁ c₄₂ c₄₃ c₄₄│\n",
|
||
"└─────────────────┘ └─────────────────┘ └─────────────────┘\n",
|
||
"\n",
|
||
"For c₁₁: Row₁ · Column₁ = a₁₁×b₁₁ + a₁₂×b₂₁ + a₁₃×b₃₁ + a₁₄×b₄₁\n",
|
||
" ↑\n",
|
||
" VECTORIZABLE!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why vectorization wins:**\n",
|
||
"- **High arithmetic intensity**: 2N³ FLOPs for N³ data\n",
|
||
"- **Predictable memory access**: Sequential row/column reads\n",
|
||
"- **Parallelizable**: Independent dot products\n",
|
||
"- **Cache-friendly**: Data reuse in inner loops"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "01b0e1a7",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "vectorized-matmul",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" High-performance matrix multiplication using vectorized operations.\n",
|
||
"\n",
|
||
" This implementation leverages optimized BLAS libraries that use:\n",
|
||
" - SIMD instructions for parallel computation\n",
|
||
" - Cache-blocking for memory efficiency\n",
|
||
" - Multi-threading for CPU parallelization\n",
|
||
"\n",
|
||
" TODO: Implement production-grade matrix multiplication\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Validate shapes are compatible for matrix multiplication\n",
|
||
" 2. Use NumPy's optimized dot product (calls BLAS GEMM)\n",
|
||
" 3. Return result wrapped in Tensor\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" Matrix multiplication visualization:\n",
|
||
" >>> a = Tensor([[1, 2], [3, 4]]) # 2×2\n",
|
||
" >>> b = Tensor([[5, 6], [7, 8]]) # 2×2\n",
|
||
" >>> result = vectorized_matmul(a, b)\n",
|
||
" >>> print(result.data)\n",
|
||
" [[19 22] # [1×5+2×7, 1×6+2×8] = [19, 22]\n",
|
||
" [43 50]] # [3×5+4×7, 3×6+4×8] = [43, 50]\n",
|
||
"\n",
|
||
" PERFORMANCE CHARACTERISTICS:\n",
|
||
" - Time Complexity: O(N³) but highly optimized\n",
|
||
" - Space Complexity: O(N²) for result\n",
|
||
" - Arithmetic Intensity: 2N³ FLOPs / 3N² bytes = 2N/3 (good for large N)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Check a.shape[-1] == b.shape[-2] for inner dimension match\n",
|
||
" - Use np.matmul() for batch support and optimization\n",
|
||
" - Trust BLAS to handle the vectorization magic\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Input validation for matrix multiplication\n",
|
||
" if len(a.shape) < 2 or len(b.shape) < 2:\n",
|
||
" raise ValueError(\n",
|
||
" f\"Matrix multiplication requires 2D+ tensors, got shapes {a.shape} and {b.shape}. \"\n",
|
||
" f\"💡 HINT: Use reshape() to add dimensions if needed.\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" if a.shape[-1] != b.shape[-2]:\n",
|
||
" raise ValueError(\n",
|
||
" f\"Matrix multiplication shape mismatch: {a.shape} @ {b.shape}. \"\n",
|
||
" f\"Inner dimensions must match: a.shape[-1]={a.shape[-1]} != b.shape[-2]={b.shape[-2]}. \"\n",
|
||
" f\"💡 HINT: For A@B, A's columns must equal B's rows.\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Use NumPy's highly optimized matrix multiplication\n",
|
||
" # This calls BLAS GEMM (General Matrix Multiply), which uses:\n",
|
||
" # - SIMD vectorization for parallel arithmetic\n",
|
||
" # - Cache blocking for memory efficiency\n",
|
||
" # - Multi-threading on multi-core systems\n",
|
||
" result_data = np.matmul(a.data, b.data)\n",
|
||
"\n",
|
||
" return Tensor(result_data)\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ae44b17e",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-vectorized-matmul",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_vectorized_matmul():\n",
|
||
" \"\"\"🔬 Test vectorized matrix multiplication implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Vectorized Matrix Multiplication...\")\n",
|
||
"\n",
|
||
" # Test basic 2D multiplication\n",
|
||
" a = Tensor([[1, 2], [3, 4]])\n",
|
||
" b = Tensor([[5, 6], [7, 8]])\n",
|
||
" result = vectorized_matmul(a, b)\n",
|
||
"\n",
|
||
" expected = np.array([[19, 22], [43, 50]])\n",
|
||
" assert np.allclose(result.data, expected), f\"Basic matmul failed: expected {expected}, got {result.data}\"\n",
|
||
"\n",
|
||
" # Test batch multiplication (3D tensors)\n",
|
||
" batch_size, m, k, n = 2, 3, 4, 5\n",
|
||
" a_batch = Tensor(np.random.randn(batch_size, m, k))\n",
|
||
" b_batch = Tensor(np.random.randn(batch_size, k, n))\n",
|
||
" result_batch = vectorized_matmul(a_batch, b_batch)\n",
|
||
"\n",
|
||
" assert result_batch.shape == (batch_size, m, n), f\"Wrong batch shape: {result_batch.shape}\"\n",
|
||
"\n",
|
||
" # Test broadcasting (different batch dimensions)\n",
|
||
" a_single = Tensor(np.random.randn(m, k))\n",
|
||
" b_batch = Tensor(np.random.randn(batch_size, k, n))\n",
|
||
" result_broadcast = vectorized_matmul(a_single, b_batch)\n",
|
||
"\n",
|
||
" assert result_broadcast.shape == (batch_size, m, n), f\"Broadcasting failed: {result_broadcast.shape}\"\n",
|
||
"\n",
|
||
" # Test error cases\n",
|
||
" try:\n",
|
||
" vectorized_matmul(Tensor([1, 2, 3]), Tensor([4, 5])) # 1D tensors\n",
|
||
" assert False, \"Should reject 1D tensors\"\n",
|
||
" except ValueError as e:\n",
|
||
" assert \"2D+\" in str(e)\n",
|
||
"\n",
|
||
" try:\n",
|
||
" vectorized_matmul(Tensor([[1, 2]]), Tensor([[1], [2], [3]])) # Shape mismatch\n",
|
||
" assert False, \"Should reject incompatible shapes\"\n",
|
||
" except ValueError as e:\n",
|
||
" assert \"shape mismatch\" in str(e).lower()\n",
|
||
"\n",
|
||
" print(\"✅ vectorized_matmul works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_vectorized_matmul()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "85cd07f9",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 3. Implementation - Kernel Fusion: Eliminating Memory Bottlenecks\n",
|
||
"\n",
|
||
"### The Memory Bandwidth Crisis\n",
|
||
"\n",
|
||
"Consider this innocent-looking computation: `y = gelu(x * weight + bias)`\n",
|
||
"\n",
|
||
"**Naive Implementation (Memory Intensive):**\n",
|
||
"```\n",
|
||
"Step 1: temp1 = x * weight → Write 4GB to memory\n",
|
||
"Step 2: temp2 = temp1 + bias → Read 4GB, Write 4GB\n",
|
||
"Step 3: y = gelu(temp2) → Read 4GB, Write 4GB\n",
|
||
" Total: 20GB memory traffic!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Fused Implementation (Memory Efficient):**\n",
|
||
"```\n",
|
||
"Single Step: y = gelu(x * weight + bias) → Read 8GB, Write 4GB\n",
|
||
" Total: 12GB memory traffic!\n",
|
||
" 60% memory bandwidth reduction!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Understanding GELU: The Smooth Activation\n",
|
||
"\n",
|
||
"GELU (Gaussian Error Linear Unit) is used in transformers because it's **smooth** (differentiable everywhere):\n",
|
||
"\n",
|
||
"```\n",
|
||
"Activation Functions Compared:\n",
|
||
"\n",
|
||
"ReLU: GELU: Sigmoid:\n",
|
||
" | | 1 ┌─────\n",
|
||
" | | ╱ │\n",
|
||
" | ╱───│─── ╱ │\n",
|
||
"─────┘ ╱─── │ ───╱ │\n",
|
||
" Discontinuous Smooth Curve │ Smooth but saturates\n",
|
||
" gradient at 0 everywhere │\n",
|
||
"```\n",
|
||
"\n",
|
||
"**GELU Formula**: `GELU(x) = x * Φ(x)` where Φ is the standard normal CDF\n",
|
||
"\n",
|
||
"**Fast Approximation**: `GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))`\n",
|
||
"\n",
|
||
"### Kernel Fusion Strategy\n",
|
||
"\n",
|
||
"```\n",
|
||
"Unfused Operations: Fused Operation:\n",
|
||
"┌─────────────────┐ ┌─────────────────┐\n",
|
||
"│ x³ computation │ → temp1 │ │\n",
|
||
"└─────────────────┘ │ │\n",
|
||
"┌─────────────────┐ │ │\n",
|
||
"│ polynomial part │ → temp2 │ All operations│\n",
|
||
"└─────────────────┘ │ combined in │\n",
|
||
"┌─────────────────┐ │ single kernel │\n",
|
||
"│ tanh computation│ → temp3 │ │\n",
|
||
"└─────────────────┘ │ │\n",
|
||
"┌─────────────────┐ │ │\n",
|
||
"│ final multiply │ → result │ │\n",
|
||
"└─────────────────┘ └─────────────────┘\n",
|
||
"\n",
|
||
"5 memory round-trips 1 memory round-trip\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "085b3c2b",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "fused-gelu",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def fused_gelu(x: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Fused GELU activation that combines all operations in a single kernel.\n",
|
||
"\n",
|
||
" GELU combines the benefits of ReLU and sigmoid:\n",
|
||
" - Smooth everywhere (unlike ReLU's discontinuity at 0)\n",
|
||
" - Non-saturating for positive values (unlike sigmoid)\n",
|
||
" - Probabilistic interpretation: x * P(X ≤ x) where X ~ N(0,1)\n",
|
||
"\n",
|
||
" Mathematical Definition:\n",
|
||
" GELU(x) = x * Φ(x) where Φ(x) is the standard normal CDF\n",
|
||
"\n",
|
||
" Fast Approximation (used here):\n",
|
||
" GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))\n",
|
||
"\n",
|
||
" TODO: Implement fused GELU to minimize memory bandwidth\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Compute all intermediate values in a single expression\n",
|
||
" 2. Avoid creating temporary arrays\n",
|
||
" 3. Let NumPy's broadcasting handle vectorization\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> x = Tensor([-2, -1, 0, 1, 2])\n",
|
||
" >>> result = fused_gelu(x)\n",
|
||
" >>> print(result.data)\n",
|
||
" [-0.04550026 -0.15865526 0. 0.8413447 1.9544997 ]\n",
|
||
" # Notice: smooth transition through 0, positive bias\n",
|
||
"\n",
|
||
" MEMORY EFFICIENCY:\n",
|
||
" - Unfused: 5 temporary arrays × input_size × 4 bytes\n",
|
||
" - Fused: 0 temporary arrays, direct computation\n",
|
||
" - Bandwidth reduction: ~80% for memory-bound operations\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use np.sqrt(2.0 / np.pi) for the constant\n",
|
||
" - Keep entire expression in one line for maximum fusion\n",
|
||
" - NumPy will optimize the expression tree automatically\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Mathematical constant for GELU approximation\n",
|
||
" sqrt_2_over_pi = np.sqrt(2.0 / np.pi)\n",
|
||
"\n",
|
||
" # Fused GELU computation - all operations in single expression\n",
|
||
" # This minimizes memory bandwidth by avoiding intermediate arrays\n",
|
||
" # NumPy's expression evaluator will optimize this into efficient machine code\n",
|
||
" result_data = 0.5 * x.data * (\n",
|
||
" 1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3))\n",
|
||
" )\n",
|
||
"\n",
|
||
" return Tensor(result_data)\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b205cb72",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-fused-gelu",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_fused_gelu():\n",
|
||
" \"\"\"🔬 Test fused GELU activation implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Fused GELU...\")\n",
|
||
"\n",
|
||
" # Test basic properties\n",
|
||
" x = Tensor([-3, -1, 0, 1, 3])\n",
|
||
" result = fused_gelu(x)\n",
|
||
"\n",
|
||
" # GELU(0) = 0 (exact property)\n",
|
||
" assert abs(result.data[2]) < 1e-6, f\"GELU(0) should be 0, got {result.data[2]}\"\n",
|
||
"\n",
|
||
" # GELU is smooth and increasing\n",
|
||
" assert result.data[4] > result.data[3] > result.data[2], \"GELU should be increasing\"\n",
|
||
"\n",
|
||
" # GELU has positive bias (unlike ReLU)\n",
|
||
" assert result.data[3] > 0.8, \"GELU(1) should be close to 1\"\n",
|
||
" assert result.data[1] > -0.2, \"GELU(-1) should be slightly negative\"\n",
|
||
"\n",
|
||
" # Test numerical stability with extreme values\n",
|
||
" x_extreme = Tensor([-10, -5, 0, 5, 10])\n",
|
||
" result_extreme = fused_gelu(x_extreme)\n",
|
||
"\n",
|
||
" assert not np.any(np.isnan(result_extreme.data)), \"No NaN values allowed\"\n",
|
||
" assert not np.any(np.isinf(result_extreme.data)), \"No infinite values allowed\"\n",
|
||
"\n",
|
||
" # Test large tensor processing\n",
|
||
" x_large = Tensor(np.random.randn(1000, 1000).astype(np.float32))\n",
|
||
" result_large = fused_gelu(x_large)\n",
|
||
"\n",
|
||
" assert result_large.shape == x_large.shape, \"Shape preservation failed\"\n",
|
||
" assert result_large.data.dtype == np.float32, \"Data type preservation failed\"\n",
|
||
"\n",
|
||
" # Test that positive inputs are mostly preserved (GELU ≈ x for large positive x)\n",
|
||
" x_positive = Tensor([5.0])\n",
|
||
" result_positive = fused_gelu(x_positive)\n",
|
||
" assert result_positive.data[0] > 4.9, \"Large positive values should be nearly preserved\"\n",
|
||
"\n",
|
||
" print(\"✅ fused_gelu works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_fused_gelu()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cb075d6f",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🔬 Performance Analysis: Measuring Fusion Benefits\n",
|
||
"\n",
|
||
"Let's quantify the impact of kernel fusion by comparing fused vs unfused implementations."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "89558452",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "unfused-gelu",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def unfused_gelu(x: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Deliberately unfused GELU implementation for performance comparison.\n",
|
||
"\n",
|
||
" This version creates multiple intermediate tensors to simulate\n",
|
||
" the memory bandwidth overhead of unfused operations.\n",
|
||
"\n",
|
||
" TODO: Implement GELU with explicit intermediate steps\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Break computation into individual steps\n",
|
||
" 2. Create temporary Tensor objects for each step\n",
|
||
" 3. This simulates real memory allocation overhead\n",
|
||
"\n",
|
||
" PERFORMANCE IMPACT:\n",
|
||
" - Creates 7 temporary arrays\n",
|
||
" - Each array allocation/deallocation has overhead\n",
|
||
" - More memory bandwidth usage\n",
|
||
" - Potential cache misses between operations\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Unfused version - creates many intermediate arrays\n",
|
||
" sqrt_2_over_pi = np.sqrt(2.0 / np.pi)\n",
|
||
"\n",
|
||
" # Each operation creates a temporary array (simulating kernel launches)\n",
|
||
" temp1 = Tensor(x.data**3) # x³\n",
|
||
" temp2 = Tensor(0.044715 * temp1.data) # 0.044715 * x³\n",
|
||
" temp3 = Tensor(x.data + temp2.data) # x + 0.044715 * x³\n",
|
||
" temp4 = Tensor(sqrt_2_over_pi * temp3.data) # √(2/π) * (...)\n",
|
||
" temp5 = Tensor(np.tanh(temp4.data)) # tanh(...)\n",
|
||
" temp6 = Tensor(1.0 + temp5.data) # 1 + tanh(...)\n",
|
||
" temp7 = Tensor(x.data * temp6.data) # x * (1 + tanh(...))\n",
|
||
" result = Tensor(0.5 * temp7.data) # 0.5 * x * (...)\n",
|
||
"\n",
|
||
" return result\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6a50536a",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-fusion-speedup",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_fusion_speedup():\n",
|
||
" \"\"\"🔬 Measure the performance impact of kernel fusion.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Kernel Fusion Performance Impact...\")\n",
|
||
"\n",
|
||
" # Create moderately large tensor for meaningful timing\n",
|
||
" size = 2000\n",
|
||
" x = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
" warmup_iterations = 2\n",
|
||
" timing_iterations = 5\n",
|
||
"\n",
|
||
" # Warmup both implementations\n",
|
||
" for _ in range(warmup_iterations):\n",
|
||
" _ = unfused_gelu(x)\n",
|
||
" _ = fused_gelu(x)\n",
|
||
"\n",
|
||
" # Time unfused version\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(timing_iterations):\n",
|
||
" result_unfused = unfused_gelu(x)\n",
|
||
" unfused_time = time.time() - start\n",
|
||
"\n",
|
||
" # Time fused version\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(timing_iterations):\n",
|
||
" result_fused = fused_gelu(x)\n",
|
||
" fused_time = time.time() - start\n",
|
||
"\n",
|
||
" # Verify numerical correctness\n",
|
||
" assert np.allclose(result_unfused.data, result_fused.data, atol=1e-6), \\\n",
|
||
" \"Fused and unfused implementations must be numerically equivalent\"\n",
|
||
"\n",
|
||
" # Calculate performance metrics\n",
|
||
" speedup = unfused_time / fused_time if fused_time > 0 else 1.0\n",
|
||
" unfused_per_elem = (unfused_time / timing_iterations) / (size * size) * 1e9 # ns per element\n",
|
||
" fused_per_elem = (fused_time / timing_iterations) / (size * size) * 1e9\n",
|
||
"\n",
|
||
" print(f\"📊 Kernel Fusion Performance Analysis:\")\n",
|
||
" print(f\" Tensor size: {size}×{size} = {size*size:,} elements\")\n",
|
||
" print(f\" Unfused time: {unfused_time/timing_iterations*1000:.2f} ms\")\n",
|
||
" print(f\" Fused time: {fused_time/timing_iterations*1000:.2f} ms\")\n",
|
||
" print(f\" Speedup: {speedup:.2f}× faster\")\n",
|
||
" print(f\" Per-element: {unfused_per_elem:.1f} ns → {fused_per_elem:.1f} ns\")\n",
|
||
"\n",
|
||
" # Memory bandwidth estimate\n",
|
||
" bytes_per_elem = 4 # float32\n",
|
||
" unfused_memory_ops = 7 # 7 intermediate arrays\n",
|
||
" fused_memory_ops = 2 # read input, write output\n",
|
||
"\n",
|
||
" unfused_bandwidth = (unfused_memory_ops * size * size * bytes_per_elem) / (unfused_time / timing_iterations) / 1e9\n",
|
||
" fused_bandwidth = (fused_memory_ops * size * size * bytes_per_elem) / (fused_time / timing_iterations) / 1e9\n",
|
||
"\n",
|
||
" print(f\" Memory efficiency: {unfused_memory_ops}→{fused_memory_ops} memory ops\")\n",
|
||
" print(f\" Effective bandwidth: {unfused_bandwidth:.1f}→{fused_bandwidth:.1f} GB/s\")\n",
|
||
"\n",
|
||
" # Interpret results\n",
|
||
" if speedup > 1.5:\n",
|
||
" print(\"🚀 Excellent! Kernel fusion providing significant speedup\")\n",
|
||
" elif speedup > 1.1:\n",
|
||
" print(\"✅ Good! Kernel fusion providing measurable benefit\")\n",
|
||
" else:\n",
|
||
" print(\"⚠️ Limited speedup - may be compute-bound or small tensor size\")\n",
|
||
"\n",
|
||
" print(\"✅ Fusion performance analysis completed!\")\n",
|
||
"\n",
|
||
"test_unit_fusion_speedup()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "adb97e5a",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 4. Integration - Mixed Precision Training: Memory and Speed\n",
|
||
"\n",
|
||
"### The Mixed Precision Revolution\n",
|
||
"\n",
|
||
"Modern GPUs (like V100, A100) have specialized **Tensor Cores** that can perform FP16 operations much faster than FP32:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Performance Comparison (Theoretical Peak):\n",
|
||
"┌─────────────────┬────────────────┬────────────────┐\n",
|
||
"│ Precision │ V100 TFLOPS │ A100 TFLOPS │\n",
|
||
"├─────────────────┼────────────────┼────────────────┤\n",
|
||
"│ FP32 (float) │ 15.7 │ 19.5 │\n",
|
||
"│ FP16 (half) │ 125.0 │ 312.0 │\n",
|
||
"│ Speedup │ 8× │ 16× │\n",
|
||
"└─────────────────┴────────────────┴────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Challenge: FP16 Precision Limitations\n",
|
||
"\n",
|
||
"FP16 has a much smaller range than FP32:\n",
|
||
"\n",
|
||
"```\n",
|
||
"FP32 (32-bit): FP16 (16-bit):\n",
|
||
"┌─────────────────────────────┐ ┌───────────────┐\n",
|
||
"│ Sign │ 8-bit │ 23-bit │ │Sign│5-bit│10-bit│\n",
|
||
"│ bit │ Exp │ Mantissa │ │bit │ Exp │Mant. │\n",
|
||
"└─────────────────────────────┘ └───────────────┘\n",
|
||
"Range: ±3.4 × 10³⁸ Range: ±6.5 × 10⁴\n",
|
||
"Precision: ~7 decimal digits Precision: ~3 decimal digits\n",
|
||
"\n",
|
||
"Problem: Small gradients (< 6e-5) become ZERO in FP16!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Solution: Automatic Loss Scaling\n",
|
||
"\n",
|
||
"```\n",
|
||
"Training Step Without Scaling: Training Step With Scaling:\n",
|
||
"\n",
|
||
"Loss = 0.0001 Loss = 0.0001\n",
|
||
" ↓ ↓\n",
|
||
"Gradients = 0.00001 Scale × 1024\n",
|
||
" ↓ ↓\n",
|
||
"Convert to FP16 Loss = 0.1024\n",
|
||
" ↓ ↓\n",
|
||
"Gradients = 0.0 (UNDERFLOW!) Gradients = 0.01024\n",
|
||
" ↓ ↓\n",
|
||
"No learning! Convert to FP16: 0.01024 ✓\n",
|
||
" ↓\n",
|
||
" Unscale: 0.01024 / 1024 = 0.00001\n",
|
||
" ↓\n",
|
||
" Successful learning!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Mixed Precision Memory Benefits\n",
|
||
"\n",
|
||
"```\n",
|
||
"Model Component Breakdown:\n",
|
||
"┌─────────────────┬─────────────┬─────────────┬─────────────┐\n",
|
||
"│ Component │ FP32 Memory │ FP16 Memory │ Savings │\n",
|
||
"├─────────────────┼─────────────┼─────────────┼─────────────┤\n",
|
||
"│ Parameters │ 4N │ 4N │ 0% │\n",
|
||
"│ Gradients │ 4N │ 2N │ 50% │\n",
|
||
"│ Activations │ 4A │ 2A │ 50% │\n",
|
||
"│ Optimizer State │ 8N │ 8N │ 0% │\n",
|
||
"├─────────────────┼─────────────┼─────────────┼─────────────┤\n",
|
||
"│ Total Typical │ ~20N │ ~16N │ 20% │\n",
|
||
"│ Activation-Heavy│ ~40N │ ~24N │ 40% │\n",
|
||
"└─────────────────┴─────────────┴─────────────┴─────────────┘\n",
|
||
"\n",
|
||
"N = parameter count, A = activation memory\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7a19b2a6",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "mixed-precision-trainer",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"class MixedPrecisionTrainer:\n",
|
||
" \"\"\"\n",
|
||
" Mixed precision trainer with automatic loss scaling.\n",
|
||
"\n",
|
||
" Implements the same pattern as PyTorch's Automatic Mixed Precision (AMP):\n",
|
||
" 1. Forward pass in FP16 for speed and memory efficiency\n",
|
||
" 2. Loss scaling to prevent gradient underflow\n",
|
||
" 3. Gradient computation and unscaling\n",
|
||
" 4. Parameter updates in FP32 for numerical stability\n",
|
||
"\n",
|
||
" The key insight: keep different parts of training in optimal precision.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, model, optimizer, loss_scale: float = 1024.0, max_loss_scale: float = 65536.0):\n",
|
||
" \"\"\"\n",
|
||
" Initialize mixed precision training infrastructure.\n",
|
||
"\n",
|
||
" TODO: Set up automatic loss scaling and overflow detection\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Store model and optimizer references\n",
|
||
" 2. Initialize dynamic loss scaling parameters\n",
|
||
" 3. Set up overflow detection and scale adjustment logic\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" model: Neural network model\n",
|
||
" optimizer: Parameter optimizer (SGD, Adam, etc.)\n",
|
||
" loss_scale: Initial scaling factor for gradients\n",
|
||
" max_loss_scale: Maximum allowed loss scale\n",
|
||
"\n",
|
||
" LOSS SCALING STRATEGY:\n",
|
||
" - Start with reasonable scale (1024)\n",
|
||
" - Increase gradually if no overflow (better precision)\n",
|
||
" - Decrease immediately on overflow (stability)\n",
|
||
" - This balances numerical precision with training stability\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Track consecutive successful steps for scale increases\n",
|
||
" - Use exponential backoff on overflow detection\n",
|
||
" - Keep scale within reasonable bounds [1, 65536]\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.model = model\n",
|
||
" self.optimizer = optimizer\n",
|
||
"\n",
|
||
" # Loss scaling parameters\n",
|
||
" self.loss_scale = loss_scale\n",
|
||
" self.max_loss_scale = max_loss_scale\n",
|
||
" self.min_loss_scale = 1.0\n",
|
||
"\n",
|
||
" # Dynamic scaling parameters\n",
|
||
" self.scale_growth_factor = 2.0 # Multiply by 2 when increasing\n",
|
||
" self.scale_backoff_factor = 0.5 # Divide by 2 when decreasing\n",
|
||
" self.growth_interval = 2000 # Steps between scale increases\n",
|
||
" self.steps_since_last_scale_update = 0\n",
|
||
"\n",
|
||
" # Overflow tracking\n",
|
||
" self.overflow_detected = False\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def scale_loss(self, loss: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Scale loss to prevent gradient underflow in FP16.\n",
|
||
"\n",
|
||
" The fundamental challenge: FP16 can only represent values ≥ 6e-5.\n",
|
||
" Small gradients (common in deep networks) become zero without scaling.\n",
|
||
"\n",
|
||
" TODO: Apply loss scaling for mixed precision stability\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Multiply loss by current scale factor\n",
|
||
" 2. This amplifies gradients proportionally\n",
|
||
" 3. Return scaled loss for backward pass\n",
|
||
"\n",
|
||
" MATHEMATICAL INSIGHT:\n",
|
||
" If loss = 1e-6 and scale = 1024:\n",
|
||
" scaled_loss = 1e-6 × 1024 = 1.024e-3\n",
|
||
"\n",
|
||
" After backward pass:\n",
|
||
" scaled_gradients = 1.024e-3 × dloss/dparam = 1024 × gradients\n",
|
||
"\n",
|
||
" These larger gradients survive FP16 conversion!\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> trainer = MixedPrecisionTrainer(model, optimizer)\n",
|
||
" >>> loss = Tensor([0.0001]) # Small loss\n",
|
||
" >>> scaled = trainer.scale_loss(loss)\n",
|
||
" >>> print(scaled.data) # [0.1024] (0.0001 × 1024)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Scale the loss to amplify gradients\n",
|
||
" # This prevents gradient underflow in FP16 arithmetic\n",
|
||
" scaled_data = loss.data * self.loss_scale\n",
|
||
" return Tensor(scaled_data)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def unscale_gradients(self, parameters: List[Tensor]) -> bool:\n",
|
||
" \"\"\"\n",
|
||
" Unscale gradients and detect overflow from FP16 conversion.\n",
|
||
"\n",
|
||
" After backward pass on scaled loss, gradients are scaled too.\n",
|
||
" We must unscale them AND check for overflow/underflow.\n",
|
||
"\n",
|
||
" TODO: Implement gradient unscaling with overflow detection\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Divide all gradients by loss scale (restore original magnitude)\n",
|
||
" 2. Check for inf/nan values (indicates FP16 overflow)\n",
|
||
" 3. Return True if gradients are valid, False if overflow detected\n",
|
||
"\n",
|
||
" OVERFLOW DETECTION:\n",
|
||
" inf/nan in gradients indicates:\n",
|
||
" - Gradient magnitude too large for FP16\n",
|
||
" - Numerical instability in computation\n",
|
||
" - Loss scale too aggressive\n",
|
||
"\n",
|
||
" When overflow occurs:\n",
|
||
" - Skip parameter update (unstable gradients)\n",
|
||
" - Reduce loss scale for next iteration\n",
|
||
" - Continue training with lower scale\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use np.isfinite() to detect inf/nan efficiently\n",
|
||
" - Process all parameters even if overflow found\n",
|
||
" - Set self.overflow_detected flag for scale adjustment\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.overflow_detected = False\n",
|
||
"\n",
|
||
" # Unscale all gradients and check for overflow\n",
|
||
" for param in parameters:\n",
|
||
" if param.grad is not None:\n",
|
||
" # Unscale gradients to original magnitude\n",
|
||
" param.grad.data = param.grad.data / self.loss_scale\n",
|
||
"\n",
|
||
" # Check for overflow/underflow (inf/nan values)\n",
|
||
" if not np.all(np.isfinite(param.grad.data)):\n",
|
||
" self.overflow_detected = True\n",
|
||
" # Continue processing to unscale all gradients\n",
|
||
"\n",
|
||
" return not self.overflow_detected\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def update_loss_scale(self):\n",
|
||
" \"\"\"\n",
|
||
" Dynamically adjust loss scale based on training stability.\n",
|
||
"\n",
|
||
" Implements the \"Goldilocks\" principle for loss scaling:\n",
|
||
" - Too low: precision loss from small gradients\n",
|
||
" - Too high: overflow and instability\n",
|
||
" - Just right: maximum precision without overflow\n",
|
||
"\n",
|
||
" TODO: Implement adaptive loss scale adjustment\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. If overflow detected: reduce scale immediately (stability)\n",
|
||
" 2. If no overflow for many steps: increase scale (precision)\n",
|
||
" 3. Keep scale within reasonable bounds\n",
|
||
"\n",
|
||
" SCALING STRATEGY:\n",
|
||
" - Aggressive reduction on overflow (×0.5)\n",
|
||
" - Conservative growth during stability (×2 every 2000 steps)\n",
|
||
" - This favors stability over maximum precision\n",
|
||
"\n",
|
||
" WHY THIS WORKS:\n",
|
||
" - Most training is stable (gradual scale increase)\n",
|
||
" - Occasional instability (rapid scale decrease)\n",
|
||
" - Converges to optimal scale for current training phase\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if self.overflow_detected:\n",
|
||
" # Immediately reduce scale on overflow\n",
|
||
" self.loss_scale = max(\n",
|
||
" self.min_loss_scale,\n",
|
||
" self.loss_scale * self.scale_backoff_factor\n",
|
||
" )\n",
|
||
" self.steps_since_last_scale_update = 0\n",
|
||
" else:\n",
|
||
" # Gradually increase scale if stable\n",
|
||
" self.steps_since_last_scale_update += 1\n",
|
||
" if self.steps_since_last_scale_update >= self.growth_interval:\n",
|
||
" self.loss_scale = min(\n",
|
||
" self.max_loss_scale,\n",
|
||
" self.loss_scale * self.scale_growth_factor\n",
|
||
" )\n",
|
||
" self.steps_since_last_scale_update = 0\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def train_step(self, batch: Tuple[Tensor, Tensor]) -> Dict[str, float]:\n",
|
||
" \"\"\"\n",
|
||
" Execute complete mixed precision training step.\n",
|
||
"\n",
|
||
" Orchestrates the entire mixed precision training process:\n",
|
||
" 1. Forward pass (FP16 in real implementation)\n",
|
||
" 2. Loss computation and scaling\n",
|
||
" 3. Backward pass on scaled loss\n",
|
||
" 4. Gradient unscaling and overflow detection\n",
|
||
" 5. Conditional parameter update\n",
|
||
" 6. Loss scale adjustment\n",
|
||
"\n",
|
||
" TODO: Implement end-to-end mixed precision training step\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Clear gradients from previous step\n",
|
||
" 2. Forward pass through model\n",
|
||
" 3. Compute and scale loss\n",
|
||
" 4. Backward pass to compute scaled gradients\n",
|
||
" 5. Unscale gradients and check for overflow\n",
|
||
" 6. Update parameters only if no overflow\n",
|
||
" 7. Adjust loss scale based on stability\n",
|
||
"\n",
|
||
" CRITICAL INSIGHT:\n",
|
||
" Skip parameter updates on overflow! Unstable gradients\n",
|
||
" would move parameters in wrong direction.\n",
|
||
"\n",
|
||
" RETURN FORMAT:\n",
|
||
" Dictionary with training metrics:\n",
|
||
" - loss: unscaled loss value\n",
|
||
" - loss_scale: current scaling factor\n",
|
||
" - overflow: whether overflow occurred\n",
|
||
" - gradients_valid: whether update was applied\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use self.optimizer.zero_grad() to clear gradients\n",
|
||
" - Get parameters with gradients for unscaling\n",
|
||
" - Only call optimizer.step() if gradients are valid\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" inputs, targets = batch\n",
|
||
"\n",
|
||
" # Clear gradients from previous step\n",
|
||
" self.optimizer.zero_grad()\n",
|
||
"\n",
|
||
" # Forward pass (would use FP16 autocast in real implementation)\n",
|
||
" # For simulation, we work in FP32 but apply scaling principles\n",
|
||
" outputs = self.model(inputs)\n",
|
||
"\n",
|
||
" # Compute loss (unscaled)\n",
|
||
" loss = self._compute_loss(outputs, targets)\n",
|
||
"\n",
|
||
" # Scale loss for mixed precision\n",
|
||
" scaled_loss = self.scale_loss(loss)\n",
|
||
"\n",
|
||
" # Backward pass on scaled loss\n",
|
||
" scaled_loss.backward()\n",
|
||
"\n",
|
||
" # Get all parameters with gradients\n",
|
||
" parameters = [p for p in self.model.parameters() if p.grad is not None]\n",
|
||
"\n",
|
||
" # Unscale gradients and detect overflow\n",
|
||
" gradients_valid = self.unscale_gradients(parameters)\n",
|
||
"\n",
|
||
" # Update parameters only if no overflow\n",
|
||
" if gradients_valid:\n",
|
||
" self.optimizer.step()\n",
|
||
"\n",
|
||
" # Adjust loss scale based on stability\n",
|
||
" self.update_loss_scale()\n",
|
||
"\n",
|
||
" # Return training metrics\n",
|
||
" return {\n",
|
||
" 'loss': loss.data.item() if hasattr(loss.data, 'item') else float(loss.data),\n",
|
||
" 'loss_scale': self.loss_scale,\n",
|
||
" 'overflow': self.overflow_detected,\n",
|
||
" 'gradients_valid': gradients_valid\n",
|
||
" }\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _compute_loss(self, outputs: Tensor, targets: Tensor) -> Tensor:\n",
|
||
" \"\"\"Simple MSE loss for demonstration purposes.\"\"\"\n",
|
||
" diff = Tensor(outputs.data - targets.data)\n",
|
||
" return Tensor(np.mean(diff.data**2))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "650bf77c",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-mixed-precision",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_mixed_precision():\n",
|
||
" \"\"\"🔬 Test mixed precision training components comprehensively.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Mixed Precision Training...\")\n",
|
||
"\n",
|
||
" # Create mock model and optimizer for testing\n",
|
||
" class MockModel:\n",
|
||
" def __init__(self):\n",
|
||
" self.weight = Tensor(np.random.randn(10, 5).astype(np.float32))\n",
|
||
" self.weight.grad = None\n",
|
||
"\n",
|
||
" def __call__(self, x):\n",
|
||
" return x.matmul(self.weight)\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" return [self.weight]\n",
|
||
"\n",
|
||
" class MockOptimizer:\n",
|
||
" def __init__(self, params):\n",
|
||
" self.params = params\n",
|
||
" self.updates_applied = 0\n",
|
||
"\n",
|
||
" def zero_grad(self):\n",
|
||
" for p in self.params:\n",
|
||
" p.grad = None\n",
|
||
"\n",
|
||
" def step(self):\n",
|
||
" for p in self.params:\n",
|
||
" if p.grad is not None:\n",
|
||
" p.data = p.data - 0.01 * p.grad.data\n",
|
||
" self.updates_applied += 1\n",
|
||
"\n",
|
||
" # Initialize mixed precision trainer\n",
|
||
" model = MockModel()\n",
|
||
" optimizer = MockOptimizer(model.parameters())\n",
|
||
" trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=1024.0)\n",
|
||
"\n",
|
||
" # Test 1: Loss scaling\n",
|
||
" print(\" Testing loss scaling...\")\n",
|
||
" loss = Tensor([0.001])\n",
|
||
" scaled_loss = trainer.scale_loss(loss)\n",
|
||
" expected_scaled = 0.001 * 1024.0\n",
|
||
" assert np.isclose(scaled_loss.data[0], expected_scaled), \\\n",
|
||
" f\"Loss scaling failed: expected {expected_scaled}, got {scaled_loss.data[0]}\"\n",
|
||
"\n",
|
||
" # Test 2: Gradient unscaling (normal case)\n",
|
||
" print(\" Testing gradient unscaling...\")\n",
|
||
" model.weight.grad = Tensor(np.full((10, 5), 1024.0)) # Simulate scaled gradients\n",
|
||
" valid = trainer.unscale_gradients([model.weight])\n",
|
||
" assert valid, \"Should detect valid gradients\"\n",
|
||
" assert np.allclose(model.weight.grad.data, 1.0), \"Gradient unscaling failed\"\n",
|
||
"\n",
|
||
" # Test 3: Overflow detection\n",
|
||
" print(\" Testing overflow detection...\")\n",
|
||
" model.weight.grad = Tensor(np.full((10, 5), np.inf)) # Simulate overflow\n",
|
||
" valid = trainer.unscale_gradients([model.weight])\n",
|
||
" assert not valid, \"Should detect overflow\"\n",
|
||
" assert trainer.overflow_detected, \"Overflow flag not set\"\n",
|
||
"\n",
|
||
" # Test 4: Loss scale adjustment after overflow\n",
|
||
" print(\" Testing loss scale adjustment...\")\n",
|
||
" initial_scale = trainer.loss_scale\n",
|
||
" trainer.update_loss_scale() # Should reduce scale due to overflow\n",
|
||
" assert trainer.loss_scale < initial_scale, \\\n",
|
||
" f\"Scale should decrease after overflow: {initial_scale} → {trainer.loss_scale}\"\n",
|
||
"\n",
|
||
" # Test 5: Loss scale increase during stability\n",
|
||
" print(\" Testing loss scale increase...\")\n",
|
||
" trainer.overflow_detected = False\n",
|
||
" trainer.steps_since_last_scale_update = 2000 # Simulate stable training\n",
|
||
" scale_before = trainer.loss_scale\n",
|
||
" trainer.update_loss_scale()\n",
|
||
" assert trainer.loss_scale > scale_before, \"Scale should increase during stability\"\n",
|
||
"\n",
|
||
" # Test 6: End-to-end training step\n",
|
||
" print(\" Testing complete training step...\")\n",
|
||
" inputs = Tensor(np.random.randn(8, 10).astype(np.float32))\n",
|
||
" targets = Tensor(np.random.randn(8, 5).astype(np.float32))\n",
|
||
"\n",
|
||
" initial_updates = optimizer.updates_applied\n",
|
||
" metrics = trainer.train_step((inputs, targets))\n",
|
||
"\n",
|
||
" # Verify metrics structure\n",
|
||
" required_keys = ['loss', 'loss_scale', 'overflow', 'gradients_valid']\n",
|
||
" for key in required_keys:\n",
|
||
" assert key in metrics, f\"Missing metric: {key}\"\n",
|
||
"\n",
|
||
" # Verify loss is reasonable\n",
|
||
" assert isinstance(metrics['loss'], (int, float)), \"Loss should be numeric\"\n",
|
||
" assert metrics['loss'] >= 0, \"Loss should be non-negative\"\n",
|
||
"\n",
|
||
" # Verify loss scale is positive\n",
|
||
" assert metrics['loss_scale'] > 0, \"Loss scale should be positive\"\n",
|
||
"\n",
|
||
" print(\"✅ Mixed precision training works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_mixed_precision()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "de9e4b44",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis - Performance Scaling Patterns\n",
|
||
"\n",
|
||
"Let's analyze how our acceleration techniques perform across different scenarios and understand their scaling characteristics."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2f7edfee",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze-vectorization",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_vectorization_scaling():\n",
|
||
" \"\"\"📊 Analyze vectorization performance across different tensor sizes.\"\"\"\n",
|
||
" print(\"📊 Analyzing vectorization scaling behavior...\")\n",
|
||
"\n",
|
||
" # Test sizes spanning different cache regimes\n",
|
||
" sizes = [64, 128, 256, 512, 1024, 2048]\n",
|
||
"\n",
|
||
" print(\"\\n🔍 Vectorization Scaling Analysis:\")\n",
|
||
" print(\"┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
|
||
" print(\"│ Size │ Time (ms) │ GFLOPS │ Bandwidth │ Efficiency │\")\n",
|
||
" print(\"│ │ │ │ (GB/s) │ (% of peak) │\")\n",
|
||
" print(\"├─────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
|
||
"\n",
|
||
" for size in sizes:\n",
|
||
" # Create test matrices\n",
|
||
" a = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
" b = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
"\n",
|
||
" # Warm up\n",
|
||
" for _ in range(2):\n",
|
||
" _ = vectorized_matmul(a, b)\n",
|
||
"\n",
|
||
" # Time vectorized implementation\n",
|
||
" iterations = max(1, 100 // (size // 64)) # Fewer iterations for larger sizes\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(iterations):\n",
|
||
" result = vectorized_matmul(a, b)\n",
|
||
" elapsed = (time.time() - start) / iterations\n",
|
||
"\n",
|
||
" # Calculate performance metrics\n",
|
||
" flops = 2 * size**3 # 2N³ FLOPs for matrix multiplication\n",
|
||
" gflops = flops / (elapsed * 1e9)\n",
|
||
"\n",
|
||
" bytes_accessed = 3 * size * size * 4 # 3 matrices × size² × 4 bytes\n",
|
||
" bandwidth = bytes_accessed / (elapsed * 1e9)\n",
|
||
"\n",
|
||
" # Estimate efficiency (rough baseline: modern CPU ~100-500 GFLOPS peak)\n",
|
||
" estimated_peak_gflops = 200 # Conservative estimate\n",
|
||
" efficiency = min(100, gflops / estimated_peak_gflops * 100)\n",
|
||
"\n",
|
||
" print(f\"│ {size:6d} │ {elapsed*1000:9.2f} │ {gflops:9.1f} │ {bandwidth:9.1f} │ {efficiency:9.1f} │\")\n",
|
||
"\n",
|
||
" print(\"└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Vectorization insights:\")\n",
|
||
" print(f\" • Small matrices: Limited by overhead and cache effects\")\n",
|
||
" print(f\" • Medium matrices: Sweet spot for cache reuse\")\n",
|
||
" print(f\" • Large matrices: Memory bandwidth becomes limiting factor\")\n",
|
||
" print(f\" • BLAS libraries automatically optimize for each size regime\")\n",
|
||
" print(\"🚀 Vectorization effectiveness depends on problem size and hardware\")\n",
|
||
"\n",
|
||
"analyze_vectorization_scaling()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5972a039",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze-arithmetic-intensity",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_arithmetic_intensity():\n",
|
||
" \"\"\"📊 Demonstrate the roofline model with different operations.\"\"\"\n",
|
||
" print(\"📊 Analyzing arithmetic intensity patterns...\")\n",
|
||
"\n",
|
||
" size = 1024\n",
|
||
" iterations = 10\n",
|
||
"\n",
|
||
" operations = []\n",
|
||
"\n",
|
||
" # Create test data\n",
|
||
" x = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
" y = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
"\n",
|
||
" print(\"\\n🎯 Arithmetic Intensity Analysis:\")\n",
|
||
" print(\"┌─────────────────────┬─────────┬─────────────┬─────────────┬─────────────┐\")\n",
|
||
" print(\"│ Operation │ AI │ Time (ms) │ GFLOPS │ GB/s │\")\n",
|
||
" print(\"│ │(FLOPs/B)│ │ │ │\")\n",
|
||
" print(\"├─────────────────────┼─────────┼─────────────┼─────────────┼─────────────┤\")\n",
|
||
"\n",
|
||
" # 1. Element-wise addition (very low arithmetic intensity)\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(iterations):\n",
|
||
" _ = Tensor(x.data + y.data)\n",
|
||
" add_time = (time.time() - start) / iterations\n",
|
||
"\n",
|
||
" add_flops = size * size # One addition per element\n",
|
||
" add_bytes = 3 * size * size * 4 # Read x, read y, write result\n",
|
||
" add_ai = add_flops / add_bytes\n",
|
||
" add_gflops = add_flops / (add_time * 1e9)\n",
|
||
" add_bandwidth = add_bytes / (add_time * 1e9)\n",
|
||
"\n",
|
||
" print(f\"│ Element-wise Add │ {add_ai:6.3f} │ {add_time*1000:9.2f} │ {add_gflops:9.1f} │ {add_bandwidth:9.1f} │\")\n",
|
||
"\n",
|
||
" # 2. Element-wise multiply (still low, but slightly higher)\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(iterations):\n",
|
||
" _ = Tensor(x.data * y.data)\n",
|
||
" mul_time = (time.time() - start) / iterations\n",
|
||
"\n",
|
||
" mul_flops = size * size\n",
|
||
" mul_bytes = 3 * size * size * 4\n",
|
||
" mul_ai = mul_flops / mul_bytes\n",
|
||
" mul_gflops = mul_flops / (mul_time * 1e9)\n",
|
||
" mul_bandwidth = mul_bytes / (mul_time * 1e9)\n",
|
||
"\n",
|
||
" print(f\"│ Element-wise Mult │ {mul_ai:6.3f} │ {mul_time*1000:9.2f} │ {mul_gflops:9.1f} │ {mul_bandwidth:9.1f} │\")\n",
|
||
"\n",
|
||
" # 3. GELU (medium arithmetic intensity)\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(iterations):\n",
|
||
" _ = fused_gelu(x)\n",
|
||
" gelu_time = (time.time() - start) / iterations\n",
|
||
"\n",
|
||
" gelu_flops = size * size * 8 # Approximate: x³, add, mul, tanh, etc.\n",
|
||
" gelu_bytes = 2 * size * size * 4 # Read x, write result\n",
|
||
" gelu_ai = gelu_flops / gelu_bytes\n",
|
||
" gelu_gflops = gelu_flops / (gelu_time * 1e9)\n",
|
||
" gelu_bandwidth = gelu_bytes / (gelu_time * 1e9)\n",
|
||
"\n",
|
||
" print(f\"│ Fused GELU │ {gelu_ai:6.3f} │ {gelu_time*1000:9.2f} │ {gelu_gflops:9.1f} │ {gelu_bandwidth:9.1f} │\")\n",
|
||
"\n",
|
||
" # 4. Matrix multiplication (high arithmetic intensity)\n",
|
||
" start = time.time()\n",
|
||
" for _ in range(iterations):\n",
|
||
" _ = vectorized_matmul(x, y)\n",
|
||
" matmul_time = (time.time() - start) / iterations\n",
|
||
"\n",
|
||
" matmul_flops = 2 * size**3 # 2N³ FLOPs\n",
|
||
" matmul_bytes = 3 * size * size * 4 # 3 matrices\n",
|
||
" matmul_ai = matmul_flops / matmul_bytes\n",
|
||
" matmul_gflops = matmul_flops / (matmul_time * 1e9)\n",
|
||
" matmul_bandwidth = matmul_bytes / (matmul_time * 1e9)\n",
|
||
"\n",
|
||
" print(f\"│ Matrix Multiply │ {matmul_ai:6.3f} │ {matmul_time*1000:9.2f} │ {matmul_gflops:9.1f} │ {matmul_bandwidth:9.1f} │\")\n",
|
||
"\n",
|
||
" print(\"└─────────────────────┴─────────┴─────────────┴─────────────┴─────────────┘\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Roofline Model Insights:\")\n",
|
||
" print(f\" 📊 Low AI (< 1): Memory bound - limited by bandwidth\")\n",
|
||
" print(f\" 📊 Med AI (1-10): Transitional - depends on implementation\")\n",
|
||
" print(f\" 📊 High AI (> 10): Compute bound - limited by ALU throughput\")\n",
|
||
" print(f\" 🎯 Matrix multiplication ({matmul_ai:.1f} AI) is ideal for GPUs/TPUs\")\n",
|
||
" print(f\" ⚡ Element-wise ops ({add_ai:.3f} AI) need memory optimization\")\n",
|
||
" print(\"🚀 Design algorithms with high arithmetic intensity for performance\")\n",
|
||
"\n",
|
||
"analyze_arithmetic_intensity()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7a539cd5",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze-mixed-precision-benefits",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_mixed_precision_benefits():\n",
|
||
" \"\"\"📊 Quantify mixed precision memory and performance benefits.\"\"\"\n",
|
||
" print(\"📊 Analyzing mixed precision benefits across model sizes...\")\n",
|
||
"\n",
|
||
" # Define representative model configurations\n",
|
||
" model_configs = [\n",
|
||
" (\"Tiny CNN\", {\"params\": 50_000, \"activations\": 100_000}),\n",
|
||
" (\"Small BERT\", {\"params\": 10_000_000, \"activations\": 5_000_000}),\n",
|
||
" (\"Medium GPT\", {\"params\": 100_000_000, \"activations\": 50_000_000}),\n",
|
||
" (\"Large Transformer\", {\"params\": 1_000_000_000, \"activations\": 500_000_000}),\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(\"\\n🧮 Mixed Precision Memory Analysis:\")\n",
|
||
" print(\"┌─────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
|
||
" print(\"│ Model Type │ Parameters │ FP32 Memory │ FP16 Memory │ Savings │\")\n",
|
||
" print(\"│ │ │ (GB) │ (GB) │ (%) │\")\n",
|
||
" print(\"├─────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
|
||
"\n",
|
||
" for name, config in model_configs:\n",
|
||
" param_count = config[\"params\"]\n",
|
||
" activation_count = config[\"activations\"]\n",
|
||
"\n",
|
||
" # Memory calculation (bytes)\n",
|
||
" # Parameters: always FP32 for stability\n",
|
||
" param_memory = param_count * 4\n",
|
||
"\n",
|
||
" # FP32 training memory\n",
|
||
" fp32_activations = activation_count * 4\n",
|
||
" fp32_gradients = param_count * 4\n",
|
||
" fp32_optimizer = param_count * 8 # Adam: momentum + velocity\n",
|
||
" fp32_total = param_memory + fp32_activations + fp32_gradients + fp32_optimizer\n",
|
||
"\n",
|
||
" # Mixed precision memory\n",
|
||
" fp16_activations = activation_count * 2 # FP16 activations\n",
|
||
" fp16_gradients = param_count * 2 # FP16 gradients during backward\n",
|
||
" mixed_total = param_memory + fp16_activations + fp16_gradients + fp32_optimizer\n",
|
||
"\n",
|
||
" # Calculate savings\n",
|
||
" savings_gb = (fp32_total - mixed_total) / 1e9\n",
|
||
" savings_pct = (fp32_total - mixed_total) / fp32_total * 100\n",
|
||
"\n",
|
||
" print(f\"│ {name:14s} │ {param_count:10,d} │ {fp32_total/1e9:9.1f} │ {mixed_total/1e9:9.1f} │ {savings_pct:9.1f} │\")\n",
|
||
"\n",
|
||
" print(\"└─────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
|
||
"\n",
|
||
" # Performance simulation\n",
|
||
" print(f\"\\n⚡ Mixed Precision Performance Simulation:\")\n",
|
||
"\n",
|
||
" # Simulate different batch sizes to show memory pressure\n",
|
||
" batch_sizes = [8, 16, 32, 64]\n",
|
||
" hidden_size = 1024\n",
|
||
" seq_length = 512\n",
|
||
"\n",
|
||
" print(\"┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
|
||
" print(\"│ Batch Size │ FP32 Mem │ FP16 Mem │ Throughput │ Efficiency │\")\n",
|
||
" print(\"│ │ (GB) │ (GB) │ Gain │ Gain │\")\n",
|
||
" print(\"├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
|
||
"\n",
|
||
" for batch_size in batch_sizes:\n",
|
||
" # Memory for activations (dominant for large models)\n",
|
||
" elements = batch_size * seq_length * hidden_size\n",
|
||
"\n",
|
||
" fp32_mem = elements * 4 / 1e9 # 4 bytes per FP32\n",
|
||
" fp16_mem = elements * 2 / 1e9 # 2 bytes per FP16\n",
|
||
"\n",
|
||
" # Simulate throughput gains (based on Tensor Core speedups)\n",
|
||
" # Real speedups depend on hardware and operation mix\n",
|
||
" throughput_gain = 1.4 # Conservative estimate for mixed workloads\n",
|
||
"\n",
|
||
" # Memory efficiency enables larger batch sizes\n",
|
||
" max_fp32_batch = 32 # Assume memory limit\n",
|
||
" max_fp16_batch = 64 # Double capacity with FP16\n",
|
||
"\n",
|
||
" efficiency_gain = max_fp16_batch / max_fp32_batch if batch_size <= max_fp32_batch else \"OOM\"\n",
|
||
" efficiency_str = f\"{efficiency_gain:.1f}×\" if isinstance(efficiency_gain, float) else efficiency_gain\n",
|
||
"\n",
|
||
" print(f\"│ {batch_size:10d} │ {fp32_mem:9.2f} │ {fp16_mem:9.2f} │ {throughput_gain:9.1f}× │ {efficiency_str:9s} │\")\n",
|
||
"\n",
|
||
" print(\"└─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Mixed Precision Key Benefits:\")\n",
|
||
" print(f\" 🎯 Memory: 20-40% reduction enables larger models/batches\")\n",
|
||
" print(f\" ⚡ Speed: 1.3-2× throughput on modern hardware (V100+)\")\n",
|
||
" print(f\" 📈 Scale: Essential for billion-parameter models\")\n",
|
||
" print(f\" ⚠️ Complexity: Requires careful loss scaling and overflow handling\")\n",
|
||
" print(\"🚀 Mixed precision is crucial for competitive ML training\")\n",
|
||
"\n",
|
||
"analyze_mixed_precision_benefits()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d42aa6ff",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 6. Optimization Insights - Production Acceleration Strategy\n",
|
||
"\n",
|
||
"Understanding when and how to apply different acceleration techniques in real-world scenarios."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "133b1f71",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "acceleration-decision-framework",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_acceleration_decision_framework():\n",
|
||
" \"\"\"📊 Decision framework for choosing acceleration techniques.\"\"\"\n",
|
||
" print(\"📊 Acceleration Technique Decision Framework...\")\n",
|
||
"\n",
|
||
" # Define workload characteristics\n",
|
||
" workloads = [\n",
|
||
" (\"Research Training\", {\n",
|
||
" \"memory_pressure\": \"medium\",\n",
|
||
" \"latency_sensitive\": False,\n",
|
||
" \"stability_critical\": False,\n",
|
||
" \"development_speed\": \"high\",\n",
|
||
" \"hardware_variety\": \"high\"\n",
|
||
" }),\n",
|
||
" (\"Production Training\", {\n",
|
||
" \"memory_pressure\": \"high\",\n",
|
||
" \"latency_sensitive\": False,\n",
|
||
" \"stability_critical\": True,\n",
|
||
" \"development_speed\": \"medium\",\n",
|
||
" \"hardware_variety\": \"low\"\n",
|
||
" }),\n",
|
||
" (\"Real-time Inference\", {\n",
|
||
" \"memory_pressure\": \"medium\",\n",
|
||
" \"latency_sensitive\": True,\n",
|
||
" \"stability_critical\": True,\n",
|
||
" \"development_speed\": \"low\",\n",
|
||
" \"hardware_variety\": \"medium\"\n",
|
||
" }),\n",
|
||
" (\"Edge Deployment\", {\n",
|
||
" \"memory_pressure\": \"very_high\",\n",
|
||
" \"latency_sensitive\": True,\n",
|
||
" \"stability_critical\": True,\n",
|
||
" \"development_speed\": \"low\",\n",
|
||
" \"hardware_variety\": \"very_high\"\n",
|
||
" }),\n",
|
||
" (\"Batch Inference\", {\n",
|
||
" \"memory_pressure\": \"low\",\n",
|
||
" \"latency_sensitive\": False,\n",
|
||
" \"stability_critical\": True,\n",
|
||
" \"development_speed\": \"medium\",\n",
|
||
" \"hardware_variety\": \"low\"\n",
|
||
" })\n",
|
||
" ]\n",
|
||
"\n",
|
||
" # Define technique characteristics\n",
|
||
" techniques = {\n",
|
||
" \"Vectorization\": {\n",
|
||
" \"implementation_cost\": \"low\",\n",
|
||
" \"memory_benefit\": \"none\",\n",
|
||
" \"latency_benefit\": \"high\",\n",
|
||
" \"stability_risk\": \"none\",\n",
|
||
" \"hardware_dependency\": \"low\"\n",
|
||
" },\n",
|
||
" \"Kernel Fusion\": {\n",
|
||
" \"implementation_cost\": \"medium\",\n",
|
||
" \"memory_benefit\": \"medium\",\n",
|
||
" \"latency_benefit\": \"medium\",\n",
|
||
" \"stability_risk\": \"low\",\n",
|
||
" \"hardware_dependency\": \"medium\"\n",
|
||
" },\n",
|
||
" \"Mixed Precision\": {\n",
|
||
" \"implementation_cost\": \"high\",\n",
|
||
" \"memory_benefit\": \"high\",\n",
|
||
" \"latency_benefit\": \"high\",\n",
|
||
" \"stability_risk\": \"medium\",\n",
|
||
" \"hardware_dependency\": \"high\"\n",
|
||
" },\n",
|
||
" \"Graph Optimization\": {\n",
|
||
" \"implementation_cost\": \"very_high\",\n",
|
||
" \"memory_benefit\": \"medium\",\n",
|
||
" \"latency_benefit\": \"very_high\",\n",
|
||
" \"stability_risk\": \"low\",\n",
|
||
" \"hardware_dependency\": \"very_high\"\n",
|
||
" }\n",
|
||
" }\n",
|
||
"\n",
|
||
" print(\"\\n🎯 Acceleration Technique Recommendations:\")\n",
|
||
" print(\"┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
|
||
" print(\"│ Workload │ Vectorize │ Fuse Kernels│ Mixed Prec │ Graph Opt │\")\n",
|
||
" print(\"├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
|
||
"\n",
|
||
" for workload_name, workload_chars in workloads:\n",
|
||
" recommendations = []\n",
|
||
"\n",
|
||
" for technique_name in [\"Vectorization\", \"Kernel Fusion\", \"Mixed Precision\", \"Graph Optimization\"]:\n",
|
||
" tech_chars = techniques[technique_name]\n",
|
||
" score = 0\n",
|
||
"\n",
|
||
" # Benefit vs requirement matching\n",
|
||
" if workload_chars[\"memory_pressure\"] in [\"high\", \"very_high\"]:\n",
|
||
" if tech_chars[\"memory_benefit\"] in [\"medium\", \"high\"]:\n",
|
||
" score += 2\n",
|
||
"\n",
|
||
" if workload_chars[\"latency_sensitive\"]:\n",
|
||
" if tech_chars[\"latency_benefit\"] in [\"medium\", \"high\", \"very_high\"]:\n",
|
||
" score += 2\n",
|
||
"\n",
|
||
" # Risk vs tolerance matching\n",
|
||
" if workload_chars[\"stability_critical\"]:\n",
|
||
" if tech_chars[\"stability_risk\"] in [\"none\", \"low\"]:\n",
|
||
" score += 1\n",
|
||
" elif tech_chars[\"stability_risk\"] == \"medium\":\n",
|
||
" score -= 1\n",
|
||
"\n",
|
||
" # Implementation cost vs development speed\n",
|
||
" if workload_chars[\"development_speed\"] == \"high\":\n",
|
||
" if tech_chars[\"implementation_cost\"] in [\"low\", \"medium\"]:\n",
|
||
" score += 1\n",
|
||
" elif tech_chars[\"implementation_cost\"] in [\"high\", \"very_high\"]:\n",
|
||
" score -= 1\n",
|
||
"\n",
|
||
" # Hardware dependency vs variety\n",
|
||
" if workload_chars[\"hardware_variety\"] in [\"high\", \"very_high\"]:\n",
|
||
" if tech_chars[\"hardware_dependency\"] in [\"low\", \"medium\"]:\n",
|
||
" score += 1\n",
|
||
" elif tech_chars[\"hardware_dependency\"] in [\"high\", \"very_high\"]:\n",
|
||
" score -= 2\n",
|
||
"\n",
|
||
" # Convert score to recommendation\n",
|
||
" if score >= 3:\n",
|
||
" rec = \"✅ High\"\n",
|
||
" elif score >= 1:\n",
|
||
" rec = \"⚡ Medium\"\n",
|
||
" elif score >= 0:\n",
|
||
" rec = \"⚠️ Low\"\n",
|
||
" else:\n",
|
||
" rec = \"❌ Skip\"\n",
|
||
"\n",
|
||
" recommendations.append(rec)\n",
|
||
"\n",
|
||
" rec_line = \" │ \".join(f\"{rec:10s}\" for rec in recommendations)\n",
|
||
" print(f\"│ {workload_name:18s} │ {rec_line} │\")\n",
|
||
"\n",
|
||
" print(\"└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
|
||
"\n",
|
||
" # Implementation priority framework\n",
|
||
" print(f\"\\n🛠️ Implementation Priority Framework:\")\n",
|
||
" print(f\" 📊 Phase 1 (Always): Vectorization\")\n",
|
||
" print(f\" • Low risk, high reward\")\n",
|
||
" print(f\" • Works on any hardware\")\n",
|
||
" print(f\" • Foundation for other optimizations\")\n",
|
||
" print(f\" \")\n",
|
||
" print(f\" 📊 Phase 2 (Memory constrained): Kernel Fusion\")\n",
|
||
" print(f\" • Targets memory-bound operations\")\n",
|
||
" print(f\" • Moderate complexity\")\n",
|
||
" print(f\" • Significant wins on element-wise ops\")\n",
|
||
" print(f\" \")\n",
|
||
" print(f\" 📊 Phase 3 (Large models): Mixed Precision\")\n",
|
||
" print(f\" • Essential for large model training\")\n",
|
||
" print(f\" • Requires careful validation\")\n",
|
||
" print(f\" • Hardware-dependent benefits\")\n",
|
||
" print(f\" \")\n",
|
||
" print(f\" 📊 Phase 4 (Production): Graph Optimization\")\n",
|
||
" print(f\" • Maximum performance extraction\")\n",
|
||
" print(f\" • High implementation cost\")\n",
|
||
" print(f\" • Deployment-specific tuning\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Key Decision Factors:\")\n",
|
||
" print(f\" 🎯 Start simple: Vectorization first, always\")\n",
|
||
" print(f\" 📈 Scale up: Add complexity only when needed\")\n",
|
||
" print(f\" ⚡ Measure impact: Profile before and after each optimization\")\n",
|
||
" print(f\" 🔄 Iterate: Optimization is an ongoing process, not one-time\")\n",
|
||
" print(\"🚀 Systematic acceleration beats random optimization\")\n",
|
||
"\n",
|
||
"analyze_acceleration_decision_framework()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "541be4f4",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 7. Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that all acceleration components work together correctly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "05244210",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-module",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire acceleration module functionality.\n",
|
||
"\n",
|
||
" This final test ensures:\n",
|
||
" - All acceleration techniques work correctly\n",
|
||
" - Performance improvements are measurable\n",
|
||
" - Mixed precision training is stable\n",
|
||
" - Components integrate seamlessly\n",
|
||
" - Module is ready for production use\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_vectorized_matmul()\n",
|
||
" test_unit_fused_gelu()\n",
|
||
" test_unit_fusion_speedup()\n",
|
||
" test_unit_mixed_precision()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test realistic acceleration pipeline\n",
|
||
" print(\"🔬 Integration Test: Complete acceleration pipeline...\")\n",
|
||
"\n",
|
||
" # Create realistic model scenario\n",
|
||
" batch_size, seq_len, hidden_dim = 16, 64, 256\n",
|
||
" print(f\" Model config: batch={batch_size}, seq_len={seq_len}, hidden={hidden_dim}\")\n",
|
||
"\n",
|
||
" # Test data\n",
|
||
" x = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))\n",
|
||
" weight = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n",
|
||
" print(f\" Input tensor: {x.shape}, Weight tensor: {weight.shape}\")\n",
|
||
"\n",
|
||
" # Test complete pipeline: reshape → matmul → activation → mixed precision\n",
|
||
" print(\" Testing vectorized operations...\")\n",
|
||
"\n",
|
||
" # Reshape for matrix multiplication (flatten batch and sequence)\n",
|
||
" x_reshaped = Tensor(x.data.reshape(-1, hidden_dim))\n",
|
||
" assert x_reshaped.shape == (batch_size * seq_len, hidden_dim)\n",
|
||
"\n",
|
||
" # Vectorized matrix multiplication\n",
|
||
" linear_output = vectorized_matmul(x_reshaped, weight)\n",
|
||
" assert linear_output.shape == (batch_size * seq_len, hidden_dim)\n",
|
||
" print(f\" ✅ Matrix multiplication: {x_reshaped.shape} @ {weight.shape} → {linear_output.shape}\")\n",
|
||
"\n",
|
||
" # Fused activation\n",
|
||
" activated = fused_gelu(linear_output)\n",
|
||
" assert activated.shape == linear_output.shape\n",
|
||
" print(f\" ✅ Fused GELU activation: {linear_output.shape} → {activated.shape}\")\n",
|
||
"\n",
|
||
" # Reshape back to original structure\n",
|
||
" final_output = Tensor(activated.data.reshape(batch_size, seq_len, hidden_dim))\n",
|
||
" assert final_output.shape == x.shape\n",
|
||
" print(f\" ✅ Output reshape: {activated.shape} → {final_output.shape}\")\n",
|
||
"\n",
|
||
" print(\" Testing mixed precision training integration...\")\n",
|
||
"\n",
|
||
" # Create complete model for mixed precision testing\n",
|
||
" class TransformerBlock:\n",
|
||
" def __init__(self, hidden_dim):\n",
|
||
" self.hidden_dim = hidden_dim\n",
|
||
" self.weight1 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n",
|
||
" self.weight2 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n",
|
||
" self.weight1.grad = None\n",
|
||
" self.weight2.grad = None\n",
|
||
"\n",
|
||
" def __call__(self, x):\n",
|
||
" # Simulate transformer block: linear → activation → linear\n",
|
||
" batch_size, seq_len, hidden_dim = x.shape\n",
|
||
" x_flat = Tensor(x.data.reshape(-1, hidden_dim))\n",
|
||
"\n",
|
||
" # First linear layer\n",
|
||
" h1 = vectorized_matmul(x_flat, self.weight1)\n",
|
||
" h1_activated = fused_gelu(h1)\n",
|
||
"\n",
|
||
" # Second linear layer\n",
|
||
" h2 = vectorized_matmul(h1_activated, self.weight2)\n",
|
||
"\n",
|
||
" # Reshape back\n",
|
||
" output = Tensor(h2.data.reshape(batch_size, seq_len, hidden_dim))\n",
|
||
" return output\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" return [self.weight1, self.weight2]\n",
|
||
"\n",
|
||
" class SimpleOptimizer:\n",
|
||
" def __init__(self, params):\n",
|
||
" self.params = params\n",
|
||
"\n",
|
||
" def zero_grad(self):\n",
|
||
" for p in self.params:\n",
|
||
" p.grad = None\n",
|
||
"\n",
|
||
" def step(self):\n",
|
||
" for p in self.params:\n",
|
||
" if p.grad is not None:\n",
|
||
" p.data = p.data - 0.001 * p.grad.data\n",
|
||
"\n",
|
||
" # Initialize model and optimizer\n",
|
||
" model = TransformerBlock(hidden_dim)\n",
|
||
" optimizer = SimpleOptimizer(model.parameters())\n",
|
||
" trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=512.0)\n",
|
||
"\n",
|
||
" print(f\" Model parameters: {len(model.parameters())}\")\n",
|
||
" print(f\" Initial loss scale: {trainer.loss_scale}\")\n",
|
||
"\n",
|
||
" # Simulate training steps\n",
|
||
" print(\" Running training steps...\")\n",
|
||
" targets = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))\n",
|
||
"\n",
|
||
" training_metrics = []\n",
|
||
" for step in range(5):\n",
|
||
" metrics = trainer.train_step((x, targets))\n",
|
||
" training_metrics.append(metrics)\n",
|
||
"\n",
|
||
" # Verify metrics are reasonable\n",
|
||
" assert isinstance(metrics['loss'], (int, float))\n",
|
||
" assert metrics['loss'] >= 0\n",
|
||
" assert metrics['loss_scale'] > 0\n",
|
||
" assert isinstance(metrics['overflow'], bool)\n",
|
||
" assert isinstance(metrics['gradients_valid'], bool)\n",
|
||
"\n",
|
||
" print(f\" ✅ Completed {len(training_metrics)} training steps\")\n",
|
||
"\n",
|
||
" # Analyze training stability\n",
|
||
" losses = [m['loss'] for m in training_metrics]\n",
|
||
" overflows = [m['overflow'] for m in training_metrics]\n",
|
||
"\n",
|
||
" print(f\" Loss range: {min(losses):.6f} - {max(losses):.6f}\")\n",
|
||
" print(f\" Overflow rate: {sum(overflows)}/{len(overflows)} steps\")\n",
|
||
"\n",
|
||
" print(\" Testing performance characteristics...\")\n",
|
||
"\n",
|
||
" # Verify acceleration provides measurable benefits\n",
|
||
" test_sizes = [128, 256]\n",
|
||
" for size in test_sizes:\n",
|
||
" test_x = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
" test_y = Tensor(np.random.randn(size, size).astype(np.float32))\n",
|
||
"\n",
|
||
" # Time operations and verify reasonable performance\n",
|
||
" start = time.time()\n",
|
||
" _ = vectorized_matmul(test_x, test_y)\n",
|
||
" matmul_time = time.time() - start\n",
|
||
"\n",
|
||
" start = time.time()\n",
|
||
" _ = fused_gelu(test_x)\n",
|
||
" gelu_time = time.time() - start\n",
|
||
"\n",
|
||
" # Verify operations complete in reasonable time\n",
|
||
" assert matmul_time < 1.0, f\"Matrix multiplication too slow: {matmul_time:.3f}s\"\n",
|
||
" assert gelu_time < 0.1, f\"GELU activation too slow: {gelu_time:.3f}s\"\n",
|
||
"\n",
|
||
" print(f\" ✅ Size {size}: matmul={matmul_time*1000:.1f}ms, gelu={gelu_time*1000:.1f}ms\")\n",
|
||
"\n",
|
||
" print(\" Testing memory efficiency...\")\n",
|
||
"\n",
|
||
" # Verify mixed precision reduces memory usage conceptually\n",
|
||
" param_count = sum(p.data.size for p in model.parameters())\n",
|
||
" activation_count = batch_size * seq_len * hidden_dim\n",
|
||
"\n",
|
||
" fp32_memory = (param_count + activation_count) * 4 # 4 bytes per FP32\n",
|
||
" mixed_memory = param_count * 4 + activation_count * 2 # FP32 params + FP16 activations\n",
|
||
" memory_savings = (fp32_memory - mixed_memory) / fp32_memory * 100\n",
|
||
"\n",
|
||
" print(f\" Memory analysis: {memory_savings:.1f}% savings from mixed precision\")\n",
|
||
" assert memory_savings > 0, \"Mixed precision should reduce memory usage\"\n",
|
||
"\n",
|
||
" print(\"✅ End-to-end acceleration pipeline works!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 16\")\n",
|
||
"\n",
|
||
"# Call the module test\n",
|
||
"test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6531eb00",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "main-execution",
|
||
"solution": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Main execution block\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" print(\"🚀 Running Acceleration module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e1054af9",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Acceleration and Performance\n",
|
||
"\n",
|
||
"### Question 1: Arithmetic Intensity Analysis\n",
|
||
"You implemented vectorized matrix multiplication and fused GELU.\n",
|
||
"- Matrix multiplication (1024×1024): Performs ~2.1 billion FLOPs, reads ~12 MB data\n",
|
||
"- Arithmetic intensity: _____ FLOPs/byte\n",
|
||
"- Compared to element-wise addition (0.33 FLOPs/byte): _____× higher intensity\n",
|
||
"- Why does this make matrix multiplication ideal for GPUs? _____\n",
|
||
"\n",
|
||
"### Question 2: Kernel Fusion Memory Benefits\n",
|
||
"Your fused_gelu combines 7 operations into a single expression.\n",
|
||
"- Unfused version memory accesses: 7 reads + 7 writes = _____ per element\n",
|
||
"- Fused version memory accesses: 1 read + 1 write = _____ per element\n",
|
||
"- Memory bandwidth reduction: _____%\n",
|
||
"- Why is this critical for transformer inference? _____\n",
|
||
"\n",
|
||
"### Question 3: Mixed Precision Memory Calculation\n",
|
||
"Your MixedPrecisionTrainer uses FP16 activations, FP32 parameters.\n",
|
||
"For a 100M parameter model with 50M activation elements:\n",
|
||
"- FP32 memory: (100M + 50M) × 4 bytes = _____ MB\n",
|
||
"- Mixed precision memory: 100M × 4 + 50M × 2 = _____ MB\n",
|
||
"- Memory reduction: _____%\n",
|
||
"\n",
|
||
"### Question 4: Loss Scaling Strategy\n",
|
||
"Your trainer starts with loss_scale=1024, grows by 2×, shrinks by 0.5×.\n",
|
||
"- Minimum FP16 representable value: ~6e-5\n",
|
||
"- Without scaling, gradients < _____ become zero\n",
|
||
"- With 1024× scaling, gradients down to _____ are preserved\n",
|
||
"- Why increase scale gradually but decrease immediately? _____\n",
|
||
"\n",
|
||
"### Question 5: Production Optimization Strategy\n",
|
||
"Based on your decision framework analysis:\n",
|
||
"For edge deployment (memory critical, stability required, hardware diverse):\n",
|
||
"- Priority 1 technique: _____ (low risk, universal)\n",
|
||
"- Priority 2 technique: _____ (memory benefits)\n",
|
||
"- Skip technique: _____ (why: _____)\n",
|
||
"- What's the primary constraint: memory, compute, or power? _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2fcecfae",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Acceleration\n",
|
||
"\n",
|
||
"Congratulations! You've mastered the fundamental techniques for accelerating neural networks!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built **vectorized operations** leveraging SIMD and optimized BLAS for 2-5× speedups\n",
|
||
"- Implemented **kernel fusion** reducing memory bandwidth by 60-80% for element-wise operations\n",
|
||
"- Created **mixed precision training** with automatic loss scaling for 20-40% memory savings\n",
|
||
"- Analyzed **arithmetic intensity patterns** and their impact on the roofline model\n",
|
||
"- Developed **production decision framework** for systematic optimization\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Systems Insights Discovered\n",
|
||
"- **Roofline Model**: Operations with high arithmetic intensity (FLOPs/byte) scale better\n",
|
||
"- **Memory Bandwidth**: Often the limiting factor for modern accelerators\n",
|
||
"- **Kernel Fusion**: Critical for memory-bound workloads, reduces intermediate storage overhead\n",
|
||
"- **Mixed Precision**: Essential for large model training, requires careful gradient scaling\n",
|
||
"- **Optimization Strategy**: Start simple (vectorization), add complexity as needed\n",
|
||
"\n",
|
||
"### Production Impact\n",
|
||
"Your acceleration techniques enable:\n",
|
||
"- **Training larger models** within memory constraints\n",
|
||
"- **Faster iteration cycles** during research and development\n",
|
||
"- **Better hardware utilization** across different deployment targets\n",
|
||
"- **Cost reduction** through improved efficiency\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your acceleration implementations provide the foundation for quantization techniques in Module 17.\n",
|
||
"The performance analysis skills transfer directly to production optimization workflows.\n",
|
||
"\n",
|
||
"Export with: `tito module complete 16`\n",
|
||
"\n",
|
||
"**Next**: Module 17 will add quantization to further reduce memory and increase throughput while maintaining accuracy!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|