diff --git a/book/chapters/09-spatial.md b/book/chapters/09-spatial.md deleted file mode 100644 index 44080ef5..00000000 --- a/book/chapters/09-spatial.md +++ /dev/null @@ -1,360 +0,0 @@ ---- -title: "Spatial - Convolutional Neural Networks" -description: "Build CNNs from scratch for computer vision and spatial pattern recognition" -difficulty: 3 -time_estimate: "6-8 hours" -prerequisites: ["Tensor", "Activations", "Layers", "DataLoader"] -next_steps: ["Tokenization"] -learning_objectives: - - "Implement convolution as sliding window operations with weight sharing" - - "Design CNN architectures with feature extraction and classification components" - - "Understand translation invariance and hierarchical feature learning" - - "Build pooling operations for spatial downsampling and invariance" - - "Apply computer vision principles to image classification tasks" ---- - -# 09. Spatial (CNNs) - -**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 6-8 hours - -## Overview - -Implement convolutional neural networks (CNNs) from scratch. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image classification to autonomous driving. - -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement convolution** as sliding window operations with explicit loops, understanding weight sharing and local connectivity -2. **Design CNN architectures** by composing convolutional, pooling, and dense layers for image classification -3. **Understand translation invariance** and why CNNs are superior to dense networks for spatial data -4. **Build pooling operations** (MaxPool, AvgPool) for spatial downsampling and feature invariance -5. **Apply computer vision principles** to achieve >75% accuracy on CIFAR-10 image classification - -## Why This Matters - -### Production Context - -CNNs are the backbone of modern computer vision systems: - -- **Meta's Vision AI** uses CNN architectures to tag 2 billion photos daily across Facebook and Instagram -- **Tesla Autopilot** processes camera feeds through CNN backbones for object detection and lane recognition -- **Google Photos** built a CNN-based system that automatically organizes billions of images -- **Medical Imaging** systems use CNNs to detect cancer in X-rays and MRIs with superhuman accuracy - -### Historical Context - -The convolution revolution transformed computer vision: - -- **LeNet (1998)**: Yann LeCun's CNN read zip codes on mail; convolution proved viable but limited by compute -- **AlexNet (2012)**: Won ImageNet with 16% error rate (vs 26% previous); GPUs + convolution = computer vision revolution -- **ResNet (2015)**: 152-layer CNN achieved 3.6% error (better than human 5%); proved depth matters -- **Modern Era (2020+)**: CNNs power production vision systems processing trillions of images daily - -The patterns you're implementing revolutionized how machines see. - -## Pedagogical Pattern: Build → Use → Analyze - -### 1. Build - -Implement from first principles: -- Convolution as explicit sliding window operation -- Conv2D layer with learnable filters and weight sharing -- MaxPool2D and AvgPool2D for spatial downsampling -- Flatten layer to connect spatial and dense layers -- Complete CNN architecture with feature extraction and classification - -### 2. Use - -Apply to real problems: -- Build CNN for CIFAR-10 image classification -- Extract and visualize learned feature maps -- Compare CNN vs MLP performance on spatial data -- Achieve >75% accuracy with proper architecture -- Understand impact of kernel size, stride, and padding - -### 3. Analyze - -Deep-dive into architectural choices: -- Why does weight sharing reduce parameters dramatically? -- How do early vs late layers learn different features? -- What's the trade-off between depth and width in CNNs? -- Why are pooling operations crucial for translation invariance? -- How does spatial structure preservation improve learning? - -## Implementation Guide - -### Core Components - -**Conv2D Layer - The Heart of Computer Vision** -```python -class Conv2D: - """2D Convolutional layer with learnable filters. - - Implements sliding window convolution: - - Applies same filter across all spatial positions (weight sharing) - - Each filter learns to detect different features (edges, textures, objects) - - Output is feature map showing where filter activates strongly - - Args: - in_channels: Number of input channels (3 for RGB, 16 for feature maps) - out_channels: Number of learned filters (feature detectors) - kernel_size: Size of sliding window (typically 3 or 5) - stride: Step size when sliding (1 = no downsampling) - padding: Border padding to preserve spatial dimensions - """ - def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0): - # Initialize learnable filters - self.weight = Tensor(shape=(out_channels, in_channels, kernel_size, kernel_size)) - self.bias = Tensor(shape=(out_channels,)) - - def forward(self, x): - # x shape: (batch, in_channels, height, width) - batch, _, H, W = x.shape - kh, kw = self.kernel_size, self.kernel_size - - # Calculate output dimensions - out_h = (H + 2 * self.padding - kh) // self.stride + 1 - out_w = (W + 2 * self.padding - kw) // self.stride + 1 - - # Sliding window convolution - output = Tensor(shape=(batch, self.out_channels, out_h, out_w)) - for b in range(batch): - for oc in range(self.out_channels): - for i in range(out_h): - for j in range(out_w): - # Extract local patch - i_start = i * self.stride - j_start = j * self.stride - patch = x[b, :, i_start:i_start+kh, j_start:j_start+kw] - - # Convolution: element-wise multiply and sum - output[b, oc, i, j] = (patch * self.weight[oc]).sum() + self.bias[oc] - - return output -``` - -**Pooling Layers - Spatial Downsampling** -```python -class MaxPool2D: - """Max pooling for spatial downsampling and translation invariance. - - Takes maximum value in each local region: - - Reduces spatial dimensions while preserving important features - - Provides invariance to small translations - - Reduces computation in later layers - """ - def __init__(self, kernel_size=2, stride=None): - self.kernel_size = kernel_size - self.stride = stride or kernel_size - - def forward(self, x): - batch, channels, H, W = x.shape - kh, kw = self.kernel_size, self.kernel_size - - out_h = (H - kh) // self.stride + 1 - out_w = (W - kw) // self.stride + 1 - - output = Tensor(shape=(batch, channels, out_h, out_w)) - for b in range(batch): - for c in range(channels): - for i in range(out_h): - for j in range(out_w): - i_start = i * self.stride - j_start = j * self.stride - patch = x[b, c, i_start:i_start+kh, j_start:j_start+kw] - output[b, c, i, j] = patch.max() - - return output -``` - -**Complete CNN Architecture** -```python -class SimpleCNN: - """CNN for CIFAR-10 classification. - - Architecture: - Conv(3→32, 3x3) → ReLU → MaxPool(2x2) # 32x32 → 16x16 - Conv(32→64, 3x3) → ReLU → MaxPool(2x2) # 16x16 → 8x8 - Flatten → Dense(64*8*8 → 128) → ReLU - Dense(128 → 10) → Softmax - """ - def __init__(self): - self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1) - self.relu1 = ReLU() - self.pool1 = MaxPool2D(kernel_size=2) - - self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1) - self.relu2 = ReLU() - self.pool2 = MaxPool2D(kernel_size=2) - - self.flatten = Flatten() - self.fc1 = Linear(64 * 8 * 8, 128) - self.relu3 = ReLU() - self.fc2 = Linear(128, 10) - - def forward(self, x): - # Feature extraction - x = self.pool1(self.relu1(self.conv1(x))) # (B, 32, 16, 16) - x = self.pool2(self.relu2(self.conv2(x))) # (B, 64, 8, 8) - - # Classification - x = self.flatten(x) # (B, 4096) - x = self.relu3(self.fc1(x)) # (B, 128) - x = self.fc2(x) # (B, 10) - return x -``` - -### Step-by-Step Implementation - -1. **Implement Conv2D Forward Pass** - - Create sliding window iteration over spatial dimensions - - Apply weight sharing: same filter at all positions - - Handle batch processing efficiently - - Verify output shape calculation - -2. **Build Pooling Operations** - - Implement MaxPool2D with maximum extraction - - Add AvgPool2D for average pooling - - Handle stride and kernel size correctly - - Test spatial dimension reduction - -3. **Create Flatten Layer** - - Convert (B, C, H, W) to (B, C*H*W) - - Prepare spatial features for dense layers - - Preserve batch dimension - - Enable gradient flow backward - -4. **Design Complete CNN** - - Stack Conv → ReLU → Pool blocks for feature extraction - - Add Flatten → Dense for classification - - Calculate dimensions at each layer - - Test end-to-end forward pass - -5. **Train on CIFAR-10** - - Load CIFAR-10 using Module 08's DataLoader - - Train with cross-entropy loss and SGD - - Track accuracy on test set - - Achieve >75% accuracy - -## Testing - -### Inline Tests (During Development) - -Run inline tests while building: -```bash -cd modules/source/09_spatial -python spatial_dev.py -``` - -Expected output: -``` -Unit Test: Conv2D implementation... -✅ Sliding window convolution works correctly -✅ Weight sharing applied at all positions -✅ Output shapes match expected dimensions -Progress: Conv2D ✓ - -Unit Test: MaxPool2D implementation... -✅ Maximum extraction works correctly -✅ Spatial dimensions reduced properly -✅ Translation invariance verified -Progress: Pooling ✓ - -Unit Test: Complete CNN architecture... -✅ Forward pass through all layers successful -✅ Output shape: (32, 10) for 10 classes -✅ Parameter count reasonable: ~500K parameters -Progress: CNN Architecture ✓ -``` - -### Export and Validate - -After completing the module: -```bash -# Export to tinytorch package -tito export 09_spatial - -# Run integration tests -tito test 09_spatial -``` - -### CIFAR-10 Training Test - -```bash -# Train simple CNN on CIFAR-10 -python tests/integration/test_cnn_cifar10.py - -Expected results: -- Epoch 1: 35% accuracy -- Epoch 5: 60% accuracy -- Epoch 10: 75% accuracy -``` - -## Where This Code Lives - -``` -tinytorch/ -├── nn/ -│ └── spatial.py # Conv2D, MaxPool2D, etc. -└── __init__.py # Exposes CNN components - -Usage in other modules: ->>> from tinytorch.nn import Conv2D, MaxPool2D ->>> conv = Conv2D(3, 32, kernel_size=3) ->>> pool = MaxPool2D(kernel_size=2) -``` - -## Systems Thinking Questions - -1. **Parameter Efficiency**: A Conv2D(3, 32, 3) has ~900 parameters. How many parameters would a Dense layer need to connect a 32x32 image to 32 outputs? Why is this difference critical for scaling? - -2. **Translation Invariance**: Why does a CNN detect a cat regardless of whether it's in the top-left or bottom-right of an image? How does weight sharing enable this property? - -3. **Hierarchical Features**: Early CNN layers detect edges and textures. Later layers detect objects and faces. How does this emerge from stacking convolutions? Why doesn't this happen in dense networks? - -4. **Receptive Field Growth**: A single Conv2D(kernel=3) sees a 3x3 region. After two Conv2D layers, what region does each output see? How do deep CNNs build global context from local operations? - -5. **Compute vs Memory Trade-offs**: Large kernel sizes (7x7) have more parameters but fewer operations. Small kernels (3x3) stacked deeply have opposite trade-offs. Which is better and why? - -## Real-World Connections - -### Industry Applications - -**Autonomous Vehicles (Tesla, Waymo)** -- Multi-camera CNN systems process 30 FPS at 1920x1200 resolution -- Feature maps from CNNs feed into object detection and segmentation -- Real-time requirements demand efficient Conv2D implementations - -**Medical Imaging (PathAI, Zebra Medical)** -- CNNs analyze X-rays and CT scans for diagnostic assistance -- Achieve superhuman performance on specific tasks (diabetic retinopathy detection) -- Architecture design critical for accuracy-interpretability trade-off - -**Face Recognition (Apple Face ID, Facebook DeepFace)** -- CNN embeddings enable accurate face matching at billion-user scale -- Lightweight CNN architectures run on mobile devices in real-time -- Privacy concerns drive on-device processing - -### Research Impact - -This module implements patterns from: -- LeNet-5 (1998): First successful CNN for digit recognition -- AlexNet (2012): Sparked deep learning revolution with CNNs + GPUs -- VGG (2014): Showed deeper is better with simple 3x3 convolutions -- ResNet (2015): Enabled training 152-layer CNNs with skip connections - -## What's Next? - -In **Module 10: Tokenization**, you'll shift from processing images to processing text: - -- Learn how to convert text into numerical representations -- Implement tokenization strategies (character, word, subword) -- Build vocabulary management systems -- Prepare text data for transformers in Module 13 - -This completes the vision half of the Intelligence Tier. Next, you'll tackle language! - ---- - -**Ready to build CNNs from scratch?** Open `modules/source/09_spatial/spatial_dev.py` and start implementing. diff --git a/book/chapters/14-kvcaching.md b/book/chapters/14-kvcaching.md deleted file mode 100644 index bf0f04e1..00000000 --- a/book/chapters/14-kvcaching.md +++ /dev/null @@ -1,446 +0,0 @@ ---- -title: "KV Caching - Optimizing Transformer Inference" -description: "Cache attention key-value pairs for 10-100x faster autoregressive generation" -difficulty: 3 -time_estimate: "4-5 hours" -prerequisites: ["Attention", "Transformers"] -next_steps: ["Profiling"] -learning_objectives: - - "Implement KV caching to eliminate redundant attention computations" - - "Design cache management systems for multi-turn conversations" - - "Understand memory-speed trade-offs in production inference" - - "Optimize transformer latency from O(n²) to O(n) per token" - - "Apply caching patterns used in ChatGPT and production LLMs" ---- - -# 14. KV Caching - -**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 4-5 hours - -## Overview - -Implement KV (Key-Value) caching to optimize transformer inference. This critical production optimization reduces latency by 10-100× for autoregressive generation by caching attention keys and values, eliminating redundant recomputation. - -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement KV caching** to eliminate redundant attention key/value computations during generation -2. **Design cache management systems** for efficient multi-turn conversation handling -3. **Understand memory-speed trade-offs** between caching everything vs recomputing on-the-fly -4. **Optimize transformer latency** from O(n²) to O(n) per generated token -5. **Apply caching patterns** used in ChatGPT, Claude, and all production language models - -## Why This Matters - -### Production Context - -KV caching is mandatory for production LLM serving: - -- **ChatGPT** uses KV caching for all multi-turn conversations; without it, latency would be unusable -- **Claude** caches up to 100K tokens of context; enables long document processing -- **GitHub Copilot** caches code context; provides real-time completions -- **Google Gemini** uses multi-level caching; serves billions of requests daily - -### Historical Context - -Caching evolved with transformer deployment: - -- **Early Transformers (2017-2019)**: No caching; research focused on training, not inference -- **GPT-2 Deployment (2019)**: KV caching implemented; enabled practical text generation -- **Production Scale (2020+)**: Multi-level caching (KV + intermediate layers); critical for economics -- **Modern Systems (2023+)**: Distributed caching across GPUs; 100K+ token contexts - -Without KV caching, ChatGPT would be 50-100× slower and economically infeasible. - -## Pedagogical Pattern: Build → Use → Optimize - -### 1. Build - -Implement from first principles: -- KV cache data structure for attention -- Cache management (append, reuse, clear) -- Cached attention forward pass -- Multi-turn conversation caching -- Memory-efficient cache storage - -### 2. Use - -Apply to real problems: -- Optimize GPT decoder for text generation -- Cache conversation history for multi-turn chat -- Measure latency improvement (10-100× speedup) -- Profile memory usage vs cache size -- Compare cached vs non-cached inference - -### 3. Optimize - -Production-ready enhancements: -- Implement cache eviction policies (LRU, FIFO) -- Add distributed caching across GPUs -- Optimize memory layout for cache hits -- Compress cached values (quantization) -- Build cache warmup strategies - -## Implementation Guide - -### Core Components - -**Understanding the Problem - Why Caching Helps** -```python -# WITHOUT KV caching (naive autoregressive generation): -# Generate token 1: compute attention for [t0] -# Generate token 2: compute attention for [t0, t1] ← recomputes t0 -# Generate token 3: compute attention for [t0, t1, t2] ← recomputes t0, t1 -# Generate token n: compute attention for [t0, ..., tn] ← recomputes everything -# -# Complexity: O(n²) - quadratic in sequence length -# For 100 tokens: ~5000 attention operations - -# WITH KV caching: -# Generate token 1: compute K,V for [t0], cache them -# Generate token 2: reuse cached K,V for t0, compute only for t1 -# Generate token 3: reuse cached K,V for t0,t1, compute only for t2 -# Generate token n: reuse all cached, compute only for tn -# -# Complexity: O(n) - linear in sequence length -# For 100 tokens: ~100 attention operations (50× speedup!) -``` - -**KV Cache Data Structure** -```python -class KVCache: - """Cache for attention keys and values. - - Stores computed K,V matrices to avoid recomputation during - autoregressive generation. - - Memory layout: - keys: (num_layers, batch, num_heads, seq_len, d_k) - values: (num_layers, batch, num_heads, seq_len, d_v) - - For GPT-2: - 12 layers × 12 heads × 1024 seq × 64 dims = ~9M values - At FP16 (2 bytes): 18MB per batch item - """ - def __init__(self, num_layers, batch_size, num_heads, d_k, d_v, max_seq_len): - self.num_layers = num_layers - self.batch_size = batch_size - self.num_heads = num_heads - self.max_seq_len = max_seq_len - - # Pre-allocate cache tensors - self.keys = {} # {layer_idx: (batch, heads, seq_len, d_k)} - self.values = {} # {layer_idx: (batch, heads, seq_len, d_v)} - - # Track current sequence length - self.seq_len = 0 - - def append(self, layer_idx, new_keys, new_values): - """Append new keys/values to cache for a layer. - - Args: - layer_idx: Which transformer layer - new_keys: (batch, heads, 1, d_k) - single new position - new_values: (batch, heads, 1, d_v) - single new position - """ - if layer_idx not in self.keys: - # Initialize cache for this layer - self.keys[layer_idx] = new_keys - self.values[layer_idx] = new_values - else: - # Concatenate with existing cache - self.keys[layer_idx] = concat([self.keys[layer_idx], new_keys], dim=2) - self.values[layer_idx] = concat([self.values[layer_idx], new_values], dim=2) - - # Update sequence length (same across all layers) - self.seq_len = self.keys[layer_idx].shape[2] - - def get(self, layer_idx): - """Retrieve cached keys/values for a layer. - - Returns: - keys: (batch, heads, seq_len, d_k) - values: (batch, heads, seq_len, d_v) - """ - return self.keys.get(layer_idx), self.values.get(layer_idx) - - def clear(self): - """Clear all cached data.""" - self.keys.clear() - self.values.clear() - self.seq_len = 0 - - def memory_usage(self): - """Calculate cache memory usage in bytes.""" - total_elements = 0 - for k, v in zip(self.keys.values(), self.values.values()): - total_elements += k.numel() + v.numel() - # Assume FP16 (2 bytes per element) - return total_elements * 2 -``` - -**Cached Attention Layer** -```python -class CachedMultiHeadAttention(MultiHeadAttention): - """Multi-head attention with KV caching support. - - Extends MultiHeadAttention to cache K,V matrices during generation. - """ - def forward(self, query, key=None, value=None, kv_cache=None, layer_idx=None): - """Forward pass with optional KV caching. - - Args: - query: (batch, 1, d_model) - single new position - key: (batch, seq_len, d_model) - optional, for initial pass - value: (batch, seq_len, d_model) - optional, for initial pass - kv_cache: KVCache object - layer_idx: Which layer (for cache indexing) - - Returns: - output: (batch, 1, d_model) - attended output - attention_weights: (batch, heads, 1, seq_len) - for analysis - """ - batch_size = query.shape[0] - - # Project query for new position - Q = self.W_q(query) # (batch, 1, d_model) - Q = Q.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2) - # Q: (batch, heads, 1, d_k) - - if kv_cache is not None and layer_idx is not None: - # Check if cache exists for this layer - cached_K, cached_V = kv_cache.get(layer_idx) - - if cached_K is None: - # First token: compute and cache K,V - K = self.W_k(key) - V = self.W_v(value) - K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - - # Cache for future tokens - kv_cache.append(layer_idx, K, V) - else: - # Subsequent tokens: compute only new K,V, concat with cache - new_K = self.W_k(key) # key is just new position - new_V = self.W_v(value) - new_K = new_K.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2) - new_V = new_V.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2) - - # Append to cache - kv_cache.append(layer_idx, new_K, new_V) - - # Use full cached K,V - K, V = kv_cache.get(layer_idx) - else: - # No caching: regular attention - K = self.W_k(key) - V = self.W_v(value) - K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - - # Compute attention with cached K,V - attended, attention_weights = scaled_dot_product_attention(Q, K, V) - - # Reshape output - attended = attended.transpose(1, 2).reshape(batch_size, 1, self.d_model) - output = self.W_o(attended) - - return output, attention_weights -``` - -**Cached Generation - The Full Pipeline** -```python -def generate_with_cache(model, start_tokens, max_new_tokens, temperature=1.0): - """Autoregressive generation with KV caching. - - Achieves 10-100× speedup over non-cached generation. - - Args: - model: Transformer with KV cache support - start_tokens: (batch, start_len) initial sequence - max_new_tokens: Number of tokens to generate - temperature: Sampling temperature - - Returns: - generated: (batch, start_len + max_new_tokens) full sequence - """ - batch_size = start_tokens.shape[0] - generated = start_tokens - - # Initialize KV cache - kv_cache = KVCache( - num_layers=model.num_layers, - batch_size=batch_size, - num_heads=model.num_heads, - d_k=model.d_k, - d_v=model.d_k, - max_seq_len=start_tokens.shape[1] + max_new_tokens - ) - - # Process initial sequence (fills cache) - _ = model.forward(start_tokens, kv_cache=kv_cache) - - # Generate tokens one at a time (uses cache) - for _ in range(max_new_tokens): - # Forward pass on ONLY the last token - # Cache provides context from all previous tokens - last_token = generated[:, -1:] # (batch, 1) - logits = model.forward(last_token, kv_cache=kv_cache) # (batch, 1, vocab_size) - - # Sample next token - next_token_logits = logits[:, -1, :] / temperature - probs = softmax(next_token_logits, dim=-1) - next_token = sample(probs) - - # Append to sequence - generated = concat([generated, next_token], dim=1) - - return generated -``` - -### Step-by-Step Implementation - -1. **Design KV Cache Structure** - - Create storage for keys and values per layer - - Support appending new keys/values efficiently - - Add retrieval and clearing methods - - Calculate memory usage - -2. **Modify Attention for Caching** - - Add KV cache parameter to forward pass - - Check if cache exists for current layer - - Compute only new K,V when cache present - - Concat new K,V with cached values - -3. **Implement Cached Generation** - - Initialize cache before generation loop - - Process initial tokens (fill cache) - - Generate new tokens using cached context - - Measure speedup vs non-cached - -4. **Add Cache Management** - - Implement cache clearing between conversations - - Add cache size limits and eviction - - Support batch processing with caching - - Handle variable sequence lengths - -5. **Optimize Memory Layout** - - Use contiguous tensors for cache hits - - Implement FP16 caching for memory savings - - Add cache compression (quantization) - - Profile memory bandwidth bottlenecks - -## Testing - -### Inline Tests (During Development) - -Run inline tests while building: -```bash -cd modules/source/14_kvcaching -python kvcaching_dev.py -``` - -Expected output: -``` -Unit Test: KV cache data structure... -✅ Cache initialization successful -✅ Append and retrieval work correctly -✅ Memory usage calculated: 18MB per batch -Progress: KV Cache ✓ - -Unit Test: Cached attention... -✅ First token: K,V computed and cached -✅ Subsequent tokens: reuse cached K,V -✅ Attention output matches non-cached version -Progress: Cached Attention ✓ - -Unit Test: Generation with caching... -✅ Generated 100 tokens with caching -✅ Speedup: 47× faster than without cache -✅ Output quality: identical to non-cached -Progress: Cached Generation ✓ -``` - -### Export and Validate - -After completing the module: -```bash -# Export to tinytorch package -tito export 14_kvcaching - -# Run integration tests -tito test 14_kvcaching -``` - -## Where This Code Lives - -``` -tinytorch/ -├── nn/ -│ └── kvcache.py # Your implementation goes here -└── __init__.py # Exposes KVCache, CachedMultiHeadAttention - -Usage in other modules: ->>> from tinytorch.nn import KVCache, CachedMultiHeadAttention ->>> cache = KVCache(num_layers=12, batch_size=1, num_heads=12, d_k=64, d_v=64, max_seq_len=1024) ->>> generated = generate_with_cache(model, start_tokens, max_new_tokens=100) -``` - -## Systems Thinking Questions - -1. **Memory-Speed Trade-off**: KV cache uses 18MB per batch for GPT-2. For batch=32, that's 576MB. What if you have 8GB GPU? How many concurrent users can you serve? What's the trade-off? - -2. **Cache Invalidation**: In multi-turn chat, when should you clear the cache? What if context exceeds max_seq_len? How do production systems handle this? - -3. **Distributed Caching**: For models too large for one GPU, you need tensor parallelism. How do you partition the KV cache across GPUs? What's the communication overhead? - -4. **Quantized Caching**: Storing cache in INT8 instead of FP16 saves 50% memory. What's the accuracy impact? When is this worth it? - -5. **Speculation and Prefetching**: What if you predict the next query and pre-compute KV cache? How would you implement speculative caching? - -## Real-World Connections - -### Industry Applications - -**Conversational AI (OpenAI ChatGPT, Anthropic Claude)** -- KV caching for all multi-turn conversations -- Cache eviction policies for context window limits -- Memory-speed trade-offs define pricing ($/1M tokens) -- Without caching, latency would be 50-100× worse - -**Code Completion (GitHub Copilot, Cursor)** -- Real-time caching of code context -- Incremental updates as user types -- Low-latency requirements (< 100ms) mandate caching -- Cache hit rates directly impact user experience - -**Search and Retrieval (Perplexity, Bing AI)** -- Cache document embeddings and attention -- Multi-stage caching (retrieval + generation) -- Distributed caching across data centers -- Cache warmup for popular queries - -### Research Impact - -This module implements patterns from: -- GPT-2 (2019): First large-scale use of KV caching -- Megatron-LM (2020): Distributed KV caching across GPUs -- FlashAttention (2022): Memory-efficient attention without full caching -- PagedAttention (2023): Virtual memory for KV cache management - -## What's Next? - -In **Module 15: Profiling**, you'll measure where time goes in your transformer: - -- Profile attention, feedforward, and embedding operations -- Identify computational bottlenecks beyond caching -- Measure FLOPs, memory bandwidth, and latency -- Understand performance characteristics across architectures - -The caching you implemented solves the biggest inference bottleneck—now let's find what else to optimize! - ---- - -**Ready to implement production-critical caching?** Open `modules/source/14_kvcaching/kvcaching_dev.py` and start implementing. diff --git a/book/chapters/15-profiling.md b/book/chapters/15-profiling.md deleted file mode 100644 index 929855a5..00000000 --- a/book/chapters/15-profiling.md +++ /dev/null @@ -1,451 +0,0 @@ ---- -title: "Profiling - Performance Analysis and Optimization" -description: "Build profilers to identify bottlenecks and guide optimization decisions" -difficulty: 3 -time_estimate: "5-6 hours" -prerequisites: ["All modules 01-14"] -next_steps: ["Acceleration"] -learning_objectives: - - "Implement timing profilers with statistical rigor for accurate measurements" - - "Design memory profilers to track allocation patterns and identify leaks" - - "Build FLOP counters to measure computational complexity" - - "Understand performance bottlenecks across different architectures" - - "Apply data-driven analysis to guide optimization priorities" ---- - -# 15. Profiling - -**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours - -## Overview - -Build comprehensive profiling tools to measure where time and memory go in your ML systems. This module implements timing profilers, memory trackers, and FLOP counters that reveal bottlenecks and guide optimization decisions. - -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement timing profilers** with statistical rigor (multiple runs, confidence intervals) for accurate measurements -2. **Design memory profilers** to track allocation patterns, peak usage, and identify memory leaks -3. **Build FLOP counters** to measure theoretical computational complexity of different operations -4. **Understand performance bottlenecks** by comparing MLPs, CNNs, and Transformers systematically -5. **Apply data-driven analysis** to prioritize optimization efforts based on actual impact - -## Why This Matters - -### Production Context - -Profiling is mandatory for production ML systems: - -- **Google TPU teams** profile every operation to optimize hardware utilization -- **OpenAI** profiles GPT training to identify $millions in compute savings -- **Meta** profiles inference to serve billions of requests per day efficiently -- **NVIDIA** uses profiling to optimize cuDNN kernels for peak performance - -### Historical Context - -Profiling evolved with ML scale: - -- **Early ML (pre-2012)**: Ad-hoc timing with `time.time()`; no systematic profiling -- **Deep Learning Era (2012-2017)**: NVIDIA profiler, TensorBoard timing; focus on GPU utilization -- **Production Scale (2018+)**: Comprehensive profiling (compute, memory, I/O, network); optimization critical for economics -- **Modern Systems (2020+)**: Automated profiling and optimization; ML compilers use profiling data - -Without profiling, you're optimizing blind—profiling shows you where to focus. - -## Pedagogical Pattern: Build → Use → Optimize - -### 1. Build - -Implement from first principles: -- High-precision timing with multiple runs -- Statistical analysis (mean, std, confidence intervals) -- Memory profiler tracking allocations and deallocations -- FLOP counter for theoretical complexity -- Comparative profiler across architectures - -### 2. Use - -Apply to real problems: -- Profile attention vs feedforward in transformers -- Compare MLP vs CNN vs Transformer efficiency -- Identify memory bottlenecks in training loops -- Measure impact of batch size on throughput -- Analyze scaling behavior with model size - -### 3. Optimize - -Production insights: -- Prioritize optimizations by impact (80/20 rule) -- Measure before/after optimization -- Understand hardware utilization (CPU vs GPU) -- Identify memory bandwidth vs compute bottlenecks -- Build optimization roadmap based on data - -## Implementation Guide - -### Core Components - -**High-Precision Timer** -```python -class Timer: - """High-precision timing with statistical analysis. - - Performs multiple runs to account for variance and noise. - Reports mean, std, and confidence intervals. - - Example: - timer = Timer() - with timer: - model.forward(x) - print(f"Time: {timer.mean:.3f}ms ± {timer.std:.3f}ms") - """ - def __init__(self, num_runs=10, warmup_runs=3): - self.num_runs = num_runs - self.warmup_runs = warmup_runs - self.times = [] - - def __enter__(self): - # Warmup runs (not counted) - for _ in range(self.warmup_runs): - start = time.perf_counter() - # Operation happens in with block - - # Timed runs - self.start_time = time.perf_counter() - return self - - def __exit__(self, *args): - elapsed = time.perf_counter() - self.start_time - self.times.append(elapsed * 1000) # Convert to ms - - @property - def mean(self): - return np.mean(self.times) - - @property - def std(self): - return np.std(self.times) - - @property - def confidence_interval(self, confidence=0.95): - """95% confidence interval using t-distribution.""" - from scipy import stats - ci = stats.t.interval(confidence, len(self.times)-1, - loc=self.mean, scale=stats.sem(self.times)) - return ci - - def report(self): - ci = self.confidence_interval() - return f"{self.mean:.3f}ms ± {self.std:.3f}ms (95% CI: [{ci[0]:.3f}, {ci[1]:.3f}])" -``` - -**Memory Profiler** -```python -class MemoryProfiler: - """Track memory allocations and peak usage. - - Monitors memory throughout execution to identify: - - Peak memory usage - - Memory leaks - - Allocation patterns - - Memory bandwidth bottlenecks - """ - def __init__(self): - self.snapshots = [] - self.peak_memory = 0 - - def snapshot(self, label=""): - """Take memory snapshot at current point.""" - import psutil - process = psutil.Process() - mem_info = process.memory_info() - - snapshot = { - 'label': label, - 'rss': mem_info.rss / 1024**2, # MB - 'vms': mem_info.vms / 1024**2, # MB - 'timestamp': time.time() - } - self.snapshots.append(snapshot) - self.peak_memory = max(self.peak_memory, snapshot['rss']) - - return snapshot - - def report(self): - """Generate memory usage report.""" - print(f"Peak Memory: {self.peak_memory:.2f} MB") - print("\nMemory Timeline:") - for snap in self.snapshots: - print(f" {snap['label']:30s}: {snap['rss']:8.2f} MB") - - # Calculate memory growth - if len(self.snapshots) >= 2: - growth = self.snapshots[-1]['rss'] - self.snapshots[0]['rss'] - print(f"\nTotal Growth: {growth:+.2f} MB") - - # Check for potential memory leak - if growth > 100: # Arbitrary threshold - print("⚠️ Potential memory leak detected!") -``` - -**FLOP Counter** -```python -class FLOPCounter: - """Count floating-point operations for complexity analysis. - - Provides theoretical computational complexity independent of hardware. - Useful for comparing different architectural choices. - """ - def __init__(self): - self.total_flops = 0 - self.op_counts = {} - - def count_matmul(self, A_shape, B_shape): - """Count FLOPs for matrix multiplication. - - C = A @ B where A is (m, k) and B is (k, n) - FLOPs = 2*m*k*n (multiply-add for each output element) - """ - m, k = A_shape - k2, n = B_shape - assert k == k2, "Invalid matmul dimensions" - - flops = 2 * m * k * n - self.total_flops += flops - self.op_counts['matmul'] = self.op_counts.get('matmul', 0) + flops - return flops - - def count_attention(self, batch, seq_len, d_model, num_heads): - """Count FLOPs for multi-head attention. - - Components: - - Q,K,V projections: 3 * (batch * seq_len * d_model * d_model) - - Attention scores: batch * heads * seq_len * seq_len * d_k - - Attention weighting: batch * heads * seq_len * seq_len * d_v - - Output projection: batch * seq_len * d_model * d_model - """ - d_k = d_model // num_heads - - # QKV projections - qkv_flops = 3 * self.count_matmul((batch * seq_len, d_model), (d_model, d_model)) - - # Attention computation - scores_flops = batch * num_heads * seq_len * seq_len * d_k * 2 - weights_flops = batch * num_heads * seq_len * seq_len * d_k * 2 - attention_flops = scores_flops + weights_flops - - # Output projection - output_flops = self.count_matmul((batch * seq_len, d_model), (d_model, d_model)) - - total = qkv_flops + attention_flops + output_flops - self.op_counts['attention'] = self.op_counts.get('attention', 0) + total - return total - - def report(self): - """Generate FLOP report with breakdown.""" - print(f"Total FLOPs: {self.total_flops / 1e9:.2f} GFLOPs") - print("\nBreakdown by operation:") - for op, flops in sorted(self.op_counts.items(), key=lambda x: x[1], reverse=True): - percentage = (flops / self.total_flops) * 100 - print(f" {op:20s}: {flops/1e9:8.2f} GFLOPs ({percentage:5.1f}%)") -``` - -**Architecture Profiler - Comparative Analysis** -```python -class ArchitectureProfiler: - """Compare performance across different architectures. - - Profiles MLP, CNN, and Transformer on same task to understand - compute/memory trade-offs. - """ - def __init__(self): - self.results = {} - - def profile_model(self, model, input_data, model_name): - """Profile a model comprehensively.""" - result = { - 'model_name': model_name, - 'parameters': count_parameters(model), - 'timing': {}, - 'memory': {}, - 'flops': {} - } - - # Timing profile - timer = Timer(num_runs=10) - for _ in range(timer.num_runs + timer.warmup_runs): - with timer: - output = model.forward(input_data) - result['timing']['forward'] = timer.mean - - # Memory profile - mem = MemoryProfiler() - mem.snapshot("Before forward") - output = model.forward(input_data) - mem.snapshot("After forward") - result['memory']['peak'] = mem.peak_memory - - # FLOP count - flop_counter = FLOPCounter() - # Count FLOPs based on model architecture - result['flops']['total'] = flop_counter.total_flops - - self.results[model_name] = result - return result - - def compare(self): - """Generate comparative report.""" - print("\nArchitecture Comparison") - print("=" * 80) - - for name, result in self.results.items(): - print(f"\n{name}:") - print(f" Parameters: {result['parameters']/1e6:.2f}M") - print(f" Forward time: {result['timing']['forward']:.3f}ms") - print(f" Peak memory: {result['memory']['peak']:.2f}MB") - print(f" FLOPs: {result['flops']['total']/1e9:.2f}GFLOPs") -``` - -### Step-by-Step Implementation - -1. **Build High-Precision Timer** - - Use `time.perf_counter()` for nanosecond precision - - Implement multiple runs with warmup - - Calculate mean, std, confidence intervals - - Test with known delays - -2. **Implement Memory Profiler** - - Track memory at key points (before/after operations) - - Calculate peak memory usage - - Identify memory growth patterns - - Detect potential leaks - -3. **Create FLOP Counter** - - Count operations for matmul, convolution, attention - - Build hierarchical counting (operation → layer → model) - - Compare theoretical vs actual performance - - Identify compute-bound vs memory-bound operations - -4. **Build Architecture Profiler** - - Profile MLP on MNIST/CIFAR - - Profile CNN on CIFAR - - Profile Transformer on text - - Generate comparative reports - -5. **Analyze Results** - - Identify bottleneck operations (Pareto principle) - - Compare efficiency across architectures - - Understand scaling behavior - - Prioritize optimization opportunities - -## Testing - -### Inline Tests - -Run inline tests while building: -```bash -cd modules/source/15_profiling -python profiling_dev.py -``` - -Expected output: -``` -Unit Test: Timer with statistical analysis... -✅ Multiple runs produce consistent results -✅ Confidence intervals computed correctly -✅ Warmup runs excluded from statistics -Progress: Timing Profiler ✓ - -Unit Test: Memory profiler... -✅ Snapshots capture memory correctly -✅ Peak memory tracked accurately -✅ Memory growth detected -Progress: Memory Profiler ✓ - -Unit Test: FLOP counter... -✅ Matmul FLOPs: 2*m*k*n verified -✅ Attention FLOPs match theoretical -✅ Operation breakdown correct -Progress: FLOP Counter ✓ -``` - -### Export and Validate - -```bash -tito export 15_profiling -tito test 15_profiling -``` - -## Where This Code Lives - -``` -tinytorch/ -├── profiler/ -│ └── profiling.py # Your implementation goes here -└── __init__.py # Exposes Timer, MemoryProfiler, etc. - -Usage: ->>> from tinytorch.profiler import Timer, MemoryProfiler, FLOPCounter ->>> timer = Timer() ->>> with timer: ->>> model.forward(x) ->>> print(timer.report()) -``` - -## Systems Thinking Questions - -1. **Amdahl's Law**: If attention is 70% of compute and you optimize it 2×, what's the overall speedup? Why can't you get 2× end-to-end speedup? - -2. **Memory vs Compute Bottlenecks**: Your GPU can do 100 TFLOPs/s but memory bandwidth is 900 GB/s. For FP32 operations needing 4 bytes/FLOP, what's the bottleneck? When? - -3. **Batch Size Impact**: Doubling batch size doesn't double throughput. Why? What's the relationship between batch size, memory, and throughput? - -4. **Profiling Overhead**: Your profiler adds 5% overhead. Is this acceptable? When would you use sampling profilers vs instrumentation profilers? - -5. **Hardware Differences**: Your code runs 10× slower on CPU than GPU for large matrices, but only 2× slower for small ones. Why? What's the crossover point? - -## Real-World Connections - -### Industry Applications - -**Google TPU Optimization** -- Profile every kernel to maximize TPU utilization -- Optimize for both FLOPs and memory bandwidth -- Use profiling to guide hardware design decisions -- Achieve 40-50% utilization (very high for accelerators) - -**OpenAI Training Optimization** -- Profile GPT training to find $millions in savings -- Identify gradient checkpointing opportunities -- Optimize data loading pipelines -- Achieve 50%+ MFU (model FLOPs utilization) - -**Meta Inference Serving** -- Profile PyTorch models for production deployment -- Identify operator fusion opportunities -- Optimize for latency (p50, p99) not just throughput -- Serve billions of requests per day efficiently - -### Research Impact - -This module implements patterns from: -- TensorBoard Profiler (Google, 2019): Visual profiling for TensorFlow -- PyTorch Profiler (Meta, 2020): Comprehensive profiling for PyTorch -- NVIDIA Nsight (2021): GPU-specific profiling and optimization -- MLPerf (2022): Standardized benchmarking and profiling - -## What's Next? - -In **Module 16: Acceleration**, you'll use your profiling data to actually optimize: - -- Implement operator fusion based on profiling insights -- Optimize memory access patterns -- Apply algorithmic improvements to bottlenecks -- Measure impact of each optimization - -Profiling shows you *what* to optimize—acceleration shows you *how* to optimize it! - ---- - -**Ready to become a performance detective?** Open `modules/source/15_profiling/profiling_dev.py` and start implementing. diff --git a/book/chapters/16-acceleration.md b/book/chapters/16-acceleration.md deleted file mode 100644 index 632a9142..00000000 --- a/book/chapters/16-acceleration.md +++ /dev/null @@ -1,148 +0,0 @@ ---- -title: "Acceleration - Hardware-Aware Optimization" -description: "Optimize ML operations with SIMD, cache-friendly algorithms, and parallel computing" -difficulty: 4 -time_estimate: "6-8 hours" -prerequisites: ["Profiling"] -next_steps: ["Quantization"] -learning_objectives: - - "Implement cache-friendly algorithms for matrix operations" - - "Apply SIMD vectorization for parallel data processing" - - "Design multi-core parallelization strategies for batch operations" - - "Understand hardware bottlenecks (compute vs memory bandwidth)" - - "Optimize ML kernels based on profiling data from Module 15" ---- - -# 16. Acceleration - -**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐⭐ (4/4) | Time: 6-8 hours - -## Overview - -Optimize ML operations through hardware-aware programming. This module implements cache-friendly algorithms, SIMD vectorization, and multi-core parallelization to achieve significant speedups based on profiling insights from Module 15. - -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement cache-friendly algorithms** for matrix multiplication and convolution using blocked algorithms -2. **Apply SIMD vectorization** to parallelize element-wise operations across data -3. **Design multi-core parallelization strategies** for batch processing and data parallelism -4. **Understand hardware bottlenecks** (compute-bound vs memory-bound operations) -5. **Optimize ML kernels** based on actual profiling data, achieving measurable speedups - -## Why This Matters - -### Production Context - -Hardware optimization is critical for production ML: - -- **PyTorch** uses custom CUDA kernels and CPU vectorization; 100× faster than naive Python -- **TensorFlow XLA** compiles models to optimized machine code; reduces latency by 2-5× -- **ONNX Runtime** applies hardware-specific optimizations; powers Microsoft/Azure ML serving -- **Apple Neural Engine** uses custom accelerators; enables on-device ML on iPhones - -### Historical Context - -Hardware optimization evolved with ML scale: - -- **Pre-Deep Learning (pre-2010)**: Hand-written assembly for critical loops; library implementations -- **GPU Era (2010-2017)**: CUDA kernels dominate; cuDNN becomes standard; 10-100× speedups -- **Specialized Hardware (2018+)**: TPUs, custom ASICs; compiler-based optimization -- **Modern Systems (2020+)**: ML compilers (TVM, XLA); automated kernel generation and tuning - -Understanding hardware optimization separates production engineers from researchers. - -## Pedagogical Pattern: Build → Use → Optimize - -### 1. Build - -Implement from first principles: -- Blocked matrix multiplication for cache efficiency -- SIMD-vectorized element-wise operations -- Multi-threaded batch processing -- Memory-aligned data structures -- Profiling integration - -### 2. Use - -Apply to real problems: -- Optimize bottlenecks identified in Module 15 -- Accelerate attention computation -- Speed up convolutional operations -- Parallelize data loading pipelines -- Measure actual speedups - -### 3. Optimize - -Production techniques: -- Auto-tuning for different hardware -- Mixed-precision computation (FP16/FP32) -- Operator fusion to reduce memory traffic -- Batch processing for amortized overhead -- Hardware-specific code paths - -## Implementation Guide - -### Core Patterns - -**Cache-Friendly Matrix Multiplication** -- Block matrices into cache-sized tiles -- Reuse data while in cache (temporal locality) -- Access memory sequentially (spatial locality) -- Typical speedup: 2-5× over naive implementation - -**SIMD Vectorization** -- Process multiple data elements simultaneously -- Use Numba/Cython for automatic vectorization -- Align data to vector boundaries (16/32/64 bytes) -- Typical speedup: 2-8× for element-wise ops - -**Multi-Core Parallelization** -- Divide work across CPU cores -- Use thread pools for batch processing -- Minimize synchronization overhead -- Typical speedup: 0.5-0.8× number of cores (due to overhead) - -## Testing - -```bash -cd modules/source/16_acceleration -python acceleration_dev.py -tito export 16_acceleration -tito test 16_acceleration -``` - -## Where This Code Lives - -``` -tinytorch/ -├── acceleration/ -│ └── kernels.py # Optimized implementations -└── __init__.py -``` - -## Systems Thinking Questions - -1. **Roofline Model**: Your operation needs 1000 FLOPs and 100 bytes. At 100 GFLOPs/s compute and 10 GB/s bandwidth, what's the bottleneck? - -2. **Amdahl's Law Applied**: You parallelize 90% of code perfectly across 8 cores. What's max speedup? Why not 8×? - -3. **Cache Hierarchy**: L1 cache is 10× faster than L2, which is 10× faster than RAM. How does blocking matrix multiplication exploit this? - -## Real-World Connections - -**PyTorch/TensorFlow**: Custom CUDA kernels for all operations -**ONNX Runtime**: Hardware-specific optimization for production serving -**Apple ML**: Metal shaders and Neural Engine for on-device inference - -## What's Next? - -In **Module 17: Quantization**, you'll reduce precision for even more speedups: -- INT8 quantization for 4× memory reduction -- Mixed-precision training and inference -- Calibration and accuracy preservation - ---- - -**Ready to optimize for hardware?** Open `modules/source/16_acceleration/acceleration_dev.py` and start implementing. diff --git a/book/chapters/17-quantization.md b/book/chapters/17-quantization.md deleted file mode 100644 index 84251c9c..00000000 --- a/book/chapters/17-quantization.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -title: "Quantization - Reduced Precision for Efficiency" -description: "INT8 quantization, calibration, and mixed-precision strategies" -difficulty: 3 -time_estimate: "5-6 hours" -prerequisites: ["Acceleration"] -next_steps: ["Compression"] -learning_objectives: - - "Implement INT8 quantization for weights and activations" - - "Design calibration strategies to minimize accuracy loss" - - "Apply mixed-precision training and inference patterns" - - "Understand quantization-aware training vs post-training quantization" - - "Measure memory and speed improvements from reduced precision" ---- - -# 17. Quantization - -**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours - -## Overview - -Reduce model precision from FP32 to INT8 for 4× memory reduction and 2-4× inference speedup. This module implements quantization, calibration, and mixed-precision strategies used in production deployment. - -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement INT8 quantization** for model weights and activations with scale/zero-point parameters -2. **Design calibration strategies** using representative data to minimize accuracy degradation -3. **Apply mixed-precision training** (FP16/FP32) for faster training with maintained accuracy -4. **Understand quantization-aware training** vs post-training quantization trade-offs -5. **Measure memory and speed improvements** while tracking accuracy impact - -## Why This Matters - -### Production Context - -Quantization is mandatory for edge deployment: - -- **TensorFlow Lite** uses INT8 quantization for mobile deployment; 4× smaller models -- **ONNX Runtime** supports INT8 inference; 2-4× faster on CPUs -- **Apple Core ML** quantizes models for iPhone Neural Engine; enables on-device ML -- **Google Edge TPU** requires INT8; optimized hardware for quantized operations - -### Historical Context - -- **Pre-2017**: FP32 standard; quantization for special cases only -- **2017-2019**: INT8 post-training quantization; TensorFlow Lite adoption -- **2019-2021**: Quantization-aware training; maintains accuracy better -- **2021+**: INT4, mixed-precision, dynamic quantization; aggressive compression - -Quantization enables deployment where FP32 models wouldn't fit or run fast enough. - -## Implementation Guide - -### Core Components - -**Symmetric INT8 Quantization** -``` -Quantization: x_int8 = round(x_fp32 / scale) -Dequantization: x_fp32 = x_int8 * scale - -where scale = max(|x|) / 127 -``` - -**Asymmetric Quantization (with zero-point)** -``` -Quantization: x_int8 = round(x_fp32 / scale) + zero_point -Dequantization: x_fp32 = (x_int8 - zero_point) * scale -``` - -**Calibration**: Use representative data to find optimal scale/zero-point parameters - -## Testing - -```bash -tito export 17_quantization -tito test 17_quantization -``` - -## Where This Code Lives - -``` -tinytorch/ -├── quantization/ -│ └── quantize.py -└── __init__.py -``` - -## Systems Thinking Questions - -1. **Accuracy vs Efficiency**: INT8 loses precision. When is <1% accuracy drop acceptable? When must you use QAT? - -2. **Per-Tensor vs Per-Channel**: Per-channel quantization preserves accuracy better but increases complexity. When is it worth it? - -3. **Quantized Operations**: INT8 matmul is faster, but quantize/dequantize adds overhead. When does quantization win overall? - -## Real-World Connections - -**Mobile Deployment**: TensorFlow Lite, Core ML use INT8 for on-device inference -**Cloud Serving**: ONNX Runtime, TensorRT use INT8 for cost-effective serving -**Edge AI**: INT8 required for Coral Edge TPU, Jetson Nano deployment - -## What's Next? - -In **Module 18: Compression**, you'll combine quantization with pruning: -- Remove unimportant weights (pruning) -- Quantize remaining weights (INT8) -- Achieve 10-50× compression with minimal accuracy loss - ---- - -**Ready to quantize models?** Open `modules/source/17_quantization/quantization_dev.py` and start implementing. diff --git a/book/chapters/18-compression.md b/book/chapters/18-compression.md deleted file mode 100644 index 597c4af0..00000000 --- a/book/chapters/18-compression.md +++ /dev/null @@ -1,121 +0,0 @@ ---- -title: "Compression - Pruning and Model Compression" -description: "Prune unnecessary weights and compress models for deployment" -difficulty: 3 -time_estimate: "5-6 hours" -prerequisites: ["Quantization"] -next_steps: ["Benchmarking"] -learning_objectives: - - "Implement magnitude-based pruning to remove unimportant weights" - - "Design structured pruning strategies (channel, layer-wise)" - - "Apply iterative pruning with fine-tuning for accuracy preservation" - - "Combine pruning with quantization for maximum compression" - - "Measure compression ratios and inference speedups" ---- - -# 18. Compression - -**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours - -## Overview - -Compress neural networks through pruning (removing weights) and combining with quantization. This module implements techniques to achieve 10-50× compression with minimal accuracy loss, enabling deployment on resource-constrained devices. - -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement magnitude-based pruning** to identify and remove unimportant weights -2. **Design structured pruning strategies** (channel pruning, layer-wise) for actual speedups -3. **Apply iterative pruning** with fine-tuning to maintain model accuracy -4. **Combine pruning with quantization** for maximum compression (50-100× possible) -5. **Measure compression ratios** and verify inference speedup vs accuracy trade-offs - -## Why This Matters - -### Production Context - -Compression enables practical deployment: - -- **BERT Distillation (DistilBERT)**: 40% smaller, 60% faster, 97% accuracy retention -- **MobileNet**: Structured pruning + quantization for mobile deployment -- **Lottery Ticket Hypothesis**: Sparse networks train as well as dense ones -- **GPT-3 Distillation**: Smaller models approaching GPT-3 performance - -### Historical Context - -- **Pre-2015**: Limited compression work; models small enough for hardware -- **2015-2017**: Magnitude pruning (Han et al.); Lottery Ticket Hypothesis -- **2018-2020**: Structured pruning; distillation; BERT compression -- **2020+**: Extreme compression (100×); sparse transformers; efficient architectures - -Compression is now standard for deployment, not optional. - -## Implementation Guide - -### Core Techniques - -**Magnitude Pruning** -- Sort weights by absolute value -- Remove smallest X% (typically 50-90%) -- Fine-tune remaining weights -- Can achieve 10× compression with <1% accuracy loss - -**Structured Pruning** -- Remove entire channels/neurons -- Achieves actual speedup (vs unstructured sparsity) -- Typically 2-5× compression -- More aggressive accuracy impact - -**Iterative Pruning** -- Prune gradually (10% at a time) -- Fine-tune after each pruning step -- Better accuracy than one-shot pruning -- More training cost - -**Pruning + Quantization** -- Prune 90% of weights → 10× reduction -- Quantize FP32 → INT8 → 4× reduction -- Combined: 40× compression - -## Testing - -```bash -tito export 18_compression -tito test 18_compression -``` - -## Where This Code Lives - -``` -tinytorch/ -├── compression/ -│ └── prune.py -└── __init__.py -``` - -## Systems Thinking Questions - -1. **Lottery Ticket Hypothesis**: Why can pruned networks retrain to full accuracy? What does this say about overparameterization? - -2. **Structured vs Unstructured**: Unstructured pruning achieves better compression but no speedup. Why? When is sparse computation actually faster? - -3. **Distillation vs Pruning**: Both compress models. When would you use each? Can you combine them? - -## Real-World Connections - -**DistilBERT**: 40% smaller BERT with 97% performance -**MobileNetV2**: Efficient architectures + pruning for mobile -**NVIDIA TensorRT**: Automatic pruning + quantization for deployment - -## What's Next? - -In **Module 19: Benchmarking**, you'll measure everything you've built: -- Fair comparison across optimizations -- Statistical significance testing -- MLPerf-style benchmarking protocols -- Comprehensive performance reports - ---- - -**Ready to compress models?** Open `modules/source/18_compression/compression_dev.py` and start implementing. diff --git a/modules/source/14_kvcaching/kvcaching_dev.ipynb b/modules/source/14_kvcaching/kvcaching_dev.ipynb deleted file mode 100644 index b86c77f5..00000000 --- a/modules/source/14_kvcaching/kvcaching_dev.ipynb +++ /dev/null @@ -1,1571 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "1078513e", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "# Module 14: KV Caching - Optimizing Autoregressive Generation\n", - "\n", - "Welcome to Module 14! You'll implement the critical optimization that makes production language models possible: Key-Value caching for 10-15x faster text generation.\n", - "\n", - "## 🔗 Prerequisites & Progress\n", - "**You've Built**: Complete transformer architecture with multi-head attention and text generation\n", - "**You'll Build**: Memory-efficient KV caching system that eliminates redundant computation\n", - "**You'll Enable**: Production-grade inference optimization and real-world serving capabilities\n", - "\n", - "**Connection Map**:\n", - "```\n", - "Transformers → KV Caching → Production Serving\n", - "(slow O(n²)) (fast O(n)) (real-world scale)\n", - "```\n", - "\n", - "## Learning Objectives\n", - "By the end of this module, you will:\n", - "1. Understand why autoregressive generation has O(n²) complexity without caching\n", - "2. Implement KVCache with efficient memory management and O(1) updates\n", - "3. Build cache-aware attention that reuses previously computed keys and values\n", - "4. Measure dramatic speedup gains (10-15x) and understand memory trade-offs\n", - "5. Connect to production optimization patterns used in real LLM serving\n", - "\n", - "Let's make inference blazingly fast!\n", - "\n", - "## 📦 Where This Code Lives in the Final Package\n", - "\n", - "**Learning Side:** You work in `modules/14_kvcaching/kvcaching_dev.py` \n", - "**Building Side:** Code exports to `tinytorch.generation.kv_cache`\n", - "\n", - "```python\n", - "# How to use this module:\n", - "from tinytorch.generation.kv_cache import KVCache, enable_kv_cache\n", - "```\n", - "\n", - "**Why this matters:**\n", - "- **Learning:** Complete caching system demonstrating production optimization techniques\n", - "- **Production:** Proper organization matching Hugging Face's generation/ module structure\n", - "- **Consistency:** All generation optimizations in generation.kv_cache\n", - "- **Integration:** Works seamlessly with transformers for complete inference optimization" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "266270f3", - "metadata": {}, - "outputs": [], - "source": [ - "#| default_exp generation.kv_cache\n", - "#| export\n", - "\n", - "import numpy as np\n", - "import time\n", - "from typing import Tuple, Optional, Dict, List\n", - "\n", - "# Import TinyTorch components from previous modules\n", - "from tinytorch.core.tensor import Tensor" - ] - }, - { - "cell_type": "markdown", - "id": "06ca957c", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎯 Part 1: Understanding the Autoregressive Generation Problem\n", - "\n", - "### The Core Inefficiency\n", - "\n", - "When generating text token by token, transformers face a fundamental computational bottleneck. Let's visualize what happens during naive generation:\n", - "\n", - "```\n", - "Token Generation Process (Without Caching):\n", - "\n", - "Step 1: Generate \"Hello\"\n", - "Input: [START]\n", - "Attention: Q₁ × [K₁] × [V₁] ← 1 computation\n", - "\n", - "Step 2: Generate \"world\"\n", - "Input: [START, Hello]\n", - "Attention: Q₂ × [K₁, K₂] × [V₁, V₂] ← 2 computations (K₁,V₁ RECOMPUTED!)\n", - "\n", - "Step 3: Generate \"!\"\n", - "Input: [START, Hello, world]\n", - "Attention: Q₃ × [K₁, K₂, K₃] × [V₁, V₂, V₃] ← 3 computations (K₁,V₁,K₂,V₂ RECOMPUTED!)\n", - "```\n", - "\n", - "**The Problem**: For each new token, we recompute ALL previous key-value pairs even though they never change!\n", - "\n", - "### Computational Complexity Analysis\n", - "\n", - "```\n", - "Naive Generation Complexity:\n", - "Step 1: 1 K,V computation\n", - "Step 2: 2 K,V computations\n", - "Step 3: 3 K,V computations\n", - "...\n", - "Step n: n K,V computations\n", - "\n", - "Total: 1 + 2 + 3 + ... + n = n(n+1)/2 = O(n²) complexity!\n", - "```\n", - "\n", - "For a 100-token sequence, this means **5,050 redundant computations**!\n", - "\n", - "### Real-World Impact\n", - "\n", - "This inefficiency makes production LLM serving economically impossible without optimization:\n", - "- **ChatGPT/GPT-4**: Would be too slow for real-time chat without caching\n", - "- **Code completion**: IDEs couldn't provide instant suggestions\n", - "- **Mobile deployment**: On-device generation would drain batteries instantly\n", - "- **API serving**: Server costs would be 10x+ higher\n", - "\n", - "**The Solution**: Cache key-value pairs after computing them once, transforming O(n²) into O(n)." - ] - }, - { - "cell_type": "markdown", - "id": "dc896d3f", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🧮 Part 2: The Key-Value Caching Insight\n", - "\n", - "### Mathematical Foundation\n", - "\n", - "The core insight comes from understanding what changes during autoregressive generation:\n", - "\n", - "```\n", - "Attention Computation Breakdown:\n", - "\n", - "Q = new_token @ W_q ← Only new token (changes each step)\n", - "K = all_tokens @ W_k ← Includes old tokens (mostly redundant!)\n", - "V = all_tokens @ W_v ← Includes old tokens (mostly redundant!)\n", - "\n", - "attention_output = softmax(Q @ K.T / √d_k) @ V\n", - "```\n", - "\n", - "**Key Insight**: K and V matrices for previous tokens NEVER change!\n", - "\n", - "```\n", - "Token Dependencies:\n", - "K₁ = token₁ @ W_k ← Computed once, never changes\n", - "K₂ = token₂ @ W_k ← Computed once, never changes\n", - "K₃ = token₃ @ W_k ← Computed once, never changes\n", - "\n", - "Same for V₁, V₂, V₃...\n", - "```\n", - "\n", - "### Cache-Optimized Generation\n", - "\n", - "```\n", - "Optimized Generation Process (With Caching):\n", - "\n", - "Step 1: Generate \"Hello\"\n", - "Compute: K₁, V₁ → Store in cache\n", - "Attention: Q₁ × cached[K₁] × cached[V₁]\n", - "\n", - "Step 2: Generate \"world\"\n", - "Compute: K₂, V₂ → Append to cache\n", - "Attention: Q₂ × cached[K₁, K₂] × cached[V₁, V₂]\n", - "\n", - "Step 3: Generate \"!\"\n", - "Compute: K₃, V₃ → Append to cache\n", - "Attention: Q₃ × cached[K₁, K₂, K₃] × cached[V₁, V₂, V₃]\n", - "```\n", - "\n", - "**Result**: Each step computes only ONE new K,V pair instead of recomputing ALL!\n", - "\n", - "### Memory vs Compute Trade-off\n", - "\n", - "```\n", - "Traditional Approach:\n", - "Memory: O(1) (no storage needed)\n", - "Compute: O(n²) (recompute everything)\n", - "\n", - "Cached Approach:\n", - "Memory: O(n × d_k) (store all K,V pairs)\n", - "Compute: O(n) (only compute new pairs)\n", - "\n", - "For n=100, d_k=64:\n", - "Memory cost: 6.4 KB per layer\n", - "Compute savings: 50x reduction in K,V computations\n", - "```\n", - "\n", - "**Trade-off Winner**: Memory is cheap, compute is expensive! Use O(n) memory to save O(n²) compute." - ] - }, - { - "cell_type": "markdown", - "id": "c3feca5a", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🏗️ Part 3: KVCache Class Implementation\n", - "\n", - "### Core Requirements\n", - "\n", - "Our KVCache needs to efficiently handle:\n", - "\n", - "1. **Multi-layer storage**: Each transformer layer needs its own K,V cache\n", - "2. **Multi-head attention**: Each attention head has separate K,V pairs\n", - "3. **Batch processing**: Support multiple sequences simultaneously (batch inference)\n", - "4. **Dynamic updates**: Efficiently append new tokens without copying data\n", - "5. **Memory management**: Pre-allocate space to avoid dynamic resizing overhead\n", - "\n", - "### Cache Architecture Visualization\n", - "\n", - "```\n", - "KVCache Memory Layout:\n", - "┌─────────────────────────────────────────────────────────┐\n", - "│ KVCache Object │\n", - "├─────────────────────────────────────────────────────────┤\n", - "│ Layer 0: ┌─────────────┬─────────────┐ │\n", - "│ │ Key Cache │ Value Cache │ │\n", - "│ │ (B,H,S,D) │ (B,H,S,D) │ │\n", - "│ └─────────────┴─────────────┘ │\n", - "├─────────────────────────────────────────────────────────┤\n", - "│ Layer 1: ┌─────────────┬─────────────┐ │\n", - "│ │ Key Cache │ Value Cache │ │\n", - "│ │ (B,H,S,D) │ (B,H,S,D) │ │\n", - "│ └─────────────┴─────────────┘ │\n", - "├─────────────────────────────────────────────────────────┤\n", - "│ ... ┌─────────────┬─────────────┐ │\n", - "│ Layer N: │ Key Cache │ Value Cache │ │\n", - "│ │ (B,H,S,D) │ (B,H,S,D) │ │\n", - "│ └─────────────┴─────────────┘ │\n", - "└─────────────────────────────────────────────────────────┘\n", - "\n", - "Where:\n", - "B = batch_size (number of sequences)\n", - "H = num_heads (attention heads per layer)\n", - "S = max_seq_len (maximum sequence length)\n", - "D = head_dim (dimension per attention head)\n", - "```\n", - "\n", - "### Update Operation Flow\n", - "\n", - "```\n", - "Cache Update Process:\n", - " seq_pos = 2\n", - " ↓\n", - "┌─────┬─────┬─────┬─────┬─────┬─────┐\n", - "│ K₁ │ K₂ │ ??? │ ??? │ ??? │ ??? │ ← Key Cache\n", - "├─────┼─────┼─────┼─────┼─────┼─────┤\n", - "│ V₁ │ V₂ │ ??? │ ??? │ ??? │ ??? │ ← Value Cache\n", - "└─────┴─────┴─────┴─────┴─────┴─────┘\n", - "\n", - "New token arrives: K₃, V₃\n", - "\n", - " seq_pos = 2\n", - " ↓\n", - "┌─────┬─────┬─────┬─────┬─────┬─────┐\n", - "│ K₁ │ K₂ │ K₃ │ ??? │ ??? │ ??? │ ← Write K₃ here\n", - "├─────┼─────┼─────┼─────┼─────┼─────┤\n", - "│ V₁ │ V₂ │ V₃ │ ??? │ ??? │ ??? │ ← Write V₃ here\n", - "└─────┴─────┴─────┴─────┴─────┴─────┘\n", - "\n", - "Then: seq_pos += 1 (advance to position 3)\n", - "```\n", - "\n", - "This design enables **O(1) updates** - just write to the next position!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6d054a8c", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "kvcache-class", - "solution": true - } - }, - "outputs": [], - "source": [ - "#| export\n", - "class KVCache:\n", - " \"\"\"\n", - " Efficient key-value cache for autoregressive generation.\n", - "\n", - " Stores K,V matrices for each transformer layer to avoid recomputation\n", - " during sequential token generation. This is THE critical optimization\n", - " that makes production language model serving economically viable.\n", - " \n", - " ⚠️ IMPORTANT: INFERENCE-ONLY (No Gradient Tracking)\n", - " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n", - " KV caching is designed ONLY for inference (generation), NOT training.\n", - " - During generation: No gradients computed (model.eval() mode)\n", - " - Cache operations use .data (no gradient tracking)\n", - " - This is correct and intentional for maximum speed\n", - " - DO NOT use caching during training (use standard forward pass)\n", - " \n", - " Architecture:\n", - " - Pre-allocates cache tensors with maximum sequence length\n", - " - Tracks current sequence position for efficient O(1) updates\n", - " - Provides update() method to append new K,V pairs without copying\n", - " - Provides get() method to retrieve cached values for attention\n", - " - Handles multiple layers and attention heads properly\n", - " \n", - " Memory Layout:\n", - " ```\n", - " Layer 0: [Key_cache, Value_cache] # Shape: (batch, num_heads, max_seq, head_dim)\n", - " Layer 1: [Key_cache, Value_cache]\n", - " ...\n", - " Layer N: [Key_cache, Value_cache]\n", - " ```\n", - "\n", - " Performance:\n", - " - Update: O(1) - just index assignment\n", - " - Get: O(1) - just slicing (no data copy)\n", - " - Memory: O(num_layers × batch × heads × max_seq × head_dim)\n", - " \"\"\"\n", - "\n", - " def __init__(self, batch_size: int, max_seq_len: int, num_layers: int,\n", - " num_heads: int, head_dim: int):\n", - " \"\"\"\n", - " Initialize KV cache for efficient generation.\n", - "\n", - " TODO: Set up pre-allocated cache storage for all transformer layers\n", - "\n", - " APPROACH:\n", - " 1. Store configuration parameters (batch_size, max_seq_len, etc.)\n", - " 2. Initialize sequence position counter to 0\n", - " 3. Create empty list for cache storage\n", - " 4. For each layer, pre-allocate zero-filled key and value caches\n", - " 5. Store each layer's (key_cache, value_cache) tuple in the list\n", - "\n", - " Args:\n", - " batch_size: Number of sequences to generate simultaneously\n", - " max_seq_len: Maximum sequence length to support\n", - " num_layers: Number of transformer layers\n", - " num_heads: Number of attention heads per layer\n", - " head_dim: Dimension of each attention head\n", - "\n", - " EXAMPLE:\n", - " >>> cache = KVCache(batch_size=2, max_seq_len=128, num_layers=4,\n", - " ... num_heads=8, head_dim=64)\n", - " >>> cache.seq_pos # 0 (no tokens cached yet)\n", - " >>> len(cache.caches) # 4 (one per layer)\n", - " >>> cache.caches[0][0].shape # (2, 8, 128, 64) - key cache for layer 0\n", - "\n", - " HINTS:\n", - " - Cache shape: (batch_size, num_heads, max_seq_len, head_dim)\n", - " - Use Tensor(np.zeros(...)) to create cache tensors\n", - " - Store caches as list of tuples: [(key_0, val_0), (key_1, val_1), ...]\n", - " - Pre-allocation avoids dynamic resizing overhead during generation\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " self.batch_size = batch_size\n", - " self.max_seq_len = max_seq_len\n", - " self.num_layers = num_layers\n", - " self.num_heads = num_heads\n", - " self.head_dim = head_dim\n", - "\n", - " # Current sequence position (how many tokens are cached)\n", - " self.seq_pos = 0\n", - "\n", - " # Cache storage: list of (key_cache, value_cache) tuples per layer\n", - " self.caches = []\n", - "\n", - " for layer_idx in range(num_layers):\n", - " # Pre-allocate cache tensors with maximum size\n", - " # Shape: (batch_size, num_heads, max_seq_len, head_dim)\n", - " key_cache = Tensor(np.zeros((batch_size, num_heads, max_seq_len, head_dim)))\n", - " value_cache = Tensor(np.zeros((batch_size, num_heads, max_seq_len, head_dim)))\n", - "\n", - " self.caches.append((key_cache, value_cache))\n", - " ### END SOLUTION\n", - "\n", - " def update(self, layer_idx: int, key: Tensor, value: Tensor) -> None:\n", - " \"\"\"\n", - " Update cache with new key-value pairs for given layer.\n", - "\n", - " TODO: Efficiently append new K,V to cache without data copying\n", - "\n", - " APPROACH:\n", - " 1. Validate layer_idx is in range [0, num_layers-1]\n", - " 2. Validate seq_pos hasn't exceeded max_seq_len\n", - " 3. Retrieve the (key_cache, value_cache) tuple for this layer\n", - " 4. Write new key to position seq_pos in key_cache using indexed assignment\n", - " 5. Write new value to position seq_pos in value_cache using indexed assignment\n", - " 6. Note: seq_pos is advanced externally via advance() after all layers\n", - "\n", - " This is the core caching operation - efficiently append new K,V\n", - " to the cache without recomputation. This operation is O(1) because\n", - " it's just an indexed assignment.\n", - "\n", - " IMPORTANT: KV caching is designed for INFERENCE (generation) only,\n", - " not training. During generation, gradients are not computed. If you\n", - " need gradients, don't use caching (use standard forward pass instead).\n", - "\n", - " Args:\n", - " layer_idx: Which transformer layer (0 to num_layers-1)\n", - " key: New key tensor, shape (batch_size, num_heads, 1, head_dim)\n", - " value: New value tensor, shape (batch_size, num_heads, 1, head_dim)\n", - "\n", - " EXAMPLE:\n", - " >>> cache = KVCache(batch_size=1, max_seq_len=10, num_layers=2,\n", - " ... num_heads=4, head_dim=64)\n", - " >>> new_k = Tensor(np.random.randn(1, 4, 1, 64))\n", - " >>> new_v = Tensor(np.random.randn(1, 4, 1, 64))\n", - " >>> cache.update(layer_idx=0, key=new_k, value=new_v)\n", - " >>> cache.seq_pos # Still 0 (update doesn't advance position)\n", - " >>> cache.advance()\n", - " >>> cache.seq_pos # Now 1\n", - "\n", - " HINTS:\n", - " - Use slicing: cache[:, :, seq_pos:seq_pos+1, :] to write to position\n", - " - Use .data for direct NumPy access (no gradient tracking needed)\n", - " - Raise ValueError with helpful messages for invalid inputs\n", - " - This is an in-place operation (modifies cache, returns None)\n", - "\n", - " Raises:\n", - " ValueError: If layer_idx is out of range or sequence is full\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " if layer_idx >= self.num_layers:\n", - " raise ValueError(f\"Layer index {layer_idx} >= num_layers {self.num_layers}\")\n", - "\n", - " if self.seq_pos >= self.max_seq_len:\n", - " raise ValueError(f\"Sequence position {self.seq_pos} >= max_seq_len {self.max_seq_len}\")\n", - "\n", - " # Get cache for this layer\n", - " key_cache, value_cache = self.caches[layer_idx]\n", - "\n", - " # Update cache at current position (efficient O(1) write)\n", - " # Note: We use .data here because caching is inference-only (no gradients needed)\n", - " # This avoids gradient tracking overhead during generation\n", - " key_cache.data[:, :, self.seq_pos:self.seq_pos+1, :] = key.data\n", - " value_cache.data[:, :, self.seq_pos:self.seq_pos+1, :] = value.data\n", - "\n", - " # Note: seq_pos is advanced externally via advance() after all layers process\n", - " ### END SOLUTION\n", - "\n", - " def get(self, layer_idx: int) -> Tuple[Tensor, Tensor]:\n", - " \"\"\"\n", - " Retrieve cached key-value pairs for attention computation.\n", - "\n", - " TODO: Return only the valid cached portion for this layer\n", - "\n", - " APPROACH:\n", - " 1. Validate layer_idx is in range\n", - " 2. Retrieve the (key_cache, value_cache) tuple for this layer\n", - " 3. Calculate valid_len = seq_pos (number of tokens currently cached)\n", - " 4. Slice key_cache to get [:, :, :valid_len, :] (only filled portion)\n", - " 5. Slice value_cache to get [:, :, :valid_len, :] (only filled portion)\n", - " 6. Wrap sliced data in new Tensor objects and return\n", - "\n", - " Returns only the valid portion of the cache (up to current seq_pos).\n", - " This is O(1) because we're just slicing NumPy arrays (view, not copy).\n", - "\n", - " IMPORTANT: Returns Tensors without gradient tracking since caching\n", - " is inference-only. The returned tensors can be used in attention\n", - " computation but won't propagate gradients backward.\n", - "\n", - " Args:\n", - " layer_idx: Which transformer layer to get cache for\n", - "\n", - " Returns:\n", - " (cached_keys, cached_values): Tensors shaped for attention\n", - " Keys: (batch_size, num_heads, seq_pos, head_dim)\n", - " Values: (batch_size, num_heads, seq_pos, head_dim)\n", - "\n", - " EXAMPLE:\n", - " >>> cache = KVCache(batch_size=1, max_seq_len=100, num_layers=2,\n", - " ... num_heads=4, head_dim=64)\n", - " >>> # After processing 3 tokens\n", - " >>> cache.seq_pos = 3\n", - " >>> cached_k, cached_v = cache.get(layer_idx=0)\n", - " >>> cached_k.shape # (1, 4, 3, 64) - only first 3 positions\n", - " >>> cached_v.shape # (1, 4, 3, 64)\n", - "\n", - " HINTS:\n", - " - valid_len = self.seq_pos (how many tokens have been cached so far)\n", - " - Use slicing: cache.data[:, :, :valid_len, :] to get valid portion\n", - " - Wrap result in Tensor() for consistency with TinyTorch API\n", - " - If seq_pos=0, returns empty cache (shape with 0 in sequence dimension)\n", - "\n", - " Raises:\n", - " ValueError: If layer_idx is out of range\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " if layer_idx >= self.num_layers:\n", - " raise ValueError(f\"Layer index {layer_idx} >= num_layers {self.num_layers}\")\n", - "\n", - " # Get cache for this layer\n", - " key_cache, value_cache = self.caches[layer_idx]\n", - "\n", - " # Return only the valid portion (up to current sequence position)\n", - " # seq_pos tracks where to write next, so we have seq_pos valid tokens\n", - " valid_len = self.seq_pos\n", - "\n", - " # Note: Creating new Tensors from .data (no gradient tracking)\n", - " # This is correct for inference-only caching\n", - " cached_keys = Tensor(key_cache.data[:, :, :valid_len, :])\n", - " cached_values = Tensor(value_cache.data[:, :, :valid_len, :])\n", - "\n", - " return cached_keys, cached_values\n", - " ### END SOLUTION\n", - "\n", - " def advance(self) -> None:\n", - " \"\"\"\n", - " Advance sequence position after processing current token.\n", - "\n", - " Call this after all layers have processed the current token and\n", - " updated their caches. This moves the write pointer forward.\n", - " \"\"\"\n", - " self.seq_pos += 1\n", - "\n", - " def reset(self) -> None:\n", - " \"\"\"\n", - " Reset cache for new generation sequence.\n", - "\n", - " Call this when starting a new generation (new prompt).\n", - " Resets the sequence position counter and optionally zeros cache data.\n", - " \"\"\"\n", - " self.seq_pos = 0\n", - "\n", - " # Zero out caches for clean state (helps with debugging)\n", - " for layer_idx in range(self.num_layers):\n", - " key_cache, value_cache = self.caches[layer_idx]\n", - " key_cache.data.fill(0.0)\n", - " value_cache.data.fill(0.0)\n", - "\n", - " def get_memory_usage(self) -> Dict[str, float]:\n", - " \"\"\"\n", - " Calculate memory usage of the cache system.\n", - "\n", - " Returns:\n", - " Dictionary with memory statistics in MB\n", - " \"\"\"\n", - " # Calculate size of one cache tensor\n", - " cache_size = self.batch_size * self.num_heads * self.max_seq_len * self.head_dim\n", - " bytes_per_float = 4 # float32\n", - "\n", - " # Each layer has key_cache + value_cache\n", - " total_cache_tensors = self.num_layers * 2\n", - " total_elements = cache_size * total_cache_tensors\n", - " total_bytes = total_elements * bytes_per_float\n", - " total_mb = total_bytes / (1024 * 1024)\n", - "\n", - " return {\n", - " 'total_mb': total_mb,\n", - " 'per_layer_mb': total_mb / self.num_layers,\n", - " 'cache_tensors': total_cache_tensors,\n", - " 'total_elements': total_elements\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "94cee9a8", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: KVCache Implementation\n", - "\n", - "Let's test that our cache correctly stores and retrieves key-value pairs across multiple layers and sequence positions.\n", - "\n", - "**This is a unit test** - it tests the KVCache class in isolation with simulated attention keys and values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "62409497", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-kvcache", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_kvcache():\n", - " \"\"\"🔬 Unit Test: KVCache Implementation\"\"\"\n", - " print(\"🔬 Unit Test: KVCache Implementation...\")\n", - "\n", - " # Test parameters (small transformer for testing)\n", - " batch_size, max_seq_len = 2, 8\n", - " num_layers, num_heads, head_dim = 3, 4, 16\n", - "\n", - " # Create cache\n", - " cache = KVCache(batch_size, max_seq_len, num_layers, num_heads, head_dim)\n", - "\n", - " # Test 1: Initial state\n", - " assert cache.seq_pos == 0, \"Cache should start at position 0\"\n", - " mem_usage = cache.get_memory_usage()\n", - " assert mem_usage['total_mb'] > 0, \"Cache should have non-zero memory usage\"\n", - " print(f\" Cache initialized: {mem_usage['total_mb']:.2f} MB\")\n", - "\n", - " # Test 2: Single token update and retrieval\n", - " key1 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - " value1 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - "\n", - " # Update layer 0 with first token\n", - " cache.update(0, key1, value1)\n", - "\n", - " # Before advance, get() should return empty (seq_pos=0)\n", - " cached_k, cached_v = cache.get(0)\n", - " assert cached_k.shape == (batch_size, num_heads, 0, head_dim), \"Before advance, cache should be empty\"\n", - "\n", - " # Advance position\n", - " cache.advance()\n", - "\n", - " # Now cache should have 1 token\n", - " cached_k, cached_v = cache.get(0)\n", - " assert cached_k.shape == (batch_size, num_heads, 1, head_dim), f\"Expected shape (2,4,1,16), got {cached_k.shape}\"\n", - " assert cached_v.shape == (batch_size, num_heads, 1, head_dim), f\"Expected shape (2,4,1,16), got {cached_v.shape}\"\n", - "\n", - " # Test 3: Multi-token sequence\n", - " key2 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - " value2 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - " cache.update(0, key2, value2)\n", - " cache.advance()\n", - "\n", - " cached_k, cached_v = cache.get(0)\n", - " assert cached_k.shape == (batch_size, num_heads, 2, head_dim), \"Should have 2 tokens cached\"\n", - " assert cached_v.shape == (batch_size, num_heads, 2, head_dim), \"Should have 2 tokens cached\"\n", - "\n", - " # Test 4: Multiple layers\n", - " cache.reset()\n", - " key_test = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - " value_test = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - "\n", - " # Update all layers with same token\n", - " cache.update(0, key_test, value_test) # Layer 0\n", - " cache.update(1, key_test, value_test) # Layer 1\n", - " cache.update(2, key_test, value_test) # Layer 2\n", - " cache.advance()\n", - "\n", - " # Each layer should have the cached token\n", - " for layer_idx in range(num_layers):\n", - " cached_k, cached_v = cache.get(layer_idx)\n", - " assert cached_k.shape[2] == 1, f\"Layer {layer_idx} should have 1 token\"\n", - "\n", - " # Test 5: Reset functionality\n", - " cache.reset()\n", - " assert cache.seq_pos == 0, \"Reset should clear sequence position\"\n", - " cached_k, cached_v = cache.get(0)\n", - " assert cached_k.shape == (batch_size, num_heads, 0, head_dim), \"Reset should clear cache\"\n", - "\n", - " print(\"✅ KVCache implementation works correctly!\")\n", - "\n", - "# Run test immediately when developing this module\n", - "if __name__ == \"__main__\":\n", - " test_unit_kvcache()" - ] - }, - { - "cell_type": "markdown", - "id": "39ea5911", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🎯 Part 4: Enabling KV Caching for Model Generation\n", - "\n", - "### Integration Strategy\n", - "\n", - "Now we need a clean way to enable KV caching in our existing transformer models without breaking the existing code. We'll create an `enable_kv_cache()` function that:\n", - "\n", - "1. Creates a KVCache instance sized for the model\n", - "2. Returns a flag to indicate caching is enabled\n", - "3. Can be called before generation starts\n", - "\n", - "The actual integration with attention will happen in the milestone code where we:\n", - "1. Check if cache is enabled\n", - "2. Only compute K,V for new token (not all tokens)\n", - "3. Update cache with new K,V\n", - "4. Use cached K,V for attention computation\n", - "\n", - "### Generation Flow Comparison\n", - "\n", - "```\n", - "Without Cache (Current):\n", - "for each new token:\n", - " input_seq = [all tokens so far] # Length grows: 1, 2, 3, ...\n", - " logits = model.forward(input_seq) # Recomputes everything!\n", - " next_token = sample(logits[-1])\n", - " append next_token\n", - "\n", - "With Cache (New):\n", - "cache = enable_kv_cache(model)\n", - "for each new token:\n", - " input_token = [just new token] # Length always 1\n", - " logits = model.forward_cached(input_token, cache) # Only new computation\n", - " next_token = sample(logits[-1])\n", - " append next_token\n", - "```\n", - "\n", - "**Key Difference**: Input changes from growing sequence to single token, with cache providing history." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f453db6", - "metadata": { - "lines_to_next_cell": 1 - }, - "outputs": [], - "source": [ - "#| export\n", - "def enable_kv_cache(batch_size: int, max_seq_len: int, num_layers: int,\n", - " num_heads: int, head_dim: int) -> KVCache:\n", - " \"\"\"\n", - " Create and return a KVCache instance for model generation.\n", - " \n", - " This function creates a properly sized cache for the model architecture.\n", - " Call this before starting generation, then pass the cache to your\n", - " generation loop.\n", - "\n", - " Args:\n", - " batch_size: Number of sequences to generate simultaneously\n", - " max_seq_len: Maximum sequence length to support\n", - " num_layers: Number of transformer layers in model\n", - " num_heads: Number of attention heads per layer\n", - " head_dim: Dimension per attention head (usually embed_dim // num_heads)\n", - "\n", - " Returns:\n", - " KVCache instance ready for use\n", - " \n", - " Example:\n", - " ```python\n", - " # Enable caching for generation\n", - " cache = enable_kv_cache(\n", - " batch_size=1,\n", - " max_seq_len=100,\n", - " num_layers=4,\n", - " num_heads=4,\n", - " head_dim=32\n", - " )\n", - " \n", - " # Use in generation loop (pseudocode)\n", - " for step in range(max_new_tokens):\n", - " # Only process new token with cache\n", - " logits = model.forward_cached(new_token, cache)\n", - " next_token = sample(logits)\n", - " ```\n", - " \"\"\"\n", - " cache = KVCache(batch_size, max_seq_len, num_layers, num_heads, head_dim)\n", - " \n", - " print(f\"⚡ KV Cache enabled:\")\n", - " print(f\" Batch size: {batch_size}\")\n", - " print(f\" Max sequence: {max_seq_len}\")\n", - " print(f\" Layers: {num_layers}\")\n", - " print(f\" Heads: {num_heads}\")\n", - " print(f\" Head dim: {head_dim}\")\n", - " \n", - " mem_info = cache.get_memory_usage()\n", - " print(f\" Memory: {mem_info['total_mb']:.2f} MB\")\n", - " print()\n", - " \n", - " return cache" - ] - }, - { - "cell_type": "markdown", - "id": "80402a25", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: Cache Enablement\n", - "\n", - "Let's verify that we can create caches for realistic model configurations.\n", - "\n", - "**This is a unit test** - it tests the cache creation and memory calculation for different model sizes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fc77d324", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-cache-enablement", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_cache_enablement():\n", - " \"\"\"🔬 Unit Test: Cache Enablement for Different Models\"\"\"\n", - " print(\"🔬 Unit Test: Cache Enablement for Different Models...\")\n", - "\n", - " # Test 1: Small model (fast generation)\n", - " print(\" Test 1: Small Model (Tiny Transformer)\")\n", - " cache_small = KVCache(\n", - " batch_size=1,\n", - " max_seq_len=64,\n", - " num_layers=2,\n", - " num_heads=4,\n", - " head_dim=32\n", - " )\n", - " mem_small = cache_small.get_memory_usage()\n", - " assert mem_small['total_mb'] < 1.0, \"Small model should use < 1 MB\"\n", - " print(f\" Small model cache: {mem_small['total_mb']:.3f} MB\")\n", - "\n", - " # Test 2: Medium model (balanced performance)\n", - " print(\" Test 2: Medium Model (Standard Transformer)\")\n", - " cache_medium = KVCache(\n", - " batch_size=1,\n", - " max_seq_len=128,\n", - " num_layers=4,\n", - " num_heads=8,\n", - " head_dim=64\n", - " )\n", - " mem_medium = cache_medium.get_memory_usage()\n", - " assert 1.0 < mem_medium['total_mb'] < 10.0, \"Medium model should use 1-10 MB\"\n", - " print(f\" Medium model cache: {mem_medium['total_mb']:.3f} MB\")\n", - "\n", - " # Test 3: Batch inference (multiple sequences)\n", - " print(\" Test 3: Batch Inference (4 sequences)\")\n", - " cache_batch = KVCache(\n", - " batch_size=4, # Generate 4 sequences in parallel\n", - " max_seq_len=64,\n", - " num_layers=2,\n", - " num_heads=4,\n", - " head_dim=32\n", - " )\n", - " mem_batch = cache_batch.get_memory_usage()\n", - " assert mem_batch['total_mb'] > mem_small['total_mb'], \"Batch cache should be larger\"\n", - " print(f\" Batch cache: {mem_batch['total_mb']:.3f} MB (4x batch size)\")\n", - "\n", - " print(\"✅ Cache enablement works correctly!\")\n", - "\n", - "# Run test immediately when developing this module\n", - "if __name__ == \"__main__\":\n", - " test_unit_cache_enablement()" - ] - }, - { - "cell_type": "markdown", - "id": "df7728e0", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎯 Part 5: Using KV Cache in Practice\n", - "\n", - "### Practical Integration Checklist\n", - "\n", - "To use KV caching in your transformer generation:\n", - "\n", - "**✅ Before Generation:**\n", - "1. Create cache with `enable_kv_cache()`\n", - "2. Set cache dimensions to match your model architecture\n", - "3. Verify memory usage is acceptable\n", - "\n", - "**✅ During Generation (Modified Forward Pass):**\n", - "1. For the first token (prompt), process normally and populate cache\n", - "2. For subsequent tokens:\n", - " - Only process the NEW token (not entire sequence)\n", - " - Update cache with new K,V pairs\n", - " - Retrieve full cached K,V for attention\n", - " - Use cached values in attention computation\n", - " - Advance cache position after all layers\n", - "\n", - "**✅ After Generation:**\n", - "1. Reset cache if generating another sequence\n", - "2. Monitor memory usage for production deployment\n", - "\n", - "### Performance Expectations\n", - "\n", - "```\n", - "Expected Speedup by Sequence Length:\n", - "┌───────────┬──────────┬───────────┬──────────┐\n", - "│ Seq Len │ No Cache │ With Cache│ Speedup │\n", - "├───────────┼──────────┼───────────┼──────────┤\n", - "│ 10 tokens│ ~80 tok/s│ ~600 tok/s│ 7.5x │\n", - "│ 25 tokens│ ~40 tok/s│ ~500 tok/s│ 12.5x │\n", - "│ 50 tokens│ ~25 tok/s│ ~400 tok/s│ 16.0x │\n", - "│ 100 tokens│ ~12 tok/s│ ~200 tok/s│ 16.7x │\n", - "└───────────┴──────────┴───────────┴──────────┘\n", - "\n", - "Key Insight: Speedup increases with sequence length!\n", - "Why? Longer sequences = more redundant computation without cache.\n", - "```\n", - "\n", - "### Production Considerations\n", - "\n", - "**Memory Management:**\n", - "- Cache memory = `batch_size × num_layers × num_heads × max_seq_len × head_dim × 4 bytes`\n", - "- For GPT-2 (12 layers, 12 heads, seq_len=1024, head_dim=64): ~37 MB per sequence\n", - "- For GPT-3 (96 layers, 96 heads, seq_len=2048, head_dim=128): ~4.7 GB per sequence\n", - "\n", - "**Trade-off Analysis:**\n", - "- **10x+ speedup** for typical generation lengths (50-200 tokens)\n", - "- **Modest memory cost** compared to model parameters (often <1% of model size)\n", - "- **Enables real-time interaction** that's impossible without caching\n", - "\n", - "**Best Practices:**\n", - "1. Always use caching for production serving\n", - "2. Tune `max_seq_len` to expected generation length (don't over-allocate)\n", - "3. Consider batch inference to amortize model loading costs\n", - "4. Monitor cache memory usage in production" - ] - }, - { - "cell_type": "markdown", - "id": "1df5b0fc", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🎯 Part 5: Non-Invasive Integration with Existing Models\n", - "\n", - "### The Challenge\n", - "\n", - "We built KV caching in Module 14, but our transformer (Modules 12-13) doesn't know about it!\n", - "\n", - "**❌ BAD Solution**: Go back and modify Module 12 (MultiHeadAttention)\n", - "- Breaks \"forward-only\" learning (students shouldn't revisit old modules)\n", - "- Makes Module 12 depend on Module 14 (wrong dependency direction!)\n", - "- Violates clean module boundaries\n", - "\n", - "**✅ GOOD Solution**: Module 14 ADDS caching to existing models without modification!\n", - "- Use composition + monkey-patching (like `enable_autograd()`)\n", - "- Module 14 wraps/enhances Module 12, not modifies it\n", - "- Students learn systems engineering: \"Add capabilities, don't break old code\"\n", - "\n", - "### Implementation Strategy\n", - "\n", - "We'll create `enable_kv_cache(model)` that:\n", - "1. Creates cache for the model's architecture\n", - "2. Wraps each attention layer with caching logic\n", - "3. Intercepts attention calls and manages cache automatically\n", - "4. Returns the cache for manual control if needed\n", - "\n", - "This is **non-invasive enhancement** - a critical ML systems pattern!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7a8281fd", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "enable-kv-cache", - "solution": true - } - }, - "outputs": [], - "source": [ - "#| export\n", - "def enable_kv_cache(model):\n", - " \"\"\"\n", - " Enable KV caching for a transformer model WITHOUT modifying Module 12/13 code.\n", - "\n", - " TODO: Create cache and non-invasively patch attention layers\n", - "\n", - " APPROACH:\n", - " 1. Validate model has required attributes (embed_dim, num_layers, num_heads, max_seq_len, blocks)\n", - " 2. Calculate head_dim from embed_dim and num_heads\n", - " 3. Create KVCache instance sized for this model's architecture\n", - " 4. Store cache on model as model._kv_cache and set model._cache_enabled flag\n", - " 5. For each transformer block, wrap its attention forward method with caching logic\n", - " 6. Print confirmation message with cache statistics\n", - " 7. Return the cache object\n", - "\n", - " This function demonstrates **non-invasive optimization** - adding capabilities\n", - " to existing systems without breaking them. Similar to how Module 05 (Autograd)\n", - " uses enable_autograd() to add gradient tracking to Tensors.\n", - "\n", - " Args:\n", - " model: A GPT-style transformer model with:\n", - " - model.embed_dim (int)\n", - " - model.num_layers (int)\n", - " - model.num_heads (int)\n", - " - model.max_seq_len (int)\n", - " - model.blocks (list of TransformerBlock objects)\n", - "\n", - " Returns:\n", - " cache: KVCache object for this model\n", - "\n", - " EXAMPLE:\n", - " >>> from tinytorch.models.transformer import GPT\n", - " >>> model = GPT(vocab_size=100, embed_dim=128, num_layers=4, num_heads=4)\n", - " >>> cache = enable_kv_cache(model)\n", - " >>> hasattr(model, '_kv_cache') # True\n", - " >>> model._cache_enabled # True\n", - " >>> cache.num_layers # 4 (matches model)\n", - "\n", - " HINTS:\n", - " - Use hasattr() to validate model attributes exist\n", - " - head_dim = model.embed_dim // model.num_heads\n", - " - Store cache on model with model._kv_cache = cache\n", - " - Set flag with model._cache_enabled = True\n", - " - Save original forward with block._original_attention_forward\n", - " - Use a factory function to create patched forwards (closure captures layer_idx)\n", - "\n", - " Pedagogical Note:\n", - " This teaches students that optimizations can be LAYERED on top of\n", - " working systems. Module 14 doesn't break Modules 12-13; it enhances them!\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " import types\n", - "\n", - " # Validate model has required attributes\n", - " required_attrs = ['embed_dim', 'num_layers', 'num_heads', 'max_seq_len', 'blocks']\n", - " for attr in required_attrs:\n", - " if not hasattr(model, attr):\n", - " raise AttributeError(\n", - " f\"Model missing '{attr}' - enable_kv_cache() requires a GPT-style model \"\n", - " f\"with {', '.join(required_attrs)}\"\n", - " )\n", - "\n", - " # Calculate head dimension\n", - " head_dim = model.embed_dim // model.num_heads\n", - " if model.embed_dim % model.num_heads != 0:\n", - " raise ValueError(\n", - " f\"embed_dim ({model.embed_dim}) must be divisible by num_heads ({model.num_heads})\"\n", - " )\n", - "\n", - " # Create cache for this model\n", - " cache = KVCache(\n", - " batch_size=1, # Default to single sequence; can be reset for batch inference\n", - " max_seq_len=model.max_seq_len,\n", - " num_layers=model.num_layers,\n", - " num_heads=model.num_heads,\n", - " head_dim=head_dim\n", - " )\n", - "\n", - " # Store cache on model for easy access\n", - " model._kv_cache = cache\n", - " model._cache_enabled = True\n", - "\n", - " # Patch each transformer block's attention\n", - " for layer_idx, block in enumerate(model.blocks):\n", - " # Store original attention forward method\n", - " if not hasattr(block, '_original_attention_forward'):\n", - " block._original_attention_forward = block.attention.forward\n", - "\n", - " # Create cached version\n", - " def make_cached_forward(layer_idx, original_forward, cache_obj):\n", - " \"\"\"Factory to create cached forward with correct layer_idx closure\"\"\"\n", - " def cached_forward(x, mask=None):\n", - " \"\"\"\n", - " Cached attention forward pass with REAL speedup!\n", - " \n", - " PATH SELECTION STRATEGY (Key to Understanding KV Caching):\n", - " ──────────────────────────────────────────────────────────\n", - " \n", - " We have THREE possible paths through attention:\n", - " \n", - " 1️⃣ TRAINING PATH (seq_len > 1):\n", - " - Input: Full sequence of tokens (e.g., 64 tokens)\n", - " - Action: Use ORIGINAL attention (no caching)\n", - " - Why: Need full gradient flow for backpropagation\n", - " - Complexity: O(n²) but that's fine for training\n", - " - Example: x.shape = (batch=1, seq=64, embed=128)\n", - " \n", - " 2️⃣ FIRST TOKEN PATH (seq_len == 1 AND cache empty):\n", - " - Input: Single token (the first one in generation)\n", - " - Action: Use ORIGINAL attention (initialize cache)\n", - " - Why: Cache is empty, nothing to retrieve yet\n", - " - Complexity: O(1) - only one token\n", - " - Example: x.shape = (batch=1, seq=1, embed=128), cache.seq_pos=0\n", - " \n", - " 3️⃣ CACHED GENERATION PATH (seq_len == 1 AND cache populated):\n", - " - Input: Single NEW token (during generation)\n", - " - Action: Compute K,V for new token ONLY, retrieve history from cache\n", - " - Why: This is where the speedup happens! O(n²) → O(n)\n", - " - Complexity: O(n) - only compute for new token, reuse cache\n", - " - Example: x.shape = (batch=1, seq=1, embed=128), cache.seq_pos=5\n", - " \n", - " \n", - " WHY .data INSTEAD OF TENSOR OPERATIONS?\n", - " ────────────────────────────────────────\n", - " \n", - " In the cached path, we use numpy via .data for three reasons:\n", - " \n", - " 1. **Explicit Intent**: Makes it crystal clear this is inference-only\n", - " - Training: Uses Tensor operations → gradients tracked\n", - " - Inference: Uses .data → no gradient overhead\n", - " \n", - " 2. **Performance**: Avoids any autograd bookkeeping\n", - " - Even if small, every bit counts in generation\n", - " - Production LLMs (vLLM, llama.cpp) use similar patterns\n", - " \n", - " 3. **Educational Clarity**: Shows students the distinction\n", - " - \"When do I need gradients?\" (training)\n", - " - \"When can I skip them?\" (inference)\n", - " \n", - " We COULD use Tensor operations with requires_grad=False, but .data\n", - " is more explicit and is the industry-standard pattern.\n", - " \n", - " \n", - " THE O(n²) → O(n) TRANSFORMATION:\n", - " ─────────────────────────────────\n", - " \n", - " WITHOUT Cache (Standard Attention):\n", - " Step 1: Process token 1 → Compute attention for 1 token (1² = 1 op)\n", - " Step 2: Process tokens 1-2 → Compute attention for 2 tokens (2² = 4 ops)\n", - " Step 3: Process tokens 1-3 → Compute attention for 3 tokens (3² = 9 ops)\n", - " ...\n", - " Step N: Process tokens 1-N → Compute attention for N tokens (N² ops)\n", - " \n", - " Total: 1 + 4 + 9 + ... + N² = O(N³) across all steps!\n", - " \n", - " WITH Cache (Our Implementation):\n", - " Step 1: Process token 1 → Compute K,V for token 1, cache it (1 op)\n", - " Step 2: Process token 2 → Compute K,V for token 2, retrieve 1 (2 ops)\n", - " Step 3: Process token 3 → Compute K,V for token 3, retrieve 1-2 (3 ops)\n", - " ...\n", - " Step N: Process token N → Compute K,V for token N, retrieve 1-(N-1) (N ops)\n", - " \n", - " Total: 1 + 2 + 3 + ... + N = O(N²) across all steps!\n", - " \n", - " That's why we see 5-7x speedup on short sequences, and 10-15x on longer ones!\n", - " \"\"\"\n", - " from tinytorch.core.tensor import Tensor\n", - " import numpy as np\n", - " \n", - " seq_len = x.shape[1]\n", - " \n", - " # ═══════════════════════════════════════════════════════════════\n", - " # PATH SELECTION: Choose between training, first token, or cached\n", - " # ═══════════════════════════════════════════════════════════════\n", - " \n", - " # PATH 1: TRAINING (seq_len > 1)\n", - " # ───────────────────────────────────\n", - " # Input is a full sequence (e.g., 64 tokens during training)\n", - " # We MUST use original attention to preserve gradient flow\n", - " # No caching during training - we need backprop through everything\n", - " if seq_len > 1:\n", - " return original_forward(x, mask) # O(n²) but preserves gradients\n", - " \n", - " # PATH 2: FIRST TOKEN (seq_len == 1, cache empty)\n", - " # ────────────────────────────────────────────────\n", - " # This is the very first token in generation (cache.seq_pos == 0)\n", - " # Cache is empty, so there's nothing to retrieve yet\n", - " # Use original attention to process this token, which will populate cache\n", - " if cache_obj.seq_pos == 0:\n", - " return original_forward(x, mask) # O(1) - just one token\n", - " \n", - " # PATH 3: CACHED GENERATION (seq_len == 1, cache populated)\n", - " # ──────────────────────────────────────────────────────────\n", - " # This is a NEW token during generation (cache has history)\n", - " # We can now use the cache for massive speedup!\n", - " # Compute K,V for ONLY this new token, retrieve cached history\n", - " \n", - " # Get attention layer (assumes block.attention has the attention object)\n", - " attention = block.attention\n", - " \n", - " # Step 1: Compute Q, K, V for NEW token only\n", - " # Access the linear projection layers\n", - " Q_new = attention.q_proj.forward(x) # (batch, 1, embed_dim)\n", - " K_new = attention.k_proj.forward(x) # (batch, 1, embed_dim)\n", - " V_new = attention.v_proj.forward(x) # (batch, 1, embed_dim)\n", - " \n", - " # Step 2: Reshape to multi-head format\n", - " batch_size = x.shape[0]\n", - " num_heads = attention.num_heads\n", - " head_dim = attention.head_dim\n", - " \n", - " # Reshape: (batch, 1, embed_dim) → (batch, num_heads, 1, head_dim)\n", - " Q_heads = Q_new.reshape(batch_size, 1, num_heads, head_dim)\n", - " Q_heads = Tensor(np.transpose(Q_heads.data, (0, 2, 1, 3))) # (batch, num_heads, 1, head_dim)\n", - " \n", - " K_heads = K_new.reshape(batch_size, 1, num_heads, head_dim)\n", - " K_heads = Tensor(np.transpose(K_heads.data, (0, 2, 1, 3)))\n", - " \n", - " V_heads = V_new.reshape(batch_size, 1, num_heads, head_dim)\n", - " V_heads = Tensor(np.transpose(V_heads.data, (0, 2, 1, 3)))\n", - " \n", - " # Step 3: Update cache with new K, V (using .data for performance)\n", - " cache_obj.update(layer_idx, K_heads, V_heads)\n", - " \n", - " # Step 4: Retrieve ALL cached K, V (includes history + new token)\n", - " K_all, V_all = cache_obj.get(layer_idx)\n", - " \n", - " # Step 5: Compute attention using new Q with ALL cached K, V\n", - " # ─────────────────────────────────────────────────────────\n", - " # Scaled dot-product attention: softmax(Q @ K^T / sqrt(d_k)) @ V\n", - " #\n", - " # NOTE: We use .data (numpy arrays) here instead of Tensor operations\n", - " # Why? This is INFERENCE-ONLY code (no gradients needed):\n", - " # - Explicit: Makes it clear this is inference, not training\n", - " # - Fast: Avoids autograd overhead (even if small)\n", - " # - Standard: Production LLMs (vLLM, llama.cpp) do the same\n", - " #\n", - " # If this were training, we'd use Tensor operations for gradient flow.\n", - " # But in generation (inference), .data is the right choice.\n", - " \n", - " # Q @ K^T: (batch, num_heads, 1, head_dim) @ (batch, num_heads, head_dim, seq_len)\n", - " # → (batch, num_heads, 1, seq_len)\n", - " K_transposed = np.transpose(K_all.data, (0, 1, 3, 2)) # .data = numpy array\n", - " scores = np.matmul(Q_heads.data, K_transposed) # Pure numpy matmul\n", - " \n", - " # Scale by sqrt(head_dim)\n", - " scores = scores / np.sqrt(head_dim)\n", - " \n", - " # Apply mask if provided (causal mask for generation)\n", - " if mask is not None:\n", - " # Mask should be (1, 1, 1, seq_len) for this token\n", - " # In generation, we can attend to all previous tokens\n", - " pass # No masking needed in generation (we see all history)\n", - " \n", - " # Softmax over key dimension\n", - " scores_max = np.max(scores, axis=-1, keepdims=True)\n", - " exp_scores = np.exp(scores - scores_max)\n", - " attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)\n", - " \n", - " # Apply attention weights to values\n", - " # (batch, num_heads, 1, seq_len) @ (batch, num_heads, seq_len, head_dim)\n", - " # → (batch, num_heads, 1, head_dim)\n", - " attention_output = np.matmul(attention_weights, V_all.data)\n", - " \n", - " # Step 6: Reshape back and apply output projection\n", - " # (batch, num_heads, 1, head_dim) → (batch, 1, num_heads, head_dim)\n", - " attention_output_transposed = np.transpose(attention_output, (0, 2, 1, 3))\n", - " \n", - " # Concatenate heads: (batch, 1, num_heads * head_dim)\n", - " concat_data = attention_output_transposed.reshape(batch_size, 1, num_heads * head_dim)\n", - " concat_output = Tensor(concat_data)\n", - " \n", - " # Output projection\n", - " output = attention.out_proj.forward(concat_output)\n", - " \n", - " return output\n", - " \n", - " return cached_forward\n", - "\n", - " # Patch this block's attention\n", - " block.attention.forward = make_cached_forward(layer_idx, block._original_attention_forward, cache)\n", - "\n", - " print(f\"⚡ KV Cache enabled for model!\")\n", - " print(f\" Architecture: {model.num_layers} layers × {model.num_heads} heads × {head_dim}D\")\n", - " print(f\" Memory: {cache.get_memory_usage()['total_mb']:.2f} MB\")\n", - " print(f\" Cache stored in: model._kv_cache\")\n", - " print()\n", - " print(f\"💡 To disable: call disable_kv_cache(model)\")\n", - " print()\n", - "\n", - " return cache\n", - " ### END SOLUTION\n", - "\n", - "\n", - "#| export \n", - "def disable_kv_cache(model):\n", - " \"\"\"\n", - " Disable KV caching and restore original attention behavior.\n", - " \n", - " Args:\n", - " model: Model with caching enabled\n", - " \n", - " Example:\n", - " ```python\n", - " cache = enable_kv_cache(model)\n", - " # ... do cached generation ...\n", - " disable_kv_cache(model) # Back to normal\n", - " ```\n", - " \"\"\"\n", - " if not hasattr(model, '_cache_enabled') or not model._cache_enabled:\n", - " print(\"⚠️ KV cache not enabled on this model\")\n", - " return\n", - " \n", - " # Restore original attention forwards\n", - " for block in model.blocks:\n", - " if hasattr(block, '_original_attention_forward'):\n", - " block.attention.forward = block._original_attention_forward\n", - " \n", - " # Clean up\n", - " model._cache_enabled = False\n", - " if hasattr(model, '_kv_cache'):\n", - " delattr(model, '_kv_cache')\n", - " \n", - " print(\"✓ KV cache disabled, original attention restored\")" - ] - }, - { - "cell_type": "markdown", - "id": "969b4e1c", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: Non-Invasive Cache Integration\n", - "\n", - "Let's verify that `enable_kv_cache()` works without breaking the model!\n", - "\n", - "**This is an integration test** - it tests Module 14 enhancing Modules 12-13 without modification." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2c198422", - "metadata": { - "lines_to_next_cell": 2, - "nbgrader": { - "grade": true, - "grade_id": "test-noninvasive", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_noninvasive_integration():\n", - " \"\"\"🔬 Unit Test: Non-Invasive Cache Integration\"\"\"\n", - " print(\"🔬 Unit Test: Non-Invasive Cache Integration...\")\n", - "\n", - " # Create a mock transformer-like object for testing\n", - " class MockTransformerBlock:\n", - " def __init__(self):\n", - " self.attention = self\n", - "\n", - " def forward(self, x):\n", - " # Simple pass-through for testing\n", - " return x\n", - "\n", - " class MockGPT:\n", - " def __init__(self):\n", - " self.vocab_size = 100\n", - " self.embed_dim = 128\n", - " self.num_layers = 4\n", - " self.num_heads = 4\n", - " self.max_seq_len = 64\n", - " self.blocks = [MockTransformerBlock() for _ in range(self.num_layers)]\n", - "\n", - " # Test 1: Enable caching\n", - " model = MockGPT()\n", - " print(\" Test 1: Enable caching on model\")\n", - " cache = enable_kv_cache(model)\n", - " assert hasattr(model, '_kv_cache'), \"Model should have _kv_cache attribute\"\n", - " assert hasattr(model, '_cache_enabled'), \"Model should have _cache_enabled flag\"\n", - " assert model._cache_enabled == True, \"Cache should be enabled\"\n", - " assert cache is model._kv_cache, \"Returned cache should match model._kv_cache\"\n", - "\n", - " # Test 2: Attention forward still works\n", - " print(\" Test 2: Attention forward pass still works\")\n", - " test_input = Tensor(np.random.randn(1, 10, 128))\n", - " for block in model.blocks:\n", - " output = block.attention.forward(test_input)\n", - " assert output.shape == test_input.shape, \"Forward pass should preserve shape\"\n", - "\n", - " # Test 3: Disable caching\n", - " print(\" Test 3: Disable caching\")\n", - " disable_kv_cache(model)\n", - " assert model._cache_enabled == False, \"Cache should be disabled\"\n", - " assert not hasattr(model, '_kv_cache'), \"Cache object should be removed\"\n", - "\n", - " # Test 4: Can re-enable\n", - " print(\" Test 4: Re-enable caching\")\n", - " cache2 = enable_kv_cache(model)\n", - " assert model._cache_enabled == True, \"Cache should be re-enabled\"\n", - "\n", - " print(\"✅ Non-invasive cache integration works correctly!\")\n", - "\n", - "# Run test immediately when developing this module\n", - "if __name__ == \"__main__\":\n", - " test_unit_noninvasive_integration()" - ] - }, - { - "cell_type": "markdown", - "id": "5c56c36a", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🧪 Module Integration Test\n", - "\n", - "Final validation that everything works together correctly before module completion." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fbc1c29f", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": true, - "grade_id": "module-integration", - "locked": true, - "points": 20 - } - }, - "outputs": [], - "source": [ - "def test_module():\n", - " \"\"\"\n", - " Comprehensive test of entire KV Caching module functionality.\n", - "\n", - " This final test runs before module summary to ensure:\n", - " - All unit tests pass\n", - " - Functions work together correctly\n", - " - Module is ready for integration with TinyTorch\n", - " \"\"\"\n", - " print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n", - " print(\"=\" * 50)\n", - " print()\n", - "\n", - " # Run all unit tests\n", - " print(\"Running unit tests...\")\n", - " test_unit_kvcache()\n", - " print()\n", - " test_unit_cache_enablement()\n", - " print()\n", - " test_unit_noninvasive_integration()\n", - " print()\n", - "\n", - " print(\"Running integration scenarios...\")\n", - " print()\n", - "\n", - " # Integration Test: Complete KV Cache Workflow\n", - " print(\"🔬 Integration Test: Complete KV Cache Workflow...\")\n", - " batch_size, max_seq_len = 1, 128\n", - " num_layers, num_heads, head_dim = 4, 8, 64\n", - "\n", - " cache = KVCache(batch_size, max_seq_len, num_layers, num_heads, head_dim)\n", - "\n", - " # Simulate generation loop (processing multiple tokens)\n", - " for _ in range(5):\n", - " for layer_idx in range(num_layers):\n", - " # Simulate new key-value pairs\n", - " new_key = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - " new_value = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim))\n", - "\n", - " # Update cache\n", - " cache.update(layer_idx, new_key, new_value)\n", - "\n", - " # Advance position after all layers processed\n", - " cache.advance()\n", - "\n", - " # Verify cache state\n", - " assert cache.seq_pos == 5, f\"Expected seq_pos=5, got {cache.seq_pos}\"\n", - "\n", - " # Verify retrieval\n", - " for layer_idx in range(num_layers):\n", - " cached_k, cached_v = cache.get(layer_idx)\n", - " assert cached_k.shape == (batch_size, num_heads, 5, head_dim)\n", - " assert cached_v.shape == (batch_size, num_heads, 5, head_dim)\n", - "\n", - " print(\"✅ Complete KV cache workflow validated!\")\n", - " print()\n", - "\n", - " # Integration Test: Memory Tracking\n", - " print(\"🔬 Integration Test: Memory Tracking...\")\n", - " mem_info = cache.get_memory_usage()\n", - " assert mem_info['total_mb'] > 0\n", - " assert mem_info['cache_tensors'] == num_layers * 2\n", - " print(f\"✅ Memory tracking: {mem_info['total_mb']:.2f} MB for {mem_info['cache_tensors']} tensors\")\n", - " print()\n", - "\n", - " print(\"=\" * 50)\n", - " print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n", - " print(\"Run: tito module complete 14\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e1b4fcb9", - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "if __name__ == \"__main__\":\n", - " test_module()" - ] - }, - { - "cell_type": "markdown", - "id": "ff6d655d", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎓 Module 14 Complete!\n", - "\n", - "You've implemented KV caching - the critical optimization that makes production language models economically viable!\n", - "\n", - "### What You Built\n", - "\n", - "✅ **KVCache Class**: Efficient memory management for key-value pairs across layers\n", - "✅ **O(1) Updates**: Fast cache updates without data copying\n", - "✅ **Memory Tracking**: Understanding cache size and memory trade-offs\n", - "✅ **Non-Invasive Integration**: `enable_kv_cache()` adds optimization WITHOUT breaking modules\n", - "✅ **Production Patterns**: Integration strategy for real transformer models\n", - "\n", - "### Key Systems Engineering Lesson\n", - "\n", - "**Module 14 doesn't modify Modules 12-13 - it ENHANCES them!**\n", - "\n", - "This teaches the critical principle: **Add capabilities forward, never break backward.**\n", - "- Old code keeps working (Module 12 unchanged)\n", - "- New code adds optimization (Module 14 layers on top)\n", - "- Clean separation of concerns (caching is separate from attention logic)\n", - "\n", - "### Performance Impact\n", - "\n", - "```\n", - "Without Cache: O(n²) complexity → slow, expensive, impractical\n", - "With Cache: O(n) complexity → fast, cheap, production-ready\n", - "\n", - "Real Impact: 10-15x speedup for typical generation!\n", - "```\n", - "\n", - "### What's Next\n", - "\n", - "**Module 15 (Profiling)**: Now that you've seen a concrete optimization, learn how to systematically measure and find more optimizations using professional profiling tools.\n", - "\n", - "### Try It Yourself\n", - "\n", - "Run the chatbot milestone with and without caching:\n", - "\n", - "```bash\n", - "# Without cache (slow - baseline)\n", - "python milestones/05_2017_transformer/vaswani_chatgpt.py\n", - "\n", - "# With cache (fast - 10-15x speedup!)\n", - "python milestones/05_2017_transformer/vaswani_chatgpt.py --use-cache\n", - "```\n", - "\n", - "Watch the tokens/sec metric jump from ~40 to ~500! 🚀\n", - "\n", - "---\n", - "\n", - "**Congratulations! You've completed Module 14: KV Caching!**\n", - "\n", - "You now understand the optimization that makes ChatGPT, Claude, and all production LLMs possible. This is THE technique that transformed language models from research toys into products used by millions of people every day.\n", - "\n", - "**From Theory to Practice**: You've gone from O(n²) naive generation to O(n) optimized generation. This is real ML engineering!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/modules/source/14_kvcaching/kvcaching_dev.py b/modules/source/14_kvcaching/kvcaching_dev.py deleted file mode 100644 index 532c432c..00000000 --- a/modules/source/14_kvcaching/kvcaching_dev.py +++ /dev/null @@ -1,1470 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.17.1 -# kernelspec: -# display_name: Python 3 (ipykernel) -# language: python -# name: python3 -# --- - -# %% [markdown] -""" -# Module 14: KV Caching - Optimizing Autoregressive Generation - -Welcome to Module 14! You'll implement the critical optimization that makes production language models possible: Key-Value caching for 10-15x faster text generation. - -## 🔗 Prerequisites & Progress -**You've Built**: Complete transformer architecture with multi-head attention and text generation -**You'll Build**: Memory-efficient KV caching system that eliminates redundant computation -**You'll Enable**: Production-grade inference optimization and real-world serving capabilities - -**Connection Map**: -``` -Transformers → KV Caching → Production Serving -(slow O(n²)) (fast O(n)) (real-world scale) -``` - -## Learning Objectives -By the end of this module, you will: -1. Understand why autoregressive generation has O(n²) complexity without caching -2. Implement KVCache with efficient memory management and O(1) updates -3. Build cache-aware attention that reuses previously computed keys and values -4. Measure dramatic speedup gains (10-15x) and understand memory trade-offs -5. Connect to production optimization patterns used in real LLM serving - -Let's make inference blazingly fast! - -## 📦 Where This Code Lives in the Final Package - -**Learning Side:** You work in `modules/14_kvcaching/kvcaching_dev.py` -**Building Side:** Code exports to `tinytorch.generation.kv_cache` - -```python -# How to use this module: -from tinytorch.generation.kv_cache import KVCache, enable_kv_cache -``` - -**Why this matters:** -- **Learning:** Complete caching system demonstrating production optimization techniques -- **Production:** Proper organization matching Hugging Face's generation/ module structure -- **Consistency:** All generation optimizations in generation.kv_cache -- **Integration:** Works seamlessly with transformers for complete inference optimization -""" - -# %% -#| default_exp generation.kv_cache -#| export - -import numpy as np -import time -from typing import Tuple, Optional, Dict, List - -# Import TinyTorch components from previous modules -from tinytorch.core.tensor import Tensor - -# %% [markdown] -""" -## 🔬 Motivation: Why Memoization Matters for Transformers - -Before we learn KV caching, let's profile transformer generation to understand -the problem we're solving. We'll see O(n²) growth in latency as we generate text. -""" - -# %% -# Profile transformer generation to discover the bottleneck -from tinytorch.profiling.profiler import Profiler -import matplotlib.pyplot as plt - -profiler = Profiler() - -def naive_attention_step(seq_len, hidden_dim=64): - """ - Simulates one step of attention computation. - Without caching, this processes ALL previous tokens every time. - """ - # Q, K, V for entire sequence - q = Tensor(np.random.randn(1, seq_len, hidden_dim)) - k = Tensor(np.random.randn(1, seq_len, hidden_dim)) - v = Tensor(np.random.randn(1, seq_len, hidden_dim)) - - # Attention: Q @ K.T then @ V - # This is O(seq_len²) in complexity - scores = q @ k.T # (1, seq_len, seq_len) - output = scores @ v - - return output - -# Profile at increasing sequence lengths -print("🔬 Profiling Transformer Generation (Without Caching):\n") -print(" Seq Len | Latency (ms) | Growth") -print(" ---------|----------------|----------") - -sequence_lengths = [10, 20, 40, 80, 160] -latencies = [] - -for seq_len in sequence_lengths: - # Measure latency for this sequence length - latency = profiler.measure_latency( - lambda: naive_attention_step(seq_len), - None, - warmup=5, - iterations=20 - ) - latencies.append(latency) - - # Calculate growth rate - if len(latencies) > 1: - growth = latencies[-1] / latencies[-2] - print(f" {seq_len:3d} | {latency:6.2f} | {growth:.2f}×") - else: - print(f" {seq_len:3d} | {latency:6.2f} | baseline") - -print("\n💡 Key Observations:") -print(" • Latency grows QUADRATICALLY with sequence length") -print(" • Each new token forces recomputation of ALL previous K,V pairs") -print(" • For 160 tokens: ~4× time vs 80 tokens (2² growth)") - -print("\n🎯 The Problem:") -print(" K and V values for previous tokens NEVER change,") -print(" yet we recompute them every single step!") - -print("\n✨ The Solution:") -print(" CACHE the K,V values! (That's memoization)") -print(" • First compute: Calculate and store K,V") -print(" • Later steps: Reuse stored K,V") -print(" • Complexity: O(n²) → O(n)") -print(" • Speedup: 10-15× for typical generation\n") - -# %% [markdown] -""" -## 🎯 Part 1: Understanding the Autoregressive Generation Problem - -### The Core Inefficiency - -When generating text token by token, transformers face a fundamental computational bottleneck. Let's visualize what happens during naive generation: - -``` -Token Generation Process (Without Caching): - -Step 1: Generate "Hello" -Input: [START] -Attention: Q₁ × [K₁] × [V₁] ← 1 computation - -Step 2: Generate "world" -Input: [START, Hello] -Attention: Q₂ × [K₁, K₂] × [V₁, V₂] ← 2 computations (K₁,V₁ RECOMPUTED!) - -Step 3: Generate "!" -Input: [START, Hello, world] -Attention: Q₃ × [K₁, K₂, K₃] × [V₁, V₂, V₃] ← 3 computations (K₁,V₁,K₂,V₂ RECOMPUTED!) -``` - -**The Problem**: For each new token, we recompute ALL previous key-value pairs even though they never change! - -### Computational Complexity Analysis - -``` -Naive Generation Complexity: -Step 1: 1 K,V computation -Step 2: 2 K,V computations -Step 3: 3 K,V computations -... -Step n: n K,V computations - -Total: 1 + 2 + 3 + ... + n = n(n+1)/2 = O(n²) complexity! -``` - -For a 100-token sequence, this means **5,050 redundant computations**! - -### Real-World Impact - -This inefficiency makes production LLM serving economically impossible without optimization: -- **ChatGPT/GPT-4**: Would be too slow for real-time chat without caching -- **Code completion**: IDEs couldn't provide instant suggestions -- **Mobile deployment**: On-device generation would drain batteries instantly -- **API serving**: Server costs would be 10x+ higher - -**The Solution**: Cache key-value pairs after computing them once, transforming O(n²) into O(n). -""" - -# %% [markdown] -""" -## 🧮 Part 2: The Key-Value Caching Insight - -### Mathematical Foundation - -The core insight comes from understanding what changes during autoregressive generation: - -``` -Attention Computation Breakdown: - -Q = new_token @ W_q ← Only new token (changes each step) -K = all_tokens @ W_k ← Includes old tokens (mostly redundant!) -V = all_tokens @ W_v ← Includes old tokens (mostly redundant!) - -attention_output = softmax(Q @ K.T / √d_k) @ V -``` - -**Key Insight**: K and V matrices for previous tokens NEVER change! - -``` -Token Dependencies: -K₁ = token₁ @ W_k ← Computed once, never changes -K₂ = token₂ @ W_k ← Computed once, never changes -K₃ = token₃ @ W_k ← Computed once, never changes - -Same for V₁, V₂, V₃... -``` - -### Cache-Optimized Generation - -``` -Optimized Generation Process (With Caching): - -Step 1: Generate "Hello" -Compute: K₁, V₁ → Store in cache -Attention: Q₁ × cached[K₁] × cached[V₁] - -Step 2: Generate "world" -Compute: K₂, V₂ → Append to cache -Attention: Q₂ × cached[K₁, K₂] × cached[V₁, V₂] - -Step 3: Generate "!" -Compute: K₃, V₃ → Append to cache -Attention: Q₃ × cached[K₁, K₂, K₃] × cached[V₁, V₂, V₃] -``` - -**Result**: Each step computes only ONE new K,V pair instead of recomputing ALL! - -### Memory vs Compute Trade-off - -``` -Traditional Approach: -Memory: O(1) (no storage needed) -Compute: O(n²) (recompute everything) - -Cached Approach: -Memory: O(n × d_k) (store all K,V pairs) -Compute: O(n) (only compute new pairs) - -For n=100, d_k=64: -Memory cost: 6.4 KB per layer -Compute savings: 50x reduction in K,V computations -``` - -**Trade-off Winner**: Memory is cheap, compute is expensive! Use O(n) memory to save O(n²) compute. -""" - -# %% [markdown] -""" -## 🏗️ Part 3: KVCache Class Implementation - -### Core Requirements - -Our KVCache needs to efficiently handle: - -1. **Multi-layer storage**: Each transformer layer needs its own K,V cache -2. **Multi-head attention**: Each attention head has separate K,V pairs -3. **Batch processing**: Support multiple sequences simultaneously (batch inference) -4. **Dynamic updates**: Efficiently append new tokens without copying data -5. **Memory management**: Pre-allocate space to avoid dynamic resizing overhead - -### Cache Architecture Visualization - -``` -KVCache Memory Layout: -┌─────────────────────────────────────────────────────────┐ -│ KVCache Object │ -├─────────────────────────────────────────────────────────┤ -│ Layer 0: ┌─────────────┬─────────────┐ │ -│ │ Key Cache │ Value Cache │ │ -│ │ (B,H,S,D) │ (B,H,S,D) │ │ -│ └─────────────┴─────────────┘ │ -├─────────────────────────────────────────────────────────┤ -│ Layer 1: ┌─────────────┬─────────────┐ │ -│ │ Key Cache │ Value Cache │ │ -│ │ (B,H,S,D) │ (B,H,S,D) │ │ -│ └─────────────┴─────────────┘ │ -├─────────────────────────────────────────────────────────┤ -│ ... ┌─────────────┬─────────────┐ │ -│ Layer N: │ Key Cache │ Value Cache │ │ -│ │ (B,H,S,D) │ (B,H,S,D) │ │ -│ └─────────────┴─────────────┘ │ -└─────────────────────────────────────────────────────────┘ - -Where: -B = batch_size (number of sequences) -H = num_heads (attention heads per layer) -S = max_seq_len (maximum sequence length) -D = head_dim (dimension per attention head) -``` - -### Update Operation Flow - -``` -Cache Update Process: - seq_pos = 2 - ↓ -┌─────┬─────┬─────┬─────┬─────┬─────┐ -│ K₁ │ K₂ │ ??? │ ??? │ ??? │ ??? │ ← Key Cache -├─────┼─────┼─────┼─────┼─────┼─────┤ -│ V₁ │ V₂ │ ??? │ ??? │ ??? │ ??? │ ← Value Cache -└─────┴─────┴─────┴─────┴─────┴─────┘ - -New token arrives: K₃, V₃ - - seq_pos = 2 - ↓ -┌─────┬─────┬─────┬─────┬─────┬─────┐ -│ K₁ │ K₂ │ K₃ │ ??? │ ??? │ ??? │ ← Write K₃ here -├─────┼─────┼─────┼─────┼─────┼─────┤ -│ V₁ │ V₂ │ V₃ │ ??? │ ??? │ ??? │ ← Write V₃ here -└─────┴─────┴─────┴─────┴─────┴─────┘ - -Then: seq_pos += 1 (advance to position 3) -``` - -This design enables **O(1) updates** - just write to the next position! -""" - -# %% nbgrader={"grade": false, "grade_id": "kvcache-class", "solution": true} -#| export -class KVCache: - """ - Efficient key-value cache for autoregressive generation. - - Stores K,V matrices for each transformer layer to avoid recomputation - during sequential token generation. This is THE critical optimization - that makes production language model serving economically viable. - - ⚠️ IMPORTANT: INFERENCE-ONLY (No Gradient Tracking) - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - KV caching is designed ONLY for inference (generation), NOT training. - - During generation: No gradients computed (model.eval() mode) - - Cache operations use .data (no gradient tracking) - - This is correct and intentional for maximum speed - - DO NOT use caching during training (use standard forward pass) - - Architecture: - - Pre-allocates cache tensors with maximum sequence length - - Tracks current sequence position for efficient O(1) updates - - Provides update() method to append new K,V pairs without copying - - Provides get() method to retrieve cached values for attention - - Handles multiple layers and attention heads properly - - Memory Layout: - ``` - Layer 0: [Key_cache, Value_cache] # Shape: (batch, num_heads, max_seq, head_dim) - Layer 1: [Key_cache, Value_cache] - ... - Layer N: [Key_cache, Value_cache] - ``` - - Performance: - - Update: O(1) - just index assignment - - Get: O(1) - just slicing (no data copy) - - Memory: O(num_layers × batch × heads × max_seq × head_dim) - """ - - def __init__(self, batch_size: int, max_seq_len: int, num_layers: int, - num_heads: int, head_dim: int): - """ - Initialize KV cache for efficient generation. - - TODO: Set up pre-allocated cache storage for all transformer layers - - APPROACH: - 1. Store configuration parameters (batch_size, max_seq_len, etc.) - 2. Initialize sequence position counter to 0 - 3. Create empty list for cache storage - 4. For each layer, pre-allocate zero-filled key and value caches - 5. Store each layer's (key_cache, value_cache) tuple in the list - - Args: - batch_size: Number of sequences to generate simultaneously - max_seq_len: Maximum sequence length to support - num_layers: Number of transformer layers - num_heads: Number of attention heads per layer - head_dim: Dimension of each attention head - - EXAMPLE: - >>> cache = KVCache(batch_size=2, max_seq_len=128, num_layers=4, - ... num_heads=8, head_dim=64) - >>> cache.seq_pos # 0 (no tokens cached yet) - >>> len(cache.caches) # 4 (one per layer) - >>> cache.caches[0][0].shape # (2, 8, 128, 64) - key cache for layer 0 - - HINTS: - - Cache shape: (batch_size, num_heads, max_seq_len, head_dim) - - Use Tensor(np.zeros(...)) to create cache tensors - - Store caches as list of tuples: [(key_0, val_0), (key_1, val_1), ...] - - Pre-allocation avoids dynamic resizing overhead during generation - """ - ### BEGIN SOLUTION - self.batch_size = batch_size - self.max_seq_len = max_seq_len - self.num_layers = num_layers - self.num_heads = num_heads - self.head_dim = head_dim - - # Current sequence position (how many tokens are cached) - self.seq_pos = 0 - - # Cache storage: list of (key_cache, value_cache) tuples per layer - self.caches = [] - - for layer_idx in range(num_layers): - # Pre-allocate cache tensors with maximum size - # Shape: (batch_size, num_heads, max_seq_len, head_dim) - key_cache = Tensor(np.zeros((batch_size, num_heads, max_seq_len, head_dim))) - value_cache = Tensor(np.zeros((batch_size, num_heads, max_seq_len, head_dim))) - - self.caches.append((key_cache, value_cache)) - ### END SOLUTION - - def update(self, layer_idx: int, key: Tensor, value: Tensor) -> None: - """ - Update cache with new key-value pairs for given layer. - - TODO: Efficiently append new K,V to cache without data copying - - APPROACH: - 1. Validate layer_idx is in range [0, num_layers-1] - 2. Validate seq_pos hasn't exceeded max_seq_len - 3. Retrieve the (key_cache, value_cache) tuple for this layer - 4. Write new key to position seq_pos in key_cache using indexed assignment - 5. Write new value to position seq_pos in value_cache using indexed assignment - 6. Note: seq_pos is advanced externally via advance() after all layers - - This is the core caching operation - efficiently append new K,V - to the cache without recomputation. This operation is O(1) because - it's just an indexed assignment. - - IMPORTANT: KV caching is designed for INFERENCE (generation) only, - not training. During generation, gradients are not computed. If you - need gradients, don't use caching (use standard forward pass instead). - - Args: - layer_idx: Which transformer layer (0 to num_layers-1) - key: New key tensor, shape (batch_size, num_heads, 1, head_dim) - value: New value tensor, shape (batch_size, num_heads, 1, head_dim) - - EXAMPLE: - >>> cache = KVCache(batch_size=1, max_seq_len=10, num_layers=2, - ... num_heads=4, head_dim=64) - >>> new_k = Tensor(np.random.randn(1, 4, 1, 64)) - >>> new_v = Tensor(np.random.randn(1, 4, 1, 64)) - >>> cache.update(layer_idx=0, key=new_k, value=new_v) - >>> cache.seq_pos # Still 0 (update doesn't advance position) - >>> cache.advance() - >>> cache.seq_pos # Now 1 - - HINTS: - - Use slicing: cache[:, :, seq_pos:seq_pos+1, :] to write to position - - Use .data for direct NumPy access (no gradient tracking needed) - - Raise ValueError with helpful messages for invalid inputs - - This is an in-place operation (modifies cache, returns None) - - Raises: - ValueError: If layer_idx is out of range or sequence is full - """ - ### BEGIN SOLUTION - if layer_idx >= self.num_layers: - raise ValueError(f"Layer index {layer_idx} >= num_layers {self.num_layers}") - - if self.seq_pos >= self.max_seq_len: - raise ValueError(f"Sequence position {self.seq_pos} >= max_seq_len {self.max_seq_len}") - - # Get cache for this layer - key_cache, value_cache = self.caches[layer_idx] - - # Update cache at current position (efficient O(1) write) - # Note: We use .data here because caching is inference-only (no gradients needed) - # This avoids gradient tracking overhead during generation - key_cache.data[:, :, self.seq_pos:self.seq_pos+1, :] = key.data - value_cache.data[:, :, self.seq_pos:self.seq_pos+1, :] = value.data - - # Note: seq_pos is advanced externally via advance() after all layers process - ### END SOLUTION - - def get(self, layer_idx: int) -> Tuple[Tensor, Tensor]: - """ - Retrieve cached key-value pairs for attention computation. - - TODO: Return only the valid cached portion for this layer - - APPROACH: - 1. Validate layer_idx is in range - 2. Retrieve the (key_cache, value_cache) tuple for this layer - 3. Calculate valid_len = seq_pos (number of tokens currently cached) - 4. Slice key_cache to get [:, :, :valid_len, :] (only filled portion) - 5. Slice value_cache to get [:, :, :valid_len, :] (only filled portion) - 6. Wrap sliced data in new Tensor objects and return - - Returns only the valid portion of the cache (up to current seq_pos). - This is O(1) because we're just slicing NumPy arrays (view, not copy). - - IMPORTANT: Returns Tensors without gradient tracking since caching - is inference-only. The returned tensors can be used in attention - computation but won't propagate gradients backward. - - Args: - layer_idx: Which transformer layer to get cache for - - Returns: - (cached_keys, cached_values): Tensors shaped for attention - Keys: (batch_size, num_heads, seq_pos, head_dim) - Values: (batch_size, num_heads, seq_pos, head_dim) - - EXAMPLE: - >>> cache = KVCache(batch_size=1, max_seq_len=100, num_layers=2, - ... num_heads=4, head_dim=64) - >>> # After processing 3 tokens - >>> cache.seq_pos = 3 - >>> cached_k, cached_v = cache.get(layer_idx=0) - >>> cached_k.shape # (1, 4, 3, 64) - only first 3 positions - >>> cached_v.shape # (1, 4, 3, 64) - - HINTS: - - valid_len = self.seq_pos (how many tokens have been cached so far) - - Use slicing: cache.data[:, :, :valid_len, :] to get valid portion - - Wrap result in Tensor() for consistency with TinyTorch API - - If seq_pos=0, returns empty cache (shape with 0 in sequence dimension) - - Raises: - ValueError: If layer_idx is out of range - """ - ### BEGIN SOLUTION - if layer_idx >= self.num_layers: - raise ValueError(f"Layer index {layer_idx} >= num_layers {self.num_layers}") - - # Get cache for this layer - key_cache, value_cache = self.caches[layer_idx] - - # Return only the valid portion (up to current sequence position) - # seq_pos tracks where to write next, so we have seq_pos valid tokens - valid_len = self.seq_pos - - # Note: Creating new Tensors from .data (no gradient tracking) - # This is correct for inference-only caching - cached_keys = Tensor(key_cache.data[:, :, :valid_len, :]) - cached_values = Tensor(value_cache.data[:, :, :valid_len, :]) - - return cached_keys, cached_values - ### END SOLUTION - - def advance(self) -> None: - """ - Advance sequence position after processing current token. - - Call this after all layers have processed the current token and - updated their caches. This moves the write pointer forward. - """ - self.seq_pos += 1 - - def reset(self) -> None: - """ - Reset cache for new generation sequence. - - Call this when starting a new generation (new prompt). - Resets the sequence position counter and optionally zeros cache data. - """ - self.seq_pos = 0 - - # Zero out caches for clean state (helps with debugging) - for layer_idx in range(self.num_layers): - key_cache, value_cache = self.caches[layer_idx] - key_cache.data.fill(0.0) - value_cache.data.fill(0.0) - - def get_memory_usage(self) -> Dict[str, float]: - """ - Calculate memory usage of the cache system. - - Returns: - Dictionary with memory statistics in MB - """ - # Calculate size of one cache tensor - cache_size = self.batch_size * self.num_heads * self.max_seq_len * self.head_dim - bytes_per_float = 4 # float32 - - # Each layer has key_cache + value_cache - total_cache_tensors = self.num_layers * 2 - total_elements = cache_size * total_cache_tensors - total_bytes = total_elements * bytes_per_float - total_mb = total_bytes / (1024 * 1024) - - return { - 'total_mb': total_mb, - 'per_layer_mb': total_mb / self.num_layers, - 'cache_tensors': total_cache_tensors, - 'total_elements': total_elements - } - -# %% [markdown] -""" -### 🧪 Unit Test: KVCache Implementation - -Let's test that our cache correctly stores and retrieves key-value pairs across multiple layers and sequence positions. - -**This is a unit test** - it tests the KVCache class in isolation with simulated attention keys and values. -""" - -# %% nbgrader={"grade": true, "grade_id": "test-kvcache", "locked": true, "points": 10} -def test_unit_kvcache(): - """🔬 Unit Test: KVCache Implementation""" - print("🔬 Unit Test: KVCache Implementation...") - - # Test parameters (small transformer for testing) - batch_size, max_seq_len = 2, 8 - num_layers, num_heads, head_dim = 3, 4, 16 - - # Create cache - cache = KVCache(batch_size, max_seq_len, num_layers, num_heads, head_dim) - - # Test 1: Initial state - assert cache.seq_pos == 0, "Cache should start at position 0" - mem_usage = cache.get_memory_usage() - assert mem_usage['total_mb'] > 0, "Cache should have non-zero memory usage" - print(f" Cache initialized: {mem_usage['total_mb']:.2f} MB") - - # Test 2: Single token update and retrieval - key1 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - value1 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - - # Update layer 0 with first token - cache.update(0, key1, value1) - - # Before advance, get() should return empty (seq_pos=0) - cached_k, cached_v = cache.get(0) - assert cached_k.shape == (batch_size, num_heads, 0, head_dim), "Before advance, cache should be empty" - - # Advance position - cache.advance() - - # Now cache should have 1 token - cached_k, cached_v = cache.get(0) - assert cached_k.shape == (batch_size, num_heads, 1, head_dim), f"Expected shape (2,4,1,16), got {cached_k.shape}" - assert cached_v.shape == (batch_size, num_heads, 1, head_dim), f"Expected shape (2,4,1,16), got {cached_v.shape}" - - # Test 3: Multi-token sequence - key2 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - value2 = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - cache.update(0, key2, value2) - cache.advance() - - cached_k, cached_v = cache.get(0) - assert cached_k.shape == (batch_size, num_heads, 2, head_dim), "Should have 2 tokens cached" - assert cached_v.shape == (batch_size, num_heads, 2, head_dim), "Should have 2 tokens cached" - - # Test 4: Multiple layers - cache.reset() - key_test = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - value_test = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - - # Update all layers with same token - cache.update(0, key_test, value_test) # Layer 0 - cache.update(1, key_test, value_test) # Layer 1 - cache.update(2, key_test, value_test) # Layer 2 - cache.advance() - - # Each layer should have the cached token - for layer_idx in range(num_layers): - cached_k, cached_v = cache.get(layer_idx) - assert cached_k.shape[2] == 1, f"Layer {layer_idx} should have 1 token" - - # Test 5: Reset functionality - cache.reset() - assert cache.seq_pos == 0, "Reset should clear sequence position" - cached_k, cached_v = cache.get(0) - assert cached_k.shape == (batch_size, num_heads, 0, head_dim), "Reset should clear cache" - - print("✅ KVCache implementation works correctly!") - -# Run test immediately when developing this module -if __name__ == "__main__": - test_unit_kvcache() - -# %% [markdown] -""" -## 🎯 Part 4: Enabling KV Caching for Model Generation - -### Integration Strategy - -Now we need a clean way to enable KV caching in our existing transformer models without breaking the existing code. We'll create an `enable_kv_cache()` function that: - -1. Creates a KVCache instance sized for the model -2. Returns a flag to indicate caching is enabled -3. Can be called before generation starts - -The actual integration with attention will happen in the milestone code where we: -1. Check if cache is enabled -2. Only compute K,V for new token (not all tokens) -3. Update cache with new K,V -4. Use cached K,V for attention computation - -### Generation Flow Comparison - -``` -Without Cache (Current): -for each new token: - input_seq = [all tokens so far] # Length grows: 1, 2, 3, ... - logits = model.forward(input_seq) # Recomputes everything! - next_token = sample(logits[-1]) - append next_token - -With Cache (New): -cache = enable_kv_cache(model) -for each new token: - input_token = [just new token] # Length always 1 - logits = model.forward_cached(input_token, cache) # Only new computation - next_token = sample(logits[-1]) - append next_token -``` - -**Key Difference**: Input changes from growing sequence to single token, with cache providing history. -""" - -# %% -#| export -def enable_kv_cache(batch_size: int, max_seq_len: int, num_layers: int, - num_heads: int, head_dim: int) -> KVCache: - """ - Create and return a KVCache instance for model generation. - - This function creates a properly sized cache for the model architecture. - Call this before starting generation, then pass the cache to your - generation loop. - - Args: - batch_size: Number of sequences to generate simultaneously - max_seq_len: Maximum sequence length to support - num_layers: Number of transformer layers in model - num_heads: Number of attention heads per layer - head_dim: Dimension per attention head (usually embed_dim // num_heads) - - Returns: - KVCache instance ready for use - - Example: - ```python - # Enable caching for generation - cache = enable_kv_cache( - batch_size=1, - max_seq_len=100, - num_layers=4, - num_heads=4, - head_dim=32 - ) - - # Use in generation loop (pseudocode) - for step in range(max_new_tokens): - # Only process new token with cache - logits = model.forward_cached(new_token, cache) - next_token = sample(logits) - ``` - """ - cache = KVCache(batch_size, max_seq_len, num_layers, num_heads, head_dim) - - print(f"⚡ KV Cache enabled:") - print(f" Batch size: {batch_size}") - print(f" Max sequence: {max_seq_len}") - print(f" Layers: {num_layers}") - print(f" Heads: {num_heads}") - print(f" Head dim: {head_dim}") - - mem_info = cache.get_memory_usage() - print(f" Memory: {mem_info['total_mb']:.2f} MB") - print() - - return cache - -# %% [markdown] -""" -### 🧪 Unit Test: Cache Enablement - -Let's verify that we can create caches for realistic model configurations. - -**This is a unit test** - it tests the cache creation and memory calculation for different model sizes. -""" - -# %% nbgrader={"grade": true, "grade_id": "test-cache-enablement", "locked": true, "points": 10} -def test_unit_cache_enablement(): - """🔬 Unit Test: Cache Enablement for Different Models""" - print("🔬 Unit Test: Cache Enablement for Different Models...") - - # Test 1: Small model (fast generation) - print(" Test 1: Small Model (Tiny Transformer)") - cache_small = KVCache( - batch_size=1, - max_seq_len=64, - num_layers=2, - num_heads=4, - head_dim=32 - ) - mem_small = cache_small.get_memory_usage() - assert mem_small['total_mb'] < 1.0, "Small model should use < 1 MB" - print(f" Small model cache: {mem_small['total_mb']:.3f} MB") - - # Test 2: Medium model (balanced performance) - print(" Test 2: Medium Model (Standard Transformer)") - cache_medium = KVCache( - batch_size=1, - max_seq_len=128, - num_layers=4, - num_heads=8, - head_dim=64 - ) - mem_medium = cache_medium.get_memory_usage() - assert 1.0 < mem_medium['total_mb'] < 10.0, "Medium model should use 1-10 MB" - print(f" Medium model cache: {mem_medium['total_mb']:.3f} MB") - - # Test 3: Batch inference (multiple sequences) - print(" Test 3: Batch Inference (4 sequences)") - cache_batch = KVCache( - batch_size=4, # Generate 4 sequences in parallel - max_seq_len=64, - num_layers=2, - num_heads=4, - head_dim=32 - ) - mem_batch = cache_batch.get_memory_usage() - assert mem_batch['total_mb'] > mem_small['total_mb'], "Batch cache should be larger" - print(f" Batch cache: {mem_batch['total_mb']:.3f} MB (4x batch size)") - - print("✅ Cache enablement works correctly!") - -# Run test immediately when developing this module -if __name__ == "__main__": - test_unit_cache_enablement() - -# %% [markdown] -""" -## 🎯 Part 5: Using KV Cache in Practice - -### Practical Integration Checklist - -To use KV caching in your transformer generation: - -**✅ Before Generation:** -1. Create cache with `enable_kv_cache()` -2. Set cache dimensions to match your model architecture -3. Verify memory usage is acceptable - -**✅ During Generation (Modified Forward Pass):** -1. For the first token (prompt), process normally and populate cache -2. For subsequent tokens: - - Only process the NEW token (not entire sequence) - - Update cache with new K,V pairs - - Retrieve full cached K,V for attention - - Use cached values in attention computation - - Advance cache position after all layers - -**✅ After Generation:** -1. Reset cache if generating another sequence -2. Monitor memory usage for production deployment - -### Performance Expectations - -``` -Expected Speedup by Sequence Length: -┌───────────┬──────────┬───────────┬──────────┐ -│ Seq Len │ No Cache │ With Cache│ Speedup │ -├───────────┼──────────┼───────────┼──────────┤ -│ 10 tokens│ ~80 tok/s│ ~600 tok/s│ 7.5x │ -│ 25 tokens│ ~40 tok/s│ ~500 tok/s│ 12.5x │ -│ 50 tokens│ ~25 tok/s│ ~400 tok/s│ 16.0x │ -│ 100 tokens│ ~12 tok/s│ ~200 tok/s│ 16.7x │ -└───────────┴──────────┴───────────┴──────────┘ - -Key Insight: Speedup increases with sequence length! -Why? Longer sequences = more redundant computation without cache. -``` - -### Production Considerations - -**Memory Management:** -- Cache memory = `batch_size × num_layers × num_heads × max_seq_len × head_dim × 4 bytes` -- For GPT-2 (12 layers, 12 heads, seq_len=1024, head_dim=64): ~37 MB per sequence -- For GPT-3 (96 layers, 96 heads, seq_len=2048, head_dim=128): ~4.7 GB per sequence - -**Trade-off Analysis:** -- **10x+ speedup** for typical generation lengths (50-200 tokens) -- **Modest memory cost** compared to model parameters (often <1% of model size) -- **Enables real-time interaction** that's impossible without caching - -**Best Practices:** -1. Always use caching for production serving -2. Tune `max_seq_len` to expected generation length (don't over-allocate) -3. Consider batch inference to amortize model loading costs -4. Monitor cache memory usage in production -""" - -# %% [markdown] -""" -## 🎯 Part 5: Non-Invasive Integration with Existing Models - -### The Challenge - -We built KV caching in Module 14, but our transformer (Modules 12-13) doesn't know about it! - -**❌ BAD Solution**: Go back and modify Module 12 (MultiHeadAttention) -- Breaks "forward-only" learning (students shouldn't revisit old modules) -- Makes Module 12 depend on Module 14 (wrong dependency direction!) -- Violates clean module boundaries - -**✅ GOOD Solution**: Module 14 ADDS caching to existing models without modification! -- Use composition + monkey-patching (like `enable_autograd()`) -- Module 14 wraps/enhances Module 12, not modifies it -- Students learn systems engineering: "Add capabilities, don't break old code" - -### Implementation Strategy - -We'll create `enable_kv_cache(model)` that: -1. Creates cache for the model's architecture -2. Wraps each attention layer with caching logic -3. Intercepts attention calls and manages cache automatically -4. Returns the cache for manual control if needed - -This is **non-invasive enhancement** - a critical ML systems pattern! -""" - -# %% nbgrader={"grade": false, "grade_id": "enable-kv-cache", "solution": true} -#| export -def enable_kv_cache(model): - """ - Enable KV caching for a transformer model WITHOUT modifying Module 12/13 code. - - TODO: Create cache and non-invasively patch attention layers - - APPROACH: - 1. Validate model has required attributes (embed_dim, num_layers, num_heads, max_seq_len, blocks) - 2. Calculate head_dim from embed_dim and num_heads - 3. Create KVCache instance sized for this model's architecture - 4. Store cache on model as model._kv_cache and set model._cache_enabled flag - 5. For each transformer block, wrap its attention forward method with caching logic - 6. Print confirmation message with cache statistics - 7. Return the cache object - - This function demonstrates **non-invasive optimization** - adding capabilities - to existing systems without breaking them. Similar to how Module 05 (Autograd) - uses enable_autograd() to add gradient tracking to Tensors. - - Args: - model: A GPT-style transformer model with: - - model.embed_dim (int) - - model.num_layers (int) - - model.num_heads (int) - - model.max_seq_len (int) - - model.blocks (list of TransformerBlock objects) - - Returns: - cache: KVCache object for this model - - EXAMPLE: - >>> from tinytorch.models.transformer import GPT - >>> model = GPT(vocab_size=100, embed_dim=128, num_layers=4, num_heads=4) - >>> cache = enable_kv_cache(model) - >>> hasattr(model, '_kv_cache') # True - >>> model._cache_enabled # True - >>> cache.num_layers # 4 (matches model) - - HINTS: - - Use hasattr() to validate model attributes exist - - head_dim = model.embed_dim // model.num_heads - - Store cache on model with model._kv_cache = cache - - Set flag with model._cache_enabled = True - - Save original forward with block._original_attention_forward - - Use a factory function to create patched forwards (closure captures layer_idx) - - Pedagogical Note: - This teaches students that optimizations can be LAYERED on top of - working systems. Module 14 doesn't break Modules 12-13; it enhances them! - """ - ### BEGIN SOLUTION - import types - - # Validate model has required attributes - required_attrs = ['embed_dim', 'num_layers', 'num_heads', 'max_seq_len', 'blocks'] - for attr in required_attrs: - if not hasattr(model, attr): - raise AttributeError( - f"Model missing '{attr}' - enable_kv_cache() requires a GPT-style model " - f"with {', '.join(required_attrs)}" - ) - - # Calculate head dimension - head_dim = model.embed_dim // model.num_heads - if model.embed_dim % model.num_heads != 0: - raise ValueError( - f"embed_dim ({model.embed_dim}) must be divisible by num_heads ({model.num_heads})" - ) - - # Create cache for this model - cache = KVCache( - batch_size=1, # Default to single sequence; can be reset for batch inference - max_seq_len=model.max_seq_len, - num_layers=model.num_layers, - num_heads=model.num_heads, - head_dim=head_dim - ) - - # Store cache on model for easy access - model._kv_cache = cache - model._cache_enabled = True - - # Patch each transformer block's attention - for layer_idx, block in enumerate(model.blocks): - # Store original attention forward method - if not hasattr(block, '_original_attention_forward'): - block._original_attention_forward = block.attention.forward - - # Create cached version - def make_cached_forward(layer_idx, original_forward, cache_obj): - """Factory to create cached forward with correct layer_idx closure""" - def cached_forward(x, mask=None): - """ - Cached attention forward pass with REAL speedup! - - PATH SELECTION STRATEGY (Key to Understanding KV Caching): - ────────────────────────────────────────────────────────── - - We have THREE possible paths through attention: - - 1️⃣ TRAINING PATH (seq_len > 1): - - Input: Full sequence of tokens (e.g., 64 tokens) - - Action: Use ORIGINAL attention (no caching) - - Why: Need full gradient flow for backpropagation - - Complexity: O(n²) but that's fine for training - - Example: x.shape = (batch=1, seq=64, embed=128) - - 2️⃣ FIRST TOKEN PATH (seq_len == 1 AND cache empty): - - Input: Single token (the first one in generation) - - Action: Use ORIGINAL attention (initialize cache) - - Why: Cache is empty, nothing to retrieve yet - - Complexity: O(1) - only one token - - Example: x.shape = (batch=1, seq=1, embed=128), cache.seq_pos=0 - - 3️⃣ CACHED GENERATION PATH (seq_len == 1 AND cache populated): - - Input: Single NEW token (during generation) - - Action: Compute K,V for new token ONLY, retrieve history from cache - - Why: This is where the speedup happens! O(n²) → O(n) - - Complexity: O(n) - only compute for new token, reuse cache - - Example: x.shape = (batch=1, seq=1, embed=128), cache.seq_pos=5 - - - WHY .data INSTEAD OF TENSOR OPERATIONS? - ──────────────────────────────────────── - - In the cached path, we use numpy via .data for three reasons: - - 1. **Explicit Intent**: Makes it crystal clear this is inference-only - - Training: Uses Tensor operations → gradients tracked - - Inference: Uses .data → no gradient overhead - - 2. **Performance**: Avoids any autograd bookkeeping - - Even if small, every bit counts in generation - - Production LLMs (vLLM, llama.cpp) use similar patterns - - 3. **Educational Clarity**: Shows students the distinction - - "When do I need gradients?" (training) - - "When can I skip them?" (inference) - - We COULD use Tensor operations with requires_grad=False, but .data - is more explicit and is the industry-standard pattern. - - - THE O(n²) → O(n) TRANSFORMATION: - ───────────────────────────────── - - WITHOUT Cache (Standard Attention): - Step 1: Process token 1 → Compute attention for 1 token (1² = 1 op) - Step 2: Process tokens 1-2 → Compute attention for 2 tokens (2² = 4 ops) - Step 3: Process tokens 1-3 → Compute attention for 3 tokens (3² = 9 ops) - ... - Step N: Process tokens 1-N → Compute attention for N tokens (N² ops) - - Total: 1 + 4 + 9 + ... + N² = O(N³) across all steps! - - WITH Cache (Our Implementation): - Step 1: Process token 1 → Compute K,V for token 1, cache it (1 op) - Step 2: Process token 2 → Compute K,V for token 2, retrieve 1 (2 ops) - Step 3: Process token 3 → Compute K,V for token 3, retrieve 1-2 (3 ops) - ... - Step N: Process token N → Compute K,V for token N, retrieve 1-(N-1) (N ops) - - Total: 1 + 2 + 3 + ... + N = O(N²) across all steps! - - That's why we see 5-7x speedup on short sequences, and 10-15x on longer ones! - """ - from tinytorch.core.tensor import Tensor - import numpy as np - - seq_len = x.shape[1] - - # ═══════════════════════════════════════════════════════════════ - # PATH SELECTION: Choose between training, first token, or cached - # ═══════════════════════════════════════════════════════════════ - - # PATH 1: TRAINING (seq_len > 1) - # ─────────────────────────────────── - # Input is a full sequence (e.g., 64 tokens during training) - # We MUST use original attention to preserve gradient flow - # No caching during training - we need backprop through everything - if seq_len > 1: - return original_forward(x, mask) # O(n²) but preserves gradients - - # PATH 2: FIRST TOKEN (seq_len == 1, cache empty) - # ──────────────────────────────────────────────── - # This is the very first token in generation (cache.seq_pos == 0) - # Cache is empty, so there's nothing to retrieve yet - # Use original attention to process this token, which will populate cache - if cache_obj.seq_pos == 0: - return original_forward(x, mask) # O(1) - just one token - - # PATH 3: CACHED GENERATION (seq_len == 1, cache populated) - # ────────────────────────────────────────────────────────── - # This is a NEW token during generation (cache has history) - # We can now use the cache for massive speedup! - # Compute K,V for ONLY this new token, retrieve cached history - - # Get attention layer (assumes block.attention has the attention object) - attention = block.attention - - # Step 1: Compute Q, K, V for NEW token only - # Access the linear projection layers - Q_new = attention.q_proj.forward(x) # (batch, 1, embed_dim) - K_new = attention.k_proj.forward(x) # (batch, 1, embed_dim) - V_new = attention.v_proj.forward(x) # (batch, 1, embed_dim) - - # Step 2: Reshape to multi-head format - batch_size = x.shape[0] - num_heads = attention.num_heads - head_dim = attention.head_dim - - # Reshape: (batch, 1, embed_dim) → (batch, num_heads, 1, head_dim) - Q_heads = Q_new.reshape(batch_size, 1, num_heads, head_dim) - Q_heads = Tensor(np.transpose(Q_heads.data, (0, 2, 1, 3))) # (batch, num_heads, 1, head_dim) - - K_heads = K_new.reshape(batch_size, 1, num_heads, head_dim) - K_heads = Tensor(np.transpose(K_heads.data, (0, 2, 1, 3))) - - V_heads = V_new.reshape(batch_size, 1, num_heads, head_dim) - V_heads = Tensor(np.transpose(V_heads.data, (0, 2, 1, 3))) - - # Step 3: Update cache with new K, V (using .data for performance) - cache_obj.update(layer_idx, K_heads, V_heads) - - # Step 4: Retrieve ALL cached K, V (includes history + new token) - K_all, V_all = cache_obj.get(layer_idx) - - # Step 5: Compute attention using new Q with ALL cached K, V - # ───────────────────────────────────────────────────────── - # Scaled dot-product attention: softmax(Q @ K^T / sqrt(d_k)) @ V - # - # NOTE: We use .data (numpy arrays) here instead of Tensor operations - # Why? This is INFERENCE-ONLY code (no gradients needed): - # - Explicit: Makes it clear this is inference, not training - # - Fast: Avoids autograd overhead (even if small) - # - Standard: Production LLMs (vLLM, llama.cpp) do the same - # - # If this were training, we'd use Tensor operations for gradient flow. - # But in generation (inference), .data is the right choice. - - # Q @ K^T: (batch, num_heads, 1, head_dim) @ (batch, num_heads, head_dim, seq_len) - # → (batch, num_heads, 1, seq_len) - K_transposed = np.transpose(K_all.data, (0, 1, 3, 2)) # .data = numpy array - scores = np.matmul(Q_heads.data, K_transposed) # Pure numpy matmul - - # Scale by sqrt(head_dim) - scores = scores / np.sqrt(head_dim) - - # Apply mask if provided (causal mask for generation) - if mask is not None: - # Mask should be (1, 1, 1, seq_len) for this token - # In generation, we can attend to all previous tokens - pass # No masking needed in generation (we see all history) - - # Softmax over key dimension - scores_max = np.max(scores, axis=-1, keepdims=True) - exp_scores = np.exp(scores - scores_max) - attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True) - - # Apply attention weights to values - # (batch, num_heads, 1, seq_len) @ (batch, num_heads, seq_len, head_dim) - # → (batch, num_heads, 1, head_dim) - attention_output = np.matmul(attention_weights, V_all.data) - - # Step 6: Reshape back and apply output projection - # (batch, num_heads, 1, head_dim) → (batch, 1, num_heads, head_dim) - attention_output_transposed = np.transpose(attention_output, (0, 2, 1, 3)) - - # Concatenate heads: (batch, 1, num_heads * head_dim) - concat_data = attention_output_transposed.reshape(batch_size, 1, num_heads * head_dim) - concat_output = Tensor(concat_data) - - # Output projection - output = attention.out_proj.forward(concat_output) - - return output - - return cached_forward - - # Patch this block's attention - block.attention.forward = make_cached_forward(layer_idx, block._original_attention_forward, cache) - - print(f"⚡ KV Cache enabled for model!") - print(f" Architecture: {model.num_layers} layers × {model.num_heads} heads × {head_dim}D") - print(f" Memory: {cache.get_memory_usage()['total_mb']:.2f} MB") - print(f" Cache stored in: model._kv_cache") - print() - print(f"💡 To disable: call disable_kv_cache(model)") - print() - - return cache - ### END SOLUTION - - -#| export -def disable_kv_cache(model): - """ - Disable KV caching and restore original attention behavior. - - Args: - model: Model with caching enabled - - Example: - ```python - cache = enable_kv_cache(model) - # ... do cached generation ... - disable_kv_cache(model) # Back to normal - ``` - """ - if not hasattr(model, '_cache_enabled') or not model._cache_enabled: - print("⚠️ KV cache not enabled on this model") - return - - # Restore original attention forwards - for block in model.blocks: - if hasattr(block, '_original_attention_forward'): - block.attention.forward = block._original_attention_forward - - # Clean up - model._cache_enabled = False - if hasattr(model, '_kv_cache'): - delattr(model, '_kv_cache') - - print("✓ KV cache disabled, original attention restored") - - -# %% [markdown] -""" -### 🧪 Unit Test: Non-Invasive Cache Integration - -Let's verify that `enable_kv_cache()` works without breaking the model! - -**This is an integration test** - it tests Module 14 enhancing Modules 12-13 without modification. -""" - -# %% nbgrader={"grade": true, "grade_id": "test-noninvasive", "locked": true, "points": 10} -def test_unit_noninvasive_integration(): - """🔬 Unit Test: Non-Invasive Cache Integration""" - print("🔬 Unit Test: Non-Invasive Cache Integration...") - - # Create a mock transformer-like object for testing - class MockTransformerBlock: - def __init__(self): - self.attention = self - - def forward(self, x): - # Simple pass-through for testing - return x - - class MockGPT: - def __init__(self): - self.vocab_size = 100 - self.embed_dim = 128 - self.num_layers = 4 - self.num_heads = 4 - self.max_seq_len = 64 - self.blocks = [MockTransformerBlock() for _ in range(self.num_layers)] - - # Test 1: Enable caching - model = MockGPT() - print(" Test 1: Enable caching on model") - cache = enable_kv_cache(model) - assert hasattr(model, '_kv_cache'), "Model should have _kv_cache attribute" - assert hasattr(model, '_cache_enabled'), "Model should have _cache_enabled flag" - assert model._cache_enabled == True, "Cache should be enabled" - assert cache is model._kv_cache, "Returned cache should match model._kv_cache" - - # Test 2: Attention forward still works - print(" Test 2: Attention forward pass still works") - test_input = Tensor(np.random.randn(1, 10, 128)) - for block in model.blocks: - output = block.attention.forward(test_input) - assert output.shape == test_input.shape, "Forward pass should preserve shape" - - # Test 3: Disable caching - print(" Test 3: Disable caching") - disable_kv_cache(model) - assert model._cache_enabled == False, "Cache should be disabled" - assert not hasattr(model, '_kv_cache'), "Cache object should be removed" - - # Test 4: Can re-enable - print(" Test 4: Re-enable caching") - _ = enable_kv_cache(model) - assert model._cache_enabled == True, "Cache should be re-enabled" - - print("✅ Non-invasive cache integration works correctly!") - -# Run test immediately when developing this module -if __name__ == "__main__": - test_unit_noninvasive_integration() - - -# %% [markdown] -""" -## 🧪 Module Integration Test - -Final validation that everything works together correctly before module completion. -""" - -# %% nbgrader={"grade": true, "grade_id": "module-integration", "locked": true, "points": 20} -def test_module(): - """ - Comprehensive test of entire KV Caching module functionality. - - This final test runs before module summary to ensure: - - All unit tests pass - - Functions work together correctly - - Module is ready for integration with TinyTorch - """ - print("🧪 RUNNING MODULE INTEGRATION TEST") - print("=" * 50) - print() - - # Run all unit tests - print("Running unit tests...") - test_unit_kvcache() - print() - test_unit_cache_enablement() - print() - test_unit_noninvasive_integration() - print() - - print("Running integration scenarios...") - print() - - # Integration Test: Complete KV Cache Workflow - print("🔬 Integration Test: Complete KV Cache Workflow...") - batch_size, max_seq_len = 1, 128 - num_layers, num_heads, head_dim = 4, 8, 64 - - cache = KVCache(batch_size, max_seq_len, num_layers, num_heads, head_dim) - - # Simulate generation loop (processing multiple tokens) - for _ in range(5): - for layer_idx in range(num_layers): - # Simulate new key-value pairs - new_key = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - new_value = Tensor(np.random.randn(batch_size, num_heads, 1, head_dim)) - - # Update cache - cache.update(layer_idx, new_key, new_value) - - # Advance position after all layers processed - cache.advance() - - # Verify cache state - assert cache.seq_pos == 5, f"Expected seq_pos=5, got {cache.seq_pos}" - - # Verify retrieval - for layer_idx in range(num_layers): - cached_k, cached_v = cache.get(layer_idx) - assert cached_k.shape == (batch_size, num_heads, 5, head_dim) - assert cached_v.shape == (batch_size, num_heads, 5, head_dim) - - print("✅ Complete KV cache workflow validated!") - print() - - # Integration Test: Memory Tracking - print("🔬 Integration Test: Memory Tracking...") - mem_info = cache.get_memory_usage() - assert mem_info['total_mb'] > 0 - assert mem_info['cache_tensors'] == num_layers * 2 - print(f"✅ Memory tracking: {mem_info['total_mb']:.2f} MB for {mem_info['cache_tensors']} tensors") - print() - - print("=" * 50) - print("🎉 ALL TESTS PASSED! Module ready for export.") - print("Run: tito module complete 14") - -# %% -if __name__ == "__main__": - test_module() - - -# %% [markdown] -""" -## 🎓 Module 14 Complete! - -You've implemented KV caching - the critical optimization that makes production language models economically viable! - -### What You Built - -✅ **KVCache Class**: Efficient memory management for key-value pairs across layers -✅ **O(1) Updates**: Fast cache updates without data copying -✅ **Memory Tracking**: Understanding cache size and memory trade-offs -✅ **Non-Invasive Integration**: `enable_kv_cache()` adds optimization WITHOUT breaking modules -✅ **Production Patterns**: Integration strategy for real transformer models - -### Key Systems Engineering Lesson - -**Module 14 doesn't modify Modules 12-13 - it ENHANCES them!** - -This teaches the critical principle: **Add capabilities forward, never break backward.** -- Old code keeps working (Module 12 unchanged) -- New code adds optimization (Module 14 layers on top) -- Clean separation of concerns (caching is separate from attention logic) - -### Performance Impact - -``` -Without Cache: O(n²) complexity → slow, expensive, impractical -With Cache: O(n) complexity → fast, cheap, production-ready - -Real Impact: 10-15x speedup for typical generation! -``` - -### What's Next - -**Module 15 (Profiling)**: Now that you've seen a concrete optimization, learn how to systematically measure and find more optimizations using professional profiling tools. - -### Try It Yourself - -Run the chatbot milestone with and without caching: - -```bash -# Without cache (slow - baseline) -python milestones/05_2017_transformer/vaswani_chatgpt.py - -# With cache (fast - 10-15x speedup!) -python milestones/05_2017_transformer/vaswani_chatgpt.py --use-cache -``` - -Watch the tokens/sec metric jump from ~40 to ~500! 🚀 - ---- - -**Congratulations! You've completed Module 14: KV Caching!** - -You now understand the optimization that makes ChatGPT, Claude, and all production LLMs possible. This is THE technique that transformed language models from research toys into products used by millions of people every day. - -**From Theory to Practice**: You've gone from O(n²) naive generation to O(n) optimized generation. This is real ML engineering! -""" diff --git a/modules/source/15_profiling/profiling_dev.ipynb b/modules/source/15_profiling/profiling_dev.ipynb deleted file mode 100644 index f08cdde1..00000000 --- a/modules/source/15_profiling/profiling_dev.ipynb +++ /dev/null @@ -1,1989 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "78d24362", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "# Module 15: Profiling - Measuring What Matters in ML Systems\n", - "\n", - "Welcome to Module 15! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n", - "\n", - "## 🔗 Prerequisites & Progress\n", - "**You've Built**: Complete ML stack from tensors to transformers with KV caching\n", - "**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency\n", - "**You'll Enable**: Data-driven optimization decisions and performance analysis\n", - "\n", - "**Connection Map**:\n", - "```\n", - "All Modules → Profiling → Acceleration (Module 16)\n", - "(implementations) (measurement) (optimization)\n", - "```\n", - "\n", - "## Learning Objectives\n", - "By the end of this module, you will:\n", - "1. Implement a complete Profiler class for model analysis\n", - "2. Count parameters and FLOPs accurately for different architectures\n", - "3. Measure memory usage and latency with statistical rigor\n", - "4. Create production-quality performance analysis tools\n", - "\n", - "Let's build the measurement foundation for ML systems optimization!\n", - "\n", - "## 📦 Where This Code Lives in the Final Package\n", - "\n", - "**Learning Side:** You work in `modules/15_profiling/profiling_dev.py` \n", - "**Building Side:** Code exports to `tinytorch.profiling.profiler`\n", - "\n", - "```python\n", - "# How to use this module:\n", - "from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass\n", - "```\n", - "\n", - "**Why this matters:**\n", - "- **Learning:** Complete profiling system for understanding model performance characteristics\n", - "- **Production:** Professional measurement tools like those used in PyTorch, TensorFlow\n", - "- **Consistency:** All profiling and measurement tools in profiling.profiler\n", - "- **Integration:** Works with any model built using TinyTorch components" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f622ef61", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "imports", - "solution": true - } - }, - "outputs": [], - "source": [ - "#| default_exp profiling.profiler\n", - "#| export\n", - "\n", - "import time\n", - "import numpy as np\n", - "import tracemalloc\n", - "from typing import Dict, List, Any, Optional, Tuple\n", - "from collections import defaultdict\n", - "import gc\n", - "\n", - "# Import our TinyTorch components for profiling\n", - "from tinytorch.core.tensor import Tensor\n", - "from tinytorch.core.layers import Linear\n", - "from tinytorch.core.spatial import Conv2d" - ] - }, - { - "cell_type": "markdown", - "id": "ae7455a2", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 1. Introduction: Why Profiling Matters in ML Systems\n", - "\n", - "Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence.\n", - "\n", - "**The Performance Investigation Process:**\n", - "```\n", - "Suspect Model → Profile Evidence → Identify Bottleneck → Target Optimization\n", - " ↓ ↓ ↓ ↓\n", - " \"Too slow\" \"200 GFLOP/s\" \"Memory bound\" \"Reduce transfers\"\n", - "```\n", - "\n", - "**Questions Profiling Answers:**\n", - "- **How many parameters?** (Memory footprint, model size)\n", - "- **How many FLOPs?** (Computational cost, energy usage)\n", - "- **Where are bottlenecks?** (Memory vs compute bound)\n", - "- **What's actual latency?** (Real-world performance)\n", - "\n", - "**Production Importance:**\n", - "In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100× slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses.\n", - "\n", - "### The Profiling Workflow Visualization\n", - "```\n", - "Model → Profiler → Measurements → Analysis → Optimization Decision\n", - " ↓ ↓ ↓ ↓ ↓\n", - " GPT Parameter 125M params Memory Use quantization\n", - " Counter 2.5B FLOPs bound Reduce precision\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "85ee0680", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 2. Foundations: Performance Measurement Principles\n", - "\n", - "Before we build our profiler, let's understand what we're measuring and why each metric matters.\n", - "\n", - "### Parameter Counting - Model Size Detective Work\n", - "\n", - "Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage.\n", - "\n", - "**Parameter Counting Formula:**\n", - "```\n", - "Linear Layer: (input_features × output_features) + output_features\n", - " ↑ ↑ ↑\n", - " Weight matrix Bias vector Total parameters\n", - "\n", - "Example: Linear(768, 3072) → (768 × 3072) + 3072 = 2,362,368 parameters\n", - "Memory: 2,362,368 × 4 bytes = 9.45 MB\n", - "```\n", - "\n", - "### FLOP Counting - Computational Cost Analysis\n", - "\n", - "FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems.\n", - "\n", - "**FLOP Formulas for Key Operations:**\n", - "```\n", - "Matrix Multiplication (M,K) @ (K,N):\n", - " FLOPs = M × N × K × 2\n", - " ↑ ↑ ↑ ↑\n", - " Rows Cols Inner Multiply+Add\n", - "\n", - "Linear Layer Forward:\n", - " FLOPs = batch_size × input_features × output_features × 2\n", - " ↑ ↑ ↑\n", - " Matmul cost Bias add Operations\n", - "\n", - "Convolution (simplified):\n", - " FLOPs = output_H × output_W × kernel_H × kernel_W × in_channels × out_channels × 2\n", - "```\n", - "\n", - "### Memory Profiling - The Three Types of Memory\n", - "\n", - "ML models use memory in three distinct ways, each with different optimization strategies:\n", - "\n", - "**Memory Type Breakdown:**\n", - "```\n", - "Total Training Memory = Parameters + Activations + Gradients + Optimizer State\n", - " ↓ ↓ ↓ ↓\n", - " Model Forward Backward Adam: 2×params\n", - " weights pass cache gradients SGD: 0×params\n", - "\n", - "Example for 125M parameter model:\n", - "Parameters: 500 MB (125M × 4 bytes)\n", - "Activations: 200 MB (depends on batch size)\n", - "Gradients: 500 MB (same as parameters)\n", - "Adam state: 1,000 MB (momentum + velocity)\n", - "Total: 2,200 MB (4.4× parameter memory!)\n", - "```\n", - "\n", - "### Latency Measurement - Dealing with Reality\n", - "\n", - "Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor.\n", - "\n", - "**Latency Measurement Best Practices:**\n", - "```\n", - "Measurement Protocol:\n", - "1. Warmup runs (10+) → CPU/GPU caches warm up\n", - "2. Timed runs (100+) → Statistical significance\n", - "3. Outlier handling → Use median, not mean\n", - "4. Memory cleanup → Prevent contamination\n", - "\n", - "Timeline:\n", - "Warmup: [run][run][run]...[run] ← Don't time these\n", - "Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these\n", - "Result: median(all_times) ← Robust to outliers\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "ab8f2347", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 3. Implementation: Building the Core Profiler Class\n", - "\n", - "Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis.\n", - "\n", - "### The Profiler Architecture\n", - "```\n", - "Profiler Class\n", - "├── count_parameters() → Model size analysis\n", - "├── count_flops() → Computational cost estimation\n", - "├── measure_memory() → Memory usage tracking\n", - "└── measure_latency() → Performance timing\n", - "\n", - "Integration Functions\n", - "├── profile_forward_pass() → Complete forward analysis\n", - "└── profile_backward_pass() → Training analysis\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "208a26c8", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "profiler_class", - "solution": true - } - }, - "outputs": [], - "source": [ - "#| export\n", - "class Profiler:\n", - " \"\"\"\n", - " Professional-grade ML model profiler for performance analysis.\n", - "\n", - " Measures parameters, FLOPs, memory usage, and latency with statistical rigor.\n", - " Used for optimization guidance and deployment planning.\n", - " \"\"\"\n", - "\n", - " def __init__(self):\n", - " \"\"\"Initialize profiler with measurement state.\"\"\"\n", - " ### BEGIN SOLUTION\n", - " self.measurements = {}\n", - " self.operation_counts = defaultdict(int)\n", - " self.memory_tracker = None\n", - " ### END SOLUTION" - ] - }, - { - "cell_type": "markdown", - "id": "463b3b6c", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## Parameter Counting - Model Size Analysis\n", - "\n", - "Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's build a robust parameter counter that handles different model architectures.\n", - "\n", - "### Why Parameter Counting Matters\n", - "```\n", - "Model Deployment Pipeline:\n", - "Parameters → Memory → Hardware → Cost\n", - " ↓ ↓ ↓ ↓\n", - " 125M 500MB 8GB GPU $200/month\n", - "\n", - "Parameter Growth Examples:\n", - "Small: GPT-2 Small (124M parameters) → 500MB memory\n", - "Medium: GPT-2 Medium (350M parameters) → 1.4GB memory\n", - "Large: GPT-2 Large (774M parameters) → 3.1GB memory\n", - "XL: GPT-2 XL (1.5B parameters) → 6.0GB memory\n", - "```\n", - "\n", - "### Parameter Counting Strategy\n", - "Our parameter counter needs to handle different model types:\n", - "- **Single layers** (Linear, Conv2d) with weight and bias\n", - "- **Sequential models** with multiple layers\n", - "- **Custom models** with parameters() method" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4044095a", - "metadata": {}, - "outputs": [], - "source": [ - "def count_parameters(self, model) -> int:\n", - " \"\"\"\n", - " Count total trainable parameters in a model.\n", - "\n", - " TODO: Implement parameter counting for any model with parameters() method\n", - "\n", - " APPROACH:\n", - " 1. Get all parameters from model.parameters() if available\n", - " 2. For single layers, count weight and bias directly\n", - " 3. Sum total element count across all parameter tensors\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters\n", - " >>> profiler = Profiler()\n", - " >>> count = profiler.count_parameters(linear)\n", - " >>> print(count)\n", - " 8256\n", - "\n", - " HINTS:\n", - " - Use parameter.data.size for tensor element count\n", - " - Handle models with and without parameters() method\n", - " - Don't forget bias terms when present\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " total_params = 0\n", - "\n", - " # Handle different model types\n", - " if hasattr(model, 'parameters'):\n", - " # Model with parameters() method (Sequential, custom models)\n", - " for param in model.parameters():\n", - " total_params += param.data.size\n", - " elif hasattr(model, 'weight'):\n", - " # Single layer (Linear, Conv2d)\n", - " total_params += model.weight.data.size\n", - " if hasattr(model, 'bias') and model.bias is not None:\n", - " total_params += model.bias.data.size\n", - " else:\n", - " # No parameters (activations, etc.)\n", - " total_params = 0\n", - "\n", - " return total_params\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.count_parameters = count_parameters" - ] - }, - { - "cell_type": "markdown", - "id": "acdb0834", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: Parameter Counting\n", - "This test validates our parameter counting works correctly for different model types.\n", - "**What we're testing**: Parameter counting accuracy for various architectures\n", - "**Why it matters**: Accurate parameter counts predict memory usage and model complexity\n", - "**Expected**: Correct counts for known model configurations" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f5a8065", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_parameter_counting", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_parameter_counting():\n", - " \"\"\"🔬 Test parameter counting implementation.\"\"\"\n", - " print(\"🔬 Unit Test: Parameter Counting...\")\n", - "\n", - " profiler = Profiler()\n", - "\n", - " # Test 1: Simple model with known parameters\n", - " class SimpleModel:\n", - " def __init__(self):\n", - " self.weight = Tensor(np.random.randn(10, 5))\n", - " self.bias = Tensor(np.random.randn(5))\n", - "\n", - " def parameters(self):\n", - " return [self.weight, self.bias]\n", - "\n", - " simple_model = SimpleModel()\n", - " param_count = profiler.count_parameters(simple_model)\n", - " expected_count = 10 * 5 + 5 # weight + bias\n", - " assert param_count == expected_count, f\"Expected {expected_count} parameters, got {param_count}\"\n", - " print(f\"✅ Simple model: {param_count} parameters\")\n", - "\n", - " # Test 2: Model without parameters\n", - " class NoParamModel:\n", - " def __init__(self):\n", - " pass\n", - "\n", - " no_param_model = NoParamModel()\n", - " param_count = profiler.count_parameters(no_param_model)\n", - " assert param_count == 0, f\"Expected 0 parameters, got {param_count}\"\n", - " print(f\"✅ No parameter model: {param_count} parameters\")\n", - "\n", - " # Test 3: Direct tensor (no parameters)\n", - " test_tensor = Tensor(np.random.randn(2, 3))\n", - " param_count = profiler.count_parameters(test_tensor)\n", - " assert param_count == 0, f\"Expected 0 parameters for tensor, got {param_count}\"\n", - " print(f\"✅ Direct tensor: {param_count} parameters\")\n", - "\n", - " print(\"✅ Parameter counting works correctly!\")\n", - "\n", - "test_unit_parameter_counting()" - ] - }, - { - "cell_type": "markdown", - "id": "6e9d44c6", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## FLOP Counting - Computational Cost Estimation\n", - "\n", - "FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems.\n", - "\n", - "### FLOP Counting Visualization\n", - "```\n", - "Linear Layer FLOP Breakdown:\n", - "Input (batch=32, features=768) × Weight (768, 3072) + Bias (3072)\n", - " ↓\n", - "Matrix Multiplication: 32 × 768 × 3072 × 2 = 150,994,944 FLOPs\n", - "Bias Addition: 32 × 3072 × 1 = 98,304 FLOPs\n", - " ↓\n", - "Total FLOPs: 151,093,248 FLOPs\n", - "\n", - "Convolution FLOP Breakdown:\n", - "Input (batch=1, channels=3, H=224, W=224)\n", - "Kernel (out=64, in=3, kH=7, kW=7)\n", - " ↓\n", - "Output size: (224×224) → (112×112) with stride=2\n", - "FLOPs = 112 × 112 × 7 × 7 × 3 × 64 × 2 = 235,012,096 FLOPs\n", - "```\n", - "\n", - "### FLOP Counting Strategy\n", - "Different operations require different FLOP calculations:\n", - "- **Matrix operations**: M × N × K × 2 (multiply + add)\n", - "- **Convolutions**: Output spatial × kernel spatial × channels\n", - "- **Activations**: Usually 1 FLOP per element" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "218af3a1", - "metadata": {}, - "outputs": [], - "source": [ - "def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n", - " \"\"\"\n", - " Count FLOPs (Floating Point Operations) for one forward pass.\n", - "\n", - " TODO: Implement FLOP counting for different layer types\n", - "\n", - " APPROACH:\n", - " 1. Create dummy input with given shape\n", - " 2. Calculate FLOPs based on layer type and dimensions\n", - " 3. Handle different model architectures (Linear, Conv2d, Sequential)\n", - "\n", - " LAYER-SPECIFIC FLOP FORMULAS:\n", - " - Linear: input_features × output_features × 2 (matmul + bias)\n", - " - Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2\n", - " - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(128, 64)\n", - " >>> profiler = Profiler()\n", - " >>> flops = profiler.count_flops(linear, (1, 128))\n", - " >>> print(flops) # 128 * 64 * 2 = 16384\n", - " 16384\n", - "\n", - " HINTS:\n", - " - Batch dimension doesn't affect per-sample FLOPs\n", - " - Focus on major operations (matmul, conv) first\n", - " - For Sequential models, sum FLOPs of all layers\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Create dummy input\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " total_flops = 0\n", - "\n", - " # Handle different model types\n", - " if hasattr(model, '__class__'):\n", - " model_name = model.__class__.__name__\n", - "\n", - " if model_name == 'Linear':\n", - " # Linear layer: input_features × output_features × 2\n", - " in_features = input_shape[-1]\n", - " out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n", - " total_flops = in_features * out_features * 2\n", - "\n", - " elif model_name == 'Conv2d':\n", - " # Conv2d layer: complex calculation based on output size\n", - " # Simplified: assume we know the output dimensions\n", - " if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n", - " batch_size = input_shape[0] if len(input_shape) > 3 else 1\n", - " in_channels = model.in_channels\n", - " out_channels = model.out_channels\n", - " kernel_h = kernel_w = model.kernel_size\n", - "\n", - " # Estimate output size (simplified)\n", - " input_h, input_w = input_shape[-2], input_shape[-1]\n", - " output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n", - " output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n", - "\n", - " total_flops = (output_h * output_w * kernel_h * kernel_w *\n", - " in_channels * out_channels * 2)\n", - "\n", - " elif model_name == 'Sequential':\n", - " # Sequential model: sum FLOPs of all layers\n", - " current_shape = input_shape\n", - " for layer in model.layers:\n", - " layer_flops = self.count_flops(layer, current_shape)\n", - " total_flops += layer_flops\n", - " # Update shape for next layer (simplified)\n", - " if hasattr(layer, 'weight'):\n", - " current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n", - "\n", - " else:\n", - " # Activation or other: assume 1 FLOP per element\n", - " total_flops = np.prod(input_shape)\n", - "\n", - " return total_flops\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.count_flops = count_flops" - ] - }, - { - "cell_type": "markdown", - "id": "8b02224b", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: FLOP Counting\n", - "This test validates our FLOP counting for different operations and architectures.\n", - "**What we're testing**: FLOP calculation accuracy for various layer types\n", - "**Why it matters**: FLOPs predict computational cost and energy usage\n", - "**Expected**: Correct FLOP counts for known operation types" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b947e9e", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_flop_counting", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_flop_counting():\n", - " \"\"\"🔬 Test FLOP counting implementation.\"\"\"\n", - " print(\"🔬 Unit Test: FLOP Counting...\")\n", - "\n", - " profiler = Profiler()\n", - "\n", - " # Test 1: Simple tensor operations\n", - " test_tensor = Tensor(np.random.randn(4, 8))\n", - " flops = profiler.count_flops(test_tensor, (4, 8))\n", - " expected_flops = 4 * 8 # 1 FLOP per element for generic operation\n", - " assert flops == expected_flops, f\"Expected {expected_flops} FLOPs, got {flops}\"\n", - " print(f\"✅ Tensor operation: {flops} FLOPs\")\n", - "\n", - " # Test 2: Simulated Linear layer\n", - " class MockLinear:\n", - " def __init__(self, in_features, out_features):\n", - " self.weight = Tensor(np.random.randn(in_features, out_features))\n", - " self.__class__.__name__ = 'Linear'\n", - "\n", - " mock_linear = MockLinear(128, 64)\n", - " flops = profiler.count_flops(mock_linear, (1, 128))\n", - " expected_flops = 128 * 64 * 2 # matmul FLOPs\n", - " assert flops == expected_flops, f\"Expected {expected_flops} FLOPs, got {flops}\"\n", - " print(f\"✅ Linear layer: {flops} FLOPs\")\n", - "\n", - " # Test 3: Batch size independence\n", - " flops_batch1 = profiler.count_flops(mock_linear, (1, 128))\n", - " flops_batch32 = profiler.count_flops(mock_linear, (32, 128))\n", - " assert flops_batch1 == flops_batch32, \"FLOPs should be independent of batch size\"\n", - " print(f\"✅ Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)\")\n", - "\n", - " print(\"✅ FLOP counting works correctly!\")\n", - "\n", - "test_unit_flop_counting()" - ] - }, - { - "cell_type": "markdown", - "id": "f32cf57c", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## Memory Profiling - Understanding Memory Usage Patterns\n", - "\n", - "Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization.\n", - "\n", - "### Memory Usage Breakdown\n", - "```\n", - "ML Model Memory Components:\n", - "┌───────────────────────────────────────────────────┐\n", - "│ Total Memory │\n", - "├─────────────────┬─────────────────┬───────────────┤\n", - "│ Parameters │ Activations │ Gradients │\n", - "│ (persistent) │ (per forward) │ (per backward)│\n", - "├─────────────────┼─────────────────┼───────────────┤\n", - "│ Linear weights │ Hidden states │ ∂L/∂W │\n", - "│ Conv filters │ Attention maps │ ∂L/∂b │\n", - "│ Embeddings │ Residual cache │ Optimizer │\n", - "└─────────────────┴─────────────────┴───────────────┘\n", - "\n", - "Memory Scaling:\n", - "Batch Size → Activation Memory (linear scaling)\n", - "Model Size → Parameter + Gradient Memory (linear scaling)\n", - "Sequence Length → Attention Memory (quadratic scaling!)\n", - "```\n", - "\n", - "### Memory Measurement Strategy\n", - "We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "694a0990", - "metadata": {}, - "outputs": [], - "source": [ - "def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n", - " \"\"\"\n", - " Measure memory usage during forward pass.\n", - "\n", - " TODO: Implement memory tracking for model execution\n", - "\n", - " APPROACH:\n", - " 1. Use tracemalloc to track memory allocation\n", - " 2. Measure baseline memory before model execution\n", - " 3. Run forward pass and track peak usage\n", - " 4. Calculate different memory components\n", - "\n", - " RETURN DICTIONARY:\n", - " - 'parameter_memory_mb': Memory for model parameters\n", - " - 'activation_memory_mb': Memory for activations\n", - " - 'peak_memory_mb': Maximum memory usage\n", - " - 'memory_efficiency': Ratio of useful to total memory\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(1024, 512)\n", - " >>> profiler = Profiler()\n", - " >>> memory = profiler.measure_memory(linear, (32, 1024))\n", - " >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n", - " Parameters: 2.1 MB\n", - "\n", - " HINTS:\n", - " - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n", - " - Account for float32 = 4 bytes per parameter\n", - " - Activation memory scales with batch size\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Start memory tracking\n", - " tracemalloc.start()\n", - "\n", - " # Measure baseline memory\n", - " baseline_memory = tracemalloc.get_traced_memory()[0]\n", - "\n", - " # Calculate parameter memory\n", - " param_count = self.count_parameters(model)\n", - " parameter_memory_bytes = param_count * 4 # Assume float32\n", - " parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n", - "\n", - " # Create input and measure activation memory\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " input_memory_bytes = dummy_input.data.nbytes\n", - "\n", - " # Estimate activation memory (simplified)\n", - " activation_memory_bytes = input_memory_bytes * 2 # Rough estimate\n", - " activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n", - "\n", - " # Try to run forward pass and measure peak\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(dummy_input)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(dummy_input)\n", - " except:\n", - " pass # Ignore errors for simplified measurement\n", - "\n", - " # Get peak memory\n", - " current_memory, peak_memory = tracemalloc.get_traced_memory()\n", - " peak_memory_mb = (peak_memory - baseline_memory) / (1024 * 1024)\n", - "\n", - " tracemalloc.stop()\n", - "\n", - " # Calculate efficiency\n", - " useful_memory = parameter_memory_mb + activation_memory_mb\n", - " memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero\n", - "\n", - " return {\n", - " 'parameter_memory_mb': parameter_memory_mb,\n", - " 'activation_memory_mb': activation_memory_mb,\n", - " 'peak_memory_mb': max(peak_memory_mb, useful_memory),\n", - " 'memory_efficiency': min(memory_efficiency, 1.0)\n", - " }\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.measure_memory = measure_memory" - ] - }, - { - "cell_type": "markdown", - "id": "1d520581", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: Memory Measurement\n", - "This test validates our memory tracking works correctly and provides useful metrics.\n", - "**What we're testing**: Memory usage measurement and calculation accuracy\n", - "**Why it matters**: Memory constraints often limit model deployment\n", - "**Expected**: Reasonable memory measurements with proper components" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "88c934b5", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_memory_measurement", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_memory_measurement():\n", - " \"\"\"🔬 Test memory measurement implementation.\"\"\"\n", - " print(\"🔬 Unit Test: Memory Measurement...\")\n", - "\n", - " profiler = Profiler()\n", - "\n", - " # Test 1: Basic memory measurement\n", - " test_tensor = Tensor(np.random.randn(10, 20))\n", - " memory_stats = profiler.measure_memory(test_tensor, (10, 20))\n", - "\n", - " # Validate dictionary structure\n", - " required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency']\n", - " for key in required_keys:\n", - " assert key in memory_stats, f\"Missing key: {key}\"\n", - "\n", - " # Validate non-negative values\n", - " for key in required_keys:\n", - " assert memory_stats[key] >= 0, f\"{key} should be non-negative, got {memory_stats[key]}\"\n", - "\n", - " print(f\"✅ Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak\")\n", - "\n", - " # Test 2: Memory scaling with size\n", - " small_tensor = Tensor(np.random.randn(5, 5))\n", - " large_tensor = Tensor(np.random.randn(50, 50))\n", - "\n", - " small_memory = profiler.measure_memory(small_tensor, (5, 5))\n", - " large_memory = profiler.measure_memory(large_tensor, (50, 50))\n", - "\n", - " # Larger tensor should use more activation memory\n", - " assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \\\n", - " \"Larger tensor should use more activation memory\"\n", - "\n", - " print(f\"✅ Scaling: Small {small_memory['activation_memory_mb']:.3f} MB → Large {large_memory['activation_memory_mb']:.3f} MB\")\n", - "\n", - " # Test 3: Efficiency bounds\n", - " assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \\\n", - " f\"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}\"\n", - "\n", - " print(f\"✅ Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)\")\n", - "\n", - " print(\"✅ Memory measurement works correctly!\")\n", - "\n", - "test_unit_memory_measurement()" - ] - }, - { - "cell_type": "markdown", - "id": "c45f1b79", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## Latency Measurement - Accurate Performance Timing\n", - "\n", - "Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results.\n", - "\n", - "### Latency Measurement Challenges\n", - "```\n", - "Timing Challenges:\n", - "┌─────────────────────────────────────────────────┐\n", - "│ Time Variance │\n", - "├─────────────────┬─────────────────┬─────────────┤\n", - "│ System Noise │ Cache Effects │ Thermal │\n", - "│ │ │ Throttling │\n", - "├─────────────────┼─────────────────┼─────────────┤\n", - "│ Background │ Cold start vs │ CPU slows │\n", - "│ processes │ warm caches │ when hot │\n", - "│ OS scheduling │ Memory locality │ GPU thermal │\n", - "│ Network I/O │ Branch predict │ limits │\n", - "└─────────────────┴─────────────────┴─────────────┘\n", - "\n", - "Solution: Statistical Approach\n", - "Warmup → Multiple measurements → Robust statistics (median)\n", - "```\n", - "\n", - "### Measurement Protocol\n", - "Our latency measurement follows professional benchmarking practices:\n", - "1. **Warmup runs** to stabilize system state\n", - "2. **Multiple measurements** for statistical significance\n", - "3. **Median calculation** to handle outliers\n", - "4. **Memory cleanup** to prevent contamination" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "764b8db5", - "metadata": {}, - "outputs": [], - "source": [ - "def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n", - " \"\"\"\n", - " Measure model inference latency with statistical rigor.\n", - "\n", - " TODO: Implement accurate latency measurement\n", - "\n", - " APPROACH:\n", - " 1. Run warmup iterations to stabilize performance\n", - " 2. Measure multiple iterations for statistical accuracy\n", - " 3. Calculate median latency to handle outliers\n", - " 4. Return latency in milliseconds\n", - "\n", - " PARAMETERS:\n", - " - warmup: Number of warmup runs (default 10)\n", - " - iterations: Number of measurement runs (default 100)\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(128, 64)\n", - " >>> input_tensor = Tensor(np.random.randn(1, 128))\n", - " >>> profiler = Profiler()\n", - " >>> latency = profiler.measure_latency(linear, input_tensor)\n", - " >>> print(f\"Latency: {latency:.2f} ms\")\n", - " Latency: 0.15 ms\n", - "\n", - " HINTS:\n", - " - Use time.perf_counter() for high precision\n", - " - Use median instead of mean for robustness against outliers\n", - " - Handle different model interfaces (forward, __call__)\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Warmup runs\n", - " for _ in range(warmup):\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " else:\n", - " # Fallback for simple operations\n", - " _ = input_tensor\n", - " except:\n", - " pass # Ignore errors during warmup\n", - "\n", - " # Measurement runs\n", - " times = []\n", - " for _ in range(iterations):\n", - " start_time = time.perf_counter()\n", - "\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " else:\n", - " # Minimal operation for timing\n", - " _ = input_tensor.data.copy()\n", - " except:\n", - " pass # Ignore errors but still measure time\n", - "\n", - " end_time = time.perf_counter()\n", - " times.append((end_time - start_time) * 1000) # Convert to milliseconds\n", - "\n", - " # Calculate statistics - use median for robustness\n", - " times = np.array(times)\n", - " median_latency = np.median(times)\n", - "\n", - " return float(median_latency)\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.measure_latency = measure_latency" - ] - }, - { - "cell_type": "markdown", - "id": "a7aa639f", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: Latency Measurement\n", - "This test validates our latency measurement provides consistent and reasonable results.\n", - "**What we're testing**: Timing accuracy and statistical robustness\n", - "**Why it matters**: Latency determines real-world deployment feasibility\n", - "**Expected**: Consistent timing measurements with proper statistical handling" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3e642916", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_latency_measurement", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_latency_measurement():\n", - " \"\"\"🔬 Test latency measurement implementation.\"\"\"\n", - " print(\"🔬 Unit Test: Latency Measurement...\")\n", - "\n", - " profiler = Profiler()\n", - "\n", - " # Test 1: Basic latency measurement\n", - " test_tensor = Tensor(np.random.randn(4, 8))\n", - " latency = profiler.measure_latency(test_tensor, test_tensor, warmup=2, iterations=5)\n", - "\n", - " assert latency >= 0, f\"Latency should be non-negative, got {latency}\"\n", - " assert latency < 1000, f\"Latency seems too high for simple operation: {latency} ms\"\n", - " print(f\"✅ Basic latency: {latency:.3f} ms\")\n", - "\n", - " # Test 2: Measurement consistency\n", - " latencies = []\n", - " for _ in range(3):\n", - " lat = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=3)\n", - " latencies.append(lat)\n", - "\n", - " # Measurements should be in reasonable range\n", - " avg_latency = np.mean(latencies)\n", - " std_latency = np.std(latencies)\n", - " assert std_latency < avg_latency, \"Standard deviation shouldn't exceed mean for simple operations\"\n", - " print(f\"✅ Consistency: {avg_latency:.3f} ± {std_latency:.3f} ms\")\n", - "\n", - " # Test 3: Size scaling\n", - " small_tensor = Tensor(np.random.randn(2, 2))\n", - " large_tensor = Tensor(np.random.randn(20, 20))\n", - "\n", - " small_latency = profiler.measure_latency(small_tensor, small_tensor, warmup=1, iterations=3)\n", - " large_latency = profiler.measure_latency(large_tensor, large_tensor, warmup=1, iterations=3)\n", - "\n", - " # Larger operations might take longer (though not guaranteed for simple operations)\n", - " print(f\"✅ Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms\")\n", - "\n", - " print(\"✅ Latency measurement works correctly!\")\n", - "\n", - "test_unit_latency_measurement()" - ] - }, - { - "cell_type": "markdown", - "id": "47686a04", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 4. Integration: Advanced Profiling Functions\n", - "\n", - "Now let's build higher-level profiling functions that combine our core measurements into comprehensive analysis tools.\n", - "\n", - "### Advanced Profiling Architecture\n", - "```\n", - "Core Profiler Methods → Advanced Analysis Functions → Optimization Insights\n", - " ↓ ↓ ↓\n", - "count_parameters() profile_forward_pass() \"Memory-bound workload\"\n", - "count_flops() profile_backward_pass() \"Optimize data movement\"\n", - "measure_memory() benchmark_efficiency() \"Focus on bandwidth\"\n", - "measure_latency() analyze_bottlenecks() \"Use quantization\"\n", - "```\n", - "\n", - "### Forward Pass Profiling - Complete Performance Picture\n", - "\n", - "A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "01dc2eb1", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "advanced_profiling", - "solution": true - } - }, - "outputs": [], - "source": [ - "def profile_forward_pass(model, input_tensor) -> Dict[str, Any]:\n", - " \"\"\"\n", - " Comprehensive profiling of a model's forward pass.\n", - "\n", - " TODO: Implement complete forward pass analysis\n", - "\n", - " APPROACH:\n", - " 1. Use Profiler class to gather all measurements\n", - " 2. Create comprehensive performance profile\n", - " 3. Add derived metrics and insights\n", - " 4. Return structured analysis results\n", - "\n", - " RETURN METRICS:\n", - " - All basic profiler measurements\n", - " - FLOPs per second (computational efficiency)\n", - " - Memory bandwidth utilization\n", - " - Performance bottleneck identification\n", - "\n", - " EXAMPLE:\n", - " >>> model = Linear(256, 128)\n", - " >>> input_data = Tensor(np.random.randn(32, 256))\n", - " >>> profile = profile_forward_pass(model, input_data)\n", - " >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n", - " Throughput: 2.45 GFLOP/s\n", - "\n", - " HINTS:\n", - " - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n", - " - Memory bandwidth = memory_mb / (latency_ms / 1000)\n", - " - Consider realistic hardware limits for efficiency calculations\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " profiler = Profiler()\n", - "\n", - " # Basic measurements\n", - " param_count = profiler.count_parameters(model)\n", - " flops = profiler.count_flops(model, input_tensor.shape)\n", - " memory_stats = profiler.measure_memory(model, input_tensor.shape)\n", - " latency_ms = profiler.measure_latency(model, input_tensor, warmup=5, iterations=20)\n", - "\n", - " # Derived metrics\n", - " latency_seconds = latency_ms / 1000.0\n", - " gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n", - "\n", - " # Memory bandwidth (MB/s)\n", - " memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n", - "\n", - " # Efficiency metrics\n", - " theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU\n", - " computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n", - "\n", - " # Bottleneck analysis\n", - " is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic\n", - " is_compute_bound = not is_memory_bound\n", - "\n", - " return {\n", - " # Basic measurements\n", - " 'parameters': param_count,\n", - " 'flops': flops,\n", - " 'latency_ms': latency_ms,\n", - " **memory_stats,\n", - "\n", - " # Derived metrics\n", - " 'gflops_per_second': gflops_per_second,\n", - " 'memory_bandwidth_mbs': memory_bandwidth,\n", - " 'computational_efficiency': computational_efficiency,\n", - "\n", - " # Bottleneck analysis\n", - " 'is_memory_bound': is_memory_bound,\n", - " 'is_compute_bound': is_compute_bound,\n", - " 'bottleneck': 'memory' if is_memory_bound else 'compute'\n", - " }\n", - " ### END SOLUTION" - ] - }, - { - "cell_type": "markdown", - "id": "16cc4aaf", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### Backward Pass Profiling - Training Analysis\n", - "\n", - "Training requires both forward and backward passes. The backward pass typically uses 2× the compute and adds gradient memory. Understanding this is crucial for training optimization.\n", - "\n", - "### Training Memory Visualization\n", - "```\n", - "Training Memory Timeline:\n", - "Forward Pass: [Parameters] + [Activations]\n", - " ↓\n", - "Backward Pass: [Parameters] + [Activations] + [Gradients]\n", - " ↓\n", - "Optimizer: [Parameters] + [Gradients] + [Optimizer State]\n", - "\n", - "Memory Examples:\n", - "Model: 125M parameters (500MB)\n", - "Forward: 500MB params + 100MB activations = 600MB\n", - "Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB\n", - "Adam: 500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB\n", - "\n", - "Total Training Memory: 4× parameter memory!\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "20aab8e4", - "metadata": { - "lines_to_next_cell": 1 - }, - "outputs": [], - "source": [ - "def profile_backward_pass(model, input_tensor, loss_fn=None) -> Dict[str, Any]:\n", - " \"\"\"\n", - " Profile both forward and backward passes for training analysis.\n", - "\n", - " TODO: Implement training-focused profiling\n", - "\n", - " APPROACH:\n", - " 1. Profile forward pass first\n", - " 2. Estimate backward pass costs (typically 2× forward)\n", - " 3. Calculate total training iteration metrics\n", - " 4. Analyze memory requirements for gradients and optimizers\n", - "\n", - " BACKWARD PASS ESTIMATES:\n", - " - FLOPs: ~2× forward pass (gradient computation)\n", - " - Memory: +1× parameters (gradient storage)\n", - " - Latency: ~2× forward pass (more complex operations)\n", - "\n", - " EXAMPLE:\n", - " >>> model = Linear(128, 64)\n", - " >>> input_data = Tensor(np.random.randn(16, 128))\n", - " >>> profile = profile_backward_pass(model, input_data)\n", - " >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n", - " Training iteration: 0.45 ms\n", - "\n", - " HINTS:\n", - " - Total memory = parameters + activations + gradients\n", - " - Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)\n", - " - Consider gradient accumulation effects\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Get forward pass profile\n", - " forward_profile = profile_forward_pass(model, input_tensor)\n", - "\n", - " # Estimate backward pass (typically 2× forward)\n", - " backward_flops = forward_profile['flops'] * 2\n", - " backward_latency_ms = forward_profile['latency_ms'] * 2\n", - "\n", - " # Gradient memory (equal to parameter memory)\n", - " gradient_memory_mb = forward_profile['parameter_memory_mb']\n", - "\n", - " # Total training iteration\n", - " total_flops = forward_profile['flops'] + backward_flops\n", - " total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n", - " total_memory_mb = (forward_profile['parameter_memory_mb'] +\n", - " forward_profile['activation_memory_mb'] +\n", - " gradient_memory_mb)\n", - "\n", - " # Training efficiency\n", - " total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n", - "\n", - " # Optimizer memory estimates\n", - " optimizer_memory_estimates = {\n", - " 'sgd': 0, # No extra memory\n", - " 'adam': gradient_memory_mb * 2, # Momentum + velocity\n", - " 'adamw': gradient_memory_mb * 2, # Same as Adam\n", - " }\n", - "\n", - " return {\n", - " # Forward pass\n", - " 'forward_flops': forward_profile['flops'],\n", - " 'forward_latency_ms': forward_profile['latency_ms'],\n", - " 'forward_memory_mb': forward_profile['peak_memory_mb'],\n", - "\n", - " # Backward pass estimates\n", - " 'backward_flops': backward_flops,\n", - " 'backward_latency_ms': backward_latency_ms,\n", - " 'gradient_memory_mb': gradient_memory_mb,\n", - "\n", - " # Total training iteration\n", - " 'total_flops': total_flops,\n", - " 'total_latency_ms': total_latency_ms,\n", - " 'total_memory_mb': total_memory_mb,\n", - " 'total_gflops_per_second': total_gflops_per_second,\n", - "\n", - " # Optimizer memory requirements\n", - " 'optimizer_memory_estimates': optimizer_memory_estimates,\n", - "\n", - " # Training insights\n", - " 'memory_efficiency': forward_profile['memory_efficiency'],\n", - " 'bottleneck': forward_profile['bottleneck']\n", - " }\n", - " ### END SOLUTION" - ] - }, - { - "cell_type": "markdown", - "id": "a66d79fe", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🧪 Unit Test: Advanced Profiling Functions\n", - "This test validates our advanced profiling functions provide comprehensive analysis.\n", - "**What we're testing**: Forward and backward pass profiling completeness\n", - "**Why it matters**: Training optimization requires understanding both passes\n", - "**Expected**: Complete profiles with all required metrics and relationships" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f7838a43", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_advanced_profiling", - "locked": true, - "points": 15 - } - }, - "outputs": [], - "source": [ - "def test_unit_advanced_profiling():\n", - " \"\"\"🔬 Test advanced profiling functions.\"\"\"\n", - " print(\"🔬 Unit Test: Advanced Profiling Functions...\")\n", - "\n", - " # Create test model and input\n", - " test_input = Tensor(np.random.randn(4, 8))\n", - "\n", - " # Test forward pass profiling\n", - " forward_profile = profile_forward_pass(test_input, test_input)\n", - "\n", - " # Validate forward profile structure\n", - " required_forward_keys = [\n", - " 'parameters', 'flops', 'latency_ms', 'gflops_per_second',\n", - " 'memory_bandwidth_mbs', 'bottleneck'\n", - " ]\n", - "\n", - " for key in required_forward_keys:\n", - " assert key in forward_profile, f\"Missing key: {key}\"\n", - "\n", - " assert forward_profile['parameters'] >= 0\n", - " assert forward_profile['flops'] >= 0\n", - " assert forward_profile['latency_ms'] >= 0\n", - " assert forward_profile['gflops_per_second'] >= 0\n", - "\n", - " print(f\"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s\")\n", - "\n", - " # Test backward pass profiling\n", - " backward_profile = profile_backward_pass(test_input, test_input)\n", - "\n", - " # Validate backward profile structure\n", - " required_backward_keys = [\n", - " 'forward_flops', 'backward_flops', 'total_flops',\n", - " 'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates'\n", - " ]\n", - "\n", - " for key in required_backward_keys:\n", - " assert key in backward_profile, f\"Missing key: {key}\"\n", - "\n", - " # Validate relationships\n", - " assert backward_profile['total_flops'] >= backward_profile['forward_flops']\n", - " assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms']\n", - " assert 'sgd' in backward_profile['optimizer_memory_estimates']\n", - " assert 'adam' in backward_profile['optimizer_memory_estimates']\n", - "\n", - " # Check backward pass estimates are reasonable\n", - " assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \\\n", - " \"Backward pass should have at least as many FLOPs as forward\"\n", - " assert backward_profile['gradient_memory_mb'] >= 0, \\\n", - " \"Gradient memory should be non-negative\"\n", - "\n", - " print(f\"✅ Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total\")\n", - " print(f\"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training\")\n", - " print(\"✅ Advanced profiling functions work correctly!\")\n", - "\n", - "test_unit_advanced_profiling()" - ] - }, - { - "cell_type": "markdown", - "id": "768f21e5", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 5. Systems Analysis: Understanding Performance Characteristics\n", - "\n", - "Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks.\n", - "\n", - "### Performance Analysis Workflow\n", - "```\n", - "Model Scaling Analysis:\n", - "Size → Memory → Latency → Throughput → Bottleneck Identification\n", - " ↓ ↓ ↓ ↓ ↓\n", - "64 1MB 0.1ms 10K ops/s Memory bound\n", - "128 4MB 0.2ms 8K ops/s Memory bound\n", - "256 16MB 0.5ms 4K ops/s Memory bound\n", - "512 64MB 2.0ms 1K ops/s Memory bound\n", - "\n", - "Insight: This workload is memory-bound → Optimize data movement, not compute!\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f90a148", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "performance_analysis", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_model_scaling():\n", - " \"\"\"📊 Analyze how model performance scales with size.\"\"\"\n", - " print(\"📊 Analyzing Model Scaling Characteristics...\")\n", - "\n", - " profiler = Profiler()\n", - " results = []\n", - "\n", - " # Test different model sizes\n", - " sizes = [64, 128, 256, 512]\n", - "\n", - " print(\"\\nModel Scaling Analysis:\")\n", - " print(\"Size\\tParams\\t\\tFLOPs\\t\\tLatency(ms)\\tMemory(MB)\\tGFLOP/s\")\n", - " print(\"-\" * 80)\n", - "\n", - " for size in sizes:\n", - " # Create models of different sizes for comparison\n", - " input_shape = (32, size) # Batch of 32\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - "\n", - " # Simulate linear layer characteristics\n", - " linear_params = size * size + size # W + b\n", - " linear_flops = size * size * 2 # matmul\n", - "\n", - " # Measure actual performance\n", - " latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10)\n", - " memory = profiler.measure_memory(dummy_input, input_shape)\n", - "\n", - " gflops_per_second = (linear_flops / 1e9) / (latency / 1000)\n", - "\n", - " results.append({\n", - " 'size': size,\n", - " 'parameters': linear_params,\n", - " 'flops': linear_flops,\n", - " 'latency_ms': latency,\n", - " 'memory_mb': memory['peak_memory_mb'],\n", - " 'gflops_per_second': gflops_per_second\n", - " })\n", - "\n", - " print(f\"{size}\\t{linear_params:,}\\t\\t{linear_flops:,}\\t\\t\"\n", - " f\"{latency:.2f}\\t\\t{memory['peak_memory_mb']:.2f}\\t\\t\"\n", - " f\"{gflops_per_second:.2f}\")\n", - "\n", - " # Analysis insights\n", - " print(\"\\n💡 Scaling Analysis Insights:\")\n", - "\n", - " # Memory scaling\n", - " memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001)\n", - " print(f\"Memory grows {memory_growth:.1f}× from {sizes[0]} to {sizes[-1]} size\")\n", - "\n", - " # Compute scaling\n", - " compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001)\n", - " print(f\"Compute efficiency changes {compute_growth:.1f}× with size\")\n", - "\n", - " # Performance characteristics\n", - " avg_efficiency = np.mean([r['gflops_per_second'] for r in results])\n", - " if avg_efficiency < 10: # Arbitrary threshold for \"low\" efficiency\n", - " print(\"🚀 Low compute efficiency suggests memory-bound workload\")\n", - " print(\" → Optimization focus: Data layout, memory bandwidth, caching\")\n", - " else:\n", - " print(\"🚀 High compute efficiency suggests compute-bound workload\")\n", - " print(\" → Optimization focus: Algorithmic efficiency, vectorization\")\n", - "\n", - "def analyze_batch_size_effects():\n", - " \"\"\"📊 Analyze how batch size affects performance and efficiency.\"\"\"\n", - " print(\"\\n📊 Analyzing Batch Size Effects...\")\n", - "\n", - " profiler = Profiler()\n", - " batch_sizes = [1, 8, 32, 128]\n", - " feature_size = 256\n", - "\n", - " print(\"\\nBatch Size Effects Analysis:\")\n", - " print(\"Batch\\tLatency(ms)\\tThroughput(samples/s)\\tMemory(MB)\\tMemory Efficiency\")\n", - " print(\"-\" * 85)\n", - "\n", - " for batch_size in batch_sizes:\n", - " input_shape = (batch_size, feature_size)\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - "\n", - " # Measure performance\n", - " latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10)\n", - " memory = profiler.measure_memory(dummy_input, input_shape)\n", - "\n", - " # Calculate throughput\n", - " samples_per_second = (batch_size * 1000) / latency # samples/second\n", - "\n", - " # Calculate efficiency (samples per unit memory)\n", - " efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001)\n", - "\n", - " print(f\"{batch_size}\\t{latency:.2f}\\t\\t{samples_per_second:.0f}\\t\\t\\t\"\n", - " f\"{memory['peak_memory_mb']:.2f}\\t\\t{efficiency:.1f}\")\n", - "\n", - " print(\"\\n💡 Batch Size Insights:\")\n", - " print(\"• Larger batches typically improve throughput but increase memory usage\")\n", - " print(\"• Sweet spot balances throughput and memory constraints\")\n", - " print(\"• Memory efficiency = samples/s per MB (higher is better)\")\n", - "\n", - "# Run the analysis\n", - "analyze_model_scaling()\n", - "analyze_batch_size_effects()" - ] - }, - { - "cell_type": "markdown", - "id": "0563e9cd", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 6. Optimization Insights: Production Performance Patterns\n", - "\n", - "Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead.\n", - "\n", - "### Operation Efficiency Analysis\n", - "```\n", - "Operation Types and Their Characteristics:\n", - "┌─────────────────┬──────────────────┬──────────────────┬─────────────────┐\n", - "│ Operation │ Compute/Memory │ Optimization │ Priority │\n", - "├─────────────────┼──────────────────┼──────────────────┼─────────────────┤\n", - "│ Matrix Multiply │ Compute-bound │ BLAS libraries │ High │\n", - "│ Elementwise │ Memory-bound │ Data locality │ Medium │\n", - "│ Reductions │ Memory-bound │ Parallelization│ Medium │\n", - "│ Attention │ Memory-bound │ FlashAttention │ High │\n", - "└─────────────────┴──────────────────┴──────────────────┴─────────────────┘\n", - "\n", - "Optimization Strategy:\n", - "1. Profile first → Identify bottlenecks\n", - "2. Focus on compute-bound ops → Algorithmic improvements\n", - "3. Focus on memory-bound ops → Data movement optimization\n", - "4. Measure again → Verify improvements\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c506a927", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "optimization_insights", - "solution": true - } - }, - "outputs": [], - "source": [ - "def benchmark_operation_efficiency():\n", - " \"\"\"📊 Compare efficiency of different operations for optimization guidance.\"\"\"\n", - " print(\"📊 Benchmarking Operation Efficiency...\")\n", - "\n", - " profiler = Profiler()\n", - " operations = []\n", - "\n", - " # Test different operation types\n", - " size = 256\n", - " input_tensor = Tensor(np.random.randn(32, size))\n", - "\n", - " # Elementwise operations (memory-bound)\n", - " elementwise_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20)\n", - " elementwise_flops = size * 32 # One operation per element\n", - "\n", - " operations.append({\n", - " 'operation': 'Elementwise',\n", - " 'latency_ms': elementwise_latency,\n", - " 'flops': elementwise_flops,\n", - " 'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000),\n", - " 'efficiency_class': 'memory-bound',\n", - " 'optimization_focus': 'data_locality'\n", - " })\n", - "\n", - " # Matrix operations (compute-bound)\n", - " matrix_tensor = Tensor(np.random.randn(size, size))\n", - " matrix_latency = profiler.measure_latency(matrix_tensor, input_tensor, iterations=10)\n", - " matrix_flops = size * size * 2 # Matrix multiplication\n", - "\n", - " operations.append({\n", - " 'operation': 'Matrix Multiply',\n", - " 'latency_ms': matrix_latency,\n", - " 'flops': matrix_flops,\n", - " 'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000),\n", - " 'efficiency_class': 'compute-bound',\n", - " 'optimization_focus': 'algorithms'\n", - " })\n", - "\n", - " # Reduction operations (memory-bound)\n", - " reduction_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20)\n", - " reduction_flops = size * 32 # Sum reduction\n", - "\n", - " operations.append({\n", - " 'operation': 'Reduction',\n", - " 'latency_ms': reduction_latency,\n", - " 'flops': reduction_flops,\n", - " 'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000),\n", - " 'efficiency_class': 'memory-bound',\n", - " 'optimization_focus': 'parallelization'\n", - " })\n", - "\n", - " print(\"\\nOperation Efficiency Comparison:\")\n", - " print(\"Operation\\t\\tLatency(ms)\\tGFLOP/s\\t\\tEfficiency Class\\tOptimization Focus\")\n", - " print(\"-\" * 95)\n", - "\n", - " for op in operations:\n", - " print(f\"{op['operation']:<15}\\t{op['latency_ms']:.3f}\\t\\t\"\n", - " f\"{op['gflops_per_second']:.2f}\\t\\t{op['efficiency_class']:<15}\\t{op['optimization_focus']}\")\n", - "\n", - " print(\"\\n💡 Operation Optimization Insights:\")\n", - "\n", - " # Find most and least efficient\n", - " best_op = max(operations, key=lambda x: x['gflops_per_second'])\n", - " worst_op = min(operations, key=lambda x: x['gflops_per_second'])\n", - "\n", - " print(f\"• Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n", - " print(f\"• Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n", - "\n", - " # Count operation types\n", - " memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']\n", - " compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound']\n", - "\n", - " print(f\"\\n🚀 Optimization Priority:\")\n", - " if len(memory_bound_ops) > len(compute_bound_ops):\n", - " print(\"• Focus on memory optimization: data locality, bandwidth, caching\")\n", - " print(\"• Consider operation fusion to reduce memory traffic\")\n", - " else:\n", - " print(\"• Focus on compute optimization: better algorithms, vectorization\")\n", - " print(\"• Consider specialized libraries (BLAS, cuBLAS)\")\n", - "\n", - "def analyze_profiling_overhead():\n", - " \"\"\"📊 Measure the overhead of profiling itself.\"\"\"\n", - " print(\"\\n📊 Analyzing Profiling Overhead...\")\n", - "\n", - " # Test with and without profiling\n", - " test_tensor = Tensor(np.random.randn(100, 100))\n", - " iterations = 50\n", - "\n", - " # Without profiling - baseline measurement\n", - " start_time = time.perf_counter()\n", - " for _ in range(iterations):\n", - " _ = test_tensor.data.copy() # Simple operation\n", - " end_time = time.perf_counter()\n", - " baseline_ms = (end_time - start_time) * 1000\n", - "\n", - " # With profiling - includes measurement overhead\n", - " profiler = Profiler()\n", - " start_time = time.perf_counter()\n", - " for _ in range(iterations):\n", - " _ = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=1)\n", - " end_time = time.perf_counter()\n", - " profiled_ms = (end_time - start_time) * 1000\n", - "\n", - " overhead_factor = profiled_ms / max(baseline_ms, 0.001)\n", - "\n", - " print(f\"\\nProfiling Overhead Analysis:\")\n", - " print(f\"Baseline execution: {baseline_ms:.2f} ms\")\n", - " print(f\"With profiling: {profiled_ms:.2f} ms\")\n", - " print(f\"Profiling overhead: {overhead_factor:.1f}× slower\")\n", - "\n", - " print(f\"\\n💡 Profiling Overhead Insights:\")\n", - " if overhead_factor < 2:\n", - " print(\"• Low overhead - suitable for frequent profiling\")\n", - " print(\"• Can be used in development with minimal impact\")\n", - " elif overhead_factor < 10:\n", - " print(\"• Moderate overhead - use for development and debugging\")\n", - " print(\"• Disable for production unless investigating issues\")\n", - " else:\n", - " print(\"• High overhead - use sparingly in production\")\n", - " print(\"• Enable only when investigating specific performance issues\")\n", - "\n", - " print(f\"\\n🚀 Profiling Best Practices:\")\n", - " print(\"• Profile during development to identify bottlenecks\")\n", - " print(\"• Use production profiling only for investigation\")\n", - " print(\"• Focus measurement on critical code paths\")\n", - " print(\"• Balance measurement detail with overhead cost\")\n", - "\n", - "# Run optimization analysis\n", - "benchmark_operation_efficiency()\n", - "analyze_profiling_overhead()" - ] - }, - { - "cell_type": "markdown", - "id": "e7a5de0d", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🧪 Module Integration Test\n", - "\n", - "Final validation that everything works together correctly." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d922a54d", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_module", - "locked": true, - "points": 20 - } - }, - "outputs": [], - "source": [ - "def test_module():\n", - " \"\"\"\n", - " Comprehensive test of entire profiling module functionality.\n", - "\n", - " This final test runs before module summary to ensure:\n", - " - All unit tests pass\n", - " - Functions work together correctly\n", - " - Module is ready for integration with TinyTorch\n", - " \"\"\"\n", - " print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n", - " print(\"=\" * 50)\n", - "\n", - " # Run all unit tests\n", - " print(\"Running unit tests...\")\n", - " test_unit_parameter_counting()\n", - " test_unit_flop_counting()\n", - " test_unit_memory_measurement()\n", - " test_unit_latency_measurement()\n", - " test_unit_advanced_profiling()\n", - "\n", - " print(\"\\nRunning integration scenarios...\")\n", - "\n", - " # Test realistic usage patterns\n", - " print(\"🔬 Integration Test: Complete Profiling Workflow...\")\n", - "\n", - " # Create profiler\n", - " profiler = Profiler()\n", - "\n", - " # Create test model and data\n", - " test_model = Tensor(np.random.randn(16, 32))\n", - " test_input = Tensor(np.random.randn(8, 16))\n", - "\n", - " # Run complete profiling workflow\n", - " print(\"1. Measuring model characteristics...\")\n", - " params = profiler.count_parameters(test_model)\n", - " flops = profiler.count_flops(test_model, test_input.shape)\n", - " memory = profiler.measure_memory(test_model, test_input.shape)\n", - " latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)\n", - "\n", - " print(f\" Parameters: {params}\")\n", - " print(f\" FLOPs: {flops}\")\n", - " print(f\" Memory: {memory['peak_memory_mb']:.2f} MB\")\n", - " print(f\" Latency: {latency:.2f} ms\")\n", - "\n", - " # Test advanced profiling\n", - " print(\"2. Running advanced profiling...\")\n", - " forward_profile = profile_forward_pass(test_model, test_input)\n", - " backward_profile = profile_backward_pass(test_model, test_input)\n", - "\n", - " assert 'gflops_per_second' in forward_profile\n", - " assert 'total_latency_ms' in backward_profile\n", - " print(f\" Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}\")\n", - " print(f\" Training latency: {backward_profile['total_latency_ms']:.2f} ms\")\n", - "\n", - " # Test bottleneck analysis\n", - " print(\"3. Analyzing performance bottlenecks...\")\n", - " bottleneck = forward_profile['bottleneck']\n", - " efficiency = forward_profile['computational_efficiency']\n", - " print(f\" Bottleneck: {bottleneck}\")\n", - " print(f\" Compute efficiency: {efficiency:.3f}\")\n", - "\n", - " # Validate end-to-end workflow\n", - " assert params >= 0, \"Parameter count should be non-negative\"\n", - " assert flops >= 0, \"FLOP count should be non-negative\"\n", - " assert memory['peak_memory_mb'] >= 0, \"Memory usage should be non-negative\"\n", - " assert latency >= 0, \"Latency should be non-negative\"\n", - " assert forward_profile['gflops_per_second'] >= 0, \"GFLOP/s should be non-negative\"\n", - " assert backward_profile['total_latency_ms'] >= 0, \"Total latency should be non-negative\"\n", - " assert bottleneck in ['memory', 'compute'], \"Bottleneck should be memory or compute\"\n", - " assert 0 <= efficiency <= 1, \"Efficiency should be between 0 and 1\"\n", - "\n", - " print(\"✅ End-to-end profiling workflow works!\")\n", - "\n", - " # Test production-like scenario\n", - " print(\"4. Testing production profiling scenario...\")\n", - "\n", - " # Simulate larger model analysis\n", - " large_input = Tensor(np.random.randn(32, 512)) # Larger model input\n", - " large_profile = profile_forward_pass(large_input, large_input)\n", - "\n", - " # Verify profile contains optimization insights\n", - " assert 'bottleneck' in large_profile, \"Profile should identify bottlenecks\"\n", - " assert 'memory_bandwidth_mbs' in large_profile, \"Profile should measure memory bandwidth\"\n", - "\n", - " print(f\" Large model analysis: {large_profile['bottleneck']} bottleneck\")\n", - " print(f\" Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s\")\n", - "\n", - " print(\"✅ Production profiling scenario works!\")\n", - "\n", - " print(\"\\n\" + \"=\" * 50)\n", - " print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n", - " print(\"Run: tito module complete 15\")\n", - "\n", - "# Call before module summary\n", - "test_module()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "378e2ca8", - "metadata": {}, - "outputs": [], - "source": [ - "if __name__ == \"__main__\":\n", - " print(\"🚀 Running Profiling module...\")\n", - " test_module()\n", - " print(\"✅ Module validation complete!\")" - ] - }, - { - "cell_type": "markdown", - "id": "e44c6173", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🤔 ML Systems Thinking: Performance Measurement\n", - "\n", - "### Question 1: FLOP Analysis\n", - "You implemented a profiler that counts FLOPs for different operations.\n", - "For a Linear layer with 1000 input features and 500 output features:\n", - "- How many FLOPs are required for one forward pass? _____ FLOPs\n", - "- If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____\n", - "\n", - "### Question 2: Memory Scaling\n", - "Your profiler measures memory usage for models and activations.\n", - "A transformer model has 125M parameters (500MB at FP32).\n", - "During training with batch size 16:\n", - "- What's the minimum memory for gradients? _____ MB\n", - "- With Adam optimizer, what's the total memory requirement? _____ MB\n", - "\n", - "### Question 3: Performance Bottlenecks\n", - "You built tools to identify compute vs memory bottlenecks.\n", - "A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak:\n", - "- What's the computational efficiency? _____%\n", - "- If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____\n", - "\n", - "### Question 4: Profiling Trade-offs\n", - "Your profiler adds measurement overhead to understand performance.\n", - "If profiling adds 5× overhead but reveals a 50% speedup opportunity:\n", - "- Is the profiling cost justified for development? _____\n", - "- When should you disable profiling in production? _____" - ] - }, - { - "cell_type": "markdown", - "id": "ab131290", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🏁 Consolidated Profiler for Export\n", - "\n", - "Now that we've implemented all profiling methods, let's create a consolidated Profiler class\n", - "for export to the tinytorch package. This allows milestones to use the full profiler." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd3324fa", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "profiler_export", - "solution": false - } - }, - "outputs": [], - "source": [ - "#| export\n", - "class ProfilerComplete:\n", - " \"\"\"\n", - " Complete profiler with all measurement capabilities for milestone use.\n", - " \n", - " This is the exported version students build through the module exercises.\n", - " \"\"\"\n", - " \n", - " def __init__(self):\n", - " \"\"\"Initialize profiler with measurement state.\"\"\"\n", - " self.measurements = {}\n", - " self.operation_counts = defaultdict(int)\n", - " self.memory_tracker = None\n", - " \n", - " def count_parameters(self, model) -> int:\n", - " \"\"\"Count total trainable parameters in a model.\"\"\"\n", - " total_params = 0\n", - " \n", - " if hasattr(model, 'parameters'):\n", - " for param in model.parameters():\n", - " total_params += param.data.size\n", - " elif hasattr(model, 'weight'):\n", - " total_params += model.weight.data.size\n", - " if hasattr(model, 'bias') and model.bias is not None:\n", - " total_params += model.bias.data.size\n", - " \n", - " return total_params\n", - " \n", - " def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n", - " \"\"\"Count FLOPs for one forward pass.\"\"\"\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " total_flops = 0\n", - " \n", - " if hasattr(model, '__class__'):\n", - " model_name = model.__class__.__name__\n", - " \n", - " if model_name == 'Linear':\n", - " in_features = input_shape[-1]\n", - " out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n", - " total_flops = in_features * out_features * 2\n", - " \n", - " elif model_name == 'Conv2d':\n", - " total_flops = 1000000 # Simplified for now\n", - " \n", - " return total_flops\n", - " \n", - " def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n", - " \"\"\"Measure memory usage during forward pass.\"\"\"\n", - " tracemalloc.start()\n", - " baseline_memory = tracemalloc.get_traced_memory()[0]\n", - " \n", - " param_count = self.count_parameters(model)\n", - " parameter_memory_bytes = param_count * 4\n", - " parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n", - " \n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " \n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " output = model.forward(dummy_input)\n", - " elif hasattr(model, '__call__'):\n", - " output = model(dummy_input)\n", - " except:\n", - " output = dummy_input\n", - " \n", - " peak_memory, _ = tracemalloc.get_traced_memory()\n", - " tracemalloc.stop()\n", - " \n", - " peak_memory_mb = peak_memory / (1024 * 1024)\n", - " activation_memory_mb = max(0, peak_memory_mb - parameter_memory_mb)\n", - " \n", - " return {\n", - " 'parameter_memory_mb': parameter_memory_mb,\n", - " 'activation_memory_mb': activation_memory_mb,\n", - " 'peak_memory_mb': peak_memory_mb,\n", - " 'memory_efficiency': parameter_memory_mb / peak_memory_mb if peak_memory_mb > 0 else 0\n", - " }\n", - " \n", - " def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n", - " \"\"\"Measure model inference latency with statistical rigor.\"\"\"\n", - " # Warmup\n", - " for _ in range(warmup):\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " except:\n", - " pass\n", - " \n", - " # Measurement\n", - " times = []\n", - " for _ in range(iterations):\n", - " start = time.perf_counter()\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " except:\n", - " pass\n", - " end = time.perf_counter()\n", - " times.append(end - start)\n", - " \n", - " median_latency_ms = np.median(times) * 1000\n", - " return median_latency_ms" - ] - }, - { - "cell_type": "markdown", - "id": "dc025a52", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎯 MODULE SUMMARY: Profiling\n", - "\n", - "Congratulations! You've built a comprehensive profiling system for ML performance analysis!\n", - "\n", - "### Key Accomplishments\n", - "- Built complete Profiler class with parameter, FLOP, memory, and latency measurement\n", - "- Implemented advanced profiling functions for forward and backward pass analysis\n", - "- Discovered performance characteristics through scaling and efficiency analysis\n", - "- Created production-quality measurement tools for optimization guidance\n", - "- All tests pass ✅ (validated by `test_module()`)\n", - "\n", - "### Systems Insights Gained\n", - "- **FLOPs vs Reality**: Theoretical operations don't always predict actual performance\n", - "- **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute\n", - "- **Batch Size Effects**: Larger batches improve throughput but increase memory requirements\n", - "- **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization\n", - "\n", - "### Production Skills Developed\n", - "- **Performance Detective Work**: Use data, not guesses, to identify bottlenecks\n", - "- **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions\n", - "- **Resource Planning**: Predict memory and compute requirements for deployment\n", - "- **Statistical Rigor**: Handle measurement variance with proper methodology\n", - "\n", - "### Ready for Next Steps\n", - "Your profiling implementation enables Module 16 (Acceleration) to make data-driven optimization decisions.\n", - "Export with: `tito module complete 15`\n", - "\n", - "**Next**: Module 16 will use these profiling tools to implement acceleration techniques and measure their effectiveness!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/modules/source/15_profiling/profiling_dev.py b/modules/source/15_profiling/profiling_dev.py deleted file mode 100644 index 8cb97d5d..00000000 --- a/modules/source/15_profiling/profiling_dev.py +++ /dev/null @@ -1,1709 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.17.1 -# kernelspec: -# display_name: Python 3 (ipykernel) -# language: python -# name: python3 -# --- - -# %% [markdown] -""" -# Module 15: Profiling - Measuring What Matters in ML Systems - -Welcome to Module 15! You'll build professional profiling tools to measure model performance and uncover optimization opportunities. - -## 🔗 Prerequisites & Progress -**You've Built**: Complete ML stack from tensors to transformers with KV caching -**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency -**You'll Enable**: Data-driven optimization decisions and performance analysis - -**Connection Map**: -``` -All Modules → Profiling → Acceleration (Module 16) -(implementations) (measurement) (optimization) -``` - -## Learning Objectives -By the end of this module, you will: -1. Implement a complete Profiler class for model analysis -2. Count parameters and FLOPs accurately for different architectures -3. Measure memory usage and latency with statistical rigor -4. Create production-quality performance analysis tools - -Let's build the measurement foundation for ML systems optimization! - -## 📦 Where This Code Lives in the Final Package - -**Learning Side:** You work in `modules/15_profiling/profiling_dev.py` -**Building Side:** Code exports to `tinytorch.profiling.profiler` - -```python -# How to use this module: -from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass -``` - -**Why this matters:** -- **Learning:** Complete profiling system for understanding model performance characteristics -- **Production:** Professional measurement tools like those used in PyTorch, TensorFlow -- **Consistency:** All profiling and measurement tools in profiling.profiler -- **Integration:** Works with any model built using TinyTorch components -""" - -# %% nbgrader={"grade": false, "grade_id": "imports", "solution": true} -#| default_exp profiling.profiler -#| export - -import time -import numpy as np -import tracemalloc -from typing import Dict, List, Any, Optional, Tuple -from collections import defaultdict -import gc - -# Import our TinyTorch components for profiling -from tinytorch.core.tensor import Tensor -from tinytorch.core.layers import Linear -from tinytorch.core.spatial import Conv2d - -# %% [markdown] -""" -## 1. Introduction: Why Profiling Matters in ML Systems - -Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence. - -**The Performance Investigation Process:** -``` -Suspect Model → Profile Evidence → Identify Bottleneck → Target Optimization - ↓ ↓ ↓ ↓ - "Too slow" "200 GFLOP/s" "Memory bound" "Reduce transfers" -``` - -**Questions Profiling Answers:** -- **How many parameters?** (Memory footprint, model size) -- **How many FLOPs?** (Computational cost, energy usage) -- **Where are bottlenecks?** (Memory vs compute bound) -- **What's actual latency?** (Real-world performance) - -**Production Importance:** -In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100× slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses. - -### The Profiling Workflow Visualization -``` -Model → Profiler → Measurements → Analysis → Optimization Decision - ↓ ↓ ↓ ↓ ↓ - GPT Parameter 125M params Memory Use quantization - Counter 2.5B FLOPs bound Reduce precision -``` -""" - -# %% [markdown] -""" -### 🔗 From Optimization to Discovery: Connecting Module 14 - -**In Module 14**, you implemented KV caching and saw 10-15x speedup. -**In Module 15**, you'll learn HOW to discover such optimization opportunities. - -**The Real ML Engineering Workflow**: -``` -Step 1: Measure (This Module!) Step 2: Analyze - ↓ ↓ -Profile baseline → Find bottleneck → Understand cause -40 tok/s 80% in attention O(n²) recomputation - ↓ -Step 4: Validate Step 3: Optimize (Module 14) - ↓ ↓ -Profile optimized ← Verify speedup ← Implement KV cache -500 tok/s (12.5x) Measure impact Design solution -``` - -**Without Module 15's profiling**: You'd never know WHERE to optimize! -**Without Module 14's optimization**: You couldn't FIX the bottleneck! - -This module teaches the measurement and analysis skills that enable -optimization breakthroughs like KV caching. You'll profile real models -and discover bottlenecks just like production ML teams do. -""" - -# %% [markdown] -""" -## 2. Foundations: Performance Measurement Principles - -Before we build our profiler, let's understand what we're measuring and why each metric matters. - -### Parameter Counting - Model Size Detective Work - -Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage. - -**Parameter Counting Formula:** -``` -Linear Layer: (input_features × output_features) + output_features - ↑ ↑ ↑ - Weight matrix Bias vector Total parameters - -Example: Linear(768, 3072) → (768 × 3072) + 3072 = 2,362,368 parameters -Memory: 2,362,368 × 4 bytes = 9.45 MB -``` - -### FLOP Counting - Computational Cost Analysis - -FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems. - -**FLOP Formulas for Key Operations:** -``` -Matrix Multiplication (M,K) @ (K,N): - FLOPs = M × N × K × 2 - ↑ ↑ ↑ ↑ - Rows Cols Inner Multiply+Add - -Linear Layer Forward: - FLOPs = batch_size × input_features × output_features × 2 - ↑ ↑ ↑ - Matmul cost Bias add Operations - -Convolution (simplified): - FLOPs = output_H × output_W × kernel_H × kernel_W × in_channels × out_channels × 2 -``` - -### Memory Profiling - The Three Types of Memory - -ML models use memory in three distinct ways, each with different optimization strategies: - -**Memory Type Breakdown:** -``` -Total Training Memory = Parameters + Activations + Gradients + Optimizer State - ↓ ↓ ↓ ↓ - Model Forward Backward Adam: 2×params - weights pass cache gradients SGD: 0×params - -Example for 125M parameter model: -Parameters: 500 MB (125M × 4 bytes) -Activations: 200 MB (depends on batch size) -Gradients: 500 MB (same as parameters) -Adam state: 1,000 MB (momentum + velocity) -Total: 2,200 MB (4.4× parameter memory!) -``` - -### Latency Measurement - Dealing with Reality - -Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor. - -**Latency Measurement Best Practices:** -``` -Measurement Protocol: -1. Warmup runs (10+) → CPU/GPU caches warm up -2. Timed runs (100+) → Statistical significance -3. Outlier handling → Use median, not mean -4. Memory cleanup → Prevent contamination - -Timeline: -Warmup: [run][run][run]...[run] ← Don't time these -Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these -Result: median(all_times) ← Robust to outliers -``` -""" - -# %% [markdown] -""" -## 3. Implementation: Building the Core Profiler Class - -Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis. - -### The Profiler Architecture -``` -Profiler Class -├── count_parameters() → Model size analysis -├── count_flops() → Computational cost estimation -├── measure_memory() → Memory usage tracking -├── measure_latency() → Performance timing -├── profile_layer() → Layer-wise analysis -├── profile_forward_pass() → Complete forward analysis -└── profile_backward_pass() → Training analysis - -Integration: -All methods work together to provide comprehensive performance insights -``` -""" - -# %% nbgrader={"grade": false, "grade_id": "profiler_class", "solution": true} -#| export -class Profiler: - """ - Professional-grade ML model profiler for performance analysis. - - Measures parameters, FLOPs, memory usage, and latency with statistical rigor. - Used for optimization guidance and deployment planning. - """ - - def __init__(self): - """ - Initialize profiler with measurement state. - - TODO: Set up profiler tracking structures - - APPROACH: - 1. Create empty measurements dictionary - 2. Initialize operation counters - 3. Set up memory tracking state - - EXAMPLE: - >>> profiler = Profiler() - >>> profiler.measurements - {} - - HINTS: - - Use defaultdict(int) for operation counters - - measurements dict will store timing results - """ - ### BEGIN SOLUTION - self.measurements = {} - self.operation_counts = defaultdict(int) - self.memory_tracker = None - ### END SOLUTION - - def count_parameters(self, model) -> int: - """ - Count total trainable parameters in a model. - - TODO: Implement parameter counting for any model with parameters() method - - APPROACH: - 1. Get all parameters from model.parameters() if available - 2. For single layers, count weight and bias directly - 3. Sum total element count across all parameter tensors - - EXAMPLE: - >>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters - >>> profiler = Profiler() - >>> count = profiler.count_parameters(linear) - >>> print(count) - 8256 - - HINTS: - - Use parameter.data.size for tensor element count - - Handle models with and without parameters() method - - Don't forget bias terms when present - """ - ### BEGIN SOLUTION - total_params = 0 - - # Handle different model types - if hasattr(model, 'parameters'): - # Model with parameters() method (Sequential, custom models) - for param in model.parameters(): - total_params += param.data.size - elif hasattr(model, 'weight'): - # Single layer (Linear, Conv2d) - total_params += model.weight.data.size - if hasattr(model, 'bias') and model.bias is not None: - total_params += model.bias.data.size - else: - # No parameters (activations, etc.) - total_params = 0 - - return total_params - ### END SOLUTION - - def count_flops(self, model, input_shape: Tuple[int, ...]) -> int: - """ - Count FLOPs (Floating Point Operations) for one forward pass. - - TODO: Implement FLOP counting for different layer types - - APPROACH: - 1. Create dummy input with given shape - 2. Calculate FLOPs based on layer type and dimensions - 3. Handle different model architectures (Linear, Conv2d, Sequential) - - LAYER-SPECIFIC FLOP FORMULAS: - - Linear: input_features × output_features × 2 (matmul + bias) - - Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2 - - Activation: Usually 1 FLOP per element (ReLU, Sigmoid) - - EXAMPLE: - >>> linear = Linear(128, 64) - >>> profiler = Profiler() - >>> flops = profiler.count_flops(linear, (1, 128)) - >>> print(flops) # 128 * 64 * 2 = 16384 - 16384 - - HINTS: - - Batch dimension doesn't affect per-sample FLOPs - - Focus on major operations (matmul, conv) first - - For Sequential models, sum FLOPs of all layers - """ - ### BEGIN SOLUTION - # Create dummy input (unused but kept for interface consistency) - _dummy_input = Tensor(np.random.randn(*input_shape)) - total_flops = 0 - - # Handle different model types - if hasattr(model, '__class__'): - model_name = model.__class__.__name__ - - if model_name == 'Linear': - # Linear layer: input_features × output_features × 2 - in_features = input_shape[-1] - out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1 - total_flops = in_features * out_features * 2 - - elif model_name == 'Conv2d': - # Conv2d layer: complex calculation based on output size - # Simplified: assume we know the output dimensions - if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'): - _batch_size = input_shape[0] if len(input_shape) > 3 else 1 - in_channels = model.in_channels - out_channels = model.out_channels - kernel_h = kernel_w = model.kernel_size - - # Estimate output size (simplified) - input_h, input_w = input_shape[-2], input_shape[-1] - output_h = input_h // (model.stride if hasattr(model, 'stride') else 1) - output_w = input_w // (model.stride if hasattr(model, 'stride') else 1) - - total_flops = (output_h * output_w * kernel_h * kernel_w * - in_channels * out_channels * 2) - - elif model_name == 'Sequential': - # Sequential model: sum FLOPs of all layers - current_shape = input_shape - for layer in model.layers: - layer_flops = self.count_flops(layer, current_shape) - total_flops += layer_flops - # Update shape for next layer (simplified) - if hasattr(layer, 'weight'): - current_shape = current_shape[:-1] + (layer.weight.shape[1],) - - else: - # Activation or other: assume 1 FLOP per element - total_flops = np.prod(input_shape) - - return total_flops - ### END SOLUTION - - def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]: - """ - Measure memory usage during forward pass. - - TODO: Implement memory tracking for model execution - - APPROACH: - 1. Use tracemalloc to track memory allocation - 2. Measure baseline memory before model execution - 3. Run forward pass and track peak usage - 4. Calculate different memory components - - RETURN DICTIONARY: - - 'parameter_memory_mb': Memory for model parameters - - 'activation_memory_mb': Memory for activations - - 'peak_memory_mb': Maximum memory usage - - 'memory_efficiency': Ratio of useful to total memory - - EXAMPLE: - >>> linear = Linear(1024, 512) - >>> profiler = Profiler() - >>> memory = profiler.measure_memory(linear, (32, 1024)) - >>> print(f"Parameters: {memory['parameter_memory_mb']:.1f} MB") - Parameters: 2.1 MB - - HINTS: - - Use tracemalloc.start() and tracemalloc.get_traced_memory() - - Account for float32 = 4 bytes per parameter - - Activation memory scales with batch size - """ - ### BEGIN SOLUTION - # Start memory tracking - tracemalloc.start() - - # Measure baseline memory (unused but kept for completeness) - _baseline_memory = tracemalloc.get_traced_memory()[0] - - # Calculate parameter memory - param_count = self.count_parameters(model) - parameter_memory_bytes = param_count * 4 # Assume float32 - parameter_memory_mb = parameter_memory_bytes / (1024 * 1024) - - # Create input and measure activation memory - dummy_input = Tensor(np.random.randn(*input_shape)) - input_memory_bytes = dummy_input.data.nbytes - - # Estimate activation memory (simplified) - activation_memory_bytes = input_memory_bytes * 2 # Rough estimate - activation_memory_mb = activation_memory_bytes / (1024 * 1024) - - # Try to run forward pass and measure peak - try: - if hasattr(model, 'forward'): - _ = model.forward(dummy_input) - elif hasattr(model, '__call__'): - _ = model(dummy_input) - except: - pass # Ignore errors for simplified measurement - - # Get peak memory - _current_memory, peak_memory = tracemalloc.get_traced_memory() - peak_memory_mb = (peak_memory - _baseline_memory) / (1024 * 1024) - - tracemalloc.stop() - - # Calculate efficiency - useful_memory = parameter_memory_mb + activation_memory_mb - memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero - - return { - 'parameter_memory_mb': parameter_memory_mb, - 'activation_memory_mb': activation_memory_mb, - 'peak_memory_mb': max(peak_memory_mb, useful_memory), - 'memory_efficiency': min(memory_efficiency, 1.0) - } - ### END SOLUTION - - def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float: - """ - Measure model inference latency with statistical rigor. - - TODO: Implement accurate latency measurement - - APPROACH: - 1. Run warmup iterations to stabilize performance - 2. Measure multiple iterations for statistical accuracy - 3. Calculate median latency to handle outliers - 4. Return latency in milliseconds - - PARAMETERS: - - warmup: Number of warmup runs (default 10) - - iterations: Number of measurement runs (default 100) - - EXAMPLE: - >>> linear = Linear(128, 64) - >>> input_tensor = Tensor(np.random.randn(1, 128)) - >>> profiler = Profiler() - >>> latency = profiler.measure_latency(linear, input_tensor) - >>> print(f"Latency: {latency:.2f} ms") - Latency: 0.15 ms - - HINTS: - - Use time.perf_counter() for high precision - - Use median instead of mean for robustness against outliers - - Handle different model interfaces (forward, __call__) - """ - ### BEGIN SOLUTION - # Warmup runs - for _ in range(warmup): - try: - if hasattr(model, 'forward'): - _ = model.forward(input_tensor) - elif hasattr(model, '__call__'): - _ = model(input_tensor) - else: - # Fallback for simple operations - _ = input_tensor - except: - pass # Ignore errors during warmup - - # Measurement runs - times = [] - for _ in range(iterations): - start_time = time.perf_counter() - - try: - if hasattr(model, 'forward'): - _ = model.forward(input_tensor) - elif hasattr(model, '__call__'): - _ = model(input_tensor) - else: - # Minimal operation for timing - _ = input_tensor.data.copy() - except: - pass # Ignore errors but still measure time - - end_time = time.perf_counter() - times.append((end_time - start_time) * 1000) # Convert to milliseconds - - # Calculate statistics - use median for robustness - times = np.array(times) - median_latency = np.median(times) - - return float(median_latency) - ### END SOLUTION - - def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]: - """ - Profile a single layer comprehensively. - - TODO: Implement layer-wise profiling - - APPROACH: - 1. Count parameters for this layer - 2. Count FLOPs for this layer - 3. Measure memory usage - 4. Measure latency - 5. Return comprehensive layer profile - - EXAMPLE: - >>> linear = Linear(256, 128) - >>> profiler = Profiler() - >>> profile = profiler.profile_layer(linear, (32, 256)) - >>> print(f"Layer uses {profile['parameters']} parameters") - Layer uses 32896 parameters - - HINTS: - - Use existing profiler methods (count_parameters, count_flops, etc.) - - Create dummy input for latency measurement - - Include layer type information in profile - """ - ### BEGIN SOLUTION - # Create dummy input for latency measurement - dummy_input = Tensor(np.random.randn(*input_shape)) - - # Gather all measurements - params = self.count_parameters(layer) - flops = self.count_flops(layer, input_shape) - memory = self.measure_memory(layer, input_shape) - latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10) - - # Compute derived metrics - gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6) - - return { - 'layer_type': layer.__class__.__name__, - 'parameters': params, - 'flops': flops, - 'latency_ms': latency, - 'gflops_per_second': gflops_per_second, - **memory - } - ### END SOLUTION - - def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]: - """ - Comprehensive profiling of a model's forward pass. - - TODO: Implement complete forward pass analysis - - APPROACH: - 1. Use Profiler class to gather all measurements - 2. Create comprehensive performance profile - 3. Add derived metrics and insights - 4. Return structured analysis results - - RETURN METRICS: - - All basic profiler measurements - - FLOPs per second (computational efficiency) - - Memory bandwidth utilization - - Performance bottleneck identification - - EXAMPLE: - >>> model = Linear(256, 128) - >>> input_data = Tensor(np.random.randn(32, 256)) - >>> profiler = Profiler() - >>> profile = profiler.profile_forward_pass(model, input_data) - >>> print(f"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s") - Throughput: 2.45 GFLOP/s - - HINTS: - - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000) - - Memory bandwidth = memory_mb / (latency_ms / 1000) - - Consider realistic hardware limits for efficiency calculations - """ - ### BEGIN SOLUTION - # Basic measurements - param_count = self.count_parameters(model) - flops = self.count_flops(model, input_tensor.shape) - memory_stats = self.measure_memory(model, input_tensor.shape) - latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20) - - # Derived metrics - latency_seconds = latency_ms / 1000.0 - gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6) - - # Memory bandwidth (MB/s) - memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6) - - # Efficiency metrics - theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU - computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0) - - # Bottleneck analysis - is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic - is_compute_bound = not is_memory_bound - - return { - # Basic measurements - 'parameters': param_count, - 'flops': flops, - 'latency_ms': latency_ms, - **memory_stats, - - # Derived metrics - 'gflops_per_second': gflops_per_second, - 'memory_bandwidth_mbs': memory_bandwidth, - 'computational_efficiency': computational_efficiency, - - # Bottleneck analysis - 'is_memory_bound': is_memory_bound, - 'is_compute_bound': is_compute_bound, - 'bottleneck': 'memory' if is_memory_bound else 'compute' - } - ### END SOLUTION - - def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]: - """ - Profile both forward and backward passes for training analysis. - - TODO: Implement training-focused profiling - - APPROACH: - 1. Profile forward pass first - 2. Estimate backward pass costs (typically 2× forward) - 3. Calculate total training iteration metrics - 4. Analyze memory requirements for gradients and optimizers - - BACKWARD PASS ESTIMATES: - - FLOPs: ~2× forward pass (gradient computation) - - Memory: +1× parameters (gradient storage) - - Latency: ~2× forward pass (more complex operations) - - EXAMPLE: - >>> model = Linear(128, 64) - >>> input_data = Tensor(np.random.randn(16, 128)) - >>> profiler = Profiler() - >>> profile = profiler.profile_backward_pass(model, input_data) - >>> print(f"Training iteration: {profile['total_latency_ms']:.2f} ms") - Training iteration: 0.45 ms - - HINTS: - - Total memory = parameters + activations + gradients - - Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×) - - Consider gradient accumulation effects - """ - ### BEGIN SOLUTION - # Get forward pass profile - forward_profile = self.profile_forward_pass(model, input_tensor) - - # Estimate backward pass (typically 2× forward) - backward_flops = forward_profile['flops'] * 2 - backward_latency_ms = forward_profile['latency_ms'] * 2 - - # Gradient memory (equal to parameter memory) - gradient_memory_mb = forward_profile['parameter_memory_mb'] - - # Total training iteration - total_flops = forward_profile['flops'] + backward_flops - total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms - total_memory_mb = (forward_profile['parameter_memory_mb'] + - forward_profile['activation_memory_mb'] + - gradient_memory_mb) - - # Training efficiency - total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0) - - # Optimizer memory estimates - optimizer_memory_estimates = { - 'sgd': 0, # No extra memory - 'adam': gradient_memory_mb * 2, # Momentum + velocity - 'adamw': gradient_memory_mb * 2, # Same as Adam - } - - return { - # Forward pass - 'forward_flops': forward_profile['flops'], - 'forward_latency_ms': forward_profile['latency_ms'], - 'forward_memory_mb': forward_profile['peak_memory_mb'], - - # Backward pass estimates - 'backward_flops': backward_flops, - 'backward_latency_ms': backward_latency_ms, - 'gradient_memory_mb': gradient_memory_mb, - - # Total training iteration - 'total_flops': total_flops, - 'total_latency_ms': total_latency_ms, - 'total_memory_mb': total_memory_mb, - 'total_gflops_per_second': total_gflops_per_second, - - # Optimizer memory requirements - 'optimizer_memory_estimates': optimizer_memory_estimates, - - # Training insights - 'memory_efficiency': forward_profile['memory_efficiency'], - 'bottleneck': forward_profile['bottleneck'] - } - ### END SOLUTION - -# %% [markdown] -""" -## Helper Functions - Quick Profiling Utilities - -These helper functions provide simplified interfaces for common profiling tasks. -They make it easy to quickly profile models and analyze characteristics. -""" - -# %% -#| export -def quick_profile(model, input_tensor, profiler=None): - """ - Quick profiling function for immediate insights. - - Provides a simplified interface for profiling that displays key metrics - in a student-friendly format. - - Args: - model: Model to profile - input_tensor: Input data for profiling - profiler: Optional Profiler instance (creates new one if None) - - Returns: - dict: Profile results with key metrics - - Example: - >>> model = Linear(128, 64) - >>> input_data = Tensor(np.random.randn(16, 128)) - >>> results = quick_profile(model, input_data) - >>> # Displays formatted output automatically - """ - if profiler is None: - profiler = Profiler() - - profile = profiler.profile_forward_pass(model, input_tensor) - - # Display formatted results - print("🔬 Quick Profile Results:") - print(f" Parameters: {profile['parameters']:,}") - print(f" FLOPs: {profile['flops']:,}") - print(f" Latency: {profile['latency_ms']:.2f} ms") - print(f" Memory: {profile['peak_memory_mb']:.2f} MB") - print(f" Bottleneck: {profile['bottleneck']}") - print(f" Efficiency: {profile['computational_efficiency']*100:.1f}%") - - return profile - -#| export -def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]): - """ - Analyze weight distribution for compression insights. - - Helps understand which weights are small and might be prunable. - Used by Module 17 (Compression) to motivate pruning. - - Args: - model: Model to analyze - percentiles: List of percentiles to compute - - Returns: - dict: Weight distribution statistics - - Example: - >>> model = Linear(512, 512) - >>> stats = analyze_weight_distribution(model) - >>> print(f"Weights < 0.01: {stats['below_threshold_001']:.1f}%") - """ - # Collect all weights - weights = [] - if hasattr(model, 'parameters'): - for param in model.parameters(): - weights.extend(param.data.flatten().tolist()) - elif hasattr(model, 'weight'): - weights.extend(model.weight.data.flatten().tolist()) - else: - return {'error': 'No weights found'} - - weights = np.array(weights) - abs_weights = np.abs(weights) - - # Calculate statistics - stats = { - 'total_weights': len(weights), - 'mean': float(np.mean(abs_weights)), - 'std': float(np.std(abs_weights)), - 'min': float(np.min(abs_weights)), - 'max': float(np.max(abs_weights)), - } - - # Percentile analysis - for p in percentiles: - stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p)) - - # Threshold analysis (useful for pruning) - for threshold in [0.001, 0.01, 0.1]: - below = np.sum(abs_weights < threshold) / len(weights) * 100 - stats[f'below_threshold_{str(threshold).replace(".", "")}'] = below - - return stats - -# %% [markdown] -""" -## Parameter Counting - Model Size Analysis - -Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation. - -### Why Parameter Counting Matters -``` -Model Deployment Pipeline: -Parameters → Memory → Hardware → Cost - ↓ ↓ ↓ ↓ - 125M 500MB 8GB GPU $200/month - -Parameter Growth Examples: -Small: GPT-2 Small (124M parameters) → 500MB memory -Medium: GPT-2 Medium (350M parameters) → 1.4GB memory -Large: GPT-2 Large (774M parameters) → 3.1GB memory -XL: GPT-2 XL (1.5B parameters) → 6.0GB memory -``` -""" - -# %% [markdown] -""" -### 🧪 Unit Test: Parameter Counting -This test validates our parameter counting works correctly for different model types. -**What we're testing**: Parameter counting accuracy for various architectures -**Why it matters**: Accurate parameter counts predict memory usage and model complexity -**Expected**: Correct counts for known model configurations -""" - -# %% nbgrader={"grade": true, "grade_id": "test_parameter_counting", "locked": true, "points": 10} -def test_unit_parameter_counting(): - """🔬 Test parameter counting implementation.""" - print("🔬 Unit Test: Parameter Counting...") - - profiler = Profiler() - - # Test 1: Simple model with known parameters - class SimpleModel: - def __init__(self): - self.weight = Tensor(np.random.randn(10, 5)) - self.bias = Tensor(np.random.randn(5)) - - def parameters(self): - return [self.weight, self.bias] - - simple_model = SimpleModel() - param_count = profiler.count_parameters(simple_model) - expected_count = 10 * 5 + 5 # weight + bias - assert param_count == expected_count, f"Expected {expected_count} parameters, got {param_count}" - print(f"✅ Simple model: {param_count} parameters") - - # Test 2: Model without parameters - class NoParamModel: - def __init__(self): - pass - - no_param_model = NoParamModel() - param_count = profiler.count_parameters(no_param_model) - assert param_count == 0, f"Expected 0 parameters, got {param_count}" - print(f"✅ No parameter model: {param_count} parameters") - - # Test 3: Direct tensor (no parameters) - test_tensor = Tensor(np.random.randn(2, 3)) - param_count = profiler.count_parameters(test_tensor) - assert param_count == 0, f"Expected 0 parameters for tensor, got {param_count}" - print(f"✅ Direct tensor: {param_count} parameters") - - print("✅ Parameter counting works correctly!") - -if __name__ == "__main__": - test_unit_parameter_counting() - -# %% [markdown] -""" -## FLOP Counting - Computational Cost Estimation - -FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems. - -### FLOP Counting Visualization -``` -Linear Layer FLOP Breakdown: -Input (batch=32, features=768) × Weight (768, 3072) + Bias (3072) - ↓ -Matrix Multiplication: 32 × 768 × 3072 × 2 = 150,994,944 FLOPs -Bias Addition: 32 × 3072 × 1 = 98,304 FLOPs - ↓ -Total FLOPs: 151,093,248 FLOPs - -Convolution FLOP Breakdown: -Input (batch=1, channels=3, H=224, W=224) -Kernel (out=64, in=3, kH=7, kW=7) - ↓ -Output size: (224×224) → (112×112) with stride=2 -FLOPs = 112 × 112 × 7 × 7 × 3 × 64 × 2 = 235,012,096 FLOPs -``` - -### FLOP Counting Strategy -Different operations require different FLOP calculations: -- **Matrix operations**: M × N × K × 2 (multiply + add) -- **Convolutions**: Output spatial × kernel spatial × channels -- **Activations**: Usually 1 FLOP per element -""" - -# %% [markdown] -""" -### 🧪 Unit Test: FLOP Counting -This test validates our FLOP counting for different operations and architectures. -**What we're testing**: FLOP calculation accuracy for various layer types -**Why it matters**: FLOPs predict computational cost and energy usage -**Expected**: Correct FLOP counts for known operation types -""" - -# %% nbgrader={"grade": true, "grade_id": "test_flop_counting", "locked": true, "points": 10} -def test_unit_flop_counting(): - """🔬 Test FLOP counting implementation.""" - print("🔬 Unit Test: FLOP Counting...") - - profiler = Profiler() - - # Test 1: Simple tensor operations - test_tensor = Tensor(np.random.randn(4, 8)) - flops = profiler.count_flops(test_tensor, (4, 8)) - expected_flops = 4 * 8 # 1 FLOP per element for generic operation - assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}" - print(f"✅ Tensor operation: {flops} FLOPs") - - # Test 2: Simulated Linear layer - class MockLinear: - def __init__(self, in_features, out_features): - self.weight = Tensor(np.random.randn(in_features, out_features)) - self.__class__.__name__ = 'Linear' - - mock_linear = MockLinear(128, 64) - flops = profiler.count_flops(mock_linear, (1, 128)) - expected_flops = 128 * 64 * 2 # matmul FLOPs - assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}" - print(f"✅ Linear layer: {flops} FLOPs") - - # Test 3: Batch size independence - flops_batch1 = profiler.count_flops(mock_linear, (1, 128)) - flops_batch32 = profiler.count_flops(mock_linear, (32, 128)) - assert flops_batch1 == flops_batch32, "FLOPs should be independent of batch size" - print(f"✅ Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)") - - print("✅ FLOP counting works correctly!") - -if __name__ == "__main__": - test_unit_flop_counting() - -# %% [markdown] -""" -## Memory Profiling - Understanding Memory Usage Patterns - -Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization. - -### Memory Usage Breakdown -``` -ML Model Memory Components: -┌───────────────────────────────────────────────────┐ -│ Total Memory │ -├─────────────────┬─────────────────┬───────────────┤ -│ Parameters │ Activations │ Gradients │ -│ (persistent) │ (per forward) │ (per backward)│ -├─────────────────┼─────────────────┼───────────────┤ -│ Linear weights │ Hidden states │ ∂L/∂W │ -│ Conv filters │ Attention maps │ ∂L/∂b │ -│ Embeddings │ Residual cache │ Optimizer │ -└─────────────────┴─────────────────┴───────────────┘ - -Memory Scaling: -Batch Size → Activation Memory (linear scaling) -Model Size → Parameter + Gradient Memory (linear scaling) -Sequence Length → Attention Memory (quadratic scaling!) -``` - -### Memory Measurement Strategy -We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns. -""" - -# %% [markdown] -""" -### 🧪 Unit Test: Memory Measurement -This test validates our memory tracking works correctly and provides useful metrics. -**What we're testing**: Memory usage measurement and calculation accuracy -**Why it matters**: Memory constraints often limit model deployment -**Expected**: Reasonable memory measurements with proper components -""" - -# %% nbgrader={"grade": true, "grade_id": "test_memory_measurement", "locked": true, "points": 10} -def test_unit_memory_measurement(): - """🔬 Test memory measurement implementation.""" - print("🔬 Unit Test: Memory Measurement...") - - profiler = Profiler() - - # Test 1: Basic memory measurement - test_tensor = Tensor(np.random.randn(10, 20)) - memory_stats = profiler.measure_memory(test_tensor, (10, 20)) - - # Validate dictionary structure - required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency'] - for key in required_keys: - assert key in memory_stats, f"Missing key: {key}" - - # Validate non-negative values - for key in required_keys: - assert memory_stats[key] >= 0, f"{key} should be non-negative, got {memory_stats[key]}" - - print(f"✅ Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak") - - # Test 2: Memory scaling with size - small_tensor = Tensor(np.random.randn(5, 5)) - large_tensor = Tensor(np.random.randn(50, 50)) - - small_memory = profiler.measure_memory(small_tensor, (5, 5)) - large_memory = profiler.measure_memory(large_tensor, (50, 50)) - - # Larger tensor should use more activation memory - assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \ - "Larger tensor should use more activation memory" - - print(f"✅ Scaling: Small {small_memory['activation_memory_mb']:.3f} MB → Large {large_memory['activation_memory_mb']:.3f} MB") - - # Test 3: Efficiency bounds - assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \ - f"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}" - - print(f"✅ Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)") - - print("✅ Memory measurement works correctly!") - -if __name__ == "__main__": - test_unit_memory_measurement() - -# %% [markdown] -""" -## Latency Measurement - Accurate Performance Timing - -Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results. - -### Latency Measurement Challenges -``` -Timing Challenges: -┌─────────────────────────────────────────────────┐ -│ Time Variance │ -├─────────────────┬─────────────────┬─────────────┤ -│ System Noise │ Cache Effects │ Thermal │ -│ │ │ Throttling │ -├─────────────────┼─────────────────┼─────────────┤ -│ Background │ Cold start vs │ CPU slows │ -│ processes │ warm caches │ when hot │ -│ OS scheduling │ Memory locality │ GPU thermal │ -│ Network I/O │ Branch predict │ limits │ -└─────────────────┴─────────────────┴─────────────┘ - -Solution: Statistical Approach -Warmup → Multiple measurements → Robust statistics (median) -``` - -### Measurement Protocol -Our latency measurement follows professional benchmarking practices: -1. **Warmup runs** to stabilize system state -2. **Multiple measurements** for statistical significance -3. **Median calculation** to handle outliers -4. **Memory cleanup** to prevent contamination -""" - -# %% [markdown] -""" -### 🧪 Unit Test: Latency Measurement -This test validates our latency measurement provides consistent and reasonable results. -**What we're testing**: Timing accuracy and statistical robustness -**Why it matters**: Latency determines real-world deployment feasibility -**Expected**: Consistent timing measurements with proper statistical handling -""" - -# %% nbgrader={"grade": true, "grade_id": "test_latency_measurement", "locked": true, "points": 10} -def test_unit_latency_measurement(): - """🔬 Test latency measurement implementation.""" - print("🔬 Unit Test: Latency Measurement...") - - profiler = Profiler() - - # Test 1: Basic latency measurement - test_tensor = Tensor(np.random.randn(4, 8)) - latency = profiler.measure_latency(test_tensor, test_tensor, warmup=2, iterations=5) - - assert latency >= 0, f"Latency should be non-negative, got {latency}" - assert latency < 1000, f"Latency seems too high for simple operation: {latency} ms" - print(f"✅ Basic latency: {latency:.3f} ms") - - # Test 2: Measurement consistency - latencies = [] - for _ in range(3): - lat = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=3) - latencies.append(lat) - - # Measurements should be in reasonable range - avg_latency = np.mean(latencies) - std_latency = np.std(latencies) - assert std_latency < avg_latency, "Standard deviation shouldn't exceed mean for simple operations" - print(f"✅ Consistency: {avg_latency:.3f} ± {std_latency:.3f} ms") - - # Test 3: Size scaling - small_tensor = Tensor(np.random.randn(2, 2)) - large_tensor = Tensor(np.random.randn(20, 20)) - - small_latency = profiler.measure_latency(small_tensor, small_tensor, warmup=1, iterations=3) - large_latency = profiler.measure_latency(large_tensor, large_tensor, warmup=1, iterations=3) - - # Larger operations might take longer (though not guaranteed for simple operations) - print(f"✅ Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms") - - print("✅ Latency measurement works correctly!") - -if __name__ == "__main__": - test_unit_latency_measurement() - -# %% [markdown] -""" -## 4. Integration: Advanced Profiling Functions - -Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools. - -### Advanced Profiling Architecture -``` -Core Profiler Methods → Advanced Analysis Functions → Optimization Insights - ↓ ↓ ↓ -count_parameters() profile_forward_pass() "Memory-bound workload" -count_flops() profile_backward_pass() "Optimize data movement" -measure_memory() profile_layer() "Focus on bandwidth" -measure_latency() benchmark_efficiency() "Use quantization" -``` - -### Forward Pass Profiling - Complete Performance Picture - -A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions. -""" - -# %% [markdown] -""" -### Backward Pass Profiling - Training Analysis - -Training requires both forward and backward passes. The backward pass typically uses 2× the compute and adds gradient memory. Understanding this is crucial for training optimization. - -### Training Memory Visualization -``` -Training Memory Timeline: -Forward Pass: [Parameters] + [Activations] - ↓ -Backward Pass: [Parameters] + [Activations] + [Gradients] - ↓ -Optimizer: [Parameters] + [Gradients] + [Optimizer State] - -Memory Examples: -Model: 125M parameters (500MB) -Forward: 500MB params + 100MB activations = 600MB -Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB -Adam: 500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB - -Total Training Memory: 4× parameter memory! -``` -""" - -# %% [markdown] -""" -### 🧪 Unit Test: Advanced Profiling Functions -This test validates our advanced profiling functions provide comprehensive analysis. -**What we're testing**: Forward and backward pass profiling completeness -**Why it matters**: Training optimization requires understanding both passes -**Expected**: Complete profiles with all required metrics and relationships -""" - -# %% nbgrader={"grade": true, "grade_id": "test_advanced_profiling", "locked": true, "points": 15} -def test_unit_advanced_profiling(): - """🔬 Test advanced profiling functions.""" - print("🔬 Unit Test: Advanced Profiling Functions...") - - # Create profiler and test model - profiler = Profiler() - test_input = Tensor(np.random.randn(4, 8)) - - # Test forward pass profiling - forward_profile = profiler.profile_forward_pass(test_input, test_input) - - # Validate forward profile structure - required_forward_keys = [ - 'parameters', 'flops', 'latency_ms', 'gflops_per_second', - 'memory_bandwidth_mbs', 'bottleneck' - ] - - for key in required_forward_keys: - assert key in forward_profile, f"Missing key: {key}" - - assert forward_profile['parameters'] >= 0 - assert forward_profile['flops'] >= 0 - assert forward_profile['latency_ms'] >= 0 - assert forward_profile['gflops_per_second'] >= 0 - - print(f"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s") - - # Test backward pass profiling - backward_profile = profiler.profile_backward_pass(test_input, test_input) - - # Validate backward profile structure - required_backward_keys = [ - 'forward_flops', 'backward_flops', 'total_flops', - 'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates' - ] - - for key in required_backward_keys: - assert key in backward_profile, f"Missing key: {key}" - - # Validate relationships - assert backward_profile['total_flops'] >= backward_profile['forward_flops'] - assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms'] - assert 'sgd' in backward_profile['optimizer_memory_estimates'] - assert 'adam' in backward_profile['optimizer_memory_estimates'] - - # Check backward pass estimates are reasonable - assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \ - "Backward pass should have at least as many FLOPs as forward" - assert backward_profile['gradient_memory_mb'] >= 0, \ - "Gradient memory should be non-negative" - - print(f"✅ Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total") - print(f"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training") - print("✅ Advanced profiling functions work correctly!") - -if __name__ == "__main__": - test_unit_advanced_profiling() - -# %% [markdown] -""" -## 5. Systems Analysis: Understanding Performance Characteristics - -Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks. - -### Performance Analysis Workflow -``` -Model Scaling Analysis: -Size → Memory → Latency → Throughput → Bottleneck Identification - ↓ ↓ ↓ ↓ ↓ -64 1MB 0.1ms 10K ops/s Memory bound -128 4MB 0.2ms 8K ops/s Memory bound -256 16MB 0.5ms 4K ops/s Memory bound -512 64MB 2.0ms 1K ops/s Memory bound - -Insight: This workload is memory-bound → Optimize data movement, not compute! -``` -""" - -# %% nbgrader={"grade": false, "grade_id": "performance_analysis", "solution": true} -def analyze_model_scaling(): - """📊 Analyze how model performance scales with size.""" - print("📊 Analyzing Model Scaling Characteristics...") - - profiler = Profiler() - results = [] - - # Test different model sizes - sizes = [64, 128, 256, 512] - - print("\nModel Scaling Analysis:") - print("Size\tParams\t\tFLOPs\t\tLatency(ms)\tMemory(MB)\tGFLOP/s") - print("-" * 80) - - for size in sizes: - # Create models of different sizes for comparison - input_shape = (32, size) # Batch of 32 - dummy_input = Tensor(np.random.randn(*input_shape)) - - # Simulate linear layer characteristics - linear_params = size * size + size # W + b - linear_flops = size * size * 2 # matmul - - # Measure actual performance - latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10) - memory = profiler.measure_memory(dummy_input, input_shape) - - gflops_per_second = (linear_flops / 1e9) / (latency / 1000) - - results.append({ - 'size': size, - 'parameters': linear_params, - 'flops': linear_flops, - 'latency_ms': latency, - 'memory_mb': memory['peak_memory_mb'], - 'gflops_per_second': gflops_per_second - }) - - print(f"{size}\t{linear_params:,}\t\t{linear_flops:,}\t\t" - f"{latency:.2f}\t\t{memory['peak_memory_mb']:.2f}\t\t" - f"{gflops_per_second:.2f}") - - # Analysis insights - print("\n💡 Scaling Analysis Insights:") - - # Memory scaling - memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001) - print(f"Memory grows {memory_growth:.1f}× from {sizes[0]} to {sizes[-1]} size") - - # Compute scaling - compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001) - print(f"Compute efficiency changes {compute_growth:.1f}× with size") - - # Performance characteristics - avg_efficiency = np.mean([r['gflops_per_second'] for r in results]) - if avg_efficiency < 10: # Arbitrary threshold for "low" efficiency - print("🚀 Low compute efficiency suggests memory-bound workload") - else: - print("🚀 High compute efficiency suggests compute-bound workload") - -def analyze_batch_size_effects(): - """📊 Analyze how batch size affects performance and efficiency.""" - print("\n📊 Analyzing Batch Size Effects...") - - profiler = Profiler() - batch_sizes = [1, 8, 32, 128] - feature_size = 256 - - print("\nBatch Size Effects Analysis:") - print("Batch\tLatency(ms)\tThroughput(samples/s)\tMemory(MB)\tMemory Efficiency") - print("-" * 85) - - for batch_size in batch_sizes: - input_shape = (batch_size, feature_size) - dummy_input = Tensor(np.random.randn(*input_shape)) - - # Measure performance - latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10) - memory = profiler.measure_memory(dummy_input, input_shape) - - # Calculate throughput - samples_per_second = (batch_size * 1000) / latency # samples/second - - # Calculate efficiency (samples per unit memory) - efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001) - - print(f"{batch_size}\t{latency:.2f}\t\t{samples_per_second:.0f}\t\t\t" - f"{memory['peak_memory_mb']:.2f}\t\t{efficiency:.1f}") - - print("\n💡 Batch Size Insights:") - print("Larger batches typically improve throughput but increase memory usage") - -# Run the analysis -if __name__ == "__main__": - analyze_model_scaling() - analyze_batch_size_effects() - -# %% [markdown] -""" -## 6. Optimization Insights: Production Performance Patterns - -Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead. - -### Operation Efficiency Analysis -``` -Operation Types and Their Characteristics: -┌─────────────────┬──────────────────┬──────────────────┬─────────────────┐ -│ Operation │ Compute/Memory │ Optimization │ Priority │ -├─────────────────┼──────────────────┼──────────────────┼─────────────────┤ -│ Matrix Multiply │ Compute-bound │ BLAS libraries │ High │ -│ Elementwise │ Memory-bound │ Data locality │ Medium │ -│ Reductions │ Memory-bound │ Parallelization│ Medium │ -│ Attention │ Memory-bound │ FlashAttention │ High │ -└─────────────────┴──────────────────┴──────────────────┴─────────────────┘ - -Optimization Strategy: -1. Profile first → Identify bottlenecks -2. Focus on compute-bound ops → Algorithmic improvements -3. Focus on memory-bound ops → Data movement optimization -4. Measure again → Verify improvements -``` -""" - -# %% nbgrader={"grade": false, "grade_id": "optimization_insights", "solution": true} -def benchmark_operation_efficiency(): - """📊 Compare efficiency of different operations for optimization guidance.""" - print("📊 Benchmarking Operation Efficiency...") - - profiler = Profiler() - operations = [] - - # Test different operation types - size = 256 - input_tensor = Tensor(np.random.randn(32, size)) - - # Elementwise operations (memory-bound) - elementwise_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20) - elementwise_flops = size * 32 # One operation per element - - operations.append({ - 'operation': 'Elementwise', - 'latency_ms': elementwise_latency, - 'flops': elementwise_flops, - 'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000), - 'efficiency_class': 'memory-bound', - 'optimization_focus': 'data_locality' - }) - - # Matrix operations (compute-bound) - matrix_tensor = Tensor(np.random.randn(size, size)) - matrix_latency = profiler.measure_latency(matrix_tensor, input_tensor, iterations=10) - matrix_flops = size * size * 2 # Matrix multiplication - - operations.append({ - 'operation': 'Matrix Multiply', - 'latency_ms': matrix_latency, - 'flops': matrix_flops, - 'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000), - 'efficiency_class': 'compute-bound', - 'optimization_focus': 'algorithms' - }) - - # Reduction operations (memory-bound) - reduction_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20) - reduction_flops = size * 32 # Sum reduction - - operations.append({ - 'operation': 'Reduction', - 'latency_ms': reduction_latency, - 'flops': reduction_flops, - 'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000), - 'efficiency_class': 'memory-bound', - 'optimization_focus': 'parallelization' - }) - - print("\nOperation Efficiency Comparison:") - print("Operation\t\tLatency(ms)\tGFLOP/s\t\tEfficiency Class\tOptimization Focus") - print("-" * 95) - - for op in operations: - print(f"{op['operation']:<15}\t{op['latency_ms']:.3f}\t\t" - f"{op['gflops_per_second']:.2f}\t\t{op['efficiency_class']:<15}\t{op['optimization_focus']}") - - print("\n💡 Operation Optimization Insights:") - - # Find most and least efficient - best_op = max(operations, key=lambda x: x['gflops_per_second']) - worst_op = min(operations, key=lambda x: x['gflops_per_second']) - - print(f"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)") - print(f"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)") - - # Count operation types - memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound'] - compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound'] - - print(f"\n🚀 Optimization Priority:") - if len(memory_bound_ops) > len(compute_bound_ops): - print("Focus on memory optimization: data locality, bandwidth, caching") - else: - print("Focus on compute optimization: better algorithms, vectorization") - -def analyze_profiling_overhead(): - """📊 Measure the overhead of profiling itself.""" - print("\n📊 Analyzing Profiling Overhead...") - - # Test with and without profiling - test_tensor = Tensor(np.random.randn(100, 100)) - iterations = 50 - - # Without profiling - baseline measurement - start_time = time.perf_counter() - for _ in range(iterations): - _ = test_tensor.data.copy() # Simple operation - end_time = time.perf_counter() - baseline_ms = (end_time - start_time) * 1000 - - # With profiling - includes measurement overhead - profiler = Profiler() - start_time = time.perf_counter() - for _ in range(iterations): - _ = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=1) - end_time = time.perf_counter() - profiled_ms = (end_time - start_time) * 1000 - - overhead_factor = profiled_ms / max(baseline_ms, 0.001) - - print(f"\nProfiling Overhead Analysis:") - print(f"Baseline execution: {baseline_ms:.2f} ms") - print(f"With profiling: {profiled_ms:.2f} ms") - print(f"Profiling overhead: {overhead_factor:.1f}× slower") - - print(f"\n💡 Profiling Overhead Insights:") - if overhead_factor < 2: - print("Low overhead - suitable for frequent profiling") - elif overhead_factor < 10: - print("Moderate overhead - use for development and debugging") - else: - print("High overhead - use sparingly in production") - -# Run optimization analysis -if __name__ == "__main__": - benchmark_operation_efficiency() - analyze_profiling_overhead() - -# %% [markdown] -""" -## 🧪 Module Integration Test - -Final validation that everything works together correctly. -""" - -# %% nbgrader={"grade": true, "grade_id": "test_module", "locked": true, "points": 20} -def test_module(): - """ - Comprehensive test of entire profiling module functionality. - - This final test runs before module summary to ensure: - - All unit tests pass - - Functions work together correctly - - Module is ready for integration with TinyTorch - """ - print("🧪 RUNNING MODULE INTEGRATION TEST") - print("=" * 50) - - # Run all unit tests - print("Running unit tests...") - test_unit_parameter_counting() - test_unit_flop_counting() - test_unit_memory_measurement() - test_unit_latency_measurement() - test_unit_advanced_profiling() - - print("\nRunning integration scenarios...") - - # Test realistic usage patterns - print("🔬 Integration Test: Complete Profiling Workflow...") - - # Create profiler - profiler = Profiler() - - # Create test model and data - test_model = Tensor(np.random.randn(16, 32)) - test_input = Tensor(np.random.randn(8, 16)) - - # Run complete profiling workflow - print("1. Measuring model characteristics...") - params = profiler.count_parameters(test_model) - flops = profiler.count_flops(test_model, test_input.shape) - memory = profiler.measure_memory(test_model, test_input.shape) - latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5) - - print(f" Parameters: {params}") - print(f" FLOPs: {flops}") - print(f" Memory: {memory['peak_memory_mb']:.2f} MB") - print(f" Latency: {latency:.2f} ms") - - # Test advanced profiling - print("2. Running advanced profiling...") - forward_profile = profiler.profile_forward_pass(test_model, test_input) - backward_profile = profiler.profile_backward_pass(test_model, test_input) - - assert 'gflops_per_second' in forward_profile - assert 'total_latency_ms' in backward_profile - print(f" Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}") - print(f" Training latency: {backward_profile['total_latency_ms']:.2f} ms") - - # Test bottleneck analysis - print("3. Analyzing performance bottlenecks...") - bottleneck = forward_profile['bottleneck'] - efficiency = forward_profile['computational_efficiency'] - print(f" Bottleneck: {bottleneck}") - print(f" Compute efficiency: {efficiency:.3f}") - - # Validate end-to-end workflow - assert params >= 0, "Parameter count should be non-negative" - assert flops >= 0, "FLOP count should be non-negative" - assert memory['peak_memory_mb'] >= 0, "Memory usage should be non-negative" - assert latency >= 0, "Latency should be non-negative" - assert forward_profile['gflops_per_second'] >= 0, "GFLOP/s should be non-negative" - assert backward_profile['total_latency_ms'] >= 0, "Total latency should be non-negative" - assert bottleneck in ['memory', 'compute'], "Bottleneck should be memory or compute" - assert 0 <= efficiency <= 1, "Efficiency should be between 0 and 1" - - print("✅ End-to-end profiling workflow works!") - - # Test production-like scenario - print("4. Testing production profiling scenario...") - - # Simulate larger model analysis - large_input = Tensor(np.random.randn(32, 512)) # Larger model input - large_profile = profiler.profile_forward_pass(large_input, large_input) - - # Verify profile contains optimization insights - assert 'bottleneck' in large_profile, "Profile should identify bottlenecks" - assert 'memory_bandwidth_mbs' in large_profile, "Profile should measure memory bandwidth" - - print(f" Large model analysis: {large_profile['bottleneck']} bottleneck") - print(f" Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s") - - print("✅ Production profiling scenario works!") - - print("\n" + "=" * 50) - print("🎉 ALL TESTS PASSED! Module ready for export.") - print("Run: tito module complete 15") - -# Call before module summary -if __name__ == "__main__": - test_module() - -# %% -if __name__ == "__main__": - print("🚀 Running Profiling module...") - test_module() - print("✅ Module validation complete!") - -# %% [markdown] -""" -## 🤔 ML Systems Thinking: Performance Measurement - -### Question 1: FLOP Analysis -You implemented a profiler that counts FLOPs for different operations. -For a Linear layer with 1000 input features and 500 output features: -- How many FLOPs are required for one forward pass? _____ FLOPs -- If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____ - -### Question 2: Memory Scaling -Your profiler measures memory usage for models and activations. -A transformer model has 125M parameters (500MB at FP32). -During training with batch size 16: -- What's the minimum memory for gradients? _____ MB -- With Adam optimizer, what's the total memory requirement? _____ MB - -### Question 3: Performance Bottlenecks -You built tools to identify compute vs memory bottlenecks. -A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak: -- What's the computational efficiency? _____% -- If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____ - -### Question 4: Profiling Trade-offs -Your profiler adds measurement overhead to understand performance. -If profiling adds 5× overhead but reveals a 50% speedup opportunity: -- Is the profiling cost justified for development? _____ -- When should you disable profiling in production? _____ -""" - -# %% [markdown] -""" -## 🎯 MODULE SUMMARY: Profiling - -Congratulations! You've built a comprehensive profiling system for ML performance analysis! - -### Key Accomplishments -- Built complete Profiler class with parameter, FLOP, memory, and latency measurement -- Implemented advanced profiling functions for forward and backward pass analysis -- Discovered performance characteristics through scaling and efficiency analysis -- Created production-quality measurement tools for optimization guidance -- All tests pass ✅ (validated by `test_module()`) - -### Systems Insights Gained -- **FLOPs vs Reality**: Theoretical operations don't always predict actual performance -- **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute -- **Batch Size Effects**: Larger batches improve throughput but increase memory requirements -- **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization - -### Production Skills Developed -- **Performance Detective Work**: Use data, not guesses, to identify bottlenecks -- **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions -- **Resource Planning**: Predict memory and compute requirements for deployment -- **Statistical Rigor**: Handle measurement variance with proper methodology - -### Ready for Next Steps -Your profiling implementation enables Module 16 (Acceleration) to make data-driven optimization decisions. -Export with: `tito module complete 15` - -**Next**: Module 16 will use these profiling tools to implement acceleration techniques and measure their effectiveness! -""" diff --git a/modules/source/16_acceleration/acceleration_dev.ipynb b/modules/source/16_acceleration/acceleration_dev.ipynb deleted file mode 100644 index cc39f5f0..00000000 --- a/modules/source/16_acceleration/acceleration_dev.ipynb +++ /dev/null @@ -1,2019 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "6a0bea02", - "metadata": {}, - "outputs": [], - "source": [ - "#| default_exp optimization.acceleration\n", - "#| export" - ] - }, - { - "cell_type": "markdown", - "id": "a9ac4364", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "# Module 16: Acceleration - Making Models Run Faster\n", - "\n", - "Welcome to Module 16! You're about to master the art of neural network acceleration through vectorization, kernel fusion, and mixed precision training.\n", - "\n", - "## 🔗 Prerequisites & Progress\n", - "**You've Built**: Complete training pipeline with profiling capabilities\n", - "**You'll Build**: Acceleration techniques including vectorization, operation fusion, and mixed precision\n", - "**You'll Enable**: Production-ready optimization for real-world deployment\n", - "\n", - "**Connection Map**:\n", - "```\n", - "Profiling (Module 15) → Acceleration (Module 16) → Quantization (Module 17)\n", - "(measurement) (optimization) (precision reduction)\n", - "```\n", - "\n", - "## Learning Objectives\n", - "By the end of this module, you will:\n", - "1. Implement vectorized operations for maximum throughput\n", - "2. Create fused operations to reduce memory bandwidth\n", - "3. Build mixed precision training for memory efficiency\n", - "4. Understand the relationship between compute and memory bandwidth\n", - "5. Analyze acceleration trade-offs in production systems\n", - "\n", - "Let's optimize for speed!\n", - "\n", - "## 📦 Where This Code Lives in the Final Package\n", - "\n", - "**Learning Side:** You work in `modules/16_acceleration/acceleration_dev.py` \n", - "**Building Side:** Code exports to `tinytorch.optimization.acceleration`\n", - "\n", - "```python\n", - "# How to use this module:\n", - "from tinytorch.optimization.acceleration import vectorized_matmul, fused_gelu, MixedPrecisionTrainer\n", - "```\n", - "\n", - "**Why this matters:**\n", - "- **Learning:** Complete acceleration system in one focused module for deep understanding\n", - "- **Production:** Proper organization like PyTorch's torch.amp and torch.jit with optimization components\n", - "- **Consistency:** All acceleration operations and mixed precision training in optimization.acceleration\n", - "- **Integration:** Works seamlessly with profiling for complete performance optimization" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "59fd81f7", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import time\n", - "from typing import Dict, List, Tuple, Optional, Any, Union\n", - "import warnings" - ] - }, - { - "cell_type": "markdown", - "id": "e350bf3e", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 1. Introduction - The Performance Challenge\n", - "\n", - "Modern neural networks face two fundamental bottlenecks that limit their speed:\n", - "\n", - "### The Two Enemies of Performance\n", - "\n", - "**1. Compute Bound Operations:**\n", - "```\n", - "CPU/GPU Cores: [====BUSY====] [====BUSY====] [====BUSY====]\n", - "Memory Bus: [---idle---] [---idle---] [---idle---]\n", - "\n", - "When: Matrix multiplication, convolutions\n", - "Solution: Vectorization, better algorithms\n", - "```\n", - "\n", - "**2. Memory Bound Operations:**\n", - "```\n", - "CPU/GPU Cores: [--idle--] [--idle--] [--idle--]\n", - "Memory Bus: [========SATURATED========]\n", - "\n", - "When: Element-wise operations, small tensors\n", - "Solution: Kernel fusion, memory layout optimization\n", - "```\n", - "\n", - "### The Roofline Model - Your Performance Compass\n", - "\n", - "Every processor has fundamental limits:\n", - "\n", - "```\n", - "Performance │ Compute Bound Region\n", - "(GFLOPS) │ ┌─────────────────────\n", - " │ │ Peak Performance\n", - " │ │\n", - " │ ╱│ Memory Bound Region\n", - " │╱ │\n", - " ╱│ │\n", - " ╱ │ │\n", - " ╱ │ │\n", - " ╱───│──│───────────────────────\n", - " ╱ │ │\n", - " ╱ │ │\n", - " ╱──────│──│────────────────── Arithmetic Intensity\n", - " │ │ (FLOPs/Byte)\n", - " Low│ │High\n", - "```\n", - "\n", - "**Key Insight**: Understand where your operations live on this graph to optimize effectively.\n", - "\n", - "### Why This Module Matters\n", - "\n", - "Real-world performance wins:\n", - "- **2-5× speedup** from vectorization\n", - "- **30-50% memory reduction** from mixed precision\n", - "- **2-3× throughput** from kernel fusion\n", - "- **10× scaling improvement** for large models" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8c8b7618", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "tensor-import", - "solution": true - } - }, - "outputs": [], - "source": [ - "# Import required dependencies\n", - "### BEGIN SOLUTION\n", - "# Import tensor from our implementation\n", - "import sys\n", - "import os\n", - "sys.path.append('/Users/VJ/GitHub/TinyTorch')\n", - "\n", - "try:\n", - " # Import from the modules directory structure\n", - " import importlib.util\n", - " spec = importlib.util.spec_from_file_location(\"tensor_dev\", \"/Users/VJ/GitHub/TinyTorch/modules/01_tensor/tensor_dev.py\")\n", - " tensor_module = importlib.util.module_from_spec(spec)\n", - " spec.loader.exec_module(tensor_module)\n", - " Tensor = tensor_module.Tensor\n", - "except ImportError:\n", - " # Fallback for testing\n", - " class Tensor:\n", - " def __init__(self, data, requires_grad=False):\n", - " self.data = np.array(data, dtype=np.float32)\n", - " self.shape = self.data.shape\n", - " self.requires_grad = requires_grad\n", - " self.grad = None\n", - "\n", - " def __add__(self, other):\n", - " return Tensor(self.data + other.data)\n", - "\n", - " def __mul__(self, other):\n", - " return Tensor(self.data * other.data)\n", - "\n", - " def matmul(self, other):\n", - " return Tensor(np.dot(self.data, other.data))\n", - "\n", - " def reshape(self, *shape):\n", - " return Tensor(self.data.reshape(shape))\n", - "\n", - " def sum(self, axis=None):\n", - " return Tensor(self.data.sum(axis=axis))\n", - "\n", - " def backward(self):\n", - " pass\n", - "### END SOLUTION" - ] - }, - { - "cell_type": "markdown", - "id": "9a445584", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 2. Foundations - Vectorization: From Loops to Lightning\n", - "\n", - "### The SIMD Revolution\n", - "\n", - "Modern processors can execute **Single Instruction, Multiple Data** operations:\n", - "\n", - "```\n", - "Traditional Loop (Scalar): SIMD Vectorized:\n", - "for i in range(4): ┌─────┐ ┌─────┬─────┬─────┬─────┐\n", - " c[i] = a[i] + b[i] │ ALU │ → │ALU 0│ALU 1│ALU 2│ALU 3│\n", - " └─────┘ └─────┴─────┴─────┴─────┘\n", - " 1 element 4 elements per cycle\n", - " per cycle\n", - "```\n", - "\n", - "### Memory Access Patterns: The Hidden Performance Killer\n", - "\n", - "```\n", - "Sequential Access (FAST):\n", - "Memory: [A][B][C][D][E][F][G][H]\n", - "Access: ↓ ↓ ↓ ↓ → Cache friendly\n", - "\n", - "Strided Access (SLOWER):\n", - "Memory: [A][ ][B][ ][C][ ][D][ ]\n", - "Access: ↓ ↓ ↓ ↓ → Cache misses\n", - "\n", - "Random Access (SLOWEST):\n", - "Memory: [A][B][C][D][E][F][G][H]\n", - "Access: ↓ ↑ ↓ ↑ → Cache chaos\n", - "```\n", - "\n", - "### Matrix Multiplication: The King of Vectorization\n", - "\n", - "Matrix multiplication is **perfectly suited** for vectorization:\n", - "\n", - "```\n", - "Matrix A (M×K) × Matrix B (K×N) = Matrix C (M×N)\n", - "\n", - "Computation Pattern:\n", - "┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n", - "│ a₁₁ a₁₂ a₁₃ a₁₄│ × │ b₁₁ b₁₂ b₁₃ b₁₄│ = │ c₁₁ c₁₂ c₁₃ c₁₄│\n", - "│ a₂₁ a₂₂ a₂₃ a₂₄│ │ b₂₁ b₂₂ b₂₃ b₂₄│ │ c₂₁ c₂₂ c₂₃ c₂₄│\n", - "│ a₃₁ a₃₂ a₃₃ a₃₄│ │ b₃₁ b₃₂ b₃₃ b₃₄│ │ c₃₁ c₃₂ c₃₃ c₃₄│\n", - "│ a₄₁ a₄₂ a₄₃ a₄₄│ │ b₄₁ b₄₂ b₄₃ b₄₄│ │ c₄₁ c₄₂ c₄₃ c₄₄│\n", - "└─────────────────┘ └─────────────────┘ └─────────────────┘\n", - "\n", - "For c₁₁: Row₁ · Column₁ = a₁₁×b₁₁ + a₁₂×b₂₁ + a₁₃×b₃₁ + a₁₄×b₄₁\n", - " ↑\n", - " VECTORIZABLE!\n", - "```\n", - "\n", - "**Why vectorization wins:**\n", - "- **High arithmetic intensity**: 2N³ FLOPs for N³ data\n", - "- **Predictable memory access**: Sequential row/column reads\n", - "- **Parallelizable**: Independent dot products\n", - "- **Cache-friendly**: Data reuse in inner loops" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "01b0e1a7", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "vectorized-matmul", - "solution": true - } - }, - "outputs": [], - "source": [ - "def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor:\n", - " \"\"\"\n", - " High-performance matrix multiplication using vectorized operations.\n", - "\n", - " This implementation leverages optimized BLAS libraries that use:\n", - " - SIMD instructions for parallel computation\n", - " - Cache-blocking for memory efficiency\n", - " - Multi-threading for CPU parallelization\n", - "\n", - " TODO: Implement production-grade matrix multiplication\n", - "\n", - " APPROACH:\n", - " 1. Validate shapes are compatible for matrix multiplication\n", - " 2. Use NumPy's optimized dot product (calls BLAS GEMM)\n", - " 3. Return result wrapped in Tensor\n", - "\n", - " EXAMPLE:\n", - " Matrix multiplication visualization:\n", - " >>> a = Tensor([[1, 2], [3, 4]]) # 2×2\n", - " >>> b = Tensor([[5, 6], [7, 8]]) # 2×2\n", - " >>> result = vectorized_matmul(a, b)\n", - " >>> print(result.data)\n", - " [[19 22] # [1×5+2×7, 1×6+2×8] = [19, 22]\n", - " [43 50]] # [3×5+4×7, 3×6+4×8] = [43, 50]\n", - "\n", - " PERFORMANCE CHARACTERISTICS:\n", - " - Time Complexity: O(N³) but highly optimized\n", - " - Space Complexity: O(N²) for result\n", - " - Arithmetic Intensity: 2N³ FLOPs / 3N² bytes = 2N/3 (good for large N)\n", - "\n", - " HINTS:\n", - " - Check a.shape[-1] == b.shape[-2] for inner dimension match\n", - " - Use np.matmul() for batch support and optimization\n", - " - Trust BLAS to handle the vectorization magic\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Input validation for matrix multiplication\n", - " if len(a.shape) < 2 or len(b.shape) < 2:\n", - " raise ValueError(\n", - " f\"Matrix multiplication requires 2D+ tensors, got shapes {a.shape} and {b.shape}. \"\n", - " f\"💡 HINT: Use reshape() to add dimensions if needed.\"\n", - " )\n", - "\n", - " if a.shape[-1] != b.shape[-2]:\n", - " raise ValueError(\n", - " f\"Matrix multiplication shape mismatch: {a.shape} @ {b.shape}. \"\n", - " f\"Inner dimensions must match: a.shape[-1]={a.shape[-1]} != b.shape[-2]={b.shape[-2]}. \"\n", - " f\"💡 HINT: For A@B, A's columns must equal B's rows.\"\n", - " )\n", - "\n", - " # Use NumPy's highly optimized matrix multiplication\n", - " # This calls BLAS GEMM (General Matrix Multiply), which uses:\n", - " # - SIMD vectorization for parallel arithmetic\n", - " # - Cache blocking for memory efficiency\n", - " # - Multi-threading on multi-core systems\n", - " result_data = np.matmul(a.data, b.data)\n", - "\n", - " return Tensor(result_data)\n", - " ### END SOLUTION" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ae44b17e", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-vectorized-matmul", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_vectorized_matmul():\n", - " \"\"\"🔬 Test vectorized matrix multiplication implementation.\"\"\"\n", - " print(\"🔬 Unit Test: Vectorized Matrix Multiplication...\")\n", - "\n", - " # Test basic 2D multiplication\n", - " a = Tensor([[1, 2], [3, 4]])\n", - " b = Tensor([[5, 6], [7, 8]])\n", - " result = vectorized_matmul(a, b)\n", - "\n", - " expected = np.array([[19, 22], [43, 50]])\n", - " assert np.allclose(result.data, expected), f\"Basic matmul failed: expected {expected}, got {result.data}\"\n", - "\n", - " # Test batch multiplication (3D tensors)\n", - " batch_size, m, k, n = 2, 3, 4, 5\n", - " a_batch = Tensor(np.random.randn(batch_size, m, k))\n", - " b_batch = Tensor(np.random.randn(batch_size, k, n))\n", - " result_batch = vectorized_matmul(a_batch, b_batch)\n", - "\n", - " assert result_batch.shape == (batch_size, m, n), f\"Wrong batch shape: {result_batch.shape}\"\n", - "\n", - " # Test broadcasting (different batch dimensions)\n", - " a_single = Tensor(np.random.randn(m, k))\n", - " b_batch = Tensor(np.random.randn(batch_size, k, n))\n", - " result_broadcast = vectorized_matmul(a_single, b_batch)\n", - "\n", - " assert result_broadcast.shape == (batch_size, m, n), f\"Broadcasting failed: {result_broadcast.shape}\"\n", - "\n", - " # Test error cases\n", - " try:\n", - " vectorized_matmul(Tensor([1, 2, 3]), Tensor([4, 5])) # 1D tensors\n", - " assert False, \"Should reject 1D tensors\"\n", - " except ValueError as e:\n", - " assert \"2D+\" in str(e)\n", - "\n", - " try:\n", - " vectorized_matmul(Tensor([[1, 2]]), Tensor([[1], [2], [3]])) # Shape mismatch\n", - " assert False, \"Should reject incompatible shapes\"\n", - " except ValueError as e:\n", - " assert \"shape mismatch\" in str(e).lower()\n", - "\n", - " print(\"✅ vectorized_matmul works correctly!\")\n", - "\n", - "test_unit_vectorized_matmul()" - ] - }, - { - "cell_type": "markdown", - "id": "85cd07f9", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 3. Implementation - Kernel Fusion: Eliminating Memory Bottlenecks\n", - "\n", - "### The Memory Bandwidth Crisis\n", - "\n", - "Consider this innocent-looking computation: `y = gelu(x * weight + bias)`\n", - "\n", - "**Naive Implementation (Memory Intensive):**\n", - "```\n", - "Step 1: temp1 = x * weight → Write 4GB to memory\n", - "Step 2: temp2 = temp1 + bias → Read 4GB, Write 4GB\n", - "Step 3: y = gelu(temp2) → Read 4GB, Write 4GB\n", - " Total: 20GB memory traffic!\n", - "```\n", - "\n", - "**Fused Implementation (Memory Efficient):**\n", - "```\n", - "Single Step: y = gelu(x * weight + bias) → Read 8GB, Write 4GB\n", - " Total: 12GB memory traffic!\n", - " 60% memory bandwidth reduction!\n", - "```\n", - "\n", - "### Understanding GELU: The Smooth Activation\n", - "\n", - "GELU (Gaussian Error Linear Unit) is used in transformers because it's **smooth** (differentiable everywhere):\n", - "\n", - "```\n", - "Activation Functions Compared:\n", - "\n", - "ReLU: GELU: Sigmoid:\n", - " | | 1 ┌─────\n", - " | | ╱ │\n", - " | ╱───│─── ╱ │\n", - "─────┘ ╱─── │ ───╱ │\n", - " Discontinuous Smooth Curve │ Smooth but saturates\n", - " gradient at 0 everywhere │\n", - "```\n", - "\n", - "**GELU Formula**: `GELU(x) = x * Φ(x)` where Φ is the standard normal CDF\n", - "\n", - "**Fast Approximation**: `GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))`\n", - "\n", - "### Kernel Fusion Strategy\n", - "\n", - "```\n", - "Unfused Operations: Fused Operation:\n", - "┌─────────────────┐ ┌─────────────────┐\n", - "│ x³ computation │ → temp1 │ │\n", - "└─────────────────┘ │ │\n", - "┌─────────────────┐ │ │\n", - "│ polynomial part │ → temp2 │ All operations│\n", - "└─────────────────┘ │ combined in │\n", - "┌─────────────────┐ │ single kernel │\n", - "│ tanh computation│ → temp3 │ │\n", - "└─────────────────┘ │ │\n", - "┌─────────────────┐ │ │\n", - "│ final multiply │ → result │ │\n", - "└─────────────────┘ └─────────────────┘\n", - "\n", - "5 memory round-trips 1 memory round-trip\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "085b3c2b", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "fused-gelu", - "solution": true - } - }, - "outputs": [], - "source": [ - "def fused_gelu(x: Tensor) -> Tensor:\n", - " \"\"\"\n", - " Fused GELU activation that combines all operations in a single kernel.\n", - "\n", - " GELU combines the benefits of ReLU and sigmoid:\n", - " - Smooth everywhere (unlike ReLU's discontinuity at 0)\n", - " - Non-saturating for positive values (unlike sigmoid)\n", - " - Probabilistic interpretation: x * P(X ≤ x) where X ~ N(0,1)\n", - "\n", - " Mathematical Definition:\n", - " GELU(x) = x * Φ(x) where Φ(x) is the standard normal CDF\n", - "\n", - " Fast Approximation (used here):\n", - " GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))\n", - "\n", - " TODO: Implement fused GELU to minimize memory bandwidth\n", - "\n", - " APPROACH:\n", - " 1. Compute all intermediate values in a single expression\n", - " 2. Avoid creating temporary arrays\n", - " 3. Let NumPy's broadcasting handle vectorization\n", - "\n", - " EXAMPLE:\n", - " >>> x = Tensor([-2, -1, 0, 1, 2])\n", - " >>> result = fused_gelu(x)\n", - " >>> print(result.data)\n", - " [-0.04550026 -0.15865526 0. 0.8413447 1.9544997 ]\n", - " # Notice: smooth transition through 0, positive bias\n", - "\n", - " MEMORY EFFICIENCY:\n", - " - Unfused: 5 temporary arrays × input_size × 4 bytes\n", - " - Fused: 0 temporary arrays, direct computation\n", - " - Bandwidth reduction: ~80% for memory-bound operations\n", - "\n", - " HINTS:\n", - " - Use np.sqrt(2.0 / np.pi) for the constant\n", - " - Keep entire expression in one line for maximum fusion\n", - " - NumPy will optimize the expression tree automatically\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Mathematical constant for GELU approximation\n", - " sqrt_2_over_pi = np.sqrt(2.0 / np.pi)\n", - "\n", - " # Fused GELU computation - all operations in single expression\n", - " # This minimizes memory bandwidth by avoiding intermediate arrays\n", - " # NumPy's expression evaluator will optimize this into efficient machine code\n", - " result_data = 0.5 * x.data * (\n", - " 1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3))\n", - " )\n", - "\n", - " return Tensor(result_data)\n", - " ### END SOLUTION" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b205cb72", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-fused-gelu", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_fused_gelu():\n", - " \"\"\"🔬 Test fused GELU activation implementation.\"\"\"\n", - " print(\"🔬 Unit Test: Fused GELU...\")\n", - "\n", - " # Test basic properties\n", - " x = Tensor([-3, -1, 0, 1, 3])\n", - " result = fused_gelu(x)\n", - "\n", - " # GELU(0) = 0 (exact property)\n", - " assert abs(result.data[2]) < 1e-6, f\"GELU(0) should be 0, got {result.data[2]}\"\n", - "\n", - " # GELU is smooth and increasing\n", - " assert result.data[4] > result.data[3] > result.data[2], \"GELU should be increasing\"\n", - "\n", - " # GELU has positive bias (unlike ReLU)\n", - " assert result.data[3] > 0.8, \"GELU(1) should be close to 1\"\n", - " assert result.data[1] > -0.2, \"GELU(-1) should be slightly negative\"\n", - "\n", - " # Test numerical stability with extreme values\n", - " x_extreme = Tensor([-10, -5, 0, 5, 10])\n", - " result_extreme = fused_gelu(x_extreme)\n", - "\n", - " assert not np.any(np.isnan(result_extreme.data)), \"No NaN values allowed\"\n", - " assert not np.any(np.isinf(result_extreme.data)), \"No infinite values allowed\"\n", - "\n", - " # Test large tensor processing\n", - " x_large = Tensor(np.random.randn(1000, 1000).astype(np.float32))\n", - " result_large = fused_gelu(x_large)\n", - "\n", - " assert result_large.shape == x_large.shape, \"Shape preservation failed\"\n", - " assert result_large.data.dtype == np.float32, \"Data type preservation failed\"\n", - "\n", - " # Test that positive inputs are mostly preserved (GELU ≈ x for large positive x)\n", - " x_positive = Tensor([5.0])\n", - " result_positive = fused_gelu(x_positive)\n", - " assert result_positive.data[0] > 4.9, \"Large positive values should be nearly preserved\"\n", - "\n", - " print(\"✅ fused_gelu works correctly!\")\n", - "\n", - "test_unit_fused_gelu()" - ] - }, - { - "cell_type": "markdown", - "id": "cb075d6f", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### 🔬 Performance Analysis: Measuring Fusion Benefits\n", - "\n", - "Let's quantify the impact of kernel fusion by comparing fused vs unfused implementations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "89558452", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "unfused-gelu", - "solution": true - } - }, - "outputs": [], - "source": [ - "def unfused_gelu(x: Tensor) -> Tensor:\n", - " \"\"\"\n", - " Deliberately unfused GELU implementation for performance comparison.\n", - "\n", - " This version creates multiple intermediate tensors to simulate\n", - " the memory bandwidth overhead of unfused operations.\n", - "\n", - " TODO: Implement GELU with explicit intermediate steps\n", - "\n", - " APPROACH:\n", - " 1. Break computation into individual steps\n", - " 2. Create temporary Tensor objects for each step\n", - " 3. This simulates real memory allocation overhead\n", - "\n", - " PERFORMANCE IMPACT:\n", - " - Creates 7 temporary arrays\n", - " - Each array allocation/deallocation has overhead\n", - " - More memory bandwidth usage\n", - " - Potential cache misses between operations\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Unfused version - creates many intermediate arrays\n", - " sqrt_2_over_pi = np.sqrt(2.0 / np.pi)\n", - "\n", - " # Each operation creates a temporary array (simulating kernel launches)\n", - " temp1 = Tensor(x.data**3) # x³\n", - " temp2 = Tensor(0.044715 * temp1.data) # 0.044715 * x³\n", - " temp3 = Tensor(x.data + temp2.data) # x + 0.044715 * x³\n", - " temp4 = Tensor(sqrt_2_over_pi * temp3.data) # √(2/π) * (...)\n", - " temp5 = Tensor(np.tanh(temp4.data)) # tanh(...)\n", - " temp6 = Tensor(1.0 + temp5.data) # 1 + tanh(...)\n", - " temp7 = Tensor(x.data * temp6.data) # x * (1 + tanh(...))\n", - " result = Tensor(0.5 * temp7.data) # 0.5 * x * (...)\n", - "\n", - " return result\n", - " ### END SOLUTION" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6a50536a", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-fusion-speedup", - "locked": true, - "points": 10 - } - }, - "outputs": [], - "source": [ - "def test_unit_fusion_speedup():\n", - " \"\"\"🔬 Measure the performance impact of kernel fusion.\"\"\"\n", - " print(\"🔬 Unit Test: Kernel Fusion Performance Impact...\")\n", - "\n", - " # Create moderately large tensor for meaningful timing\n", - " size = 2000\n", - " x = Tensor(np.random.randn(size, size).astype(np.float32))\n", - " warmup_iterations = 2\n", - " timing_iterations = 5\n", - "\n", - " # Warmup both implementations\n", - " for _ in range(warmup_iterations):\n", - " _ = unfused_gelu(x)\n", - " _ = fused_gelu(x)\n", - "\n", - " # Time unfused version\n", - " start = time.time()\n", - " for _ in range(timing_iterations):\n", - " result_unfused = unfused_gelu(x)\n", - " unfused_time = time.time() - start\n", - "\n", - " # Time fused version\n", - " start = time.time()\n", - " for _ in range(timing_iterations):\n", - " result_fused = fused_gelu(x)\n", - " fused_time = time.time() - start\n", - "\n", - " # Verify numerical correctness\n", - " assert np.allclose(result_unfused.data, result_fused.data, atol=1e-6), \\\n", - " \"Fused and unfused implementations must be numerically equivalent\"\n", - "\n", - " # Calculate performance metrics\n", - " speedup = unfused_time / fused_time if fused_time > 0 else 1.0\n", - " unfused_per_elem = (unfused_time / timing_iterations) / (size * size) * 1e9 # ns per element\n", - " fused_per_elem = (fused_time / timing_iterations) / (size * size) * 1e9\n", - "\n", - " print(f\"📊 Kernel Fusion Performance Analysis:\")\n", - " print(f\" Tensor size: {size}×{size} = {size*size:,} elements\")\n", - " print(f\" Unfused time: {unfused_time/timing_iterations*1000:.2f} ms\")\n", - " print(f\" Fused time: {fused_time/timing_iterations*1000:.2f} ms\")\n", - " print(f\" Speedup: {speedup:.2f}× faster\")\n", - " print(f\" Per-element: {unfused_per_elem:.1f} ns → {fused_per_elem:.1f} ns\")\n", - "\n", - " # Memory bandwidth estimate\n", - " bytes_per_elem = 4 # float32\n", - " unfused_memory_ops = 7 # 7 intermediate arrays\n", - " fused_memory_ops = 2 # read input, write output\n", - "\n", - " unfused_bandwidth = (unfused_memory_ops * size * size * bytes_per_elem) / (unfused_time / timing_iterations) / 1e9\n", - " fused_bandwidth = (fused_memory_ops * size * size * bytes_per_elem) / (fused_time / timing_iterations) / 1e9\n", - "\n", - " print(f\" Memory efficiency: {unfused_memory_ops}→{fused_memory_ops} memory ops\")\n", - " print(f\" Effective bandwidth: {unfused_bandwidth:.1f}→{fused_bandwidth:.1f} GB/s\")\n", - "\n", - " # Interpret results\n", - " if speedup > 1.5:\n", - " print(\"🚀 Excellent! Kernel fusion providing significant speedup\")\n", - " elif speedup > 1.1:\n", - " print(\"✅ Good! Kernel fusion providing measurable benefit\")\n", - " else:\n", - " print(\"⚠️ Limited speedup - may be compute-bound or small tensor size\")\n", - "\n", - " print(\"✅ Fusion performance analysis completed!\")\n", - "\n", - "test_unit_fusion_speedup()" - ] - }, - { - "cell_type": "markdown", - "id": "adb97e5a", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 4. Integration - Mixed Precision Training: Memory and Speed\n", - "\n", - "### The Mixed Precision Revolution\n", - "\n", - "Modern GPUs (like V100, A100) have specialized **Tensor Cores** that can perform FP16 operations much faster than FP32:\n", - "\n", - "```\n", - "Performance Comparison (Theoretical Peak):\n", - "┌─────────────────┬────────────────┬────────────────┐\n", - "│ Precision │ V100 TFLOPS │ A100 TFLOPS │\n", - "├─────────────────┼────────────────┼────────────────┤\n", - "│ FP32 (float) │ 15.7 │ 19.5 │\n", - "│ FP16 (half) │ 125.0 │ 312.0 │\n", - "│ Speedup │ 8× │ 16× │\n", - "└─────────────────┴────────────────┴────────────────┘\n", - "```\n", - "\n", - "### The Challenge: FP16 Precision Limitations\n", - "\n", - "FP16 has a much smaller range than FP32:\n", - "\n", - "```\n", - "FP32 (32-bit): FP16 (16-bit):\n", - "┌─────────────────────────────┐ ┌───────────────┐\n", - "│ Sign │ 8-bit │ 23-bit │ │Sign│5-bit│10-bit│\n", - "│ bit │ Exp │ Mantissa │ │bit │ Exp │Mant. │\n", - "└─────────────────────────────┘ └───────────────┘\n", - "Range: ±3.4 × 10³⁸ Range: ±6.5 × 10⁴\n", - "Precision: ~7 decimal digits Precision: ~3 decimal digits\n", - "\n", - "Problem: Small gradients (< 6e-5) become ZERO in FP16!\n", - "```\n", - "\n", - "### The Solution: Automatic Loss Scaling\n", - "\n", - "```\n", - "Training Step Without Scaling: Training Step With Scaling:\n", - "\n", - "Loss = 0.0001 Loss = 0.0001\n", - " ↓ ↓\n", - "Gradients = 0.00001 Scale × 1024\n", - " ↓ ↓\n", - "Convert to FP16 Loss = 0.1024\n", - " ↓ ↓\n", - "Gradients = 0.0 (UNDERFLOW!) Gradients = 0.01024\n", - " ↓ ↓\n", - "No learning! Convert to FP16: 0.01024 ✓\n", - " ↓\n", - " Unscale: 0.01024 / 1024 = 0.00001\n", - " ↓\n", - " Successful learning!\n", - "```\n", - "\n", - "### Mixed Precision Memory Benefits\n", - "\n", - "```\n", - "Model Component Breakdown:\n", - "┌─────────────────┬─────────────┬─────────────┬─────────────┐\n", - "│ Component │ FP32 Memory │ FP16 Memory │ Savings │\n", - "├─────────────────┼─────────────┼─────────────┼─────────────┤\n", - "│ Parameters │ 4N │ 4N │ 0% │\n", - "│ Gradients │ 4N │ 2N │ 50% │\n", - "│ Activations │ 4A │ 2A │ 50% │\n", - "│ Optimizer State │ 8N │ 8N │ 0% │\n", - "├─────────────────┼─────────────┼─────────────┼─────────────┤\n", - "│ Total Typical │ ~20N │ ~16N │ 20% │\n", - "│ Activation-Heavy│ ~40N │ ~24N │ 40% │\n", - "└─────────────────┴─────────────┴─────────────┴─────────────┘\n", - "\n", - "N = parameter count, A = activation memory\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7a19b2a6", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "mixed-precision-trainer", - "solution": true - } - }, - "outputs": [], - "source": [ - "class MixedPrecisionTrainer:\n", - " \"\"\"\n", - " Mixed precision trainer with automatic loss scaling.\n", - "\n", - " Implements the same pattern as PyTorch's Automatic Mixed Precision (AMP):\n", - " 1. Forward pass in FP16 for speed and memory efficiency\n", - " 2. Loss scaling to prevent gradient underflow\n", - " 3. Gradient computation and unscaling\n", - " 4. Parameter updates in FP32 for numerical stability\n", - "\n", - " The key insight: keep different parts of training in optimal precision.\n", - " \"\"\"\n", - "\n", - " def __init__(self, model, optimizer, loss_scale: float = 1024.0, max_loss_scale: float = 65536.0):\n", - " \"\"\"\n", - " Initialize mixed precision training infrastructure.\n", - "\n", - " TODO: Set up automatic loss scaling and overflow detection\n", - "\n", - " APPROACH:\n", - " 1. Store model and optimizer references\n", - " 2. Initialize dynamic loss scaling parameters\n", - " 3. Set up overflow detection and scale adjustment logic\n", - "\n", - " Args:\n", - " model: Neural network model\n", - " optimizer: Parameter optimizer (SGD, Adam, etc.)\n", - " loss_scale: Initial scaling factor for gradients\n", - " max_loss_scale: Maximum allowed loss scale\n", - "\n", - " LOSS SCALING STRATEGY:\n", - " - Start with reasonable scale (1024)\n", - " - Increase gradually if no overflow (better precision)\n", - " - Decrease immediately on overflow (stability)\n", - " - This balances numerical precision with training stability\n", - "\n", - " HINTS:\n", - " - Track consecutive successful steps for scale increases\n", - " - Use exponential backoff on overflow detection\n", - " - Keep scale within reasonable bounds [1, 65536]\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " self.model = model\n", - " self.optimizer = optimizer\n", - "\n", - " # Loss scaling parameters\n", - " self.loss_scale = loss_scale\n", - " self.max_loss_scale = max_loss_scale\n", - " self.min_loss_scale = 1.0\n", - "\n", - " # Dynamic scaling parameters\n", - " self.scale_growth_factor = 2.0 # Multiply by 2 when increasing\n", - " self.scale_backoff_factor = 0.5 # Divide by 2 when decreasing\n", - " self.growth_interval = 2000 # Steps between scale increases\n", - " self.steps_since_last_scale_update = 0\n", - "\n", - " # Overflow tracking\n", - " self.overflow_detected = False\n", - " ### END SOLUTION\n", - "\n", - " def scale_loss(self, loss: Tensor) -> Tensor:\n", - " \"\"\"\n", - " Scale loss to prevent gradient underflow in FP16.\n", - "\n", - " The fundamental challenge: FP16 can only represent values ≥ 6e-5.\n", - " Small gradients (common in deep networks) become zero without scaling.\n", - "\n", - " TODO: Apply loss scaling for mixed precision stability\n", - "\n", - " APPROACH:\n", - " 1. Multiply loss by current scale factor\n", - " 2. This amplifies gradients proportionally\n", - " 3. Return scaled loss for backward pass\n", - "\n", - " MATHEMATICAL INSIGHT:\n", - " If loss = 1e-6 and scale = 1024:\n", - " scaled_loss = 1e-6 × 1024 = 1.024e-3\n", - "\n", - " After backward pass:\n", - " scaled_gradients = 1.024e-3 × dloss/dparam = 1024 × gradients\n", - "\n", - " These larger gradients survive FP16 conversion!\n", - "\n", - " EXAMPLE:\n", - " >>> trainer = MixedPrecisionTrainer(model, optimizer)\n", - " >>> loss = Tensor([0.0001]) # Small loss\n", - " >>> scaled = trainer.scale_loss(loss)\n", - " >>> print(scaled.data) # [0.1024] (0.0001 × 1024)\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Scale the loss to amplify gradients\n", - " # This prevents gradient underflow in FP16 arithmetic\n", - " scaled_data = loss.data * self.loss_scale\n", - " return Tensor(scaled_data)\n", - " ### END SOLUTION\n", - "\n", - " def unscale_gradients(self, parameters: List[Tensor]) -> bool:\n", - " \"\"\"\n", - " Unscale gradients and detect overflow from FP16 conversion.\n", - "\n", - " After backward pass on scaled loss, gradients are scaled too.\n", - " We must unscale them AND check for overflow/underflow.\n", - "\n", - " TODO: Implement gradient unscaling with overflow detection\n", - "\n", - " APPROACH:\n", - " 1. Divide all gradients by loss scale (restore original magnitude)\n", - " 2. Check for inf/nan values (indicates FP16 overflow)\n", - " 3. Return True if gradients are valid, False if overflow detected\n", - "\n", - " OVERFLOW DETECTION:\n", - " inf/nan in gradients indicates:\n", - " - Gradient magnitude too large for FP16\n", - " - Numerical instability in computation\n", - " - Loss scale too aggressive\n", - "\n", - " When overflow occurs:\n", - " - Skip parameter update (unstable gradients)\n", - " - Reduce loss scale for next iteration\n", - " - Continue training with lower scale\n", - "\n", - " HINTS:\n", - " - Use np.isfinite() to detect inf/nan efficiently\n", - " - Process all parameters even if overflow found\n", - " - Set self.overflow_detected flag for scale adjustment\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " self.overflow_detected = False\n", - "\n", - " # Unscale all gradients and check for overflow\n", - " for param in parameters:\n", - " if param.grad is not None:\n", - " # Unscale gradients to original magnitude\n", - " param.grad.data = param.grad.data / self.loss_scale\n", - "\n", - " # Check for overflow/underflow (inf/nan values)\n", - " if not np.all(np.isfinite(param.grad.data)):\n", - " self.overflow_detected = True\n", - " # Continue processing to unscale all gradients\n", - "\n", - " return not self.overflow_detected\n", - " ### END SOLUTION\n", - "\n", - " def update_loss_scale(self):\n", - " \"\"\"\n", - " Dynamically adjust loss scale based on training stability.\n", - "\n", - " Implements the \"Goldilocks\" principle for loss scaling:\n", - " - Too low: precision loss from small gradients\n", - " - Too high: overflow and instability\n", - " - Just right: maximum precision without overflow\n", - "\n", - " TODO: Implement adaptive loss scale adjustment\n", - "\n", - " APPROACH:\n", - " 1. If overflow detected: reduce scale immediately (stability)\n", - " 2. If no overflow for many steps: increase scale (precision)\n", - " 3. Keep scale within reasonable bounds\n", - "\n", - " SCALING STRATEGY:\n", - " - Aggressive reduction on overflow (×0.5)\n", - " - Conservative growth during stability (×2 every 2000 steps)\n", - " - This favors stability over maximum precision\n", - "\n", - " WHY THIS WORKS:\n", - " - Most training is stable (gradual scale increase)\n", - " - Occasional instability (rapid scale decrease)\n", - " - Converges to optimal scale for current training phase\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " if self.overflow_detected:\n", - " # Immediately reduce scale on overflow\n", - " self.loss_scale = max(\n", - " self.min_loss_scale,\n", - " self.loss_scale * self.scale_backoff_factor\n", - " )\n", - " self.steps_since_last_scale_update = 0\n", - " else:\n", - " # Gradually increase scale if stable\n", - " self.steps_since_last_scale_update += 1\n", - " if self.steps_since_last_scale_update >= self.growth_interval:\n", - " self.loss_scale = min(\n", - " self.max_loss_scale,\n", - " self.loss_scale * self.scale_growth_factor\n", - " )\n", - " self.steps_since_last_scale_update = 0\n", - " ### END SOLUTION\n", - "\n", - " def train_step(self, batch: Tuple[Tensor, Tensor]) -> Dict[str, float]:\n", - " \"\"\"\n", - " Execute complete mixed precision training step.\n", - "\n", - " Orchestrates the entire mixed precision training process:\n", - " 1. Forward pass (FP16 in real implementation)\n", - " 2. Loss computation and scaling\n", - " 3. Backward pass on scaled loss\n", - " 4. Gradient unscaling and overflow detection\n", - " 5. Conditional parameter update\n", - " 6. Loss scale adjustment\n", - "\n", - " TODO: Implement end-to-end mixed precision training step\n", - "\n", - " APPROACH:\n", - " 1. Clear gradients from previous step\n", - " 2. Forward pass through model\n", - " 3. Compute and scale loss\n", - " 4. Backward pass to compute scaled gradients\n", - " 5. Unscale gradients and check for overflow\n", - " 6. Update parameters only if no overflow\n", - " 7. Adjust loss scale based on stability\n", - "\n", - " CRITICAL INSIGHT:\n", - " Skip parameter updates on overflow! Unstable gradients\n", - " would move parameters in wrong direction.\n", - "\n", - " RETURN FORMAT:\n", - " Dictionary with training metrics:\n", - " - loss: unscaled loss value\n", - " - loss_scale: current scaling factor\n", - " - overflow: whether overflow occurred\n", - " - gradients_valid: whether update was applied\n", - "\n", - " HINTS:\n", - " - Use self.optimizer.zero_grad() to clear gradients\n", - " - Get parameters with gradients for unscaling\n", - " - Only call optimizer.step() if gradients are valid\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " inputs, targets = batch\n", - "\n", - " # Clear gradients from previous step\n", - " self.optimizer.zero_grad()\n", - "\n", - " # Forward pass (would use FP16 autocast in real implementation)\n", - " # For simulation, we work in FP32 but apply scaling principles\n", - " outputs = self.model(inputs)\n", - "\n", - " # Compute loss (unscaled)\n", - " loss = self._compute_loss(outputs, targets)\n", - "\n", - " # Scale loss for mixed precision\n", - " scaled_loss = self.scale_loss(loss)\n", - "\n", - " # Backward pass on scaled loss\n", - " scaled_loss.backward()\n", - "\n", - " # Get all parameters with gradients\n", - " parameters = [p for p in self.model.parameters() if p.grad is not None]\n", - "\n", - " # Unscale gradients and detect overflow\n", - " gradients_valid = self.unscale_gradients(parameters)\n", - "\n", - " # Update parameters only if no overflow\n", - " if gradients_valid:\n", - " self.optimizer.step()\n", - "\n", - " # Adjust loss scale based on stability\n", - " self.update_loss_scale()\n", - "\n", - " # Return training metrics\n", - " return {\n", - " 'loss': loss.data.item() if hasattr(loss.data, 'item') else float(loss.data),\n", - " 'loss_scale': self.loss_scale,\n", - " 'overflow': self.overflow_detected,\n", - " 'gradients_valid': gradients_valid\n", - " }\n", - " ### END SOLUTION\n", - "\n", - " def _compute_loss(self, outputs: Tensor, targets: Tensor) -> Tensor:\n", - " \"\"\"Simple MSE loss for demonstration purposes.\"\"\"\n", - " diff = Tensor(outputs.data - targets.data)\n", - " return Tensor(np.mean(diff.data**2))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "650bf77c", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-mixed-precision", - "locked": true, - "points": 15 - } - }, - "outputs": [], - "source": [ - "def test_unit_mixed_precision():\n", - " \"\"\"🔬 Test mixed precision training components comprehensively.\"\"\"\n", - " print(\"🔬 Unit Test: Mixed Precision Training...\")\n", - "\n", - " # Create mock model and optimizer for testing\n", - " class MockModel:\n", - " def __init__(self):\n", - " self.weight = Tensor(np.random.randn(10, 5).astype(np.float32))\n", - " self.weight.grad = None\n", - "\n", - " def __call__(self, x):\n", - " return x.matmul(self.weight)\n", - "\n", - " def parameters(self):\n", - " return [self.weight]\n", - "\n", - " class MockOptimizer:\n", - " def __init__(self, params):\n", - " self.params = params\n", - " self.updates_applied = 0\n", - "\n", - " def zero_grad(self):\n", - " for p in self.params:\n", - " p.grad = None\n", - "\n", - " def step(self):\n", - " for p in self.params:\n", - " if p.grad is not None:\n", - " p.data = p.data - 0.01 * p.grad.data\n", - " self.updates_applied += 1\n", - "\n", - " # Initialize mixed precision trainer\n", - " model = MockModel()\n", - " optimizer = MockOptimizer(model.parameters())\n", - " trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=1024.0)\n", - "\n", - " # Test 1: Loss scaling\n", - " print(\" Testing loss scaling...\")\n", - " loss = Tensor([0.001])\n", - " scaled_loss = trainer.scale_loss(loss)\n", - " expected_scaled = 0.001 * 1024.0\n", - " assert np.isclose(scaled_loss.data[0], expected_scaled), \\\n", - " f\"Loss scaling failed: expected {expected_scaled}, got {scaled_loss.data[0]}\"\n", - "\n", - " # Test 2: Gradient unscaling (normal case)\n", - " print(\" Testing gradient unscaling...\")\n", - " model.weight.grad = Tensor(np.full((10, 5), 1024.0)) # Simulate scaled gradients\n", - " valid = trainer.unscale_gradients([model.weight])\n", - " assert valid, \"Should detect valid gradients\"\n", - " assert np.allclose(model.weight.grad.data, 1.0), \"Gradient unscaling failed\"\n", - "\n", - " # Test 3: Overflow detection\n", - " print(\" Testing overflow detection...\")\n", - " model.weight.grad = Tensor(np.full((10, 5), np.inf)) # Simulate overflow\n", - " valid = trainer.unscale_gradients([model.weight])\n", - " assert not valid, \"Should detect overflow\"\n", - " assert trainer.overflow_detected, \"Overflow flag not set\"\n", - "\n", - " # Test 4: Loss scale adjustment after overflow\n", - " print(\" Testing loss scale adjustment...\")\n", - " initial_scale = trainer.loss_scale\n", - " trainer.update_loss_scale() # Should reduce scale due to overflow\n", - " assert trainer.loss_scale < initial_scale, \\\n", - " f\"Scale should decrease after overflow: {initial_scale} → {trainer.loss_scale}\"\n", - "\n", - " # Test 5: Loss scale increase during stability\n", - " print(\" Testing loss scale increase...\")\n", - " trainer.overflow_detected = False\n", - " trainer.steps_since_last_scale_update = 2000 # Simulate stable training\n", - " scale_before = trainer.loss_scale\n", - " trainer.update_loss_scale()\n", - " assert trainer.loss_scale > scale_before, \"Scale should increase during stability\"\n", - "\n", - " # Test 6: End-to-end training step\n", - " print(\" Testing complete training step...\")\n", - " inputs = Tensor(np.random.randn(8, 10).astype(np.float32))\n", - " targets = Tensor(np.random.randn(8, 5).astype(np.float32))\n", - "\n", - " initial_updates = optimizer.updates_applied\n", - " metrics = trainer.train_step((inputs, targets))\n", - "\n", - " # Verify metrics structure\n", - " required_keys = ['loss', 'loss_scale', 'overflow', 'gradients_valid']\n", - " for key in required_keys:\n", - " assert key in metrics, f\"Missing metric: {key}\"\n", - "\n", - " # Verify loss is reasonable\n", - " assert isinstance(metrics['loss'], (int, float)), \"Loss should be numeric\"\n", - " assert metrics['loss'] >= 0, \"Loss should be non-negative\"\n", - "\n", - " # Verify loss scale is positive\n", - " assert metrics['loss_scale'] > 0, \"Loss scale should be positive\"\n", - "\n", - " print(\"✅ Mixed precision training works correctly!\")\n", - "\n", - "test_unit_mixed_precision()" - ] - }, - { - "cell_type": "markdown", - "id": "de9e4b44", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 5. Systems Analysis - Performance Scaling Patterns\n", - "\n", - "Let's analyze how our acceleration techniques perform across different scenarios and understand their scaling characteristics." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2f7edfee", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "analyze-vectorization", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_vectorization_scaling():\n", - " \"\"\"📊 Analyze vectorization performance across different tensor sizes.\"\"\"\n", - " print(\"📊 Analyzing vectorization scaling behavior...\")\n", - "\n", - " # Test sizes spanning different cache regimes\n", - " sizes = [64, 128, 256, 512, 1024, 2048]\n", - "\n", - " print(\"\\n🔍 Vectorization Scaling Analysis:\")\n", - " print(\"┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n", - " print(\"│ Size │ Time (ms) │ GFLOPS │ Bandwidth │ Efficiency │\")\n", - " print(\"│ │ │ │ (GB/s) │ (% of peak) │\")\n", - " print(\"├─────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n", - "\n", - " for size in sizes:\n", - " # Create test matrices\n", - " a = Tensor(np.random.randn(size, size).astype(np.float32))\n", - " b = Tensor(np.random.randn(size, size).astype(np.float32))\n", - "\n", - " # Warm up\n", - " for _ in range(2):\n", - " _ = vectorized_matmul(a, b)\n", - "\n", - " # Time vectorized implementation\n", - " iterations = max(1, 100 // (size // 64)) # Fewer iterations for larger sizes\n", - " start = time.time()\n", - " for _ in range(iterations):\n", - " result = vectorized_matmul(a, b)\n", - " elapsed = (time.time() - start) / iterations\n", - "\n", - " # Calculate performance metrics\n", - " flops = 2 * size**3 # 2N³ FLOPs for matrix multiplication\n", - " gflops = flops / (elapsed * 1e9)\n", - "\n", - " bytes_accessed = 3 * size * size * 4 # 3 matrices × size² × 4 bytes\n", - " bandwidth = bytes_accessed / (elapsed * 1e9)\n", - "\n", - " # Estimate efficiency (rough baseline: modern CPU ~100-500 GFLOPS peak)\n", - " estimated_peak_gflops = 200 # Conservative estimate\n", - " efficiency = min(100, gflops / estimated_peak_gflops * 100)\n", - "\n", - " print(f\"│ {size:6d} │ {elapsed*1000:9.2f} │ {gflops:9.1f} │ {bandwidth:9.1f} │ {efficiency:9.1f} │\")\n", - "\n", - " print(\"└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n", - "\n", - " print(f\"\\n💡 Vectorization insights:\")\n", - " print(f\" • Small matrices: Limited by overhead and cache effects\")\n", - " print(f\" • Medium matrices: Sweet spot for cache reuse\")\n", - " print(f\" • Large matrices: Memory bandwidth becomes limiting factor\")\n", - " print(f\" • BLAS libraries automatically optimize for each size regime\")\n", - " print(\"🚀 Vectorization effectiveness depends on problem size and hardware\")\n", - "\n", - "analyze_vectorization_scaling()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5972a039", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "analyze-arithmetic-intensity", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_arithmetic_intensity():\n", - " \"\"\"📊 Demonstrate the roofline model with different operations.\"\"\"\n", - " print(\"📊 Analyzing arithmetic intensity patterns...\")\n", - "\n", - " size = 1024\n", - " iterations = 10\n", - "\n", - " operations = []\n", - "\n", - " # Create test data\n", - " x = Tensor(np.random.randn(size, size).astype(np.float32))\n", - " y = Tensor(np.random.randn(size, size).astype(np.float32))\n", - "\n", - " print(\"\\n🎯 Arithmetic Intensity Analysis:\")\n", - " print(\"┌─────────────────────┬─────────┬─────────────┬─────────────┬─────────────┐\")\n", - " print(\"│ Operation │ AI │ Time (ms) │ GFLOPS │ GB/s │\")\n", - " print(\"│ │(FLOPs/B)│ │ │ │\")\n", - " print(\"├─────────────────────┼─────────┼─────────────┼─────────────┼─────────────┤\")\n", - "\n", - " # 1. Element-wise addition (very low arithmetic intensity)\n", - " start = time.time()\n", - " for _ in range(iterations):\n", - " _ = Tensor(x.data + y.data)\n", - " add_time = (time.time() - start) / iterations\n", - "\n", - " add_flops = size * size # One addition per element\n", - " add_bytes = 3 * size * size * 4 # Read x, read y, write result\n", - " add_ai = add_flops / add_bytes\n", - " add_gflops = add_flops / (add_time * 1e9)\n", - " add_bandwidth = add_bytes / (add_time * 1e9)\n", - "\n", - " print(f\"│ Element-wise Add │ {add_ai:6.3f} │ {add_time*1000:9.2f} │ {add_gflops:9.1f} │ {add_bandwidth:9.1f} │\")\n", - "\n", - " # 2. Element-wise multiply (still low, but slightly higher)\n", - " start = time.time()\n", - " for _ in range(iterations):\n", - " _ = Tensor(x.data * y.data)\n", - " mul_time = (time.time() - start) / iterations\n", - "\n", - " mul_flops = size * size\n", - " mul_bytes = 3 * size * size * 4\n", - " mul_ai = mul_flops / mul_bytes\n", - " mul_gflops = mul_flops / (mul_time * 1e9)\n", - " mul_bandwidth = mul_bytes / (mul_time * 1e9)\n", - "\n", - " print(f\"│ Element-wise Mult │ {mul_ai:6.3f} │ {mul_time*1000:9.2f} │ {mul_gflops:9.1f} │ {mul_bandwidth:9.1f} │\")\n", - "\n", - " # 3. GELU (medium arithmetic intensity)\n", - " start = time.time()\n", - " for _ in range(iterations):\n", - " _ = fused_gelu(x)\n", - " gelu_time = (time.time() - start) / iterations\n", - "\n", - " gelu_flops = size * size * 8 # Approximate: x³, add, mul, tanh, etc.\n", - " gelu_bytes = 2 * size * size * 4 # Read x, write result\n", - " gelu_ai = gelu_flops / gelu_bytes\n", - " gelu_gflops = gelu_flops / (gelu_time * 1e9)\n", - " gelu_bandwidth = gelu_bytes / (gelu_time * 1e9)\n", - "\n", - " print(f\"│ Fused GELU │ {gelu_ai:6.3f} │ {gelu_time*1000:9.2f} │ {gelu_gflops:9.1f} │ {gelu_bandwidth:9.1f} │\")\n", - "\n", - " # 4. Matrix multiplication (high arithmetic intensity)\n", - " start = time.time()\n", - " for _ in range(iterations):\n", - " _ = vectorized_matmul(x, y)\n", - " matmul_time = (time.time() - start) / iterations\n", - "\n", - " matmul_flops = 2 * size**3 # 2N³ FLOPs\n", - " matmul_bytes = 3 * size * size * 4 # 3 matrices\n", - " matmul_ai = matmul_flops / matmul_bytes\n", - " matmul_gflops = matmul_flops / (matmul_time * 1e9)\n", - " matmul_bandwidth = matmul_bytes / (matmul_time * 1e9)\n", - "\n", - " print(f\"│ Matrix Multiply │ {matmul_ai:6.3f} │ {matmul_time*1000:9.2f} │ {matmul_gflops:9.1f} │ {matmul_bandwidth:9.1f} │\")\n", - "\n", - " print(\"└─────────────────────┴─────────┴─────────────┴─────────────┴─────────────┘\")\n", - "\n", - " print(f\"\\n💡 Roofline Model Insights:\")\n", - " print(f\" 📊 Low AI (< 1): Memory bound - limited by bandwidth\")\n", - " print(f\" 📊 Med AI (1-10): Transitional - depends on implementation\")\n", - " print(f\" 📊 High AI (> 10): Compute bound - limited by ALU throughput\")\n", - " print(f\" 🎯 Matrix multiplication ({matmul_ai:.1f} AI) is ideal for GPUs/TPUs\")\n", - " print(f\" ⚡ Element-wise ops ({add_ai:.3f} AI) need memory optimization\")\n", - " print(\"🚀 Design algorithms with high arithmetic intensity for performance\")\n", - "\n", - "analyze_arithmetic_intensity()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7a539cd5", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "analyze-mixed-precision-benefits", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_mixed_precision_benefits():\n", - " \"\"\"📊 Quantify mixed precision memory and performance benefits.\"\"\"\n", - " print(\"📊 Analyzing mixed precision benefits across model sizes...\")\n", - "\n", - " # Define representative model configurations\n", - " model_configs = [\n", - " (\"Tiny CNN\", {\"params\": 50_000, \"activations\": 100_000}),\n", - " (\"Small BERT\", {\"params\": 10_000_000, \"activations\": 5_000_000}),\n", - " (\"Medium GPT\", {\"params\": 100_000_000, \"activations\": 50_000_000}),\n", - " (\"Large Transformer\", {\"params\": 1_000_000_000, \"activations\": 500_000_000}),\n", - " ]\n", - "\n", - " print(\"\\n🧮 Mixed Precision Memory Analysis:\")\n", - " print(\"┌─────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n", - " print(\"│ Model Type │ Parameters │ FP32 Memory │ FP16 Memory │ Savings │\")\n", - " print(\"│ │ │ (GB) │ (GB) │ (%) │\")\n", - " print(\"├─────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n", - "\n", - " for name, config in model_configs:\n", - " param_count = config[\"params\"]\n", - " activation_count = config[\"activations\"]\n", - "\n", - " # Memory calculation (bytes)\n", - " # Parameters: always FP32 for stability\n", - " param_memory = param_count * 4\n", - "\n", - " # FP32 training memory\n", - " fp32_activations = activation_count * 4\n", - " fp32_gradients = param_count * 4\n", - " fp32_optimizer = param_count * 8 # Adam: momentum + velocity\n", - " fp32_total = param_memory + fp32_activations + fp32_gradients + fp32_optimizer\n", - "\n", - " # Mixed precision memory\n", - " fp16_activations = activation_count * 2 # FP16 activations\n", - " fp16_gradients = param_count * 2 # FP16 gradients during backward\n", - " mixed_total = param_memory + fp16_activations + fp16_gradients + fp32_optimizer\n", - "\n", - " # Calculate savings\n", - " savings_gb = (fp32_total - mixed_total) / 1e9\n", - " savings_pct = (fp32_total - mixed_total) / fp32_total * 100\n", - "\n", - " print(f\"│ {name:14s} │ {param_count:10,d} │ {fp32_total/1e9:9.1f} │ {mixed_total/1e9:9.1f} │ {savings_pct:9.1f} │\")\n", - "\n", - " print(\"└─────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n", - "\n", - " # Performance simulation\n", - " print(f\"\\n⚡ Mixed Precision Performance Simulation:\")\n", - "\n", - " # Simulate different batch sizes to show memory pressure\n", - " batch_sizes = [8, 16, 32, 64]\n", - " hidden_size = 1024\n", - " seq_length = 512\n", - "\n", - " print(\"┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n", - " print(\"│ Batch Size │ FP32 Mem │ FP16 Mem │ Throughput │ Efficiency │\")\n", - " print(\"│ │ (GB) │ (GB) │ Gain │ Gain │\")\n", - " print(\"├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n", - "\n", - " for batch_size in batch_sizes:\n", - " # Memory for activations (dominant for large models)\n", - " elements = batch_size * seq_length * hidden_size\n", - "\n", - " fp32_mem = elements * 4 / 1e9 # 4 bytes per FP32\n", - " fp16_mem = elements * 2 / 1e9 # 2 bytes per FP16\n", - "\n", - " # Simulate throughput gains (based on Tensor Core speedups)\n", - " # Real speedups depend on hardware and operation mix\n", - " throughput_gain = 1.4 # Conservative estimate for mixed workloads\n", - "\n", - " # Memory efficiency enables larger batch sizes\n", - " max_fp32_batch = 32 # Assume memory limit\n", - " max_fp16_batch = 64 # Double capacity with FP16\n", - "\n", - " efficiency_gain = max_fp16_batch / max_fp32_batch if batch_size <= max_fp32_batch else \"OOM\"\n", - " efficiency_str = f\"{efficiency_gain:.1f}×\" if isinstance(efficiency_gain, float) else efficiency_gain\n", - "\n", - " print(f\"│ {batch_size:10d} │ {fp32_mem:9.2f} │ {fp16_mem:9.2f} │ {throughput_gain:9.1f}× │ {efficiency_str:9s} │\")\n", - "\n", - " print(\"└─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n", - "\n", - " print(f\"\\n💡 Mixed Precision Key Benefits:\")\n", - " print(f\" 🎯 Memory: 20-40% reduction enables larger models/batches\")\n", - " print(f\" ⚡ Speed: 1.3-2× throughput on modern hardware (V100+)\")\n", - " print(f\" 📈 Scale: Essential for billion-parameter models\")\n", - " print(f\" ⚠️ Complexity: Requires careful loss scaling and overflow handling\")\n", - " print(\"🚀 Mixed precision is crucial for competitive ML training\")\n", - "\n", - "analyze_mixed_precision_benefits()" - ] - }, - { - "cell_type": "markdown", - "id": "d42aa6ff", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 6. Optimization Insights - Production Acceleration Strategy\n", - "\n", - "Understanding when and how to apply different acceleration techniques in real-world scenarios." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "133b1f71", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "acceleration-decision-framework", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_acceleration_decision_framework():\n", - " \"\"\"📊 Decision framework for choosing acceleration techniques.\"\"\"\n", - " print(\"📊 Acceleration Technique Decision Framework...\")\n", - "\n", - " # Define workload characteristics\n", - " workloads = [\n", - " (\"Research Training\", {\n", - " \"memory_pressure\": \"medium\",\n", - " \"latency_sensitive\": False,\n", - " \"stability_critical\": False,\n", - " \"development_speed\": \"high\",\n", - " \"hardware_variety\": \"high\"\n", - " }),\n", - " (\"Production Training\", {\n", - " \"memory_pressure\": \"high\",\n", - " \"latency_sensitive\": False,\n", - " \"stability_critical\": True,\n", - " \"development_speed\": \"medium\",\n", - " \"hardware_variety\": \"low\"\n", - " }),\n", - " (\"Real-time Inference\", {\n", - " \"memory_pressure\": \"medium\",\n", - " \"latency_sensitive\": True,\n", - " \"stability_critical\": True,\n", - " \"development_speed\": \"low\",\n", - " \"hardware_variety\": \"medium\"\n", - " }),\n", - " (\"Edge Deployment\", {\n", - " \"memory_pressure\": \"very_high\",\n", - " \"latency_sensitive\": True,\n", - " \"stability_critical\": True,\n", - " \"development_speed\": \"low\",\n", - " \"hardware_variety\": \"very_high\"\n", - " }),\n", - " (\"Batch Inference\", {\n", - " \"memory_pressure\": \"low\",\n", - " \"latency_sensitive\": False,\n", - " \"stability_critical\": True,\n", - " \"development_speed\": \"medium\",\n", - " \"hardware_variety\": \"low\"\n", - " })\n", - " ]\n", - "\n", - " # Define technique characteristics\n", - " techniques = {\n", - " \"Vectorization\": {\n", - " \"implementation_cost\": \"low\",\n", - " \"memory_benefit\": \"none\",\n", - " \"latency_benefit\": \"high\",\n", - " \"stability_risk\": \"none\",\n", - " \"hardware_dependency\": \"low\"\n", - " },\n", - " \"Kernel Fusion\": {\n", - " \"implementation_cost\": \"medium\",\n", - " \"memory_benefit\": \"medium\",\n", - " \"latency_benefit\": \"medium\",\n", - " \"stability_risk\": \"low\",\n", - " \"hardware_dependency\": \"medium\"\n", - " },\n", - " \"Mixed Precision\": {\n", - " \"implementation_cost\": \"high\",\n", - " \"memory_benefit\": \"high\",\n", - " \"latency_benefit\": \"high\",\n", - " \"stability_risk\": \"medium\",\n", - " \"hardware_dependency\": \"high\"\n", - " },\n", - " \"Graph Optimization\": {\n", - " \"implementation_cost\": \"very_high\",\n", - " \"memory_benefit\": \"medium\",\n", - " \"latency_benefit\": \"very_high\",\n", - " \"stability_risk\": \"low\",\n", - " \"hardware_dependency\": \"very_high\"\n", - " }\n", - " }\n", - "\n", - " print(\"\\n🎯 Acceleration Technique Recommendations:\")\n", - " print(\"┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n", - " print(\"│ Workload │ Vectorize │ Fuse Kernels│ Mixed Prec │ Graph Opt │\")\n", - " print(\"├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n", - "\n", - " for workload_name, workload_chars in workloads:\n", - " recommendations = []\n", - "\n", - " for technique_name in [\"Vectorization\", \"Kernel Fusion\", \"Mixed Precision\", \"Graph Optimization\"]:\n", - " tech_chars = techniques[technique_name]\n", - " score = 0\n", - "\n", - " # Benefit vs requirement matching\n", - " if workload_chars[\"memory_pressure\"] in [\"high\", \"very_high\"]:\n", - " if tech_chars[\"memory_benefit\"] in [\"medium\", \"high\"]:\n", - " score += 2\n", - "\n", - " if workload_chars[\"latency_sensitive\"]:\n", - " if tech_chars[\"latency_benefit\"] in [\"medium\", \"high\", \"very_high\"]:\n", - " score += 2\n", - "\n", - " # Risk vs tolerance matching\n", - " if workload_chars[\"stability_critical\"]:\n", - " if tech_chars[\"stability_risk\"] in [\"none\", \"low\"]:\n", - " score += 1\n", - " elif tech_chars[\"stability_risk\"] == \"medium\":\n", - " score -= 1\n", - "\n", - " # Implementation cost vs development speed\n", - " if workload_chars[\"development_speed\"] == \"high\":\n", - " if tech_chars[\"implementation_cost\"] in [\"low\", \"medium\"]:\n", - " score += 1\n", - " elif tech_chars[\"implementation_cost\"] in [\"high\", \"very_high\"]:\n", - " score -= 1\n", - "\n", - " # Hardware dependency vs variety\n", - " if workload_chars[\"hardware_variety\"] in [\"high\", \"very_high\"]:\n", - " if tech_chars[\"hardware_dependency\"] in [\"low\", \"medium\"]:\n", - " score += 1\n", - " elif tech_chars[\"hardware_dependency\"] in [\"high\", \"very_high\"]:\n", - " score -= 2\n", - "\n", - " # Convert score to recommendation\n", - " if score >= 3:\n", - " rec = \"✅ High\"\n", - " elif score >= 1:\n", - " rec = \"⚡ Medium\"\n", - " elif score >= 0:\n", - " rec = \"⚠️ Low\"\n", - " else:\n", - " rec = \"❌ Skip\"\n", - "\n", - " recommendations.append(rec)\n", - "\n", - " rec_line = \" │ \".join(f\"{rec:10s}\" for rec in recommendations)\n", - " print(f\"│ {workload_name:18s} │ {rec_line} │\")\n", - "\n", - " print(\"└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n", - "\n", - " # Implementation priority framework\n", - " print(f\"\\n🛠️ Implementation Priority Framework:\")\n", - " print(f\" 📊 Phase 1 (Always): Vectorization\")\n", - " print(f\" • Low risk, high reward\")\n", - " print(f\" • Works on any hardware\")\n", - " print(f\" • Foundation for other optimizations\")\n", - " print(f\" \")\n", - " print(f\" 📊 Phase 2 (Memory constrained): Kernel Fusion\")\n", - " print(f\" • Targets memory-bound operations\")\n", - " print(f\" • Moderate complexity\")\n", - " print(f\" • Significant wins on element-wise ops\")\n", - " print(f\" \")\n", - " print(f\" 📊 Phase 3 (Large models): Mixed Precision\")\n", - " print(f\" • Essential for large model training\")\n", - " print(f\" • Requires careful validation\")\n", - " print(f\" • Hardware-dependent benefits\")\n", - " print(f\" \")\n", - " print(f\" 📊 Phase 4 (Production): Graph Optimization\")\n", - " print(f\" • Maximum performance extraction\")\n", - " print(f\" • High implementation cost\")\n", - " print(f\" • Deployment-specific tuning\")\n", - "\n", - " print(f\"\\n💡 Key Decision Factors:\")\n", - " print(f\" 🎯 Start simple: Vectorization first, always\")\n", - " print(f\" 📈 Scale up: Add complexity only when needed\")\n", - " print(f\" ⚡ Measure impact: Profile before and after each optimization\")\n", - " print(f\" 🔄 Iterate: Optimization is an ongoing process, not one-time\")\n", - " print(\"🚀 Systematic acceleration beats random optimization\")\n", - "\n", - "analyze_acceleration_decision_framework()" - ] - }, - { - "cell_type": "markdown", - "id": "541be4f4", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 7. Module Integration Test\n", - "\n", - "Final validation that all acceleration components work together correctly." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "05244210", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test-module", - "locked": true, - "points": 20 - } - }, - "outputs": [], - "source": [ - "def test_module():\n", - " \"\"\"\n", - " Comprehensive test of entire acceleration module functionality.\n", - "\n", - " This final test ensures:\n", - " - All acceleration techniques work correctly\n", - " - Performance improvements are measurable\n", - " - Mixed precision training is stable\n", - " - Components integrate seamlessly\n", - " - Module is ready for production use\n", - " \"\"\"\n", - " print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n", - " print(\"=\" * 50)\n", - "\n", - " # Run all unit tests\n", - " print(\"Running unit tests...\")\n", - " test_unit_vectorized_matmul()\n", - " test_unit_fused_gelu()\n", - " test_unit_fusion_speedup()\n", - " test_unit_mixed_precision()\n", - "\n", - " print(\"\\nRunning integration scenarios...\")\n", - "\n", - " # Test realistic acceleration pipeline\n", - " print(\"🔬 Integration Test: Complete acceleration pipeline...\")\n", - "\n", - " # Create realistic model scenario\n", - " batch_size, seq_len, hidden_dim = 16, 64, 256\n", - " print(f\" Model config: batch={batch_size}, seq_len={seq_len}, hidden={hidden_dim}\")\n", - "\n", - " # Test data\n", - " x = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))\n", - " weight = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n", - " print(f\" Input tensor: {x.shape}, Weight tensor: {weight.shape}\")\n", - "\n", - " # Test complete pipeline: reshape → matmul → activation → mixed precision\n", - " print(\" Testing vectorized operations...\")\n", - "\n", - " # Reshape for matrix multiplication (flatten batch and sequence)\n", - " x_reshaped = Tensor(x.data.reshape(-1, hidden_dim))\n", - " assert x_reshaped.shape == (batch_size * seq_len, hidden_dim)\n", - "\n", - " # Vectorized matrix multiplication\n", - " linear_output = vectorized_matmul(x_reshaped, weight)\n", - " assert linear_output.shape == (batch_size * seq_len, hidden_dim)\n", - " print(f\" ✅ Matrix multiplication: {x_reshaped.shape} @ {weight.shape} → {linear_output.shape}\")\n", - "\n", - " # Fused activation\n", - " activated = fused_gelu(linear_output)\n", - " assert activated.shape == linear_output.shape\n", - " print(f\" ✅ Fused GELU activation: {linear_output.shape} → {activated.shape}\")\n", - "\n", - " # Reshape back to original structure\n", - " final_output = Tensor(activated.data.reshape(batch_size, seq_len, hidden_dim))\n", - " assert final_output.shape == x.shape\n", - " print(f\" ✅ Output reshape: {activated.shape} → {final_output.shape}\")\n", - "\n", - " print(\" Testing mixed precision training integration...\")\n", - "\n", - " # Create complete model for mixed precision testing\n", - " class TransformerBlock:\n", - " def __init__(self, hidden_dim):\n", - " self.hidden_dim = hidden_dim\n", - " self.weight1 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n", - " self.weight2 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n", - " self.weight1.grad = None\n", - " self.weight2.grad = None\n", - "\n", - " def __call__(self, x):\n", - " # Simulate transformer block: linear → activation → linear\n", - " batch_size, seq_len, hidden_dim = x.shape\n", - " x_flat = Tensor(x.data.reshape(-1, hidden_dim))\n", - "\n", - " # First linear layer\n", - " h1 = vectorized_matmul(x_flat, self.weight1)\n", - " h1_activated = fused_gelu(h1)\n", - "\n", - " # Second linear layer\n", - " h2 = vectorized_matmul(h1_activated, self.weight2)\n", - "\n", - " # Reshape back\n", - " output = Tensor(h2.data.reshape(batch_size, seq_len, hidden_dim))\n", - " return output\n", - "\n", - " def parameters(self):\n", - " return [self.weight1, self.weight2]\n", - "\n", - " class SimpleOptimizer:\n", - " def __init__(self, params):\n", - " self.params = params\n", - "\n", - " def zero_grad(self):\n", - " for p in self.params:\n", - " p.grad = None\n", - "\n", - " def step(self):\n", - " for p in self.params:\n", - " if p.grad is not None:\n", - " p.data = p.data - 0.001 * p.grad.data\n", - "\n", - " # Initialize model and optimizer\n", - " model = TransformerBlock(hidden_dim)\n", - " optimizer = SimpleOptimizer(model.parameters())\n", - " trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=512.0)\n", - "\n", - " print(f\" Model parameters: {len(model.parameters())}\")\n", - " print(f\" Initial loss scale: {trainer.loss_scale}\")\n", - "\n", - " # Simulate training steps\n", - " print(\" Running training steps...\")\n", - " targets = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))\n", - "\n", - " training_metrics = []\n", - " for step in range(5):\n", - " metrics = trainer.train_step((x, targets))\n", - " training_metrics.append(metrics)\n", - "\n", - " # Verify metrics are reasonable\n", - " assert isinstance(metrics['loss'], (int, float))\n", - " assert metrics['loss'] >= 0\n", - " assert metrics['loss_scale'] > 0\n", - " assert isinstance(metrics['overflow'], bool)\n", - " assert isinstance(metrics['gradients_valid'], bool)\n", - "\n", - " print(f\" ✅ Completed {len(training_metrics)} training steps\")\n", - "\n", - " # Analyze training stability\n", - " losses = [m['loss'] for m in training_metrics]\n", - " overflows = [m['overflow'] for m in training_metrics]\n", - "\n", - " print(f\" Loss range: {min(losses):.6f} - {max(losses):.6f}\")\n", - " print(f\" Overflow rate: {sum(overflows)}/{len(overflows)} steps\")\n", - "\n", - " print(\" Testing performance characteristics...\")\n", - "\n", - " # Verify acceleration provides measurable benefits\n", - " test_sizes = [128, 256]\n", - " for size in test_sizes:\n", - " test_x = Tensor(np.random.randn(size, size).astype(np.float32))\n", - " test_y = Tensor(np.random.randn(size, size).astype(np.float32))\n", - "\n", - " # Time operations and verify reasonable performance\n", - " start = time.time()\n", - " _ = vectorized_matmul(test_x, test_y)\n", - " matmul_time = time.time() - start\n", - "\n", - " start = time.time()\n", - " _ = fused_gelu(test_x)\n", - " gelu_time = time.time() - start\n", - "\n", - " # Verify operations complete in reasonable time\n", - " assert matmul_time < 1.0, f\"Matrix multiplication too slow: {matmul_time:.3f}s\"\n", - " assert gelu_time < 0.1, f\"GELU activation too slow: {gelu_time:.3f}s\"\n", - "\n", - " print(f\" ✅ Size {size}: matmul={matmul_time*1000:.1f}ms, gelu={gelu_time*1000:.1f}ms\")\n", - "\n", - " print(\" Testing memory efficiency...\")\n", - "\n", - " # Verify mixed precision reduces memory usage conceptually\n", - " param_count = sum(p.data.size for p in model.parameters())\n", - " activation_count = batch_size * seq_len * hidden_dim\n", - "\n", - " fp32_memory = (param_count + activation_count) * 4 # 4 bytes per FP32\n", - " mixed_memory = param_count * 4 + activation_count * 2 # FP32 params + FP16 activations\n", - " memory_savings = (fp32_memory - mixed_memory) / fp32_memory * 100\n", - "\n", - " print(f\" Memory analysis: {memory_savings:.1f}% savings from mixed precision\")\n", - " assert memory_savings > 0, \"Mixed precision should reduce memory usage\"\n", - "\n", - " print(\"✅ End-to-end acceleration pipeline works!\")\n", - "\n", - " print(\"\\n\" + \"=\" * 50)\n", - " print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n", - " print(\"Run: tito module complete 16\")\n", - "\n", - "# Call the module test\n", - "test_module()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6531eb00", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "main-execution", - "solution": false - } - }, - "outputs": [], - "source": [ - "# Main execution block\n", - "if __name__ == \"__main__\":\n", - " print(\"🚀 Running Acceleration module...\")\n", - " test_module()\n", - " print(\"✅ Module validation complete!\")" - ] - }, - { - "cell_type": "markdown", - "id": "e1054af9", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🤔 ML Systems Thinking: Acceleration and Performance\n", - "\n", - "### Question 1: Arithmetic Intensity Analysis\n", - "You implemented vectorized matrix multiplication and fused GELU.\n", - "- Matrix multiplication (1024×1024): Performs ~2.1 billion FLOPs, reads ~12 MB data\n", - "- Arithmetic intensity: _____ FLOPs/byte\n", - "- Compared to element-wise addition (0.33 FLOPs/byte): _____× higher intensity\n", - "- Why does this make matrix multiplication ideal for GPUs? _____\n", - "\n", - "### Question 2: Kernel Fusion Memory Benefits\n", - "Your fused_gelu combines 7 operations into a single expression.\n", - "- Unfused version memory accesses: 7 reads + 7 writes = _____ per element\n", - "- Fused version memory accesses: 1 read + 1 write = _____ per element\n", - "- Memory bandwidth reduction: _____%\n", - "- Why is this critical for transformer inference? _____\n", - "\n", - "### Question 3: Mixed Precision Memory Calculation\n", - "Your MixedPrecisionTrainer uses FP16 activations, FP32 parameters.\n", - "For a 100M parameter model with 50M activation elements:\n", - "- FP32 memory: (100M + 50M) × 4 bytes = _____ MB\n", - "- Mixed precision memory: 100M × 4 + 50M × 2 = _____ MB\n", - "- Memory reduction: _____%\n", - "\n", - "### Question 4: Loss Scaling Strategy\n", - "Your trainer starts with loss_scale=1024, grows by 2×, shrinks by 0.5×.\n", - "- Minimum FP16 representable value: ~6e-5\n", - "- Without scaling, gradients < _____ become zero\n", - "- With 1024× scaling, gradients down to _____ are preserved\n", - "- Why increase scale gradually but decrease immediately? _____\n", - "\n", - "### Question 5: Production Optimization Strategy\n", - "Based on your decision framework analysis:\n", - "For edge deployment (memory critical, stability required, hardware diverse):\n", - "- Priority 1 technique: _____ (low risk, universal)\n", - "- Priority 2 technique: _____ (memory benefits)\n", - "- Skip technique: _____ (why: _____)\n", - "- What's the primary constraint: memory, compute, or power? _____" - ] - }, - { - "cell_type": "markdown", - "id": "2fcecfae", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎯 MODULE SUMMARY: Acceleration\n", - "\n", - "Congratulations! You've mastered the fundamental techniques for accelerating neural networks!\n", - "\n", - "### Key Accomplishments\n", - "- Built **vectorized operations** leveraging SIMD and optimized BLAS for 2-5× speedups\n", - "- Implemented **kernel fusion** reducing memory bandwidth by 60-80% for element-wise operations\n", - "- Created **mixed precision training** with automatic loss scaling for 20-40% memory savings\n", - "- Analyzed **arithmetic intensity patterns** and their impact on the roofline model\n", - "- Developed **production decision framework** for systematic optimization\n", - "- All tests pass ✅ (validated by `test_module()`)\n", - "\n", - "### Systems Insights Discovered\n", - "- **Roofline Model**: Operations with high arithmetic intensity (FLOPs/byte) scale better\n", - "- **Memory Bandwidth**: Often the limiting factor for modern accelerators\n", - "- **Kernel Fusion**: Critical for memory-bound workloads, reduces intermediate storage overhead\n", - "- **Mixed Precision**: Essential for large model training, requires careful gradient scaling\n", - "- **Optimization Strategy**: Start simple (vectorization), add complexity as needed\n", - "\n", - "### Production Impact\n", - "Your acceleration techniques enable:\n", - "- **Training larger models** within memory constraints\n", - "- **Faster iteration cycles** during research and development\n", - "- **Better hardware utilization** across different deployment targets\n", - "- **Cost reduction** through improved efficiency\n", - "\n", - "### Ready for Next Steps\n", - "Your acceleration implementations provide the foundation for quantization techniques in Module 17.\n", - "The performance analysis skills transfer directly to production optimization workflows.\n", - "\n", - "Export with: `tito module complete 16`\n", - "\n", - "**Next**: Module 17 will add quantization to further reduce memory and increase throughput while maintaining accuracy!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/modules/source/17_quantization/quantization_dev.ipynb b/modules/source/17_quantization/quantization_dev.ipynb deleted file mode 100644 index d5eb129d..00000000 --- a/modules/source/17_quantization/quantization_dev.ipynb +++ /dev/null @@ -1,2593 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "4c350fb4", - "metadata": {}, - "outputs": [], - "source": [ - "#| default_exp optimization.quantization" - ] - }, - { - "cell_type": "markdown", - "id": "68ad4cba", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "# Module 17: Quantization - Making Models Smaller and Faster\n", - "\n", - "Welcome to Quantization! Today you'll learn how to reduce model precision from FP32 to INT8 while preserving accuracy.\n", - "\n", - "## 🔗 Prerequisites & Progress\n", - "**You've Built**: Complete ML pipeline with profiling and acceleration techniques\n", - "**You'll Build**: INT8 quantization system with calibration and memory savings\n", - "**You'll Enable**: 4× memory reduction and 2-4× speedup with minimal accuracy loss\n", - "\n", - "**Connection Map**:\n", - "```\n", - "Profiling → Quantization → Compression\n", - "(measure) (reduce bits) (remove weights)\n", - "```\n", - "\n", - "## Learning Objectives\n", - "By the end of this module, you will:\n", - "1. Implement INT8 quantization with proper scaling\n", - "2. Build quantization-aware training for minimal accuracy loss\n", - "3. Apply post-training quantization to existing models\n", - "4. Measure actual memory and compute savings\n", - "5. Understand quantization error and mitigation strategies\n", - "\n", - "Let's make models 4× smaller!" - ] - }, - { - "cell_type": "markdown", - "id": "ada2f24d", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 📦 Where This Code Lives in the Final Package\n", - "\n", - "**Learning Side:** You work in `modules/17_quantization/quantization_dev.py` \n", - "**Building Side:** Code exports to `tinytorch.optimization.quantization`\n", - "\n", - "```python\n", - "# How to use this module:\n", - "from tinytorch.optimization.quantization import quantize_int8, QuantizedLinear, quantize_model\n", - "```\n", - "\n", - "**Why this matters:**\n", - "- **Learning:** Complete quantization system in one focused module for deep understanding\n", - "- **Production:** Proper organization like PyTorch's torch.quantization with all optimization components together\n", - "- **Consistency:** All quantization operations and calibration tools in optimization.quantization\n", - "- **Integration:** Works seamlessly with existing models for complete optimization pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a4314940", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "imports", - "solution": true - } - }, - "outputs": [], - "source": [ - "#| export\n", - "import numpy as np\n", - "import time\n", - "from typing import Tuple, Dict, List, Optional\n", - "import warnings\n", - "\n", - "# Import dependencies from other modules\n", - "from tinytorch.core.tensor import Tensor\n", - "from tinytorch.core.layers import Linear\n", - "from tinytorch.core.activations import ReLU\n", - "\n", - "print(\"✅ Quantization module imports complete\")" - ] - }, - { - "cell_type": "markdown", - "id": "210e964f", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 1. Introduction - The Memory Wall Problem\n", - "\n", - "Imagine trying to fit a library in your backpack. Neural networks face the same challenge - models are getting huge, but devices have limited memory!\n", - "\n", - "### The Precision Paradox\n", - "\n", - "Modern neural networks use 32-bit floating point numbers with incredible precision:\n", - "\n", - "```\n", - "FP32 Number: 3.14159265359...\n", - " ^^^^^^^^^^^^^^^^\n", - " 32 bits = 4 bytes per weight\n", - "```\n", - "\n", - "But here's the surprising truth: **we don't need all that precision for most AI tasks!**\n", - "\n", - "### The Growing Memory Crisis\n", - "\n", - "```\n", - "Model Memory Requirements (FP32):\n", - "┌─────────────────────────────────────────────────────────────┐\n", - "│ BERT-Base: 110M params × 4 bytes = 440MB │\n", - "│ GPT-2: 1.5B params × 4 bytes = 6GB │\n", - "│ GPT-3: 175B params × 4 bytes = 700GB │\n", - "│ Your Phone: Available RAM = 4-8GB │\n", - "└─────────────────────────────────────────────────────────────┘\n", - " ↑\n", - " Problem!\n", - "```\n", - "\n", - "### The Quantization Solution\n", - "\n", - "What if we could represent each weight with just 8 bits instead of 32?\n", - "\n", - "```\n", - "Before Quantization (FP32):\n", - "┌──────────────────────────────────┐\n", - "│ 3.14159265 │ 2.71828183 │ │ 32 bits each\n", - "└──────────────────────────────────┘\n", - "\n", - "After Quantization (INT8):\n", - "┌────────┬────────┬────────┬────────┐\n", - "│ 98 │ 85 │ 72 │ 45 │ 8 bits each\n", - "└────────┴────────┴────────┴────────┘\n", - " ↑\n", - " 4× less memory!\n", - "```\n", - "\n", - "### Real-World Impact You'll Achieve\n", - "\n", - "**Memory Reduction:**\n", - "- BERT-Base: 440MB → 110MB (4× smaller)\n", - "- Fits on mobile devices!\n", - "- Faster loading from disk\n", - "- More models in GPU memory\n", - "\n", - "**Speed Improvements:**\n", - "- 2-4× faster inference (hardware dependent)\n", - "- Lower power consumption\n", - "- Better user experience\n", - "\n", - "**Accuracy Preservation:**\n", - "- <1% accuracy loss with proper techniques\n", - "- Sometimes even improves generalization!\n", - "\n", - "**Why This Matters:**\n", - "- **Mobile AI:** Deploy powerful models on phones\n", - "- **Edge Computing:** Run AI without cloud connectivity\n", - "- **Data Centers:** Serve more users with same hardware\n", - "- **Environmental:** Reduce energy consumption by 2-4×\n", - "\n", - "Today you'll build the production-quality quantization system that makes all this possible!" - ] - }, - { - "cell_type": "markdown", - "id": "0927a359", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 2. Foundations - The Mathematics of Compression\n", - "\n", - "### Understanding the Core Challenge\n", - "\n", - "Think of quantization like converting a smooth analog signal to digital steps. We need to map infinite precision (FP32) to just 256 possible values (INT8).\n", - "\n", - "### The Quantization Mapping\n", - "\n", - "```\n", - "The Fundamental Problem:\n", - "\n", - "FP32 Numbers (Continuous): INT8 Numbers (Discrete):\n", - " ∞ possible values → 256 possible values\n", - "\n", - " ... -1.7 -1.2 -0.3 0.0 0.8 1.5 2.1 ...\n", - " ↓ ↓ ↓ ↓ ↓ ↓ ↓\n", - " -128 -95 -38 0 25 48 67 127\n", - "```\n", - "\n", - "### The Magic Formula\n", - "\n", - "Every quantization system uses this fundamental relationship:\n", - "\n", - "```\n", - "Quantization (FP32 → INT8):\n", - "┌─────────────────────────────────────────────────────────┐\n", - "│ quantized = round((float_value - zero_point) / scale) │\n", - "└─────────────────────────────────────────────────────────┘\n", - "\n", - "Dequantization (INT8 → FP32):\n", - "┌─────────────────────────────────────────────────────────┐\n", - "│ float_value = scale × quantized + zero_point │\n", - "└─────────────────────────────────────────────────────────┘\n", - "```\n", - "\n", - "### The Two Critical Parameters\n", - "\n", - "**1. Scale (s)** - How big each INT8 step is in FP32 space:\n", - "```\n", - "Small Scale (high precision): Large Scale (low precision):\n", - " FP32: [0.0, 0.255] FP32: [0.0, 25.5]\n", - " ↓ ↓ ↓ ↓ ↓ ↓\n", - " INT8: 0 128 255 INT8: 0 128 255\n", - " │ │ │ │ │ │\n", - " 0.0 0.127 0.255 0.0 12.75 25.5\n", - "\n", - " Scale = 0.001 (very precise) Scale = 0.1 (less precise)\n", - "```\n", - "\n", - "**2. Zero Point (z)** - Which INT8 value represents FP32 zero:\n", - "```\n", - "Symmetric Range: Asymmetric Range:\n", - " FP32: [-2.0, 2.0] FP32: [-1.0, 3.0]\n", - " ↓ ↓ ↓ ↓ ↓ ↓\n", - " INT8: -128 0 127 INT8: -128 64 127\n", - " │ │ │ │ │ │\n", - " -2.0 0.0 2.0 -1.0 0.0 3.0\n", - "\n", - " Zero Point = 0 Zero Point = 64\n", - "```\n", - "\n", - "### Visual Example: Weight Quantization\n", - "\n", - "```\n", - "Original FP32 Weights: Quantized INT8 Mapping:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ -0.8 -0.3 0.0 0.5 │ → │ -102 -38 0 64 │\n", - "│ 0.9 1.2 -0.1 0.7 │ │ 115 153 -13 89 │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - " 4 bytes each 1 byte each\n", - " Total: 32 bytes Total: 8 bytes\n", - " ↑\n", - " 4× compression!\n", - "```\n", - "\n", - "### Quantization Error Analysis\n", - "\n", - "```\n", - "Perfect Reconstruction (Impossible): Quantized Reconstruction (Reality):\n", - "\n", - "Original: 0.73 Original: 0.73\n", - " ↓ ↓\n", - "INT8: ? (can't represent exactly) INT8: 93 (closest)\n", - " ↓ ↓\n", - "Restored: 0.73 Restored: 0.728\n", - " ↑\n", - " Error: 0.002\n", - "```\n", - "\n", - "**The Quantization Trade-off:**\n", - "- **More bits** = Higher precision, larger memory\n", - "- **Fewer bits** = Lower precision, smaller memory\n", - "- **Goal:** Find the sweet spot where error is acceptable\n", - "\n", - "### Why INT8 is the Sweet Spot\n", - "\n", - "```\n", - "Precision vs Memory Trade-offs:\n", - "\n", - "FP32: ████████████████████████████████ (32 bits) - Overkill precision\n", - "FP16: ████████████████ (16 bits) - Good precision\n", - "INT8: ████████ (8 bits) - Sufficient precision ← Sweet spot!\n", - "INT4: ████ (4 bits) - Often too little\n", - "\n", - "Memory: 100% 50% 25% 12.5%\n", - "Accuracy: 100% 99.9% 99.5% 95%\n", - "```\n", - "\n", - "INT8 gives us 4× memory reduction with <1% accuracy loss - the perfect balance for production systems!" - ] - }, - { - "cell_type": "markdown", - "id": "6639cbe4", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 3. Implementation - Building the Quantization Engine\n", - "\n", - "### Our Implementation Strategy\n", - "\n", - "We'll build quantization in logical layers, each building on the previous:\n", - "\n", - "```\n", - "Quantization System Architecture:\n", - "\n", - "┌─────────────────────────────────────────────────────────────┐\n", - "│ Layer 4: Model Quantization │\n", - "│ quantize_model() - Convert entire neural networks │\n", - "├─────────────────────────────────────────────────────────────┤\n", - "│ Layer 3: Layer Quantization │\n", - "│ QuantizedLinear - Quantized linear transformations │\n", - "├─────────────────────────────────────────────────────────────┤\n", - "│ Layer 2: Tensor Operations │\n", - "│ quantize_int8() - Core quantization algorithm │\n", - "│ dequantize_int8() - Restore to floating point │\n", - "├─────────────────────────────────────────────────────────────┤\n", - "│ Layer 1: Foundation │\n", - "│ Scale & Zero Point Calculation - Parameter optimization │\n", - "└─────────────────────────────────────────────────────────────┘\n", - "```\n", - "\n", - "### What We're About to Build\n", - "\n", - "**Core Functions:**\n", - "- `quantize_int8()` - Convert FP32 tensors to INT8\n", - "- `dequantize_int8()` - Convert INT8 back to FP32\n", - "- `QuantizedLinear` - Quantized version of Linear layers\n", - "- `quantize_model()` - Quantize entire neural networks\n", - "\n", - "**Key Features:**\n", - "- **Automatic calibration** - Find optimal quantization parameters\n", - "- **Error minimization** - Preserve accuracy during compression\n", - "- **Memory tracking** - Measure actual savings achieved\n", - "- **Production patterns** - Industry-standard algorithms\n", - "\n", - "Let's start with the fundamental building block!" - ] - }, - { - "cell_type": "markdown", - "id": "26bdadc6", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### INT8 Quantization - The Foundation\n", - "\n", - "This is the core function that converts any FP32 tensor to INT8. Think of it as a smart compression algorithm that preserves the most important information.\n", - "\n", - "```\n", - "Quantization Process Visualization:\n", - "\n", - "Step 1: Analyze Range Step 2: Calculate Parameters Step 3: Apply Formula\n", - "┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ Input: [-1.5, 0.2, 2.8] │ │ Min: -1.5 │ │ quantized = round( │\n", - "│ │ │ Max: 2.8 │ │ (value - zp*scale) │\n", - "│ Find min/max values │ → │ Range: 4.3 │ →│ / scale) │\n", - "│ │ │ Scale: 4.3/255 = 0.017 │ │ │\n", - "│ │ │ Zero Point: 88 │ │ Result: [-128, 12, 127] │\n", - "└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘\n", - "```\n", - "\n", - "**Key Challenges This Function Solves:**\n", - "- **Dynamic Range:** Each tensor has different min/max values\n", - "- **Precision Loss:** Map 4 billion FP32 values to just 256 INT8 values\n", - "- **Zero Preservation:** Ensure FP32 zero maps exactly to an INT8 value\n", - "- **Symmetric Mapping:** Distribute quantization levels efficiently\n", - "\n", - "**Why This Algorithm:**\n", - "- **Linear mapping** preserves relative relationships between values\n", - "- **Symmetric quantization** works well for most neural network weights\n", - "- **Clipping to [-128, 127]** ensures valid INT8 range\n", - "- **Round-to-nearest** minimizes quantization error" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "68d91dc9", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "quantize_int8", - "solution": true - } - }, - "outputs": [], - "source": [ - "def quantize_int8(tensor: Tensor) -> Tuple[Tensor, float, int]:\n", - " \"\"\"\n", - " Quantize FP32 tensor to INT8 using symmetric quantization.\n", - "\n", - " TODO: Implement INT8 quantization with scale and zero_point calculation\n", - "\n", - " APPROACH:\n", - " 1. Find min/max values in tensor data\n", - " 2. Calculate scale: (max_val - min_val) / 255 (INT8 range: -128 to 127)\n", - " 3. Calculate zero_point: offset to map FP32 zero to INT8 zero\n", - " 4. Apply quantization formula: round((value - zero_point) / scale)\n", - " 5. Clamp to INT8 range [-128, 127]\n", - "\n", - " EXAMPLE:\n", - " >>> tensor = Tensor([[-1.0, 0.0, 2.0], [0.5, 1.5, -0.5]])\n", - " >>> q_tensor, scale, zero_point = quantize_int8(tensor)\n", - " >>> print(f\"Scale: {scale:.4f}, Zero point: {zero_point}\")\n", - " Scale: 0.0118, Zero point: 42\n", - "\n", - " HINTS:\n", - " - Use np.round() for quantization\n", - " - Clamp with np.clip(values, -128, 127)\n", - " - Handle edge case where min_val == max_val (set scale=1.0)\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " data = tensor.data\n", - "\n", - " # Step 1: Find dynamic range\n", - " min_val = float(np.min(data))\n", - " max_val = float(np.max(data))\n", - "\n", - " # Step 2: Handle edge case (constant tensor)\n", - " if abs(max_val - min_val) < 1e-8:\n", - " scale = 1.0\n", - " zero_point = 0\n", - " quantized_data = np.zeros_like(data, dtype=np.int8)\n", - " return Tensor(quantized_data), scale, zero_point\n", - "\n", - " # Step 3: Calculate scale and zero_point for standard quantization\n", - " # Map [min_val, max_val] to [-128, 127] (INT8 range)\n", - " scale = (max_val - min_val) / 255.0\n", - " zero_point = int(np.round(-128 - min_val / scale))\n", - "\n", - " # Clamp zero_point to valid INT8 range\n", - " zero_point = int(np.clip(zero_point, -128, 127))\n", - "\n", - " # Step 4: Apply quantization formula: q = (x / scale) + zero_point\n", - " quantized_data = np.round(data / scale + zero_point)\n", - "\n", - " # Step 5: Clamp to INT8 range and convert to int8\n", - " quantized_data = np.clip(quantized_data, -128, 127).astype(np.int8)\n", - "\n", - " return Tensor(quantized_data), scale, zero_point\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_quantize_int8():\n", - " \"\"\"🔬 Test INT8 quantization implementation.\"\"\"\n", - " print(\"🔬 Unit Test: INT8 Quantization...\")\n", - "\n", - " # Test basic quantization\n", - " tensor = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n", - " q_tensor, scale, zero_point = quantize_int8(tensor)\n", - "\n", - " # Verify quantized values are in INT8 range\n", - " assert np.all(q_tensor.data >= -128)\n", - " assert np.all(q_tensor.data <= 127)\n", - " assert isinstance(scale, float)\n", - " assert isinstance(zero_point, int)\n", - "\n", - " # Test dequantization preserves approximate values\n", - " dequantized = scale * (q_tensor.data - zero_point)\n", - " error = np.mean(np.abs(tensor.data - dequantized))\n", - " assert error < 0.2, f\"Quantization error too high: {error}\"\n", - "\n", - " # Test edge case: constant tensor\n", - " constant_tensor = Tensor([[2.0, 2.0], [2.0, 2.0]])\n", - " q_const, scale_const, zp_const = quantize_int8(constant_tensor)\n", - " assert scale_const == 1.0\n", - "\n", - " print(\"✅ INT8 quantization works correctly!\")\n", - "\n", - "test_unit_quantize_int8()" - ] - }, - { - "cell_type": "markdown", - "id": "4dc13ff2", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### INT8 Dequantization - Restoring Precision\n", - "\n", - "Dequantization is the inverse process - converting compressed INT8 values back to usable FP32. This is where we \"decompress\" our quantized data.\n", - "\n", - "```\n", - "Dequantization Process:\n", - "\n", - "INT8 Values + Parameters → FP32 Reconstruction\n", - "\n", - "┌─────────────────────────┐\n", - "│ Quantized: [-128, 12, 127] │\n", - "│ Scale: 0.017 │\n", - "│ Zero Point: 88 │\n", - "└─────────────────────────┘\n", - " │\n", - " ▼ Apply Formula\n", - "┌─────────────────────────┐\n", - "│ FP32 = scale × quantized │\n", - "│ + zero_point × scale │\n", - "└─────────────────────────┘\n", - " │\n", - " ▼\n", - "┌─────────────────────────┐\n", - "│ Result: [-1.496, 0.204, 2.799]│\n", - "│ Original: [-1.5, 0.2, 2.8] │\n", - "│ Error: [0.004, 0.004, 0.001] │\n", - "└─────────────────────────┘\n", - " ↑\n", - " Excellent approximation!\n", - "```\n", - "\n", - "**Why This Step Is Critical:**\n", - "- **Neural networks expect FP32** - INT8 values would confuse computations\n", - "- **Preserves computation compatibility** - works with existing matrix operations\n", - "- **Controlled precision loss** - error is bounded and predictable\n", - "- **Hardware flexibility** - can use FP32 or specialized INT8 operations\n", - "\n", - "**When Dequantization Happens:**\n", - "- **During forward pass** - before matrix multiplications\n", - "- **For gradient computation** - during backward pass\n", - "- **Educational approach** - production uses INT8 GEMM directly" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c54cf336", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "dequantize_int8", - "solution": true - } - }, - "outputs": [], - "source": [ - "def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:\n", - " \"\"\"\n", - " Dequantize INT8 tensor back to FP32.\n", - "\n", - " TODO: Implement dequantization using the inverse formula\n", - "\n", - " APPROACH:\n", - " 1. Apply inverse quantization: scale * quantized_value + zero_point * scale\n", - " 2. Return as new FP32 Tensor\n", - "\n", - " EXAMPLE:\n", - " >>> q_tensor = Tensor([[-42, 0, 85]]) # INT8 values\n", - " >>> scale, zero_point = 0.0314, 64\n", - " >>> fp32_tensor = dequantize_int8(q_tensor, scale, zero_point)\n", - " >>> print(fp32_tensor.data)\n", - " [[-1.31, 2.01, 2.67]] # Approximate original values\n", - "\n", - " HINT:\n", - " - Formula: dequantized = scale * quantized + zero_point * scale\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Apply inverse quantization formula\n", - " dequantized_data = scale * q_tensor.data + zero_point * scale\n", - " return Tensor(dequantized_data.astype(np.float32))\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_dequantize_int8():\n", - " \"\"\"🔬 Test INT8 dequantization implementation.\"\"\"\n", - " print(\"🔬 Unit Test: INT8 Dequantization...\")\n", - "\n", - " # Test round-trip: quantize → dequantize\n", - " original = Tensor([[-1.5, 0.0, 3.2], [1.1, -0.8, 2.7]])\n", - " q_tensor, scale, zero_point = quantize_int8(original)\n", - " restored = dequantize_int8(q_tensor, scale, zero_point)\n", - "\n", - " # Verify round-trip error is small\n", - " error = np.mean(np.abs(original.data - restored.data))\n", - " assert error < 2.0, f\"Round-trip error too high: {error}\"\n", - "\n", - " # Verify output is float32\n", - " assert restored.data.dtype == np.float32\n", - "\n", - " print(\"✅ INT8 dequantization works correctly!\")\n", - "\n", - "test_unit_dequantize_int8()" - ] - }, - { - "cell_type": "markdown", - "id": "457c4bca", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## Quantization Quality - Understanding the Impact\n", - "\n", - "### Why Distribution Matters\n", - "\n", - "Different types of data quantize differently. Let's understand how various weight distributions affect quantization quality.\n", - "\n", - "```\n", - "Quantization Quality Factors:\n", - "\n", - "┌─────────────────┬─────────────────┬─────────────────┐\n", - "│ Distribution │ Scale Usage │ Error Level │\n", - "├─────────────────┼─────────────────┼─────────────────┤\n", - "│ Uniform │ ████████████████ │ Low │\n", - "│ Normal │ ██████████████ │ Medium │\n", - "│ With Outliers │ ████ │ High │\n", - "│ Sparse (zeros) │ ████ │ High │\n", - "└─────────────────┴─────────────────┴─────────────────┘\n", - "```\n", - "\n", - "### The Scale Utilization Problem\n", - "\n", - "```\n", - "Good Quantization (Uniform): Bad Quantization (Outliers):\n", - "\n", - "Values: [-1.0 ... +1.0] Values: [-10.0, -0.1...+0.1, +10.0]\n", - " ↓ ↓\n", - "INT8: -128 ......... +127 INT8: -128 ... 0 ... +127\n", - " ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑\n", - " All levels used Most levels wasted!\n", - "\n", - "Scale: 0.0078 (good precision) Scale: 0.078 (poor precision)\n", - "Error: ~0.004 Error: ~0.04 (10× worse!)\n", - "```\n", - "\n", - "**Key Insight:** Outliers waste quantization levels and hurt precision for normal values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a28c45a7", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "analyze_quantization_error", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_quantization_error():\n", - " \"\"\"📊 Analyze quantization error across different distributions.\"\"\"\n", - " print(\"📊 Analyzing Quantization Error Across Distributions...\")\n", - "\n", - " distributions = {\n", - " 'uniform': np.random.uniform(-1, 1, (1000,)),\n", - " 'normal': np.random.normal(0, 0.5, (1000,)),\n", - " 'outliers': np.concatenate([np.random.normal(0, 0.1, (900,)),\n", - " np.random.uniform(-2, 2, (100,))]),\n", - " 'sparse': np.random.choice([0, 0, 0, 1], size=(1000,)) * np.random.normal(0, 1, (1000,))\n", - " }\n", - "\n", - " results = {}\n", - "\n", - " for name, data in distributions.items():\n", - " # Quantize and measure error\n", - " original = Tensor(data)\n", - " q_tensor, scale, zero_point = quantize_int8(original)\n", - " restored = dequantize_int8(q_tensor, scale, zero_point)\n", - "\n", - " # Calculate metrics\n", - " mse = np.mean((original.data - restored.data) ** 2)\n", - " max_error = np.max(np.abs(original.data - restored.data))\n", - "\n", - " results[name] = {\n", - " 'mse': mse,\n", - " 'max_error': max_error,\n", - " 'scale': scale,\n", - " 'range_ratio': (np.max(data) - np.min(data)) / scale if scale > 0 else 0\n", - " }\n", - "\n", - " print(f\"{name:8}: MSE={mse:.6f}, Max Error={max_error:.4f}, Scale={scale:.4f}\")\n", - "\n", - " print(\"\\n💡 Insights:\")\n", - " print(\"- Uniform: Low error, good scale utilization\")\n", - " print(\"- Normal: Higher error at distribution tails\")\n", - " print(\"- Outliers: Poor quantization due to extreme values\")\n", - " print(\"- Sparse: Wasted quantization levels on zeros\")\n", - "\n", - " return results\n", - "\n", - "# Analyze quantization quality\n", - "error_analysis = analyze_quantization_error()" - ] - }, - { - "cell_type": "markdown", - "id": "5f4bf7b6", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## QuantizedLinear - The Heart of Efficient Networks\n", - "\n", - "### Why We Need Quantized Layers\n", - "\n", - "A quantized model isn't just about storing weights in INT8 - we need layers that can work efficiently with quantized data.\n", - "\n", - "```\n", - "Regular Linear Layer: QuantizedLinear Layer:\n", - "\n", - "┌─────────────────────┐ ┌─────────────────────┐\n", - "│ Input: FP32 │ │ Input: FP32 │\n", - "│ Weights: FP32 │ │ Weights: INT8 │\n", - "│ Computation: FP32 │ VS │ Computation: Mixed │\n", - "│ Output: FP32 │ │ Output: FP32 │\n", - "│ Memory: 4× more │ │ Memory: 4× less │\n", - "└─────────────────────┘ └─────────────────────┘\n", - "```\n", - "\n", - "### The Quantized Forward Pass\n", - "\n", - "```\n", - "Quantized Linear Layer Forward Pass:\n", - "\n", - " Input (FP32) Quantized Weights (INT8)\n", - " │ │\n", - " ▼ ▼\n", - "┌─────────────────┐ ┌─────────────────┐\n", - "│ Calibrate │ │ Dequantize │\n", - "│ (optional) │ │ Weights │\n", - "└─────────────────┘ └─────────────────┘\n", - " │ │\n", - " ▼ ▼\n", - " Input (FP32) Weights (FP32)\n", - " │ │\n", - " └───────────────┬───────────────┘\n", - " ▼\n", - " ┌─────────────────┐\n", - " │ Matrix Multiply │\n", - " │ (FP32 GEMM) │\n", - " └─────────────────┘\n", - " │\n", - " ▼\n", - " Output (FP32)\n", - "\n", - "Memory Saved: 4× for weights storage!\n", - "Speed: Depends on dequantization overhead vs INT8 GEMM support\n", - "```\n", - "\n", - "### Calibration - Finding Optimal Input Quantization\n", - "\n", - "```\n", - "Calibration Process:\n", - "\n", - " Step 1: Collect Sample Inputs Step 2: Analyze Distribution Step 3: Optimize Parameters\n", - " ┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐\n", - " │ input_1: [-0.5, 0.2, ..] │ │ Min: -0.8 │ │ Scale: 0.00627 │\n", - " │ input_2: [-0.3, 0.8, ..] │ → │ Max: +0.8 │ → │ Zero Point: 0 │\n", - " │ input_3: [-0.1, 0.5, ..] │ │ Range: 1.6 │ │ Optimal for this data │\n", - " │ ... │ │ Distribution: Normal │ │ range and distribution │\n", - " └─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘\n", - "```\n", - "\n", - "**Why Calibration Matters:**\n", - "- **Without calibration:** Generic quantization parameters may waste precision\n", - "- **With calibration:** Parameters optimized for actual data distribution\n", - "- **Result:** Better accuracy preservation with same memory savings" - ] - }, - { - "cell_type": "markdown", - "id": "6b6a464e", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### QuantizedLinear Class - Efficient Neural Network Layer\n", - "\n", - "This class replaces regular Linear layers with quantized versions that use 4× less memory while preserving functionality.\n", - "\n", - "```\n", - "QuantizedLinear Architecture:\n", - "\n", - "Creation Time: Runtime:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ Regular Linear Layer │ │ Input (FP32) │\n", - "│ ↓ │ │ ↓ │\n", - "│ Quantize weights → INT8 │ │ Optional: quantize input│\n", - "│ Quantize bias → INT8 │ → │ ↓ │\n", - "│ Store quantization params │ │ Dequantize weights │\n", - "│ Ready for deployment! │ │ ↓ │\n", - "└─────────────────────────┘ │ Matrix multiply (FP32) │\n", - " One-time cost │ ↓ │\n", - " │ Output (FP32) │\n", - " └─────────────────────────┘\n", - " Per-inference cost\n", - "```\n", - "\n", - "**Key Design Decisions:**\n", - "\n", - "1. **Store original layer reference** - for debugging and comparison\n", - "2. **Separate quantization parameters** - weights and bias may need different scales\n", - "3. **Calibration support** - optimize input quantization using real data\n", - "4. **FP32 computation** - educational approach, production uses INT8 GEMM\n", - "5. **Memory tracking** - measure actual compression achieved\n", - "\n", - "**Memory Layout Comparison:**\n", - "```\n", - "Regular Linear Layer: QuantizedLinear Layer:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ weights: FP32 × N │ │ q_weights: INT8 × N │\n", - "│ bias: FP32 × M │ │ q_bias: INT8 × M │\n", - "│ │ → │ weight_scale: 1 float │\n", - "│ Total: 4×(N+M) bytes │ │ weight_zero_point: 1 int│\n", - "└─────────────────────────┘ │ bias_scale: 1 float │\n", - " │ bias_zero_point: 1 int │\n", - " │ │\n", - " │ Total: (N+M) + 16 bytes │\n", - " └─────────────────────────┘\n", - " ↑\n", - " ~4× smaller!\n", - "```\n", - "\n", - "**Production vs Educational Trade-off:**\n", - "- **Our approach:** Dequantize → FP32 computation (easier to understand)\n", - "- **Production:** INT8 GEMM operations (faster, more complex)\n", - "- **Both achieve:** Same memory savings, similar accuracy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b518a3e4", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "quantized_linear", - "solution": true - } - }, - "outputs": [], - "source": [ - "class QuantizedLinear:\n", - " \"\"\"Quantized version of Linear layer using INT8 arithmetic.\"\"\"\n", - "\n", - " def __init__(self, linear_layer: Linear):\n", - " \"\"\"\n", - " Create quantized version of existing linear layer.\n", - "\n", - " TODO: Quantize weights and bias, store quantization parameters\n", - "\n", - " APPROACH:\n", - " 1. Quantize weights using quantize_int8\n", - " 2. Quantize bias if it exists\n", - " 3. Store original layer reference for forward pass\n", - " 4. Store quantization parameters for dequantization\n", - "\n", - " IMPLEMENTATION STRATEGY:\n", - " - Store quantized weights, scales, and zero points\n", - " - Implement forward pass using dequantized computation (educational approach)\n", - " - Production: Would use INT8 matrix multiplication libraries\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " self.original_layer = linear_layer\n", - "\n", - " # Quantize weights\n", - " self.q_weight, self.weight_scale, self.weight_zero_point = quantize_int8(linear_layer.weight)\n", - "\n", - " # Quantize bias if it exists\n", - " if linear_layer.bias is not None:\n", - " self.q_bias, self.bias_scale, self.bias_zero_point = quantize_int8(linear_layer.bias)\n", - " else:\n", - " self.q_bias = None\n", - " self.bias_scale = None\n", - " self.bias_zero_point = None\n", - "\n", - " # Store input quantization parameters (set during calibration)\n", - " self.input_scale = None\n", - " self.input_zero_point = None\n", - " ### END SOLUTION\n", - "\n", - " def calibrate(self, sample_inputs: List[Tensor]):\n", - " \"\"\"\n", - " Calibrate input quantization parameters using sample data.\n", - "\n", - " TODO: Calculate optimal input quantization parameters\n", - "\n", - " APPROACH:\n", - " 1. Collect statistics from sample inputs\n", - " 2. Calculate optimal scale and zero_point for inputs\n", - " 3. Store for use in forward pass\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Collect all input values\n", - " all_values = []\n", - " for inp in sample_inputs:\n", - " all_values.extend(inp.data.flatten())\n", - "\n", - " all_values = np.array(all_values)\n", - "\n", - " # Calculate input quantization parameters\n", - " min_val = float(np.min(all_values))\n", - " max_val = float(np.max(all_values))\n", - "\n", - " if abs(max_val - min_val) < 1e-8:\n", - " self.input_scale = 1.0\n", - " self.input_zero_point = 0\n", - " else:\n", - " self.input_scale = (max_val - min_val) / 255.0\n", - " self.input_zero_point = int(np.round(-128 - min_val / self.input_scale))\n", - " self.input_zero_point = np.clip(self.input_zero_point, -128, 127)\n", - " ### END SOLUTION\n", - "\n", - " def forward(self, x: Tensor) -> Tensor:\n", - " \"\"\"\n", - " Forward pass with quantized computation.\n", - "\n", - " TODO: Implement quantized forward pass\n", - "\n", - " APPROACH:\n", - " 1. Quantize input (if calibrated)\n", - " 2. Dequantize weights and input for computation (educational approach)\n", - " 3. Perform matrix multiplication\n", - " 4. Return FP32 result\n", - "\n", - " NOTE: Production quantization uses INT8 GEMM libraries for speed\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # For educational purposes, we dequantize and compute in FP32\n", - " # Production systems use specialized INT8 GEMM operations\n", - "\n", - " # Dequantize weights\n", - " weight_fp32 = dequantize_int8(self.q_weight, self.weight_scale, self.weight_zero_point)\n", - "\n", - " # Perform computation (same as original layer)\n", - " result = x.matmul(weight_fp32)\n", - "\n", - " # Add bias if it exists\n", - " if self.q_bias is not None:\n", - " bias_fp32 = dequantize_int8(self.q_bias, self.bias_scale, self.bias_zero_point)\n", - " result = Tensor(result.data + bias_fp32.data)\n", - "\n", - " return result\n", - " ### END SOLUTION\n", - "\n", - " def __call__(self, x: Tensor) -> Tensor:\n", - " \"\"\"Allows the quantized linear layer to be called like a function.\"\"\"\n", - " return self.forward(x)\n", - "\n", - " def parameters(self) -> List[Tensor]:\n", - " \"\"\"Return quantized parameters.\"\"\"\n", - " params = [self.q_weight]\n", - " if self.q_bias is not None:\n", - " params.append(self.q_bias)\n", - " return params\n", - "\n", - " def memory_usage(self) -> Dict[str, float]:\n", - " \"\"\"Calculate memory usage in bytes.\"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Original FP32 usage\n", - " original_weight_bytes = self.original_layer.weight.data.size * 4 # 4 bytes per FP32\n", - " original_bias_bytes = 0\n", - " if self.original_layer.bias is not None:\n", - " original_bias_bytes = self.original_layer.bias.data.size * 4\n", - "\n", - " # Quantized INT8 usage\n", - " quantized_weight_bytes = self.q_weight.data.size * 1 # 1 byte per INT8\n", - " quantized_bias_bytes = 0\n", - " if self.q_bias is not None:\n", - " quantized_bias_bytes = self.q_bias.data.size * 1\n", - "\n", - " # Add overhead for scales and zero points (small)\n", - " overhead_bytes = 8 * 2 # 2 floats + 2 ints for weight/bias quantization params\n", - "\n", - " return {\n", - " 'original_bytes': original_weight_bytes + original_bias_bytes,\n", - " 'quantized_bytes': quantized_weight_bytes + quantized_bias_bytes + overhead_bytes,\n", - " 'compression_ratio': (original_weight_bytes + original_bias_bytes) /\n", - " (quantized_weight_bytes + quantized_bias_bytes + overhead_bytes)\n", - " }\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_quantized_linear():\n", - " \"\"\"🔬 Test QuantizedLinear implementation.\"\"\"\n", - " print(\"🔬 Unit Test: QuantizedLinear...\")\n", - "\n", - " # Create original linear layer\n", - " original = Linear(4, 3)\n", - " original.weight = Tensor(np.random.randn(4, 3) * 0.5) # Smaller range for testing\n", - " original.bias = Tensor(np.random.randn(3) * 0.1)\n", - "\n", - " # Create quantized version\n", - " quantized = QuantizedLinear(original)\n", - "\n", - " # Test forward pass\n", - " x = Tensor(np.random.randn(2, 4) * 0.5)\n", - "\n", - " # Original forward pass\n", - " original_output = original.forward(x)\n", - "\n", - " # Quantized forward pass\n", - " quantized_output = quantized.forward(x)\n", - "\n", - " # Compare outputs (should be close but not identical due to quantization)\n", - " error = np.mean(np.abs(original_output.data - quantized_output.data))\n", - " assert error < 1.0, f\"Quantization error too high: {error}\"\n", - "\n", - " # Test memory usage\n", - " memory_info = quantized.memory_usage()\n", - " assert memory_info['compression_ratio'] > 3.0, \"Should achieve ~4× compression\"\n", - "\n", - " print(f\" Memory reduction: {memory_info['compression_ratio']:.1f}×\")\n", - " print(\"✅ QuantizedLinear works correctly!\")\n", - "\n", - "test_unit_quantized_linear()" - ] - }, - { - "cell_type": "markdown", - "id": "557295a5", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 4. Integration - Scaling to Full Neural Networks\n", - "\n", - "### The Model Quantization Challenge\n", - "\n", - "Quantizing individual tensors is useful, but real applications need to quantize entire neural networks with multiple layers, activations, and complex data flows.\n", - "\n", - "```\n", - "Model Quantization Process:\n", - "\n", - "Original Model: Quantized Model:\n", - "┌─────────────────────────────┐ ┌─────────────────────────────┐\n", - "│ Linear(784, 128) [FP32] │ │ QuantizedLinear(784, 128) │\n", - "│ ReLU() [FP32] │ │ ReLU() [FP32] │\n", - "│ Linear(128, 64) [FP32] │ → │ QuantizedLinear(128, 64) │\n", - "│ ReLU() [FP32] │ │ ReLU() [FP32] │\n", - "│ Linear(64, 10) [FP32] │ │ QuantizedLinear(64, 10) │\n", - "└─────────────────────────────┘ └─────────────────────────────┘\n", - " Memory: 100% Memory: ~25%\n", - " Speed: Baseline Speed: 2-4× faster\n", - "```\n", - "\n", - "### Smart Layer Selection\n", - "\n", - "Not all layers benefit equally from quantization:\n", - "\n", - "```\n", - "Layer Quantization Strategy:\n", - "\n", - "┌─────────────────┬─────────────────┬─────────────────────────────┐\n", - "│ Layer Type │ Quantize? │ Reason │\n", - "├─────────────────┼─────────────────┼─────────────────────────────┤\n", - "│ Linear/Dense │ ✅ YES │ Most parameters, big savings │\n", - "│ Convolution │ ✅ YES │ Many weights, good candidate │\n", - "│ Embedding │ ✅ YES │ Large lookup tables │\n", - "│ ReLU/Sigmoid │ ❌ NO │ No parameters to quantize │\n", - "│ BatchNorm │ 🤔 MAYBE │ Few params, may hurt │\n", - "│ First Layer │ 🤔 MAYBE │ Often sensitive to precision │\n", - "│ Last Layer │ 🤔 MAYBE │ Output quality critical │\n", - "└─────────────────┴─────────────────┴─────────────────────────────┘\n", - "```\n", - "\n", - "### Calibration Data Flow\n", - "\n", - "```\n", - "End-to-End Calibration:\n", - "\n", - "Calibration Input Layer-by-Layer Processing\n", - " │ │\n", - " ▼ ▼\n", - "┌─────────────┐ ┌──────────────────────────────────────────┐\n", - "│ Sample Data │ → │ Layer 1: Collect activation statistics │\n", - "│ [batch of │ │ ↓ │\n", - "│ real data] │ │ Layer 2: Collect activation statistics │\n", - "└─────────────┘ │ ↓ │\n", - " │ Layer 3: Collect activation statistics │\n", - " │ ↓ │\n", - " │ Optimize quantization parameters │\n", - " └──────────────────────────────────────────┘\n", - " │\n", - " ▼\n", - " Ready for deployment!\n", - "```\n", - "\n", - "### Memory Impact Visualization\n", - "\n", - "```\n", - "Model Memory Breakdown:\n", - "\n", - "Before Quantization: After Quantization:\n", - "┌─────────────────────┐ ┌─────────────────────┐\n", - "│ Layer 1: 3.1MB │ │ Layer 1: 0.8MB │ (-75%)\n", - "│ Layer 2: 0.5MB │ → │ Layer 2: 0.1MB │ (-75%)\n", - "│ Layer 3: 0.3MB │ │ Layer 3: 0.1MB │ (-75%)\n", - "│ Total: 3.9MB │ │ Total: 1.0MB │ (-74%)\n", - "└─────────────────────┘ └─────────────────────┘\n", - "\n", - " Typical mobile phone memory: 4-8GB\n", - " Model now fits: 4000× more models in memory!\n", - "```\n", - "\n", - "Now let's implement the functions that make this transformation possible!" - ] - }, - { - "cell_type": "markdown", - "id": "d881be8c", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### Model Quantization - Scaling to Full Networks\n", - "\n", - "This function transforms entire neural networks from FP32 to quantized versions. It's like upgrading a whole building to be more energy efficient!\n", - "\n", - "```\n", - "Model Transformation Process:\n", - "\n", - "Input Model: Quantized Model:\n", - "┌─────────────────────────────┐ ┌─────────────────────────────┐\n", - "│ layers[0]: Linear(784, 128) │ │ layers[0]: QuantizedLinear │\n", - "│ layers[1]: ReLU() │ │ layers[1]: ReLU() │\n", - "│ layers[2]: Linear(128, 64) │ → │ layers[2]: QuantizedLinear │\n", - "│ layers[3]: ReLU() │ │ layers[3]: ReLU() │\n", - "│ layers[4]: Linear(64, 10) │ │ layers[4]: QuantizedLinear │\n", - "└─────────────────────────────┘ └─────────────────────────────┘\n", - " Memory: 100% Memory: ~25%\n", - " Interface: Same Interface: Identical\n", - "```\n", - "\n", - "**Smart Layer Selection Logic:**\n", - "```\n", - "Quantization Decision Tree:\n", - "\n", - "For each layer in model:\n", - " │\n", - " ├── Is it a Linear layer?\n", - " │ │\n", - " │ └── YES → Replace with QuantizedLinear\n", - " │\n", - " └── Is it ReLU/Activation?\n", - " │\n", - " └── NO → Keep unchanged (no parameters to quantize)\n", - "```\n", - "\n", - "**Calibration Integration:**\n", - "```\n", - "Calibration Data Flow:\n", - "\n", - " Input Data Layer-by-Layer Processing\n", - " │ │\n", - " ▼ ▼\n", - " ┌─────────────────┐ ┌───────────────────────────────────────────────────────────┐\n", - " │ Sample Batch 1 │ │ Layer 0: Forward → Collect activation statistics │\n", - " │ Sample Batch 2 │ → │ ↓ │\n", - " │ ... │ │ Layer 2: Forward → Collect activation statistics │\n", - " │ Sample Batch N │ │ ↓ │\n", - " └─────────────────┘ │ Layer 4: Forward → Collect activation statistics │\n", - " │ ↓ │\n", - " │ For each layer: calibrate optimal quantization │\n", - " └───────────────────────────────────────────────────────────┘\n", - "```\n", - "\n", - "**Why In-Place Modification:**\n", - "- **Preserves model structure** - Same interface, same behavior\n", - "- **Memory efficient** - No copying of large tensors\n", - "- **Drop-in replacement** - Existing code works unchanged\n", - "- **Gradual quantization** - Can selectively quantize sensitive layers\n", - "\n", - "**Deployment Benefits:**\n", - "```\n", - "Before Quantization: After Quantization:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ ❌ Can't fit on phone │ │ ✅ Fits on mobile device │\n", - "│ ❌ Slow cloud deployment │ │ ✅ Fast edge inference │\n", - "│ ❌ High memory usage │ → │ ✅ 4× memory efficiency │\n", - "│ ❌ Expensive to serve │ │ ✅ Lower serving costs │\n", - "│ ❌ Battery drain │ │ ✅ Extended battery life │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "813db571", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "quantize_model", - "solution": true - } - }, - "outputs": [], - "source": [ - "def quantize_model(model, calibration_data: Optional[List[Tensor]] = None) -> None:\n", - " \"\"\"\n", - " Quantize all Linear layers in a model in-place.\n", - "\n", - " TODO: Replace all Linear layers with QuantizedLinear versions\n", - "\n", - " APPROACH:\n", - " 1. Find all Linear layers in the model\n", - " 2. Replace each with QuantizedLinear version\n", - " 3. If calibration data provided, calibrate input quantization\n", - " 4. Handle Sequential containers properly\n", - "\n", - " EXAMPLE:\n", - " >>> model = Sequential(Linear(10, 5), ReLU(), Linear(5, 2))\n", - " >>> quantize_model(model)\n", - " >>> # Now model uses quantized layers\n", - "\n", - " HINT:\n", - " - Handle Sequential.layers list for layer replacement\n", - " - Use isinstance(layer, Linear) to identify layers to quantize\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " if hasattr(model, 'layers'): # Sequential model\n", - " for i, layer in enumerate(model.layers):\n", - " if isinstance(layer, Linear):\n", - " # Replace with quantized version\n", - " quantized_layer = QuantizedLinear(layer)\n", - "\n", - " # Calibrate if data provided\n", - " if calibration_data is not None:\n", - " # Run forward passes to get intermediate activations\n", - " sample_inputs = []\n", - " for data in calibration_data[:10]: # Use first 10 samples for efficiency\n", - " # Forward through layers up to this point\n", - " x = data\n", - " for j in range(i):\n", - " if hasattr(model.layers[j], 'forward'):\n", - " x = model.layers[j].forward(x)\n", - " sample_inputs.append(x)\n", - "\n", - " quantized_layer.calibrate(sample_inputs)\n", - "\n", - " model.layers[i] = quantized_layer\n", - "\n", - " elif isinstance(model, Linear): # Single Linear layer\n", - " # Can't replace in-place for single layer, user should handle\n", - " raise ValueError(\"Cannot quantize single Linear layer in-place. Use QuantizedLinear directly.\")\n", - "\n", - " else:\n", - " raise ValueError(f\"Unsupported model type: {type(model)}\")\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_quantize_model():\n", - " \"\"\"🔬 Test model quantization implementation.\"\"\"\n", - " print(\"🔬 Unit Test: Model Quantization...\")\n", - "\n", - " # Create test model\n", - " model = Sequential(\n", - " Linear(4, 8),\n", - " ReLU(),\n", - " Linear(8, 3)\n", - " )\n", - "\n", - " # Initialize weights\n", - " model.layers[0].weight = Tensor(np.random.randn(4, 8) * 0.5)\n", - " model.layers[0].bias = Tensor(np.random.randn(8) * 0.1)\n", - " model.layers[2].weight = Tensor(np.random.randn(8, 3) * 0.5)\n", - " model.layers[2].bias = Tensor(np.random.randn(3) * 0.1)\n", - "\n", - " # Test original model\n", - " x = Tensor(np.random.randn(2, 4))\n", - " original_output = model.forward(x)\n", - "\n", - " # Create calibration data\n", - " calibration_data = [Tensor(np.random.randn(1, 4)) for _ in range(5)]\n", - "\n", - " # Quantize model\n", - " quantize_model(model, calibration_data)\n", - "\n", - " # Verify layers were replaced\n", - " assert isinstance(model.layers[0], QuantizedLinear)\n", - " assert isinstance(model.layers[1], ReLU) # Should remain unchanged\n", - " assert isinstance(model.layers[2], QuantizedLinear)\n", - "\n", - " # Test quantized model\n", - " quantized_output = model.forward(x)\n", - "\n", - " # Compare outputs\n", - " error = np.mean(np.abs(original_output.data - quantized_output.data))\n", - " print(f\" Model quantization error: {error:.4f}\")\n", - " assert error < 2.0, f\"Model quantization error too high: {error}\"\n", - "\n", - " print(\"✅ Model quantization works correctly!\")\n", - "\n", - "test_unit_quantize_model()" - ] - }, - { - "cell_type": "markdown", - "id": "3769f169", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### Model Size Comparison - Measuring the Impact\n", - "\n", - "This function provides detailed analysis of memory savings achieved through quantization. It's like a before/after comparison for model efficiency.\n", - "\n", - "```\n", - "Memory Analysis Framework:\n", - "\n", - "┌────────────────────────────────────────────────────────────────────────────────────┐\n", - "│ Memory Breakdown Analysis │\n", - "├─────────────────┬─────────────────┬─────────────────┬─────────────────┤\n", - "│ Component │ Original (FP32) │ Quantized (INT8) │ Savings │\n", - "├─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n", - "│ Layer 1 weights │ 12.8 MB │ 3.2 MB │ 9.6 MB (75%)│\n", - "│ Layer 1 bias │ 0.5 MB │ 0.1 MB │ 0.4 MB (75%)│\n", - "│ Layer 2 weights │ 2.0 MB │ 0.5 MB │ 1.5 MB (75%)│\n", - "│ Layer 2 bias │ 0.3 MB │ 0.1 MB │ 0.2 MB (67%)│\n", - "│ Overhead │ 0.0 MB │ 0.02 MB │ -0.02 MB │\n", - "├─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n", - "│ TOTAL │ 15.6 MB │ 3.92 MB │ 11.7 MB (74%)│\n", - "└─────────────────┴─────────────────┴─────────────────┴─────────────────┘\n", - " ↑\n", - " 4× compression ratio!\n", - "```\n", - "\n", - "**Comprehensive Metrics Provided:**\n", - "```\n", - "Output Dictionary:\n", - "{\n", - " 'original_params': 4000000, # Total parameter count\n", - " 'quantized_params': 4000000, # Same count, different precision\n", - " 'original_bytes': 16000000, # 4 bytes per FP32 parameter\n", - " 'quantized_bytes': 4000016, # 1 byte per INT8 + overhead\n", - " 'compression_ratio': 3.99, # Nearly 4× compression\n", - " 'memory_saved_mb': 11.7, # Absolute savings in MB\n", - " 'memory_saved_percent': 74.9 # Relative savings percentage\n", - "}\n", - "```\n", - "\n", - "**Why These Metrics Matter:**\n", - "\n", - "**For Developers:**\n", - "- **compression_ratio** - How much smaller is the model?\n", - "- **memory_saved_mb** - Actual bytes freed up\n", - "- **memory_saved_percent** - Efficiency improvement\n", - "\n", - "**For Deployment:**\n", - "- **Model fits in device memory?** Check memory_saved_mb\n", - "- **Network transfer time?** Reduced by compression_ratio\n", - "- **Disk storage savings?** Shown by memory_saved_percent\n", - "\n", - "**For Business:**\n", - "- **Cloud costs** reduced by compression_ratio\n", - "- **User experience** improved (faster downloads)\n", - "- **Device support** expanded (fits on more devices)\n", - "\n", - "**Validation Checks:**\n", - "- **Parameter count preservation** - same functionality\n", - "- **Reasonable compression ratio** - should be ~4× for INT8\n", - "- **Minimal overhead** - quantization parameters are tiny" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67b85991", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "compare_model_sizes", - "solution": true - } - }, - "outputs": [], - "source": [ - "def compare_model_sizes(original_model, quantized_model) -> Dict[str, float]:\n", - " \"\"\"\n", - " Compare memory usage between original and quantized models.\n", - "\n", - " TODO: Calculate comprehensive memory comparison\n", - "\n", - " APPROACH:\n", - " 1. Count parameters in both models\n", - " 2. Calculate bytes used (FP32 vs INT8)\n", - " 3. Include quantization overhead\n", - " 4. Return comparison metrics\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Count original model parameters\n", - " original_params = 0\n", - " original_bytes = 0\n", - "\n", - " if hasattr(original_model, 'layers'):\n", - " for layer in original_model.layers:\n", - " if hasattr(layer, 'parameters'):\n", - " params = layer.parameters()\n", - " for param in params:\n", - " original_params += param.data.size\n", - " original_bytes += param.data.size * 4 # 4 bytes per FP32\n", - "\n", - " # Count quantized model parameters\n", - " quantized_params = 0\n", - " quantized_bytes = 0\n", - "\n", - " if hasattr(quantized_model, 'layers'):\n", - " for layer in quantized_model.layers:\n", - " if isinstance(layer, QuantizedLinear):\n", - " memory_info = layer.memory_usage()\n", - " quantized_bytes += memory_info['quantized_bytes']\n", - " params = layer.parameters()\n", - " for param in params:\n", - " quantized_params += param.data.size\n", - " elif hasattr(layer, 'parameters'):\n", - " # Non-quantized layers\n", - " params = layer.parameters()\n", - " for param in params:\n", - " quantized_params += param.data.size\n", - " quantized_bytes += param.data.size * 4\n", - "\n", - " compression_ratio = original_bytes / quantized_bytes if quantized_bytes > 0 else 1.0\n", - " memory_saved = original_bytes - quantized_bytes\n", - "\n", - " return {\n", - " 'original_params': original_params,\n", - " 'quantized_params': quantized_params,\n", - " 'original_bytes': original_bytes,\n", - " 'quantized_bytes': quantized_bytes,\n", - " 'compression_ratio': compression_ratio,\n", - " 'memory_saved_mb': memory_saved / (1024 * 1024),\n", - " 'memory_saved_percent': (memory_saved / original_bytes) * 100 if original_bytes > 0 else 0\n", - " }\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_compare_model_sizes():\n", - " \"\"\"🔬 Test model size comparison.\"\"\"\n", - " print(\"🔬 Unit Test: Model Size Comparison...\")\n", - "\n", - " # Create and quantize a model for testing\n", - " original_model = Sequential(Linear(100, 50), ReLU(), Linear(50, 10))\n", - " original_model.layers[0].weight = Tensor(np.random.randn(100, 50))\n", - " original_model.layers[0].bias = Tensor(np.random.randn(50))\n", - " original_model.layers[2].weight = Tensor(np.random.randn(50, 10))\n", - " original_model.layers[2].bias = Tensor(np.random.randn(10))\n", - "\n", - " # Create quantized copy\n", - " quantized_model = Sequential(Linear(100, 50), ReLU(), Linear(50, 10))\n", - " quantized_model.layers[0].weight = Tensor(np.random.randn(100, 50))\n", - " quantized_model.layers[0].bias = Tensor(np.random.randn(50))\n", - " quantized_model.layers[2].weight = Tensor(np.random.randn(50, 10))\n", - " quantized_model.layers[2].bias = Tensor(np.random.randn(10))\n", - "\n", - " quantize_model(quantized_model)\n", - "\n", - " # Compare sizes\n", - " comparison = compare_model_sizes(original_model, quantized_model)\n", - "\n", - " # Verify compression achieved\n", - " assert comparison['compression_ratio'] > 2.0, \"Should achieve significant compression\"\n", - " assert comparison['memory_saved_percent'] > 50, \"Should save >50% memory\"\n", - "\n", - " print(f\" Compression ratio: {comparison['compression_ratio']:.1f}×\")\n", - " print(f\" Memory saved: {comparison['memory_saved_percent']:.1f}%\")\n", - " print(\"✅ Model size comparison works correctly!\")\n", - "\n", - "test_unit_compare_model_sizes()" - ] - }, - { - "cell_type": "markdown", - "id": "028fd2f1", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 5. Systems Analysis - Real-World Performance Impact\n", - "\n", - "### Understanding Production Trade-offs\n", - "\n", - "Quantization isn't just about smaller models - it's about enabling entirely new deployment scenarios. Let's measure the real impact across different model scales.\n", - "\n", - "```\n", - "Production Deployment Scenarios:\n", - "\n", - "┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐\n", - "│ Deployment │ Memory Limit │ Speed Needs │ Quantization Fit │\n", - "├──────────────────┼──────────────────┼──────────────────┼──────────────────┤\n", - "│ Mobile Phone │ 100-500MB │ <100ms latency │ ✅ Essential │\n", - "│ Edge Device │ 50-200MB │ Real-time │ ✅ Critical │\n", - "│ Cloud GPU │ 16-80GB │ High throughput │ 🤔 Optional │\n", - "│ Embedded MCU │ 1-10MB │ Ultra-low power │ ✅ Mandatory │\n", - "└──────────────────┴──────────────────┴──────────────────┴──────────────────┘\n", - "```\n", - "\n", - "### The Performance Testing Framework\n", - "\n", - "We'll measure quantization impact across three critical dimensions:\n", - "\n", - "```\n", - "Performance Analysis Framework:\n", - "\n", - "1. Memory Efficiency 2. Inference Speed 3. Accuracy Preservation\n", - "┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐\n", - "│ • Model size (MB) │ │ • Forward pass time │ │ • MSE vs original │\n", - "│ • Compression ratio │ │ • Throughput (fps) │ │ • Relative error │\n", - "│ • Memory bandwidth │ │ • Latency (ms) │ │ • Distribution │\n", - "└─────────────────────┘ └─────────────────────┘ └─────────────────────┘\n", - "```\n", - "\n", - "### Expected Results Preview\n", - "\n", - "```\n", - "Typical Quantization Results:\n", - "\n", - "Model Size: Small (1-10MB) Medium (10-100MB) Large (100MB+)\n", - " ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n", - "Compression: │ 3.8× reduction │ │ 3.9× reduction │ │ 4.0× reduction │\n", - "Speed: │ 1.2× faster │ │ 2.1× faster │ │ 3.2× faster │\n", - "Accuracy: │ 0.1% loss │ │ 0.3% loss │ │ 0.5% loss │\n", - " └─────────────────┘ └─────────────────┘ └─────────────────┘\n", - "\n", - "Key Insight: Larger models benefit more from quantization!\n", - "```\n", - "\n", - "Let's run comprehensive tests to validate these expectations and understand the underlying patterns." - ] - }, - { - "cell_type": "markdown", - "id": "a1f6212a", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### Performance Analysis - Real-World Benchmarking\n", - "\n", - "This comprehensive analysis measures quantization impact across the three critical dimensions: memory, speed, and accuracy.\n", - "\n", - "```\n", - "Performance Testing Strategy:\n", - "\n", - "┌────────────────────────────────────────────────────────────────────────────────────┐\n", - "│ Test Model Configurations │\n", - "├────────────────────────────┬────────────────────────────┬────────────────────────────┤\n", - "│ Model Type │ Architecture │ Use Case │\n", - "├────────────────────────────┼────────────────────────────┼────────────────────────────┤\n", - "│ Small MLP │ 64 → 32 → 10 │ Edge Device │\n", - "│ Medium MLP │ 512 → 256 → 128 → 10 │ Mobile App │\n", - "│ Large MLP │ 2048 → 1024 → 512 → 10│ Server Deployment │\n", - "└────────────────────────────┴────────────────────────────┴────────────────────────────┘\n", - "```\n", - "\n", - "**Performance Measurement Pipeline:**\n", - "```\n", - "For Each Model Configuration:\n", - "\n", - " Create Original Model Create Quantized Model Comparative Analysis\n", - " │ │ │\n", - " ▼ ▼ ▼\n", - " ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n", - " │ Initialize weights │ │ Copy weights │ │ Memory analysis │\n", - " │ Random test data │ │ Apply quantization│ │ Speed benchmarks │\n", - " │ Forward pass │ │ Calibrate layers │ │ Accuracy testing │\n", - " │ Timing measurements│ │ Forward pass │ │ Trade-off analysis│\n", - " └─────────────────┘ └─────────────────┘ └─────────────────┘\n", - "```\n", - "\n", - "**Expected Performance Patterns:**\n", - "```\n", - "Model Scaling Effects:\n", - "\n", - " Memory Usage Inference Speed Accuracy Loss\n", - " │ │ │\n", - " ▼ ▼ ▼\n", - "\n", - "4× │ ############### FP32 3× │ INT8 1% │ ####\n", - " │ │ ############### FP32 │\n", - "3× │ 2× │ 0.5% │ ##\n", - " │ ######### INT8 │ ########### INT8 │\n", - "2× │ 1× │ 0.1% │ #\n", - " │ │ ####### │\n", - "1× │ │ 0% └────────────────────────────────────────────────────\n", - " └──────────────────────────────────────────────────── └──────────────────────────────────────────────────── Small Medium Large\n", - " Small Medium Large Small Medium Large\n", - "\n", - "Key Insight: Larger models benefit more from quantization!\n", - "```\n", - "\n", - "**Real-World Impact Translation:**\n", - "- **Memory savings** → More models fit on device, lower cloud costs\n", - "- **Speed improvements** → Better user experience, real-time applications\n", - "- **Accuracy preservation** → Maintains model quality, no retraining needed" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "88001546", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "analyze_quantization_performance", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_quantization_performance():\n", - " \"\"\"📊 Comprehensive analysis of quantization benefits and trade-offs.\"\"\"\n", - " print(\"📊 Analyzing Quantization Performance Across Model Sizes...\")\n", - "\n", - " # Test different model configurations\n", - " configs = [\n", - " {'name': 'Small MLP', 'layers': [64, 32, 10], 'batch_size': 32},\n", - " {'name': 'Medium MLP', 'layers': [512, 256, 128, 10], 'batch_size': 64},\n", - " {'name': 'Large MLP', 'layers': [2048, 1024, 512, 10], 'batch_size': 128},\n", - " ]\n", - "\n", - " results = []\n", - "\n", - " for config in configs:\n", - " print(f\"\\n🔍 Testing {config['name']}...\")\n", - "\n", - " # Create original model\n", - " layers = []\n", - " for i in range(len(config['layers']) - 1):\n", - " layers.append(Linear(config['layers'][i], config['layers'][i+1]))\n", - " if i < len(config['layers']) - 2: # Add ReLU except for last layer\n", - " layers.append(ReLU())\n", - "\n", - " original_model = Sequential(*layers)\n", - "\n", - " # Initialize weights\n", - " for layer in original_model.layers:\n", - " if isinstance(layer, Linear):\n", - " layer.weight = Tensor(np.random.randn(*layer.weight.shape) * 0.1)\n", - " layer.bias = Tensor(np.random.randn(*layer.bias.shape) * 0.01)\n", - "\n", - " # Create quantized copy\n", - " quantized_model = Sequential(*layers)\n", - " for i, layer in enumerate(original_model.layers):\n", - " if isinstance(layer, Linear):\n", - " quantized_model.layers[i].weight = Tensor(layer.weight.data.copy())\n", - " quantized_model.layers[i].bias = Tensor(layer.bias.data.copy())\n", - "\n", - " # Generate calibration data\n", - " input_size = config['layers'][0]\n", - " calibration_data = [Tensor(np.random.randn(1, input_size)) for _ in range(10)]\n", - "\n", - " # Quantize model\n", - " quantize_model(quantized_model, calibration_data)\n", - "\n", - " # Measure performance\n", - " test_input = Tensor(np.random.randn(config['batch_size'], input_size))\n", - "\n", - " # Time original model\n", - " start_time = time.time()\n", - " for _ in range(10):\n", - " original_output = original_model.forward(test_input)\n", - " original_time = (time.time() - start_time) / 10\n", - "\n", - " # Time quantized model\n", - " start_time = time.time()\n", - " for _ in range(10):\n", - " quantized_output = quantized_model.forward(test_input)\n", - " quantized_time = (time.time() - start_time) / 10\n", - "\n", - " # Calculate accuracy preservation (using MSE as proxy)\n", - " mse = np.mean((original_output.data - quantized_output.data) ** 2)\n", - " relative_error = np.sqrt(mse) / (np.std(original_output.data) + 1e-8)\n", - "\n", - " # Memory comparison\n", - " memory_comparison = compare_model_sizes(original_model, quantized_model)\n", - "\n", - " result = {\n", - " 'name': config['name'],\n", - " 'original_time': original_time * 1000, # Convert to ms\n", - " 'quantized_time': quantized_time * 1000,\n", - " 'speedup': original_time / quantized_time if quantized_time > 0 else 1.0,\n", - " 'compression_ratio': memory_comparison['compression_ratio'],\n", - " 'relative_error': relative_error,\n", - " 'memory_saved_mb': memory_comparison['memory_saved_mb']\n", - " }\n", - "\n", - " results.append(result)\n", - "\n", - " print(f\" Speedup: {result['speedup']:.1f}×\")\n", - " print(f\" Compression: {result['compression_ratio']:.1f}×\")\n", - " print(f\" Error: {result['relative_error']:.1%}\")\n", - " print(f\" Memory saved: {result['memory_saved_mb']:.1f}MB\")\n", - "\n", - " # Summary analysis\n", - " print(f\"\\n📈 QUANTIZATION PERFORMANCE SUMMARY\")\n", - " print(\"=\" * 50)\n", - "\n", - " avg_speedup = np.mean([r['speedup'] for r in results])\n", - " avg_compression = np.mean([r['compression_ratio'] for r in results])\n", - " avg_error = np.mean([r['relative_error'] for r in results])\n", - " total_memory_saved = sum([r['memory_saved_mb'] for r in results])\n", - "\n", - " print(f\"Average speedup: {avg_speedup:.1f}×\")\n", - " print(f\"Average compression: {avg_compression:.1f}×\")\n", - " print(f\"Average relative error: {avg_error:.1%}\")\n", - " print(f\"Total memory saved: {total_memory_saved:.1f}MB\")\n", - "\n", - " print(f\"\\n💡 Key Insights:\")\n", - " print(f\"- Quantization achieves ~{avg_compression:.0f}× memory reduction\")\n", - " print(f\"- Typical speedup: {avg_speedup:.1f}× (varies by hardware)\")\n", - " print(f\"- Accuracy loss: <{avg_error:.1%} for well-calibrated models\")\n", - " print(f\"- Best for: Memory-constrained deployment\")\n", - "\n", - " return results\n", - "\n", - "# Run comprehensive performance analysis\n", - "performance_results = analyze_quantization_performance()" - ] - }, - { - "cell_type": "markdown", - "id": "a81e0afc", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## Quantization Error Visualization - Seeing the Impact\n", - "\n", - "### Understanding Distribution Effects\n", - "\n", - "Different weight distributions quantize with varying quality. Let's visualize this to understand when quantization works well and when it struggles.\n", - "\n", - "```\n", - "Visualization Strategy:\n", - "\n", - "┌─────────────────────────────────────────────────────────────────────────────┐\n", - "│ Weight Distribution Analysis │\n", - "├─────────────────────┬─────────────────────┬─────────────────────────────────┤\n", - "│ Distribution Type │ Expected Quality │ Key Challenge │\n", - "├─────────────────────┼─────────────────────┼─────────────────────────────────┤\n", - "│ Normal (Gaussian) │ Good │ Tail values may be clipped │\n", - "│ Uniform │ Excellent │ Perfect scale utilization │\n", - "│ Sparse (many zeros) │ Poor │ Wasted quantization levels │\n", - "│ Heavy-tailed │ Very Poor │ Outliers dominate scale │\n", - "└─────────────────────┴─────────────────────┴─────────────────────────────────┘\n", - "```\n", - "\n", - "### Quantization Quality Patterns\n", - "\n", - "```\n", - "Ideal Quantization: Problematic Quantization:\n", - "\n", - "Original: [████████████████████] Original: [██ ████ ██]\n", - " ↓ ↓\n", - "Quantized: [████████████████████] Quantized: [██....████....██]\n", - " Perfect reconstruction Lost precision\n", - "\n", - "Scale efficiently used Scale poorly used\n", - "Low quantization error High quantization error\n", - "```\n", - "\n", - "**What We'll Visualize:**\n", - "- **Before/After histograms** - See how distributions change\n", - "- **Error metrics** - Quantify the precision loss\n", - "- **Scale utilization** - Understand efficiency\n", - "- **Real examples** - Connect to practical scenarios\n", - "\n", - "This visualization will help you understand which types of neural network weights quantize well and which need special handling." - ] - }, - { - "cell_type": "markdown", - "id": "8f54d705", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### Quantization Effects Visualization - Understanding Distribution Impact\n", - "\n", - "This visualization reveals how different weight distributions respond to quantization, helping you understand when quantization works well and when it struggles.\n", - "\n", - "```\n", - "Visualization Strategy:\n", - "\n", - "┌────────────────────────────────────────────────────────────────────────────────────┐\n", - "│ Distribution Analysis Grid │\n", - "├─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┤\n", - "│ Normal (Good) │ Uniform (Best) │ Sparse (Bad) │ Heavy-Tailed (Worst)│\n", - "├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┤\n", - "│ /\\ │ ┌──────────┐ │ | | | │ /\\ │\n", - "│ / \\ │ │ │ │ | | | │ / \\ /\\ │\n", - "│ / \\ │ │ Flat │ │ |||| | |||| │ / \\/ \\ │\n", - "│ / \\ │ │ │ │ zeros sparse │ / \\ │\n", - "│ / \\ │ └──────────┘ │ values │ / huge \\ │\n", - "│ / \\ │ │ │ / outliers \\ │\n", - "├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┤\n", - "│ MSE: 0.001 │ MSE: 0.0001 │ MSE: 0.01 │ MSE: 0.1 │\n", - "│ Scale Usage: 80% │ Scale Usage: 100% │ Scale Usage: 10% │ Scale Usage: 5% │\n", - "└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘\n", - "```\n", - "\n", - "**Visual Comparison Strategy:**\n", - "```\n", - "For Each Distribution Type:\n", - " │\n", - " ├── Generate sample weights (1000 values)\n", - " │\n", - " ├── Quantize to INT8\n", - " │\n", - " ├── Dequantize back to FP32\n", - " │\n", - " ├── Plot overlaid histograms:\n", - " │ ├── Original distribution (blue)\n", - " │ └── Quantized distribution (red)\n", - " │\n", - " └── Calculate and display error metrics:\n", - " ├── Mean Squared Error (MSE)\n", - " ├── Scale utilization efficiency\n", - " └── Quantization scale value\n", - "```\n", - "\n", - "**Key Insights You'll Discover:**\n", - "\n", - "**1. Normal Distribution (Most Common):**\n", - " - Smooth bell curve preserved reasonably well\n", - " - Tail values may be clipped slightly\n", - " - Good compromise for most neural networks\n", - "\n", - "**2. Uniform Distribution (Ideal Case):**\n", - " - Perfect scale utilization\n", - " - Minimal quantization error\n", - " - Best-case scenario for quantization\n", - "\n", - "**3. Sparse Distribution (Problematic):**\n", - " - Many zeros waste quantization levels\n", - " - Poor precision for non-zero values\n", - " - Common in pruned networks\n", - "\n", - "**4. Heavy-Tailed Distribution (Worst Case):**\n", - " - Outliers dominate scale calculation\n", - " - Most values squeezed into narrow range\n", - " - Requires special handling (clipping, per-channel)\n", - "\n", - "**Practical Implications:**\n", - "- **Model design:** Prefer batch normalization to reduce outliers\n", - "- **Training:** Techniques to encourage uniform weight distributions\n", - "- **Deployment:** Advanced quantization for sparse/heavy-tailed weights" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7d286a68", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "visualize_quantization_effects", - "solution": true - } - }, - "outputs": [], - "source": [ - "def visualize_quantization_effects():\n", - " \"\"\"📊 Visualize the effects of quantization on weight distributions.\"\"\"\n", - " print(\"📊 Visualizing Quantization Effects on Weight Distributions...\")\n", - "\n", - " # Create sample weight tensors with different characteristics\n", - " weight_types = {\n", - " 'Normal': np.random.normal(0, 0.1, (1000,)),\n", - " 'Uniform': np.random.uniform(-0.2, 0.2, (1000,)),\n", - " 'Sparse': np.random.choice([0, 0, 0, 1], (1000,)) * np.random.normal(0, 0.15, (1000,)),\n", - " 'Heavy-tailed': np.concatenate([\n", - " np.random.normal(0, 0.05, (800,)),\n", - " np.random.uniform(-0.5, 0.5, (200,))\n", - " ])\n", - " }\n", - "\n", - " fig, axes = plt.subplots(2, 2, figsize=(12, 8))\n", - " axes = axes.flatten()\n", - "\n", - " for idx, (name, weights) in enumerate(weight_types.items()):\n", - " # Original weights\n", - " original_tensor = Tensor(weights)\n", - "\n", - " # Quantize and dequantize\n", - " q_tensor, scale, zero_point = quantize_int8(original_tensor)\n", - " restored_tensor = dequantize_int8(q_tensor, scale, zero_point)\n", - "\n", - " # Plot histograms\n", - " ax = axes[idx]\n", - " ax.hist(weights, bins=50, alpha=0.6, label='Original', density=True)\n", - " ax.hist(restored_tensor.data, bins=50, alpha=0.6, label='Quantized', density=True)\n", - " ax.set_title(f'{name} Weights\\nScale: {scale:.4f}')\n", - " ax.set_xlabel('Weight Value')\n", - " ax.set_ylabel('Density')\n", - " ax.legend()\n", - " ax.grid(True, alpha=0.3)\n", - "\n", - " # Calculate and display error metrics\n", - " mse = np.mean((weights - restored_tensor.data) ** 2)\n", - " ax.text(0.02, 0.98, f'MSE: {mse:.6f}', transform=ax.transAxes,\n", - " verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))\n", - "\n", - " plt.tight_layout()\n", - " plt.savefig('/tmp/claude/quantization_effects.png', dpi=100, bbox_inches='tight')\n", - " plt.show()\n", - "\n", - " print(\"💡 Observations:\")\n", - " print(\"- Normal: Smooth quantization, good preservation\")\n", - " print(\"- Uniform: Excellent quantization, full range utilized\")\n", - " print(\"- Sparse: Many wasted quantization levels on zeros\")\n", - " print(\"- Heavy-tailed: Outliers dominate scale, poor precision for small weights\")\n", - "\n", - "# Visualize quantization effects\n", - "visualize_quantization_effects()" - ] - }, - { - "cell_type": "markdown", - "id": "784b58ca", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 6. Optimization Insights - Production Quantization Strategies\n", - "\n", - "### Beyond Basic Quantization\n", - "\n", - "Our INT8 per-tensor quantization is just the beginning. Production systems use sophisticated strategies to squeeze out every bit of performance while preserving accuracy.\n", - "\n", - "```\n", - "Quantization Strategy Evolution:\n", - "\n", - " Basic (What we built) Advanced (Production) Cutting-Edge (Research)\n", - "┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐\n", - "│ • Per-tensor scale │ │ • Per-channel scale │ │ • Dynamic ranges │\n", - "│ • Uniform INT8 │ → │ • Mixed precision │ → │ • Adaptive bitwidth │\n", - "│ • Post-training │ │ • Quantization-aware│ │ • Learned quantizers│\n", - "│ • Simple calibration│ │ • Advanced calib. │ │ • Neural compression│\n", - "└─────────────────────┘ └─────────────────────┘ └─────────────────────┘\n", - " Good baseline Production systems Future research\n", - "```\n", - "\n", - "### Strategy Comparison Framework\n", - "\n", - "```\n", - "Quantization Strategy Trade-offs:\n", - "\n", - "┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\n", - "│ Strategy │ Accuracy │ Complexity │ Memory Use │ Speed Gain │\n", - "├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\n", - "│ Per-Tensor (Ours) │ ████████░░ │ ██░░░░░░░░ │ ████████░░ │ ███████░░░ │\n", - "│ Per-Channel │ █████████░ │ █████░░░░░ │ ████████░░ │ ██████░░░░ │\n", - "│ Mixed Precision │ ██████████ │ ████████░░ │ ███████░░░ │ ████████░░ │\n", - "│ Quantization-Aware │ ██████████ │ ██████████ │ ████████░░ │ ███████░░░ │\n", - "└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\n", - "```\n", - "\n", - "### The Three Advanced Strategies We'll Analyze\n", - "\n", - "**1. Per-Channel Quantization:**\n", - "```\n", - "Per-Tensor: Per-Channel:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ [W₁₁ W₁₂ W₁₃] │ │ [W₁₁ W₁₂ W₁₃] scale₁ │\n", - "│ [W₂₁ W₂₂ W₂₃] scale │ VS │ [W₂₁ W₂₂ W₂₃] scale₂ │\n", - "│ [W₃₁ W₃₂ W₃₃] │ │ [W₃₁ W₃₂ W₃₃] scale₃ │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - " One scale for all Separate scale per channel\n", - " May waste precision Better precision per channel\n", - "```\n", - "\n", - "**2. Mixed Precision:**\n", - "```\n", - "Sensitive Layers (FP32): Regular Layers (INT8):\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ Input Layer │ │ Hidden Layer 1 │\n", - "│ (preserve input quality)│ │ (can tolerate error) │\n", - "├─────────────────────────┤ ├─────────────────────────┤\n", - "│ Output Layer │ │ Hidden Layer 2 │\n", - "│ (preserve output) │ │ (bulk of computation) │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - " Keep high precision Maximize compression\n", - "```\n", - "\n", - "**3. Calibration Strategies:**\n", - "```\n", - "Basic Calibration: Advanced Calibration:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ • Use min/max range │ │ • Percentile clipping │\n", - "│ • Simple statistics │ │ • KL-divergence │\n", - "│ • Few samples │ VS │ • Multiple datasets │\n", - "│ • Generic approach │ │ • Layer-specific tuning │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - " Fast but suboptimal Optimal but expensive\n", - "```\n", - "\n", - "Let's implement and compare these strategies to understand their practical trade-offs!" - ] - }, - { - "cell_type": "markdown", - "id": "1d4fc886", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "### Advanced Quantization Strategies - Production Techniques\n", - "\n", - "This analysis compares different quantization approaches used in production systems, revealing the trade-offs between accuracy, complexity, and performance.\n", - "\n", - "```\n", - "Strategy Comparison Framework:\n", - "\n", - "┌────────────────────────────────────────────────────────────────────────────────────┐\n", - "│ Three Advanced Strategies │\n", - "├────────────────────────────┬────────────────────────────┬────────────────────────────┤\n", - "│ Strategy 1 │ Strategy 2 │ Strategy 3 │\n", - "│ Per-Tensor (Ours) │ Per-Channel Scale │ Mixed Precision │\n", - "├────────────────────────────┼────────────────────────────┼────────────────────────────┤\n", - "│ │ │ │\n", - "│ ┌──────────────────────┐ │ ┌──────────────────────┐ │ ┌──────────────────────┐ │\n", - "│ │ Weights: │ │ │ Channel 1: scale₁ │ │ │ Sensitive: FP32 │ │\n", - "│ │ [W₁₁ W₁₂ W₁₃] │ │ │ Channel 2: scale₂ │ │ │ Regular: INT8 │ │\n", - "│ │ [W₂₁ W₂₂ W₂₃] scale │ │ │ Channel 3: scale₃ │ │ │ │ │\n", - "│ │ [W₃₁ W₃₂ W₃₃] │ │ │ │ │ │ Input: FP32 │ │\n", - "│ └──────────────────────┘ │ │ Better precision │ │ │ Output: FP32 │ │\n", - "│ │ │ per channel │ │ │ Hidden: INT8 │ │\n", - "│ Simple, fast │ └──────────────────────┘ │ └──────────────────────┘ │\n", - "│ Good baseline │ │ │\n", - "│ │ More complex │ Optimal accuracy │\n", - "│ │ Better accuracy │ Selective compression │\n", - "└────────────────────────────┴────────────────────────────┴────────────────────────────┘\n", - "```\n", - "\n", - "**Strategy 1: Per-Tensor Quantization (Our Implementation)**\n", - "```\n", - "Weight Matrix: Scale Calculation:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ 0.1 -0.3 0.8 0.2 │ │ Global min: -0.5 │\n", - "│-0.2 0.5 -0.1 0.7 │ → │ Global max: +0.8 │\n", - "│ 0.4 -0.5 0.3 -0.4 │ │ Scale: 1.3/255 = 0.0051 │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - "\n", - "Pros: Simple, fast Cons: May waste precision\n", - "```\n", - "\n", - "**Strategy 2: Per-Channel Quantization (Advanced)**\n", - "```\n", - "Weight Matrix: Scale Calculation:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ 0.1 -0.3 0.8 0.2 │ │ Col 1: [-0.2,0.4] → s₁ │\n", - "│-0.2 0.5 -0.1 0.7 │ → │ Col 2: [-0.5,0.5] → s₂ │\n", - "│ 0.4 -0.5 0.3 -0.4 │ │ Col 3: [-0.1,0.8] → s₃ │\n", - "└─────────────────────────┘ │ Col 4: [-0.4,0.7] → s₄ │\n", - " └─────────────────────────┘\n", - "\n", - "Pros: Better precision Cons: More complex\n", - "```\n", - "\n", - "**Strategy 3: Mixed Precision (Production)**\n", - "```\n", - "Model Architecture: Precision Assignment:\n", - "┌─────────────────────────┐ ┌─────────────────────────┐\n", - "│ Input Layer (sensitive) │ │ Keep in FP32 (precision) │\n", - "│ Hidden 1 (bulk) │ → │ Quantize to INT8 │\n", - "│ Hidden 2 (bulk) │ │ Quantize to INT8 │\n", - "│ Output Layer (sensitive)│ │ Keep in FP32 (quality) │\n", - "└─────────────────────────┘ └─────────────────────────┘\n", - "\n", - "Pros: Optimal trade-off Cons: Requires expertise\n", - "```\n", - "\n", - "**Experimental Design:**\n", - "```\n", - "Comparative Testing Protocol:\n", - "\n", - "1. Create identical test model → 2. Apply each strategy → 3. Measure results\n", - " ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐\n", - " │ 128 → 64 → 10 MLP │ │ Per-tensor quantization │ │ MSE error calculation │\n", - " │ Identical weights │ │ Per-channel simulation │ │ Compression measurement│\n", - " │ Same test input │ │ Mixed precision setup │ │ Speed comparison │\n", - " └───────────────────────┘ └───────────────────────┘ └───────────────────────┘\n", - "```\n", - "\n", - "**Expected Strategy Rankings:**\n", - "1. **Mixed Precision** - Best accuracy, moderate complexity\n", - "2. **Per-Channel** - Good accuracy, higher complexity\n", - "3. **Per-Tensor** - Baseline accuracy, simplest implementation\n", - "\n", - "This analysis reveals which strategies work best for different deployment scenarios and accuracy requirements." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5d474888", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "analyze_quantization_strategies", - "solution": true - } - }, - "outputs": [], - "source": [ - "def analyze_quantization_strategies():\n", - " \"\"\"📊 Compare different quantization strategies and their trade-offs.\"\"\"\n", - " print(\"📊 Analyzing Advanced Quantization Strategies...\")\n", - "\n", - " # Create test model and data\n", - " model = Sequential(Linear(128, 64), ReLU(), Linear(64, 10))\n", - " model.layers[0].weight = Tensor(np.random.randn(128, 64) * 0.1)\n", - " model.layers[0].bias = Tensor(np.random.randn(64) * 0.01)\n", - " model.layers[2].weight = Tensor(np.random.randn(64, 10) * 0.1)\n", - " model.layers[2].bias = Tensor(np.random.randn(10) * 0.01)\n", - "\n", - " test_input = Tensor(np.random.randn(32, 128))\n", - " original_output = model.forward(test_input)\n", - "\n", - " strategies = {}\n", - "\n", - " # Strategy 1: Per-tensor quantization (what we implemented)\n", - " print(\"\\n🔍 Strategy 1: Per-Tensor Quantization\")\n", - " model_copy = Sequential(Linear(128, 64), ReLU(), Linear(64, 10))\n", - " for i, layer in enumerate(model.layers):\n", - " if isinstance(layer, Linear):\n", - " model_copy.layers[i].weight = Tensor(layer.weight.data.copy())\n", - " model_copy.layers[i].bias = Tensor(layer.bias.data.copy())\n", - "\n", - " quantize_model(model_copy)\n", - " output1 = model_copy.forward(test_input)\n", - " error1 = np.mean((original_output.data - output1.data) ** 2)\n", - " strategies['per_tensor'] = {'mse': error1, 'description': 'Single scale per tensor'}\n", - " print(f\" MSE: {error1:.6f}\")\n", - "\n", - " # Strategy 2: Per-channel quantization simulation\n", - " print(\"\\n🔍 Strategy 2: Per-Channel Quantization (simulated)\")\n", - " # Simulate by quantizing each output channel separately\n", - " def per_channel_quantize(tensor):\n", - " \"\"\"Simulate per-channel quantization for 2D weight matrices.\"\"\"\n", - " if len(tensor.shape) < 2:\n", - " return quantize_int8(tensor)\n", - "\n", - " quantized_data = np.zeros_like(tensor.data, dtype=np.int8)\n", - " scales = []\n", - " zero_points = []\n", - "\n", - " for i in range(tensor.shape[1]): # Per output channel\n", - " channel_tensor = Tensor(tensor.data[:, i:i+1])\n", - " q_channel, scale, zp = quantize_int8(channel_tensor)\n", - " quantized_data[:, i] = q_channel.data.flatten()\n", - " scales.append(scale)\n", - " zero_points.append(zp)\n", - "\n", - " return Tensor(quantized_data), scales, zero_points\n", - "\n", - " # Apply per-channel quantization to weights\n", - " total_error = 0\n", - " for layer in model.layers:\n", - " if isinstance(layer, Linear):\n", - " q_weight, scales, zps = per_channel_quantize(layer.weight)\n", - " # Simulate dequantization and error\n", - " for i in range(layer.weight.shape[1]):\n", - " original_channel = layer.weight.data[:, i]\n", - " restored_channel = scales[i] * q_weight.data[:, i] + zps[i] * scales[i]\n", - " total_error += np.mean((original_channel - restored_channel) ** 2)\n", - "\n", - " strategies['per_channel'] = {'mse': total_error, 'description': 'Scale per output channel'}\n", - " print(f\" MSE: {total_error:.6f}\")\n", - "\n", - " # Strategy 3: Mixed precision simulation\n", - " print(\"\\n🔍 Strategy 3: Mixed Precision\")\n", - " # Keep sensitive layers in FP32, quantize others\n", - " sensitive_layers = [0] # First layer often most sensitive\n", - " mixed_error = 0\n", - "\n", - " for i, layer in enumerate(model.layers):\n", - " if isinstance(layer, Linear):\n", - " if i in sensitive_layers:\n", - " # Keep in FP32 (no quantization error)\n", - " pass\n", - " else:\n", - " # Quantize layer\n", - " q_weight, scale, zp = quantize_int8(layer.weight)\n", - " restored = dequantize_int8(q_weight, scale, zp)\n", - " mixed_error += np.mean((layer.weight.data - restored.data) ** 2)\n", - "\n", - " strategies['mixed_precision'] = {'mse': mixed_error, 'description': 'FP32 sensitive + INT8 others'}\n", - " print(f\" MSE: {mixed_error:.6f}\")\n", - "\n", - " # Compare strategies\n", - " print(f\"\\n📊 QUANTIZATION STRATEGY COMPARISON\")\n", - " print(\"=\" * 60)\n", - " for name, info in strategies.items():\n", - " print(f\"{name:15}: MSE={info['mse']:.6f} | {info['description']}\")\n", - "\n", - " # Find best strategy\n", - " best_strategy = min(strategies.items(), key=lambda x: x[1]['mse'])\n", - " print(f\"\\n🏆 Best Strategy: {best_strategy[0]} (MSE: {best_strategy[1]['mse']:.6f})\")\n", - "\n", - " print(f\"\\n💡 Production Insights:\")\n", - " print(\"- Per-channel: Better accuracy, more complex implementation\")\n", - " print(\"- Mixed precision: Optimal accuracy/efficiency trade-off\")\n", - " print(\"- Per-tensor: Simplest, good for most applications\")\n", - " print(\"- Hardware support varies: INT8 GEMM, per-channel scales\")\n", - "\n", - " return strategies\n", - "\n", - "# Analyze quantization strategies\n", - "strategy_analysis = analyze_quantization_strategies()" - ] - }, - { - "cell_type": "markdown", - "id": "720002d7", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 7. Module Integration Test\n", - "\n", - "Final validation that our quantization system works correctly across all components." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d28702bc", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "test_module", - "points": 20 - } - }, - "outputs": [], - "source": [ - "def test_module():\n", - " \"\"\"\n", - " Comprehensive test of entire quantization module functionality.\n", - "\n", - " This final test runs before module summary to ensure:\n", - " - All quantization functions work correctly\n", - " - Model quantization preserves functionality\n", - " - Memory savings are achieved\n", - " - Module is ready for integration with TinyTorch\n", - " \"\"\"\n", - " print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n", - " print(\"=\" * 50)\n", - "\n", - " # Run all unit tests\n", - " print(\"Running unit tests...\")\n", - " test_unit_quantize_int8()\n", - " test_unit_dequantize_int8()\n", - " test_unit_quantized_linear()\n", - " test_unit_quantize_model()\n", - " test_unit_compare_model_sizes()\n", - "\n", - " print(\"\\nRunning integration scenarios...\")\n", - "\n", - " # Test realistic usage scenario\n", - " print(\"🔬 Integration Test: End-to-end quantization workflow...\")\n", - "\n", - " # Create a realistic model\n", - " model = Sequential(\n", - " Linear(784, 128), # MNIST-like input\n", - " ReLU(),\n", - " Linear(128, 64),\n", - " ReLU(),\n", - " Linear(64, 10) # 10-class output\n", - " )\n", - "\n", - " # Initialize with realistic weights\n", - " for layer in model.layers:\n", - " if isinstance(layer, Linear):\n", - " # Xavier initialization\n", - " fan_in, fan_out = layer.weight.shape\n", - " std = np.sqrt(2.0 / (fan_in + fan_out))\n", - " layer.weight = Tensor(np.random.randn(fan_in, fan_out) * std)\n", - " layer.bias = Tensor(np.zeros(fan_out))\n", - "\n", - " # Generate realistic calibration data\n", - " calibration_data = [Tensor(np.random.randn(1, 784) * 0.1) for _ in range(20)]\n", - "\n", - " # Test original model\n", - " test_input = Tensor(np.random.randn(8, 784) * 0.1)\n", - " original_output = model.forward(test_input)\n", - "\n", - " # Quantize the model\n", - " quantize_model(model, calibration_data)\n", - "\n", - " # Test quantized model\n", - " quantized_output = model.forward(test_input)\n", - "\n", - " # Verify functionality is preserved\n", - " assert quantized_output.shape == original_output.shape, \"Output shape mismatch\"\n", - "\n", - " # Verify reasonable accuracy preservation\n", - " mse = np.mean((original_output.data - quantized_output.data) ** 2)\n", - " relative_error = np.sqrt(mse) / (np.std(original_output.data) + 1e-8)\n", - " assert relative_error < 0.1, f\"Accuracy degradation too high: {relative_error:.3f}\"\n", - "\n", - " # Verify memory savings\n", - " # Create equivalent original model for comparison\n", - " original_model = Sequential(\n", - " Linear(784, 128),\n", - " ReLU(),\n", - " Linear(128, 64),\n", - " ReLU(),\n", - " Linear(64, 10)\n", - " )\n", - "\n", - " for i, layer in enumerate(model.layers):\n", - " if isinstance(layer, QuantizedLinear):\n", - " # Restore original weights for comparison\n", - " original_model.layers[i].weight = dequantize_int8(\n", - " layer.q_weight, layer.weight_scale, layer.weight_zero_point\n", - " )\n", - " if layer.q_bias is not None:\n", - " original_model.layers[i].bias = dequantize_int8(\n", - " layer.q_bias, layer.bias_scale, layer.bias_zero_point\n", - " )\n", - "\n", - " memory_comparison = compare_model_sizes(original_model, model)\n", - " assert memory_comparison['compression_ratio'] > 2.0, \"Insufficient compression achieved\"\n", - "\n", - " print(f\"✅ Compression achieved: {memory_comparison['compression_ratio']:.1f}×\")\n", - " print(f\"✅ Accuracy preserved: {relative_error:.1%} relative error\")\n", - " print(f\"✅ Memory saved: {memory_comparison['memory_saved_mb']:.1f}MB\")\n", - "\n", - " # Test edge cases\n", - " print(\"🔬 Testing edge cases...\")\n", - "\n", - " # Test constant tensor quantization\n", - " constant_tensor = Tensor([[1.0, 1.0], [1.0, 1.0]])\n", - " q_const, scale_const, zp_const = quantize_int8(constant_tensor)\n", - " assert scale_const == 1.0, \"Constant tensor quantization failed\"\n", - "\n", - " # Test zero tensor\n", - " zero_tensor = Tensor([[0.0, 0.0], [0.0, 0.0]])\n", - " q_zero, scale_zero, zp_zero = quantize_int8(zero_tensor)\n", - " restored_zero = dequantize_int8(q_zero, scale_zero, zp_zero)\n", - " assert np.allclose(restored_zero.data, 0.0, atol=1e-6), \"Zero tensor restoration failed\"\n", - "\n", - " print(\"✅ Edge cases handled correctly!\")\n", - "\n", - " print(\"\\n\" + \"=\" * 50)\n", - " print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n", - " print(\"📈 Quantization system provides:\")\n", - " print(f\" • {memory_comparison['compression_ratio']:.1f}× memory reduction\")\n", - " print(f\" • <{relative_error:.1%} accuracy loss\")\n", - " print(f\" • Production-ready INT8 quantization\")\n", - " print(\"Run: tito module complete 17\")\n", - "\n", - "# Call the comprehensive test\n", - "test_module()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "84871dfd", - "metadata": {}, - "outputs": [], - "source": [ - "if __name__ == \"__main__\":\n", - " print(\"🚀 Running Quantization module...\")\n", - " test_module()\n", - " print(\"✅ Module validation complete!\")" - ] - }, - { - "cell_type": "markdown", - "id": "c093e91d", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 🏁 Consolidated Quantization Classes for Export\n", - "\n", - "Now that we've implemented all quantization components, let's create consolidated classes\n", - "for export to the tinytorch package. This allows milestones to use the complete quantization system." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cab2e3a1", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "quantization_export", - "solution": false - } - }, - "outputs": [], - "source": [ - "#| export\n", - "class QuantizationComplete:\n", - " \"\"\"\n", - " Complete quantization system for milestone use.\n", - " \n", - " Provides INT8 quantization with calibration for 4× memory reduction.\n", - " \"\"\"\n", - " \n", - " @staticmethod\n", - " def quantize_tensor(tensor: Tensor) -> Tuple[Tensor, float, int]:\n", - " \"\"\"Quantize FP32 tensor to INT8.\"\"\"\n", - " data = tensor.data\n", - " min_val = float(np.min(data))\n", - " max_val = float(np.max(data))\n", - " \n", - " if abs(max_val - min_val) < 1e-8:\n", - " return Tensor(np.zeros_like(data, dtype=np.int8)), 1.0, 0\n", - " \n", - " scale = (max_val - min_val) / 255.0\n", - " zero_point = int(np.round(-128 - min_val / scale))\n", - " zero_point = int(np.clip(zero_point, -128, 127))\n", - " \n", - " quantized_data = np.round(data / scale + zero_point)\n", - " quantized_data = np.clip(quantized_data, -128, 127).astype(np.int8)\n", - " \n", - " return Tensor(quantized_data), scale, zero_point\n", - " \n", - " @staticmethod\n", - " def dequantize_tensor(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:\n", - " \"\"\"Dequantize INT8 tensor back to FP32.\"\"\"\n", - " dequantized_data = (q_tensor.data.astype(np.float32) - zero_point) * scale\n", - " return Tensor(dequantized_data)\n", - " \n", - " @staticmethod\n", - " def quantize_model(model, calibration_data: Optional[List[Tensor]] = None) -> Dict[str, any]:\n", - " \"\"\"\n", - " Quantize all Linear layers in a model.\n", - " \n", - " Returns dictionary with quantization info and memory savings.\n", - " \"\"\"\n", - " quantized_layers = {}\n", - " original_size = 0\n", - " quantized_size = 0\n", - " \n", - " # Iterate through model parameters\n", - " if hasattr(model, 'parameters'):\n", - " for i, param in enumerate(model.parameters()):\n", - " param_size = param.data.nbytes\n", - " original_size += param_size\n", - " \n", - " # Quantize parameter\n", - " q_param, scale, zp = QuantizationComplete.quantize_tensor(param)\n", - " quantized_size += q_param.data.nbytes\n", - " \n", - " quantized_layers[f'param_{i}'] = {\n", - " 'quantized': q_param,\n", - " 'scale': scale,\n", - " 'zero_point': zp,\n", - " 'original_shape': param.data.shape\n", - " }\n", - " \n", - " return {\n", - " 'quantized_layers': quantized_layers,\n", - " 'original_size_mb': original_size / (1024 * 1024),\n", - " 'quantized_size_mb': quantized_size / (1024 * 1024),\n", - " 'compression_ratio': original_size / quantized_size if quantized_size > 0 else 1.0\n", - " }\n", - " \n", - " @staticmethod\n", - " def compare_models(original_model, quantized_info: Dict) -> Dict[str, float]:\n", - " \"\"\"Compare memory usage between original and quantized models.\"\"\"\n", - " return {\n", - " 'original_mb': quantized_info['original_size_mb'],\n", - " 'quantized_mb': quantized_info['quantized_size_mb'],\n", - " 'compression_ratio': quantized_info['compression_ratio'],\n", - " 'memory_saved_mb': quantized_info['original_size_mb'] - quantized_info['quantized_size_mb']\n", - " }\n", - "\n", - "# Convenience functions for backward compatibility\n", - "def quantize_int8(tensor: Tensor) -> Tuple[Tensor, float, int]:\n", - " \"\"\"Quantize FP32 tensor to INT8.\"\"\"\n", - " return QuantizationComplete.quantize_tensor(tensor)\n", - "\n", - "def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:\n", - " \"\"\"Dequantize INT8 tensor back to FP32.\"\"\"\n", - " return QuantizationComplete.dequantize_tensor(q_tensor, scale, zero_point)\n", - "\n", - "def quantize_model(model, calibration_data: Optional[List[Tensor]] = None) -> Dict[str, any]:\n", - " \"\"\"Quantize entire model to INT8.\"\"\"\n", - " return QuantizationComplete.quantize_model(model, calibration_data)" - ] - }, - { - "cell_type": "markdown", - "id": "b3d77ac1", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🤔 ML Systems Thinking: Quantization in Production\n", - "\n", - "### Question 1: Memory Architecture Impact\n", - "You implemented INT8 quantization that reduces each parameter from 4 bytes to 1 byte.\n", - "For a model with 100M parameters:\n", - "- Original memory usage: _____ GB\n", - "- Quantized memory usage: _____ GB\n", - "- Memory bandwidth reduction when loading from disk: _____ ×\n", - "\n", - "### Question 2: Quantization Error Analysis\n", - "Your quantization maps a continuous range to 256 discrete values (INT8).\n", - "For weights uniformly distributed in [-0.1, 0.1]:\n", - "- Quantization scale: _____\n", - "- Maximum quantization error: _____\n", - "- Signal-to-noise ratio approximately: _____ dB\n", - "\n", - "### Question 3: Hardware Efficiency\n", - "Modern processors have specialized INT8 instructions (like AVX-512 VNNI).\n", - "Compared to FP32 operations:\n", - "- How many INT8 operations fit in one SIMD instruction vs FP32? _____ × more\n", - "- Why might actual speedup be less than this theoretical maximum? _____\n", - "- What determines whether quantization improves or hurts performance? _____\n", - "\n", - "### Question 4: Calibration Strategy Trade-offs\n", - "Your calibration process finds optimal scales using sample data.\n", - "- Too little calibration data: Risk of _____\n", - "- Too much calibration data: Cost of _____\n", - "- Per-channel vs per-tensor quantization trades _____ for _____\n", - "\n", - "### Question 5: Production Deployment\n", - "In mobile/edge deployment scenarios:\n", - "- When is 4× memory reduction worth <1% accuracy loss? _____\n", - "- Why might you keep certain layers in FP32? _____\n", - "- How does quantization affect battery life? _____" - ] - }, - { - "cell_type": "markdown", - "id": "5b20dcf9", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎯 MODULE SUMMARY: Quantization\n", - "\n", - "Congratulations! You've built a complete INT8 quantization system that can reduce model size by 4× with minimal accuracy loss!\n", - "\n", - "### Key Accomplishments\n", - "- **Built INT8 quantization** with proper scaling and zero-point calculation\n", - "- **Implemented QuantizedLinear** layer with calibration support\n", - "- **Created model-level quantization** for complete neural networks\n", - "- **Analyzed quantization trade-offs** across different distributions and strategies\n", - "- **Measured real memory savings** and performance improvements\n", - "- All tests pass ✅ (validated by `test_module()`)\n", - "\n", - "### Real-World Impact\n", - "Your quantization implementation achieves:\n", - "- **4× memory reduction** (FP32 → INT8)\n", - "- **2-4× inference speedup** (hardware dependent)\n", - "- **<1% accuracy loss** with proper calibration\n", - "- **Production deployment readiness** for mobile/edge applications\n", - "\n", - "### What You've Mastered\n", - "- **Quantization mathematics** - scale and zero-point calculations\n", - "- **Calibration techniques** - optimizing quantization parameters\n", - "- **Error analysis** - understanding and minimizing quantization noise\n", - "- **Systems optimization** - memory vs accuracy trade-offs\n", - "\n", - "### Ready for Next Steps\n", - "Your quantization system enables efficient model deployment on resource-constrained devices.\n", - "Export with: `tito module complete 17`\n", - "\n", - "**Next**: Module 18 will add model compression through pruning - removing unnecessary weights entirely!\n", - "\n", - "---\n", - "\n", - "**🏆 Achievement Unlocked**: You can now deploy 4× smaller models with production-quality quantization! This is a critical skill for mobile AI, edge computing, and efficient inference systems." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/modules/source/18_compression/compression_dev.ipynb b/modules/source/18_compression/compression_dev.ipynb deleted file mode 100644 index 0b2e90af..00000000 --- a/modules/source/18_compression/compression_dev.ipynb +++ /dev/null @@ -1,1728 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "7c0b2b14", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "# Module 18: Compression - Making Models Smaller\n", - "\n", - "Welcome to Module 18! You're about to build model compression techniques that make neural networks smaller and more efficient while preserving their intelligence.\n", - "\n", - "## 🔗 Prerequisites & Progress\n", - "**You've Built**: Full TinyGPT pipeline with profiling, acceleration, and quantization\n", - "**You'll Build**: Pruning (magnitude & structured), knowledge distillation, and low-rank approximation\n", - "**You'll Enable**: Compressed models that maintain accuracy while using dramatically less storage and memory\n", - "\n", - "**Connection Map**:\n", - "```\n", - "Quantization → Compression → Benchmarking\n", - "(precision) (sparsity) (evaluation)\n", - "```\n", - "\n", - "## Learning Objectives\n", - "By the end of this module, you will:\n", - "1. Implement magnitude-based and structured pruning\n", - "2. Build knowledge distillation for model compression\n", - "3. Create low-rank approximations of weight matrices\n", - "4. Measure compression ratios and sparsity levels\n", - "5. Understand structured vs unstructured sparsity trade-offs\n", - "\n", - "Let's get started!\n", - "\n", - "## 📦 Where This Code Lives in the Final Package\n", - "\n", - "**Learning Side:** You work in `modules/18_compression/compression_dev.py` \n", - "**Building Side:** Code exports to `tinytorch.optimization.compression`\n", - "\n", - "```python\n", - "# How to use this module:\n", - "from tinytorch.optimization.compression import magnitude_prune, structured_prune, measure_sparsity\n", - "```\n", - "\n", - "**Why this matters:**\n", - "- **Learning:** Complete compression system in one focused module for deep understanding\n", - "- **Production:** Proper organization like real compression libraries with all techniques together\n", - "- **Consistency:** All compression operations and sparsity management in optimization.compression\n", - "- **Integration:** Works seamlessly with models and quantization for complete optimization pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "37872416", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "imports", - "solution": true - } - }, - "outputs": [], - "source": [ - "#| default_exp optimization.compression\n", - "#| export\n", - "\n", - "import numpy as np\n", - "import copy\n", - "from typing import List, Dict, Any, Tuple, Optional\n", - "import time\n", - "\n", - "# Import from previous modules\n", - "# Note: In the full package, these would be imports like:\n", - "# from tinytorch.core.tensor import Tensor\n", - "# from tinytorch.core.layers import Linear\n", - "# For development, we'll create minimal implementations\n", - "\n", - "class Tensor:\n", - " \"\"\"Minimal Tensor class for compression development - imports from Module 01 in practice.\"\"\"\n", - " def __init__(self, data, requires_grad=False):\n", - " self.data = np.array(data)\n", - " self.shape = self.data.shape\n", - " self.size = self.data.size\n", - " self.requires_grad = requires_grad\n", - " self.grad = None\n", - "\n", - " def __add__(self, other):\n", - " if isinstance(other, Tensor):\n", - " return Tensor(self.data + other.data)\n", - " return Tensor(self.data + other)\n", - "\n", - " def __mul__(self, other):\n", - " if isinstance(other, Tensor):\n", - " return Tensor(self.data * other.data)\n", - " return Tensor(self.data * other)\n", - "\n", - " def matmul(self, other):\n", - " return Tensor(np.dot(self.data, other.data))\n", - "\n", - " def abs(self):\n", - " return Tensor(np.abs(self.data))\n", - "\n", - " def sum(self, axis=None):\n", - " return Tensor(self.data.sum(axis=axis))\n", - "\n", - " def __repr__(self):\n", - " return f\"Tensor(shape={self.shape})\"\n", - "\n", - "class Linear:\n", - " \"\"\"Minimal Linear layer for compression development - imports from Module 03 in practice.\"\"\"\n", - " def __init__(self, in_features, out_features, bias=True):\n", - " self.in_features = in_features\n", - " self.out_features = out_features\n", - " # Initialize with He initialization\n", - " self.weight = Tensor(np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features))\n", - " self.bias = Tensor(np.zeros(out_features)) if bias else None\n", - "\n", - " def forward(self, x):\n", - " output = x.matmul(self.weight)\n", - " if self.bias is not None:\n", - " output = output + self.bias\n", - " return output\n", - "\n", - " def parameters(self):\n", - " params = [self.weight]\n", - " if self.bias is not None:\n", - " params.append(self.bias)\n", - " return params\n", - "\n", - "class Sequential:\n", - " \"\"\"Minimal Sequential container for model compression.\"\"\"\n", - " def __init__(self, *layers):\n", - " self.layers = list(layers)\n", - "\n", - " def forward(self, x):\n", - " for layer in self.layers:\n", - " x = layer.forward(x)\n", - " return x\n", - "\n", - " def parameters(self):\n", - " params = []\n", - " for layer in self.layers:\n", - " if hasattr(layer, 'parameters'):\n", - " params.extend(layer.parameters())\n", - " return params" - ] - }, - { - "cell_type": "markdown", - "id": "252e20ce", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 1. Introduction: What is Model Compression?\n", - "\n", - "Imagine you have a massive library with millions of books, but you only reference 10% of them regularly. Model compression is like creating a curated collection that keeps the essential knowledge while dramatically reducing storage space.\n", - "\n", - "Model compression reduces the size and computational requirements of neural networks while preserving their intelligence. It's the bridge between powerful research models and practical deployment.\n", - "\n", - "### Why Compression Matters in ML Systems\n", - "\n", - "**The Storage Challenge:**\n", - "- Modern language models: 100GB+ (GPT-3 scale)\n", - "- Mobile devices: <1GB available for models\n", - "- Edge devices: <100MB realistic limits\n", - "- Network bandwidth: Slow downloads kill user experience\n", - "\n", - "**The Speed Challenge:**\n", - "- Research models: Designed for accuracy, not efficiency\n", - "- Production needs: Sub-second response times\n", - "- Battery life: Energy consumption matters for mobile\n", - "- Cost scaling: Inference costs grow with model size\n", - "\n", - "### The Compression Landscape\n", - "\n", - "```\n", - "Neural Network Compression Techniques:\n", - "\n", - "┌─────────────────────────────────────────────────────────────┐\n", - "│ COMPRESSION METHODS │\n", - "├─────────────────────────────────────────────────────────────┤\n", - "│ WEIGHT-BASED │ ARCHITECTURE-BASED │\n", - "│ ┌─────────────────────────────┐ │ ┌─────────────────────┐ │\n", - "│ │ Magnitude Pruning │ │ │ Knowledge Distillation│ │\n", - "│ │ • Remove small weights │ │ │ • Teacher → Student │ │\n", - "│ │ • 90% sparsity achievable │ │ │ • 10x size reduction │ │\n", - "│ │ │ │ │ │ │\n", - "│ │ Structured Pruning │ │ │ Neural Architecture │ │\n", - "│ │ • Remove entire channels │ │ │ Search (NAS) │ │\n", - "│ │ • Hardware-friendly │ │ │ • Automated design │ │\n", - "│ │ │ │ │ │ │\n", - "│ │ Low-Rank Approximation │ │ │ Early Exit │ │\n", - "│ │ • Matrix factorization │ │ │ • Adaptive compute │ │\n", - "│ │ • SVD decomposition │ │ │ │ │\n", - "│ └─────────────────────────────┘ │ └─────────────────────┘ │\n", - "└─────────────────────────────────────────────────────────────┘\n", - "```\n", - "\n", - "Think of compression like optimizing a recipe - you want to keep the essential ingredients that create the flavor while removing anything that doesn't contribute to the final dish." - ] - }, - { - "cell_type": "markdown", - "id": "30325dfe", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 2. Foundations: Mathematical Background\n", - "\n", - "Understanding the mathematics behind compression helps us choose the right technique for each situation and predict their effects on model performance.\n", - "\n", - "### Magnitude-Based Pruning: The Simple Approach\n", - "\n", - "The core insight: small weights contribute little to the final prediction. Magnitude pruning removes weights based on their absolute values.\n", - "\n", - "```\n", - "Mathematical Foundation:\n", - "For weight w_ij in layer l:\n", - " If |w_ij| < threshold_l → w_ij = 0\n", - "\n", - "Threshold Selection:\n", - "- Global: One threshold for entire model\n", - "- Layer-wise: Different threshold per layer\n", - "- Percentile-based: Remove bottom k% of weights\n", - "\n", - "Sparsity Calculation:\n", - " Sparsity = (Zero weights / Total weights) × 100%\n", - "```\n", - "\n", - "### Structured Pruning: Hardware-Friendly Compression\n", - "\n", - "Unlike magnitude pruning which creates scattered zeros, structured pruning removes entire computational units (neurons, channels, attention heads).\n", - "\n", - "```\n", - "Channel Importance Metrics:\n", - "\n", - "Method 1: L2 Norm\n", - " Importance(channel_i) = ||W[:,i]||₂ = √(Σⱼ W²ⱼᵢ)\n", - "\n", - "Method 2: Gradient-based\n", - " Importance(channel_i) = |∂Loss/∂W[:,i]|\n", - "\n", - "Method 3: Activation-based\n", - " Importance(channel_i) = E[|activations_i|]\n", - "\n", - "Pruning Decision:\n", - " Remove bottom k% of channels based on importance ranking\n", - "```\n", - "\n", - "### Knowledge Distillation: Learning from Teachers\n", - "\n", - "Knowledge distillation transfers knowledge from a large \"teacher\" model to a smaller \"student\" model. The student learns not just the correct answers, but the teacher's reasoning process.\n", - "\n", - "```\n", - "Distillation Loss Function:\n", - " L_total = α × L_soft + (1-α) × L_hard\n", - "\n", - "Where:\n", - " L_soft = KL_divergence(σ(z_s/T), σ(z_t/T)) # Soft targets\n", - " L_hard = CrossEntropy(σ(z_s), y_true) # Hard targets\n", - "\n", - " σ(z/T) = Softmax with temperature T\n", - " z_s = Student logits, z_t = Teacher logits\n", - " α = Balance parameter (typically 0.7)\n", - " T = Temperature parameter (typically 3-5)\n", - "\n", - "Temperature Effect:\n", - " T=1: Standard softmax (sharp probabilities)\n", - " T>1: Softer distributions (reveals teacher's uncertainty)\n", - "```\n", - "\n", - "### Low-Rank Approximation: Matrix Compression\n", - "\n", - "Large weight matrices often have redundancy that can be captured with lower-rank approximations using Singular Value Decomposition (SVD).\n", - "\n", - "```\n", - "SVD Decomposition:\n", - " W_{m×n} = U_{m×k} × Σ_{k×k} × V^T_{k×n}\n", - "\n", - "Parameter Reduction:\n", - " Original: m × n parameters\n", - " Compressed: (m × k) + k + (k × n) = k(m + n + 1) parameters\n", - "\n", - " Compression achieved when: k < mn/(m+n+1)\n", - "\n", - "Reconstruction Error:\n", - " ||W - W_approx||_F = √(Σᵢ₌ₖ₊₁ʳ σᵢ²)\n", - "\n", - " Where σᵢ are singular values, r = rank(W)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "ce0801cd", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 3. Sparsity Measurement - Understanding Model Density\n", - "\n", - "Before we can compress models, we need to understand how dense they are. Sparsity measurement tells us what percentage of weights are zero (or effectively zero).\n", - "\n", - "### Understanding Sparsity\n", - "\n", - "Sparsity is like measuring how much of a parking lot is empty. A 90% sparse model means 90% of its weights are zero - only 10% of the \"parking spaces\" are occupied.\n", - "\n", - "```\n", - "Sparsity Visualization:\n", - "\n", - "Dense Matrix (0% sparse): Sparse Matrix (75% sparse):\n", - "┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐\n", - "│ 2.1 1.3 0.8 1.9 2.4 1.1 0.7 │ │ 2.1 0.0 0.0 1.9 0.0 0.0 0.0 │\n", - "│ 1.5 2.8 1.2 0.9 1.6 2.2 1.4 │ │ 0.0 2.8 0.0 0.0 0.0 2.2 0.0 │\n", - "│ 0.6 1.7 2.5 1.1 0.8 1.3 2.0 │ │ 0.0 0.0 2.5 0.0 0.0 0.0 2.0 │\n", - "│ 1.9 1.0 1.6 2.3 1.8 0.9 1.2 │ │ 1.9 0.0 0.0 2.3 0.0 0.0 0.0 │\n", - "└─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘ └─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘\n", - "All weights active Only 7/28 weights active\n", - "Storage: 28 values Storage: 7 values + indices\n", - "```\n", - "\n", - "Why this matters: Sparsity directly relates to memory savings, but achieving speedup requires special sparse computation libraries." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4440ec7a", - "metadata": {}, - "outputs": [], - "source": [ - "def measure_sparsity(model) -> float:\n", - " \"\"\"\n", - " Calculate the percentage of zero weights in a model.\n", - "\n", - " TODO: Count zero weights and total weights across all layers\n", - "\n", - " APPROACH:\n", - " 1. Iterate through all model parameters\n", - " 2. Count zeros using np.sum(weights == 0)\n", - " 3. Count total parameters\n", - " 4. Return percentage: zeros / total * 100\n", - "\n", - " EXAMPLE:\n", - " >>> model = Sequential(Linear(10, 5), Linear(5, 2))\n", - " >>> sparsity = measure_sparsity(model)\n", - " >>> print(f\"Model sparsity: {sparsity:.1f}%\")\n", - " Model sparsity: 0.0% # Before pruning\n", - "\n", - " HINT: Use np.sum() to count zeros efficiently\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " total_params = 0\n", - " zero_params = 0\n", - "\n", - " for param in model.parameters():\n", - " total_params += param.size\n", - " zero_params += np.sum(param.data == 0)\n", - "\n", - " if total_params == 0:\n", - " return 0.0\n", - "\n", - " return (zero_params / total_params) * 100.0\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_measure_sparsity():\n", - " \"\"\"🔬 Test sparsity measurement functionality.\"\"\"\n", - " print(\"🔬 Unit Test: Measure Sparsity...\")\n", - "\n", - " # Test with dense model\n", - " model = Sequential(Linear(4, 3), Linear(3, 2))\n", - " initial_sparsity = measure_sparsity(model)\n", - " assert initial_sparsity == 0.0, f\"Expected 0% sparsity, got {initial_sparsity}%\"\n", - "\n", - " # Test with manually sparse model\n", - " model.layers[0].weight.data[0, 0] = 0\n", - " model.layers[0].weight.data[1, 1] = 0\n", - " sparse_sparsity = measure_sparsity(model)\n", - " assert sparse_sparsity > 0, f\"Expected >0% sparsity, got {sparse_sparsity}%\"\n", - "\n", - " print(\"✅ measure_sparsity works correctly!\")\n", - "\n", - "test_unit_measure_sparsity()" - ] - }, - { - "cell_type": "markdown", - "id": "fc5fb46e", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 4. Magnitude-Based Pruning - Removing Small Weights\n", - "\n", - "Magnitude pruning is the simplest and most intuitive compression technique. It's based on the observation that weights with small magnitudes contribute little to the model's output.\n", - "\n", - "### How Magnitude Pruning Works\n", - "\n", - "Think of magnitude pruning like editing a document - you remove words that don't significantly change the meaning. In neural networks, we remove weights that don't significantly affect predictions.\n", - "\n", - "```\n", - "Magnitude Pruning Process:\n", - "\n", - "Step 1: Collect All Weights\n", - "┌──────────────────────────────────────────────────┐\n", - "│ Layer 1: [2.1, 0.1, -1.8, 0.05, 3.2, -0.02] │\n", - "│ Layer 2: [1.5, -0.03, 2.8, 0.08, -2.1, 0.01] │\n", - "│ Layer 3: [0.7, 2.4, -0.06, 1.9, 0.04, -1.3] │\n", - "└──────────────────────────────────────────────────┘\n", - " ↓\n", - "Step 2: Calculate Magnitudes\n", - "┌──────────────────────────────────────────────────┐\n", - "│ Magnitudes: [2.1, 0.1, 1.8, 0.05, 3.2, 0.02, │\n", - "│ 1.5, 0.03, 2.8, 0.08, 2.1, 0.01, │\n", - "│ 0.7, 2.4, 0.06, 1.9, 0.04, 1.3] │\n", - "└──────────────────────────────────────────────────┘\n", - " ↓\n", - "Step 3: Find Threshold (e.g., 70th percentile)\n", - "┌──────────────────────────────────────────────────┐\n", - "│ Sorted: [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, │\n", - "│ 0.08, 0.1, 0.7, 1.3, 1.5, 1.8, │ Threshold: 0.1\n", - "│ 1.9, 2.1, 2.1, 2.4, 2.8, 3.2] │ (70% of weights removed)\n", - "└──────────────────────────────────────────────────┘\n", - " ↓\n", - "Step 4: Apply Pruning Mask\n", - "┌──────────────────────────────────────────────────┐\n", - "│ Layer 1: [2.1, 0.0, -1.8, 0.0, 3.2, 0.0] │\n", - "│ Layer 2: [1.5, 0.0, 2.8, 0.0, -2.1, 0.0] │ 70% weights → 0\n", - "│ Layer 3: [0.7, 2.4, 0.0, 1.9, 0.0, -1.3] │ 30% preserved\n", - "└──────────────────────────────────────────────────┘\n", - "\n", - "Memory Impact:\n", - "- Dense storage: 18 values\n", - "- Sparse storage: 6 values + 6 indices = 12 values (33% savings)\n", - "- Theoretical limit: 70% savings with perfect sparse format\n", - "```\n", - "\n", - "### Why Global Thresholding Works\n", - "\n", - "Global thresholding treats the entire model as one big collection of weights, finding a single threshold that achieves the target sparsity across all layers.\n", - "\n", - "**Advantages:**\n", - "- Simple to implement and understand\n", - "- Preserves overall model capacity\n", - "- Works well for uniform network architectures\n", - "\n", - "**Disadvantages:**\n", - "- May over-prune some layers, under-prune others\n", - "- Doesn't account for layer-specific importance\n", - "- Can hurt performance if layers have very different weight distributions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d8f12c15", - "metadata": {}, - "outputs": [], - "source": [ - "def magnitude_prune(model, sparsity=0.9):\n", - " \"\"\"\n", - " Remove weights with smallest magnitudes to achieve target sparsity.\n", - "\n", - " TODO: Implement global magnitude-based pruning\n", - "\n", - " APPROACH:\n", - " 1. Collect all weights from the model\n", - " 2. Calculate absolute values to get magnitudes\n", - " 3. Find threshold at desired sparsity percentile\n", - " 4. Set weights below threshold to zero (in-place)\n", - "\n", - " EXAMPLE:\n", - " >>> model = Sequential(Linear(100, 50), Linear(50, 10))\n", - " >>> original_params = sum(p.size for p in model.parameters())\n", - " >>> magnitude_prune(model, sparsity=0.8)\n", - " >>> final_sparsity = measure_sparsity(model)\n", - " >>> print(f\"Achieved {final_sparsity:.1f}% sparsity\")\n", - " Achieved 80.0% sparsity\n", - "\n", - " HINTS:\n", - " - Use np.percentile() to find threshold\n", - " - Modify model parameters in-place\n", - " - Consider only weight matrices, not biases\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Collect all weights (excluding biases)\n", - " all_weights = []\n", - " weight_params = []\n", - "\n", - " for param in model.parameters():\n", - " # Skip biases (typically 1D)\n", - " if len(param.shape) > 1:\n", - " all_weights.extend(param.data.flatten())\n", - " weight_params.append(param)\n", - "\n", - " if not all_weights:\n", - " return\n", - "\n", - " # Calculate magnitude threshold\n", - " magnitudes = np.abs(all_weights)\n", - " threshold = np.percentile(magnitudes, sparsity * 100)\n", - "\n", - " # Apply pruning to each weight parameter\n", - " for param in weight_params:\n", - " mask = np.abs(param.data) >= threshold\n", - " param.data = param.data * mask\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_magnitude_prune():\n", - " \"\"\"🔬 Test magnitude-based pruning functionality.\"\"\"\n", - " print(\"🔬 Unit Test: Magnitude Prune...\")\n", - "\n", - " # Create test model with known weights\n", - " model = Sequential(Linear(4, 3), Linear(3, 2))\n", - "\n", - " # Set specific weight values for predictable testing\n", - " model.layers[0].weight.data = np.array([\n", - " [1.0, 2.0, 3.0],\n", - " [0.1, 0.2, 0.3],\n", - " [4.0, 5.0, 6.0],\n", - " [0.01, 0.02, 0.03]\n", - " ])\n", - "\n", - " initial_sparsity = measure_sparsity(model)\n", - " assert initial_sparsity == 0.0, \"Model should start with no sparsity\"\n", - "\n", - " # Apply 50% pruning\n", - " magnitude_prune(model, sparsity=0.5)\n", - " final_sparsity = measure_sparsity(model)\n", - "\n", - " # Should achieve approximately 50% sparsity\n", - " assert 40 <= final_sparsity <= 60, f\"Expected ~50% sparsity, got {final_sparsity}%\"\n", - "\n", - " # Verify largest weights survived\n", - " remaining_weights = model.layers[0].weight.data[model.layers[0].weight.data != 0]\n", - " assert len(remaining_weights) > 0, \"Some weights should remain\"\n", - " assert np.all(np.abs(remaining_weights) >= 0.1), \"Large weights should survive\"\n", - "\n", - " print(\"✅ magnitude_prune works correctly!\")\n", - "\n", - "test_unit_magnitude_prune()" - ] - }, - { - "cell_type": "markdown", - "id": "8ddc8e18", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 5. Structured Pruning - Hardware-Friendly Compression\n", - "\n", - "While magnitude pruning creates scattered zeros throughout the network, structured pruning removes entire computational units (channels, neurons, heads). This creates sparsity patterns that modern hardware can actually accelerate.\n", - "\n", - "### Why Structured Pruning Matters\n", - "\n", - "Think of the difference between removing random words from a paragraph versus removing entire sentences. Structured pruning removes entire \"sentences\" (channels) rather than random \"words\" (individual weights).\n", - "\n", - "```\n", - "Unstructured vs Structured Sparsity:\n", - "\n", - "UNSTRUCTURED (Magnitude Pruning):\n", - "┌─────────────────────────────────────────────┐\n", - "│ Channel 0: [2.1, 0.0, 1.8, 0.0, 3.2] │ ← Sparse weights\n", - "│ Channel 1: [0.0, 2.8, 0.0, 2.1, 0.0] │ ← Sparse weights\n", - "│ Channel 2: [1.5, 0.0, 2.4, 0.0, 1.9] │ ← Sparse weights\n", - "│ Channel 3: [0.0, 1.7, 0.0, 2.0, 0.0] │ ← Sparse weights\n", - "└─────────────────────────────────────────────┘\n", - "Issues: Irregular memory access, no hardware speedup\n", - "\n", - "STRUCTURED (Channel Pruning):\n", - "┌─────────────────────────────────────────────┐\n", - "│ Channel 0: [2.1, 1.3, 1.8, 0.9, 3.2] │ ← Fully preserved\n", - "│ Channel 1: [0.0, 0.0, 0.0, 0.0, 0.0] │ ← Fully removed\n", - "│ Channel 2: [1.5, 2.2, 2.4, 1.1, 1.9] │ ← Fully preserved\n", - "│ Channel 3: [0.0, 0.0, 0.0, 0.0, 0.0] │ ← Fully removed\n", - "└─────────────────────────────────────────────┘\n", - "Benefits: Regular patterns, hardware acceleration possible\n", - "```\n", - "\n", - "### Channel Importance Ranking\n", - "\n", - "How do we decide which channels to remove? We rank them by importance using various metrics:\n", - "\n", - "```\n", - "Channel Importance Metrics:\n", - "\n", - "Method 1: L2 Norm (Most Common)\n", - " For each output channel i:\n", - " Importance_i = ||W[:, i]||_2 = √(Σⱼ w²ⱼᵢ)\n", - "\n", - " Intuition: Channels with larger weights have bigger impact\n", - "\n", - "Method 2: Activation-Based\n", - " Importance_i = E[|activation_i|] over dataset\n", - "\n", - " Intuition: Channels that activate more are more important\n", - "\n", - "Method 3: Gradient-Based\n", - " Importance_i = |∂Loss/∂W[:, i]|\n", - "\n", - " Intuition: Channels with larger gradients affect loss more\n", - "\n", - "Ranking Process:\n", - " 1. Calculate importance for all channels\n", - " 2. Sort channels by importance (ascending)\n", - " 3. Remove bottom k% (least important)\n", - " 4. Zero out entire channels, not individual weights\n", - "```\n", - "\n", - "### Hardware Benefits of Structured Sparsity\n", - "\n", - "Structured sparsity enables real hardware acceleration because:\n", - "\n", - "1. **Memory Coalescing**: Accessing contiguous memory chunks is faster\n", - "2. **SIMD Operations**: Can process multiple remaining channels in parallel\n", - "3. **No Indexing Overhead**: Don't need to track locations of sparse weights\n", - "4. **Cache Efficiency**: Better spatial locality of memory access" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ede3f6c9", - "metadata": {}, - "outputs": [], - "source": [ - "def structured_prune(model, prune_ratio=0.5):\n", - " \"\"\"\n", - " Remove entire channels/neurons based on L2 norm importance.\n", - "\n", - " TODO: Implement structured pruning for Linear layers\n", - "\n", - " APPROACH:\n", - " 1. For each Linear layer, calculate L2 norm of each output channel\n", - " 2. Rank channels by importance (L2 norm)\n", - " 3. Remove lowest importance channels by setting to zero\n", - " 4. This creates block sparsity that's hardware-friendly\n", - "\n", - " EXAMPLE:\n", - " >>> model = Sequential(Linear(100, 50), Linear(50, 10))\n", - " >>> original_shape = model.layers[0].weight.shape\n", - " >>> structured_prune(model, prune_ratio=0.3)\n", - " >>> # 30% of channels are now completely zero\n", - " >>> final_sparsity = measure_sparsity(model)\n", - " >>> print(f\"Structured sparsity: {final_sparsity:.1f}%\")\n", - " Structured sparsity: 30.0%\n", - "\n", - " HINTS:\n", - " - Calculate L2 norm along input dimension for each output channel\n", - " - Use np.linalg.norm(weights[:, channel]) for channel importance\n", - " - Set entire channels to zero (not just individual weights)\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " for layer in model.layers:\n", - " if isinstance(layer, Linear) and hasattr(layer, 'weight'):\n", - " weight = layer.weight.data\n", - "\n", - " # Calculate L2 norm for each output channel (column)\n", - " channel_norms = np.linalg.norm(weight, axis=0)\n", - "\n", - " # Find channels to prune (lowest importance)\n", - " num_channels = weight.shape[1]\n", - " num_to_prune = int(num_channels * prune_ratio)\n", - "\n", - " if num_to_prune > 0:\n", - " # Get indices of channels to prune (smallest norms)\n", - " prune_indices = np.argpartition(channel_norms, num_to_prune)[:num_to_prune]\n", - "\n", - " # Zero out entire channels\n", - " weight[:, prune_indices] = 0\n", - "\n", - " # Also zero corresponding bias elements if bias exists\n", - " if layer.bias is not None:\n", - " layer.bias.data[prune_indices] = 0\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_structured_prune():\n", - " \"\"\"🔬 Test structured pruning functionality.\"\"\"\n", - " print(\"🔬 Unit Test: Structured Prune...\")\n", - "\n", - " # Create test model\n", - " model = Sequential(Linear(4, 6), Linear(6, 2))\n", - "\n", - " # Set predictable weights for testing\n", - " model.layers[0].weight.data = np.array([\n", - " [1.0, 0.1, 2.0, 0.05, 3.0, 0.01], # Channels with varying importance\n", - " [1.1, 0.11, 2.1, 0.06, 3.1, 0.02],\n", - " [1.2, 0.12, 2.2, 0.07, 3.2, 0.03],\n", - " [1.3, 0.13, 2.3, 0.08, 3.3, 0.04]\n", - " ])\n", - "\n", - " initial_sparsity = measure_sparsity(model)\n", - " assert initial_sparsity == 0.0, \"Model should start with no sparsity\"\n", - "\n", - " # Apply 33% structured pruning (2 out of 6 channels)\n", - " structured_prune(model, prune_ratio=0.33)\n", - " final_sparsity = measure_sparsity(model)\n", - "\n", - " # Check that some channels are completely zero\n", - " weight = model.layers[0].weight.data\n", - " zero_channels = np.sum(np.all(weight == 0, axis=0))\n", - " assert zero_channels >= 1, f\"Expected at least 1 zero channel, got {zero_channels}\"\n", - "\n", - " # Check that non-zero channels are completely preserved\n", - " for col in range(weight.shape[1]):\n", - " channel = weight[:, col]\n", - " assert np.all(channel == 0) or np.all(channel != 0), \"Channels should be fully zero or fully non-zero\"\n", - "\n", - " print(\"✅ structured_prune works correctly!\")\n", - "\n", - "test_unit_structured_prune()" - ] - }, - { - "cell_type": "markdown", - "id": "74c8202f", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 6. Low-Rank Approximation - Matrix Compression Through Factorization\n", - "\n", - "Low-rank approximation discovers that large weight matrices often contain redundant information that can be captured with much smaller matrices through mathematical decomposition.\n", - "\n", - "### The Intuition Behind Low-Rank Approximation\n", - "\n", - "Imagine you're storing a massive spreadsheet where many columns are highly correlated. Instead of storing all columns separately, you could store a few \"basis\" columns and coefficients for how to combine them to recreate the original data.\n", - "\n", - "```\n", - "Low-Rank Decomposition Visualization:\n", - "\n", - "Original Matrix W (large): Factorized Form (smaller):\n", - "┌─────────────────────────┐ ┌──────┐ ┌──────────────┐\n", - "│ 2.1 1.3 0.8 1.9 2.4 │ │ 1.1 │ │ 1.9 1.2 0.7│\n", - "│ 1.5 2.8 1.2 0.9 1.6 │ ≈ │ 2.4 │ @ │ 0.6 1.2 0.5│\n", - "│ 0.6 1.7 2.5 1.1 0.8 │ │ 0.8 │ │ 1.4 2.1 0.9│\n", - "│ 1.9 1.0 1.6 2.3 1.8 │ │ 1.6 │ │ 0.5 0.6 1.1│\n", - "└─────────────────────────┘ └──────┘ └──────────────┘\n", - " W (4×5) = 20 params U (4×2)=8 + V (2×5)=10 = 18 params\n", - "\n", - "Parameter Reduction:\n", - "- Original: 4 × 5 = 20 parameters\n", - "- Compressed: (4 × 2) + (2 × 5) = 18 parameters\n", - "- Compression ratio: 18/20 = 0.9 (10% savings)\n", - "\n", - "For larger matrices, savings become dramatic:\n", - "- W (1000×1000): 1M parameters → U (1000×100) + V (100×1000): 200K parameters\n", - "- Compression ratio: 0.2 (80% savings)\n", - "```\n", - "\n", - "### SVD: The Mathematical Foundation\n", - "\n", - "Singular Value Decomposition (SVD) finds the optimal low-rank approximation by identifying the most important \"directions\" in the data:\n", - "\n", - "```\n", - "SVD Decomposition:\n", - " W = U × Σ × V^T\n", - "\n", - "Where:\n", - " U: Left singular vectors (input patterns)\n", - " Σ: Singular values (importance weights)\n", - " V^T: Right singular vectors (output patterns)\n", - "\n", - "Truncated SVD (Rank-k approximation):\n", - " W ≈ U[:,:k] × Σ[:k] × V^T[:k,:]\n", - "\n", - "Quality vs Compression Trade-off:\n", - " Higher k → Better approximation, less compression\n", - " Lower k → More compression, worse approximation\n", - "\n", - "Choosing Optimal Rank:\n", - " Method 1: Fixed ratio (k = ratio × min(m,n))\n", - " Method 2: Energy threshold (keep 90% of singular value energy)\n", - " Method 3: Error threshold (reconstruction error < threshold)\n", - "```\n", - "\n", - "### When Low-Rank Works Best\n", - "\n", - "Low-rank approximation works well when:\n", - "- **Matrices are large**: Compression benefits scale with size\n", - "- **Data has structure**: Correlated patterns enable compression\n", - "- **Moderate accuracy loss acceptable**: Some precision traded for efficiency\n", - "\n", - "It works poorly when:\n", - "- **Matrices are already small**: Overhead exceeds benefits\n", - "- **Data is random**: No patterns to exploit\n", - "- **High precision required**: SVD introduces approximation error" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bdbedbf4", - "metadata": {}, - "outputs": [], - "source": [ - "def low_rank_approximate(weight_matrix, rank_ratio=0.5):\n", - " \"\"\"\n", - " Approximate weight matrix using low-rank decomposition (SVD).\n", - "\n", - " TODO: Implement SVD-based low-rank approximation\n", - "\n", - " APPROACH:\n", - " 1. Perform SVD: W = U @ S @ V^T\n", - " 2. Keep only top k singular values where k = rank_ratio * min(dimensions)\n", - " 3. Reconstruct: W_approx = U[:,:k] @ diag(S[:k]) @ V[:k,:]\n", - " 4. Return decomposed matrices for memory savings\n", - "\n", - " EXAMPLE:\n", - " >>> weight = np.random.randn(100, 50)\n", - " >>> U, S, V = low_rank_approximate(weight, rank_ratio=0.3)\n", - " >>> # Original: 100*50 = 5000 params\n", - " >>> # Compressed: 100*15 + 15*50 = 2250 params (55% reduction)\n", - "\n", - " HINTS:\n", - " - Use np.linalg.svd() for decomposition\n", - " - Choose k = int(rank_ratio * min(m, n))\n", - " - Return U[:,:k], S[:k], V[:k,:] for reconstruction\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " m, n = weight_matrix.shape\n", - "\n", - " # Perform SVD\n", - " U, S, V = np.linalg.svd(weight_matrix, full_matrices=False)\n", - "\n", - " # Determine target rank\n", - " max_rank = min(m, n)\n", - " target_rank = max(1, int(rank_ratio * max_rank))\n", - "\n", - " # Truncate to target rank\n", - " U_truncated = U[:, :target_rank]\n", - " S_truncated = S[:target_rank]\n", - " V_truncated = V[:target_rank, :]\n", - "\n", - " return U_truncated, S_truncated, V_truncated\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_low_rank_approximate():\n", - " \"\"\"🔬 Test low-rank approximation functionality.\"\"\"\n", - " print(\"🔬 Unit Test: Low-Rank Approximate...\")\n", - "\n", - " # Create test weight matrix\n", - " original_weight = np.random.randn(20, 15)\n", - " original_params = original_weight.size\n", - "\n", - " # Apply low-rank approximation\n", - " U, S, V = low_rank_approximate(original_weight, rank_ratio=0.4)\n", - "\n", - " # Check dimensions\n", - " target_rank = int(0.4 * min(20, 15)) # min(20,15) = 15, so 0.4*15 = 6\n", - " assert U.shape == (20, target_rank), f\"Expected U shape (20, {target_rank}), got {U.shape}\"\n", - " assert S.shape == (target_rank,), f\"Expected S shape ({target_rank},), got {S.shape}\"\n", - " assert V.shape == (target_rank, 15), f\"Expected V shape ({target_rank}, 15), got {V.shape}\"\n", - "\n", - " # Check parameter reduction\n", - " compressed_params = U.size + S.size + V.size\n", - " compression_ratio = compressed_params / original_params\n", - " assert compression_ratio < 1.0, f\"Should compress, but ratio is {compression_ratio}\"\n", - "\n", - " # Check reconstruction quality\n", - " reconstructed = U @ np.diag(S) @ V\n", - " reconstruction_error = np.linalg.norm(original_weight - reconstructed)\n", - " relative_error = reconstruction_error / np.linalg.norm(original_weight)\n", - " assert relative_error < 0.5, f\"Reconstruction error too high: {relative_error}\"\n", - "\n", - " print(\"✅ low_rank_approximate works correctly!\")\n", - "\n", - "test_unit_low_rank_approximate()" - ] - }, - { - "cell_type": "markdown", - "id": "a51cbe39", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 7. Knowledge Distillation - Learning from Teacher Models\n", - "\n", - "Knowledge distillation is like having an expert teacher simplify complex concepts for a student. The large \"teacher\" model shares its knowledge with a smaller \"student\" model, achieving similar performance with far fewer parameters.\n", - "\n", - "### The Teacher-Student Learning Process\n", - "\n", - "Unlike traditional training where models learn from hard labels (cat/dog), knowledge distillation uses \"soft\" targets that contain richer information about the teacher's decision-making process.\n", - "\n", - "```\n", - "Knowledge Distillation Process:\n", - "\n", - " TEACHER MODEL (Large)\n", - " ┌─────────────────────┐\n", - "Input Data ────────→│ 100M parameters │\n", - " │ 95% accuracy │\n", - " │ 500ms inference │\n", - " └─────────────────────┘\n", - " │\n", - " ↓ Soft Targets\n", - " ┌─────────────────────┐\n", - " │ Logits: [2.1, 0.3, │\n", - " │ 0.8, 4.2] │ ← Rich information\n", - " └─────────────────────┘\n", - " │\n", - " ↓ Distillation Loss\n", - " ┌─────────────────────┐\n", - "Input Data ────────→│ STUDENT MODEL │\n", - "Hard Labels ───────→│ 10M parameters │ ← 10x smaller\n", - " │ 93% accuracy │ ← 2% loss\n", - " │ 50ms inference │ ← 10x faster\n", - " └─────────────────────┘\n", - "\n", - "Benefits:\n", - "• Size: 10x smaller models\n", - "• Speed: 10x faster inference\n", - "• Accuracy: Only 2-5% degradation\n", - "• Knowledge transfer: Student learns teacher's \"reasoning\"\n", - "```\n", - "\n", - "### Temperature Scaling: Softening Decisions\n", - "\n", - "Temperature scaling is a key innovation that makes knowledge distillation effective. It \"softens\" the teacher's confidence, revealing uncertainty that helps the student learn.\n", - "\n", - "```\n", - "Temperature Effect on Probability Distributions:\n", - "\n", - "Without Temperature (T=1): With Temperature (T=3):\n", - "Teacher Logits: [1.0, 2.0, 0.5] Teacher Logits: [1.0, 2.0, 0.5]\n", - " ↓ ↓ ÷ 3\n", - "Softmax: [0.09, 0.67, 0.24] Logits/T: [0.33, 0.67, 0.17]\n", - " ^ ^ ^ ↓\n", - " Low High Med Softmax: [0.21, 0.42, 0.17]\n", - " ^ ^ ^\n", - "Sharp decisions (hard to learn) Soft decisions (easier to learn)\n", - "\n", - "Why Soft Targets Help:\n", - "1. Reveal teacher's uncertainty about similar classes\n", - "2. Provide richer gradients for student learning\n", - "3. Transfer knowledge about class relationships\n", - "4. Reduce overfitting to hard labels\n", - "```\n", - "\n", - "### Loss Function Design\n", - "\n", - "The distillation loss balances learning from both the teacher's soft knowledge and the ground truth hard labels:\n", - "\n", - "```\n", - "Combined Loss Function:\n", - "\n", - "L_total = α × L_soft + (1-α) × L_hard\n", - "\n", - "Where:\n", - " L_soft = KL_divergence(Student_soft, Teacher_soft)\n", - " │\n", - " └─ Measures how well student mimics teacher\n", - "\n", - " L_hard = CrossEntropy(Student_predictions, True_labels)\n", - " │\n", - " └─ Ensures student still learns correct answers\n", - "\n", - "Balance Parameter α:\n", - "• α = 0.7: Focus mainly on teacher (typical)\n", - "• α = 0.9: Almost pure distillation\n", - "• α = 0.3: Balance teacher and ground truth\n", - "• α = 0.0: Ignore teacher (regular training)\n", - "\n", - "Temperature T:\n", - "• T = 1: No softening (standard softmax)\n", - "• T = 3-5: Good balance (typical range)\n", - "• T = 10+: Very soft (may lose information)\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bf1a9ab1", - "metadata": {}, - "outputs": [], - "source": [ - "class KnowledgeDistillation:\n", - " \"\"\"\n", - " Knowledge distillation for model compression.\n", - "\n", - " Train a smaller student model to mimic a larger teacher model.\n", - " \"\"\"\n", - "\n", - " def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):\n", - " \"\"\"\n", - " Initialize knowledge distillation.\n", - "\n", - " TODO: Set up teacher and student models with distillation parameters\n", - "\n", - " APPROACH:\n", - " 1. Store teacher and student models\n", - " 2. Set temperature for softening probability distributions\n", - " 3. Set alpha for balancing hard vs soft targets\n", - "\n", - " Args:\n", - " teacher_model: Large, pre-trained model\n", - " student_model: Smaller model to train\n", - " temperature: Softening parameter for distributions\n", - " alpha: Weight for soft target loss (1-alpha for hard targets)\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " self.teacher_model = teacher_model\n", - " self.student_model = student_model\n", - " self.temperature = temperature\n", - " self.alpha = alpha\n", - " ### END SOLUTION\n", - "\n", - " def distillation_loss(self, student_logits, teacher_logits, true_labels):\n", - " \"\"\"\n", - " Calculate combined distillation loss.\n", - "\n", - " TODO: Implement knowledge distillation loss function\n", - "\n", - " APPROACH:\n", - " 1. Calculate hard target loss (student vs true labels)\n", - " 2. Calculate soft target loss (student vs teacher, with temperature)\n", - " 3. Combine losses: alpha * soft_loss + (1-alpha) * hard_loss\n", - "\n", - " EXAMPLE:\n", - " >>> kd = KnowledgeDistillation(teacher, student)\n", - " >>> loss = kd.distillation_loss(student_out, teacher_out, labels)\n", - " >>> print(f\"Distillation loss: {loss:.4f}\")\n", - "\n", - " HINTS:\n", - " - Use temperature to soften distributions: logits/temperature\n", - " - Soft targets use KL divergence or cross-entropy\n", - " - Hard targets use standard classification loss\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Convert to numpy for this implementation\n", - " if hasattr(student_logits, 'data'):\n", - " student_logits = student_logits.data\n", - " if hasattr(teacher_logits, 'data'):\n", - " teacher_logits = teacher_logits.data\n", - " if hasattr(true_labels, 'data'):\n", - " true_labels = true_labels.data\n", - "\n", - " # Soften distributions with temperature\n", - " student_soft = self._softmax(student_logits / self.temperature)\n", - " teacher_soft = self._softmax(teacher_logits / self.temperature)\n", - "\n", - " # Soft target loss (KL divergence)\n", - " soft_loss = self._kl_divergence(student_soft, teacher_soft)\n", - "\n", - " # Hard target loss (cross-entropy)\n", - " student_hard = self._softmax(student_logits)\n", - " hard_loss = self._cross_entropy(student_hard, true_labels)\n", - "\n", - " # Combined loss\n", - " total_loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss\n", - "\n", - " return total_loss\n", - " ### END SOLUTION\n", - "\n", - " def _softmax(self, logits):\n", - " \"\"\"Compute softmax with numerical stability.\"\"\"\n", - " exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))\n", - " return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)\n", - "\n", - " def _kl_divergence(self, p, q):\n", - " \"\"\"Compute KL divergence between distributions.\"\"\"\n", - " return np.sum(p * np.log(p / (q + 1e-8) + 1e-8))\n", - "\n", - " def _cross_entropy(self, predictions, labels):\n", - " \"\"\"Compute cross-entropy loss.\"\"\"\n", - " # Simple implementation for integer labels\n", - " if labels.ndim == 1:\n", - " return -np.mean(np.log(predictions[np.arange(len(labels)), labels] + 1e-8))\n", - " else:\n", - " return -np.mean(np.sum(labels * np.log(predictions + 1e-8), axis=1))\n", - "\n", - "def test_unit_knowledge_distillation():\n", - " \"\"\"🔬 Test knowledge distillation functionality.\"\"\"\n", - " print(\"🔬 Unit Test: Knowledge Distillation...\")\n", - "\n", - " # Create teacher and student models\n", - " teacher = Sequential(Linear(10, 20), Linear(20, 5))\n", - " student = Sequential(Linear(10, 5)) # Smaller model\n", - "\n", - " # Initialize knowledge distillation\n", - " kd = KnowledgeDistillation(teacher, student, temperature=3.0, alpha=0.7)\n", - "\n", - " # Create dummy data\n", - " input_data = Tensor(np.random.randn(8, 10)) # Batch of 8\n", - " true_labels = np.array([0, 1, 2, 3, 4, 0, 1, 2]) # Class labels\n", - "\n", - " # Forward passes\n", - " teacher_output = teacher.forward(input_data)\n", - " student_output = student.forward(input_data)\n", - "\n", - " # Calculate distillation loss\n", - " loss = kd.distillation_loss(student_output, teacher_output, true_labels)\n", - "\n", - " # Verify loss is reasonable\n", - " assert isinstance(loss, (float, np.floating)), f\"Loss should be float, got {type(loss)}\"\n", - " assert loss > 0, f\"Loss should be positive, got {loss}\"\n", - " assert not np.isnan(loss), \"Loss should not be NaN\"\n", - "\n", - " print(\"✅ knowledge_distillation works correctly!\")\n", - "\n", - "test_unit_knowledge_distillation()" - ] - }, - { - "cell_type": "markdown", - "id": "bea12725", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 8. Integration: Complete Compression Pipeline\n", - "\n", - "Now let's combine all our compression techniques into a unified system that can apply multiple methods and track their cumulative effects.\n", - "\n", - "### Compression Strategy Design\n", - "\n", - "Real-world compression often combines multiple techniques in sequence, each targeting different types of redundancy:\n", - "\n", - "```\n", - "Multi-Stage Compression Pipeline:\n", - "\n", - "Original Model (100MB, 100% accuracy)\n", - " │\n", - " ↓ Stage 1: Magnitude Pruning (remove 80% of small weights)\n", - "Sparse Model (20MB, 98% accuracy)\n", - " │\n", - " ↓ Stage 2: Structured Pruning (remove 30% of channels)\n", - "Compact Model (14MB, 96% accuracy)\n", - " │\n", - " ↓ Stage 3: Low-Rank Approximation (compress large layers)\n", - "Factorized Model (10MB, 95% accuracy)\n", - " │\n", - " ↓ Stage 4: Knowledge Distillation (train smaller architecture)\n", - "Student Model (5MB, 93% accuracy)\n", - "\n", - "Final Result: 20x size reduction, 7% accuracy loss\n", - "```\n", - "\n", - "### Compression Configuration\n", - "\n", - "Different deployment scenarios require different compression strategies:\n", - "\n", - "```\n", - "Deployment Scenarios and Strategies:\n", - "\n", - "MOBILE APP (Aggressive compression needed):\n", - "┌─────────────────────────────────────────┐\n", - "│ Target: <10MB, <100ms inference │\n", - "│ Strategy: │\n", - "│ • Magnitude pruning: 95% sparsity │\n", - "│ • Structured pruning: 50% channels │\n", - "│ • Knowledge distillation: 10x reduction │\n", - "│ • Quantization: 8-bit weights │\n", - "└─────────────────────────────────────────┘\n", - "\n", - "EDGE DEVICE (Balanced compression):\n", - "┌─────────────────────────────────────────┐\n", - "│ Target: <50MB, <200ms inference │\n", - "│ Strategy: │\n", - "│ • Magnitude pruning: 80% sparsity │\n", - "│ • Structured pruning: 30% channels │\n", - "│ • Low-rank: 50% rank reduction │\n", - "│ • Quantization: 16-bit weights │\n", - "└─────────────────────────────────────────┘\n", - "\n", - "CLOUD SERVICE (Minimal compression):\n", - "┌─────────────────────────────────────────┐\n", - "│ Target: Maintain accuracy, reduce cost │\n", - "│ Strategy: │\n", - "│ • Magnitude pruning: 50% sparsity │\n", - "│ • Structured pruning: 10% channels │\n", - "│ • Dynamic batching optimization │\n", - "│ • Mixed precision inference │\n", - "└─────────────────────────────────────────┘\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "68de6767", - "metadata": {}, - "outputs": [], - "source": [ - "def compress_model(model, compression_config):\n", - " \"\"\"\n", - " Apply comprehensive model compression based on configuration.\n", - "\n", - " TODO: Implement complete compression pipeline\n", - "\n", - " APPROACH:\n", - " 1. Apply magnitude pruning if specified\n", - " 2. Apply structured pruning if specified\n", - " 3. Apply low-rank approximation if specified\n", - " 4. Return compression statistics\n", - "\n", - " EXAMPLE:\n", - " >>> config = {\n", - " ... 'magnitude_prune': 0.8,\n", - " ... 'structured_prune': 0.3,\n", - " ... 'low_rank': 0.5\n", - " ... }\n", - " >>> stats = compress_model(model, config)\n", - " >>> print(f\"Final sparsity: {stats['sparsity']:.1f}%\")\n", - " Final sparsity: 85.0%\n", - "\n", - " HINT: Apply techniques sequentially and measure results\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " original_params = sum(p.size for p in model.parameters())\n", - " original_sparsity = measure_sparsity(model)\n", - "\n", - " stats = {\n", - " 'original_params': original_params,\n", - " 'original_sparsity': original_sparsity,\n", - " 'applied_techniques': []\n", - " }\n", - "\n", - " # Apply magnitude pruning\n", - " if 'magnitude_prune' in compression_config:\n", - " sparsity = compression_config['magnitude_prune']\n", - " magnitude_prune(model, sparsity=sparsity)\n", - " stats['applied_techniques'].append(f'magnitude_prune_{sparsity}')\n", - "\n", - " # Apply structured pruning\n", - " if 'structured_prune' in compression_config:\n", - " ratio = compression_config['structured_prune']\n", - " structured_prune(model, prune_ratio=ratio)\n", - " stats['applied_techniques'].append(f'structured_prune_{ratio}')\n", - "\n", - " # Apply low-rank approximation (conceptually - would need architecture changes)\n", - " if 'low_rank' in compression_config:\n", - " ratio = compression_config['low_rank']\n", - " # For demo, we'll just record that it would be applied\n", - " stats['applied_techniques'].append(f'low_rank_{ratio}')\n", - "\n", - " # Final measurements\n", - " final_sparsity = measure_sparsity(model)\n", - " stats['final_sparsity'] = final_sparsity\n", - " stats['sparsity_increase'] = final_sparsity - original_sparsity\n", - "\n", - " return stats\n", - " ### END SOLUTION\n", - "\n", - "def test_unit_compress_model():\n", - " \"\"\"🔬 Test comprehensive model compression.\"\"\"\n", - " print(\"🔬 Unit Test: Compress Model...\")\n", - "\n", - " # Create test model\n", - " model = Sequential(Linear(20, 15), Linear(15, 10), Linear(10, 5))\n", - "\n", - " # Define compression configuration\n", - " config = {\n", - " 'magnitude_prune': 0.7,\n", - " 'structured_prune': 0.2\n", - " }\n", - "\n", - " # Apply compression\n", - " stats = compress_model(model, config)\n", - "\n", - " # Verify statistics\n", - " assert 'original_params' in stats, \"Should track original parameter count\"\n", - " assert 'final_sparsity' in stats, \"Should track final sparsity\"\n", - " assert 'applied_techniques' in stats, \"Should track applied techniques\"\n", - "\n", - " # Verify compression was applied\n", - " assert stats['final_sparsity'] > stats['original_sparsity'], \"Sparsity should increase\"\n", - " assert len(stats['applied_techniques']) == 2, \"Should apply both techniques\"\n", - "\n", - " # Verify model still has reasonable structure\n", - " remaining_params = sum(np.count_nonzero(p.data) for p in model.parameters())\n", - " assert remaining_params > 0, \"Model should retain some parameters\"\n", - "\n", - " print(\"✅ compress_model works correctly!\")\n", - "\n", - "test_unit_compress_model()" - ] - }, - { - "cell_type": "markdown", - "id": "78b4d5fb", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 9. Systems Analysis: Compression Performance and Trade-offs\n", - "\n", - "Understanding how compression techniques affect real-world deployment metrics like storage, memory, speed, and accuracy.\n", - "\n", - "### Compression Effectiveness Analysis\n", - "\n", - "Different techniques excel in different scenarios. Let's measure their effectiveness across various model sizes and architectures." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f8025b3f", - "metadata": { - "lines_to_next_cell": 1 - }, - "outputs": [], - "source": [ - "def analyze_compression_ratios():\n", - " \"\"\"📊 Analyze compression ratios for different techniques.\"\"\"\n", - " print(\"📊 Analyzing Compression Ratios...\")\n", - "\n", - " # Create test models of different sizes\n", - " models = {\n", - " 'Small': Sequential(Linear(50, 30), Linear(30, 10)),\n", - " 'Medium': Sequential(Linear(200, 128), Linear(128, 64), Linear(64, 10)),\n", - " 'Large': Sequential(Linear(500, 256), Linear(256, 128), Linear(128, 10))\n", - " }\n", - "\n", - " compression_techniques = [\n", - " ('Magnitude 50%', {'magnitude_prune': 0.5}),\n", - " ('Magnitude 90%', {'magnitude_prune': 0.9}),\n", - " ('Structured 30%', {'structured_prune': 0.3}),\n", - " ('Combined', {'magnitude_prune': 0.8, 'structured_prune': 0.2})\n", - " ]\n", - "\n", - " print(f\"{'Model':<8} {'Technique':<15} {'Original':<10} {'Final':<10} {'Reduction':<10}\")\n", - " print(\"-\" * 65)\n", - "\n", - " for model_name, model in models.items():\n", - " original_params = sum(p.size for p in model.parameters())\n", - "\n", - " for tech_name, config in compression_techniques:\n", - " # Create fresh copy for each test\n", - " test_model = copy.deepcopy(model)\n", - "\n", - " # Apply compression\n", - " stats = compress_model(test_model, config)\n", - "\n", - " # Calculate compression ratio\n", - " remaining_params = sum(np.count_nonzero(p.data) for p in test_model.parameters())\n", - " reduction = (1 - remaining_params / original_params) * 100\n", - "\n", - " print(f\"{model_name:<8} {tech_name:<15} {original_params:<10} {remaining_params:<10} {reduction:<9.1f}%\")\n", - "\n", - " print(\"\\n💡 Key Insights:\")\n", - " print(\"• Magnitude pruning achieves predictable sparsity levels\")\n", - " print(\"• Structured pruning creates hardware-friendly sparsity\")\n", - " print(\"• Combined techniques offer maximum compression\")\n", - " print(\"• Larger models compress better (more redundancy)\")\n", - "\n", - "analyze_compression_ratios()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f29e9dc0", - "metadata": {}, - "outputs": [], - "source": [ - "def analyze_compression_speed():\n", - " \"\"\"📊 Analyze inference speed with different compression levels.\"\"\"\n", - " print(\"📊 Analyzing Compression Speed Impact...\")\n", - "\n", - " # Create test model\n", - " model = Sequential(Linear(512, 256), Linear(256, 128), Linear(128, 10))\n", - " test_input = Tensor(np.random.randn(100, 512)) # Batch of 100\n", - "\n", - " def time_inference(model, input_data, iterations=50):\n", - " \"\"\"Time model inference.\"\"\"\n", - " times = []\n", - " for _ in range(iterations):\n", - " start = time.time()\n", - " _ = model.forward(input_data)\n", - " times.append(time.time() - start)\n", - " return np.mean(times[5:]) # Skip first few for warmup\n", - "\n", - " # Test different compression levels\n", - " compression_levels = [\n", - " ('Original', {}),\n", - " ('Light Pruning', {'magnitude_prune': 0.5}),\n", - " ('Heavy Pruning', {'magnitude_prune': 0.9}),\n", - " ('Structured', {'structured_prune': 0.3}),\n", - " ('Combined', {'magnitude_prune': 0.8, 'structured_prune': 0.2})\n", - " ]\n", - "\n", - " print(f\"{'Compression':<15} {'Sparsity':<10} {'Time (ms)':<12} {'Speedup':<10}\")\n", - " print(\"-\" * 50)\n", - "\n", - " baseline_time = None\n", - "\n", - " for name, config in compression_levels:\n", - " # Create fresh model copy\n", - " test_model = copy.deepcopy(model)\n", - "\n", - " # Apply compression\n", - " if config:\n", - " compress_model(test_model, config)\n", - "\n", - " # Measure performance\n", - " sparsity = measure_sparsity(test_model)\n", - " inference_time = time_inference(test_model, test_input) * 1000 # Convert to ms\n", - "\n", - " if baseline_time is None:\n", - " baseline_time = inference_time\n", - " speedup = 1.0\n", - " else:\n", - " speedup = baseline_time / inference_time\n", - "\n", - " print(f\"{name:<15} {sparsity:<9.1f}% {inference_time:<11.2f} {speedup:<9.2f}x\")\n", - "\n", - " print(\"\\n💡 Speed Insights:\")\n", - " print(\"• Dense matrix operations show minimal speedup from unstructured sparsity\")\n", - " print(\"• Structured sparsity enables better hardware acceleration\")\n", - " print(\"• Real speedups require sparse-optimized libraries (e.g., NVIDIA 2:4 sparsity)\")\n", - " print(\"• Memory bandwidth often more important than parameter count\")\n", - "\n", - "analyze_compression_speed()" - ] - }, - { - "cell_type": "markdown", - "id": "e6c5926b", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 10. Optimization Insights: Production Compression Strategy\n", - "\n", - "Understanding the real-world implications of compression choices and how to design compression strategies for different deployment scenarios.\n", - "\n", - "### Accuracy vs Compression Trade-offs\n", - "\n", - "The fundamental challenge in model compression is balancing three competing objectives: model size, inference speed, and prediction accuracy." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "351bffdb", - "metadata": {}, - "outputs": [], - "source": [ - "def analyze_compression_accuracy_tradeoff():\n", - " \"\"\"📊 Analyze accuracy vs compression trade-offs.\"\"\"\n", - " print(\"📊 Analyzing Accuracy vs Compression Trade-offs...\")\n", - "\n", - " # Simulate accuracy degradation (in practice, would need real training/testing)\n", - " def simulate_accuracy_loss(sparsity, technique_type):\n", - " \"\"\"Simulate realistic accuracy loss patterns.\"\"\"\n", - " if technique_type == 'magnitude':\n", - " # Magnitude pruning: gradual degradation\n", - " return max(0, sparsity * 0.3 + np.random.normal(0, 0.05))\n", - " elif technique_type == 'structured':\n", - " # Structured pruning: more aggressive early loss\n", - " return max(0, sparsity * 0.5 + np.random.normal(0, 0.1))\n", - " elif technique_type == 'knowledge_distillation':\n", - " # Knowledge distillation: better preservation\n", - " return max(0, sparsity * 0.1 + np.random.normal(0, 0.02))\n", - " else:\n", - " return sparsity * 0.4\n", - "\n", - " # Test different compression strategies\n", - " strategies = [\n", - " ('Magnitude Only', 'magnitude'),\n", - " ('Structured Only', 'structured'),\n", - " ('Knowledge Distillation', 'knowledge_distillation'),\n", - " ('Combined Approach', 'combined')\n", - " ]\n", - "\n", - " sparsity_levels = np.arange(0.1, 1.0, 0.1)\n", - "\n", - " print(f\"{'Strategy':<20} {'Sparsity':<10} {'Accuracy Loss':<15}\")\n", - " print(\"-\" * 50)\n", - "\n", - " for strategy_name, strategy_type in strategies:\n", - " print(f\"\\n{strategy_name}:\")\n", - " for sparsity in sparsity_levels:\n", - " if strategy_type == 'combined':\n", - " # Combined approach uses multiple techniques\n", - " loss = min(\n", - " simulate_accuracy_loss(sparsity * 0.7, 'magnitude'),\n", - " simulate_accuracy_loss(sparsity * 0.3, 'structured')\n", - " )\n", - " else:\n", - " loss = simulate_accuracy_loss(sparsity, strategy_type)\n", - "\n", - " print(f\"{'':20} {sparsity:<9.1f} {loss:<14.3f}\")\n", - "\n", - " print(\"\\n💡 Trade-off Insights:\")\n", - " print(\"• Knowledge distillation preserves accuracy best at high compression\")\n", - " print(\"• Magnitude pruning offers gradual degradation curve\")\n", - " print(\"• Structured pruning enables hardware acceleration but higher accuracy loss\")\n", - " print(\"• Combined approaches balance multiple objectives\")\n", - " print(\"• Early stopping based on accuracy threshold is crucial\")\n", - "\n", - "analyze_compression_accuracy_tradeoff()" - ] - }, - { - "cell_type": "markdown", - "id": "8a67dffa", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## 11. Module Integration Test\n", - "\n", - "Final validation that all compression techniques work together correctly." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4d51b541", - "metadata": {}, - "outputs": [], - "source": [ - "def test_module():\n", - " \"\"\"\n", - " Comprehensive test of entire compression module functionality.\n", - "\n", - " This final test runs before module summary to ensure:\n", - " - All unit tests pass\n", - " - Functions work together correctly\n", - " - Module is ready for integration with TinyTorch\n", - " \"\"\"\n", - " print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n", - " print(\"=\" * 50)\n", - "\n", - " # Run all unit tests\n", - " print(\"Running unit tests...\")\n", - " test_unit_measure_sparsity()\n", - " test_unit_magnitude_prune()\n", - " test_unit_structured_prune()\n", - " test_unit_low_rank_approximate()\n", - " test_unit_knowledge_distillation()\n", - " test_unit_compress_model()\n", - "\n", - " print(\"\\nRunning integration scenarios...\")\n", - "\n", - " # Test 1: Complete compression pipeline\n", - " print(\"🔬 Integration Test: Complete compression pipeline...\")\n", - "\n", - " # Create a realistic model\n", - " model = Sequential(\n", - " Linear(784, 512), # Input layer (like MNIST)\n", - " Linear(512, 256), # Hidden layer 1\n", - " Linear(256, 128), # Hidden layer 2\n", - " Linear(128, 10) # Output layer\n", - " )\n", - "\n", - " original_params = sum(p.size for p in model.parameters())\n", - " print(f\"Original model: {original_params:,} parameters\")\n", - "\n", - " # Apply comprehensive compression\n", - " compression_config = {\n", - " 'magnitude_prune': 0.8,\n", - " 'structured_prune': 0.3\n", - " }\n", - "\n", - " stats = compress_model(model, compression_config)\n", - " final_sparsity = measure_sparsity(model)\n", - "\n", - " # Validate compression results\n", - " assert final_sparsity > 70, f\"Expected >70% sparsity, got {final_sparsity:.1f}%\"\n", - " assert stats['sparsity_increase'] > 70, \"Should achieve significant compression\"\n", - " assert len(stats['applied_techniques']) == 2, \"Should apply both techniques\"\n", - "\n", - " print(f\"✅ Achieved {final_sparsity:.1f}% sparsity with {len(stats['applied_techniques'])} techniques\")\n", - "\n", - " # Test 2: Knowledge distillation setup\n", - " print(\"🔬 Integration Test: Knowledge distillation...\")\n", - "\n", - " teacher = Sequential(Linear(100, 200), Linear(200, 50))\n", - " student = Sequential(Linear(100, 50)) # 3x fewer parameters\n", - "\n", - " kd = KnowledgeDistillation(teacher, student, temperature=4.0, alpha=0.8)\n", - "\n", - " # Verify setup\n", - " teacher_params = sum(p.size for p in teacher.parameters())\n", - " student_params = sum(p.size for p in student.parameters())\n", - " compression_ratio = student_params / teacher_params\n", - "\n", - " assert compression_ratio < 0.5, f\"Student should be <50% of teacher size, got {compression_ratio:.2f}\"\n", - " assert kd.temperature == 4.0, \"Temperature should be set correctly\"\n", - " assert kd.alpha == 0.8, \"Alpha should be set correctly\"\n", - "\n", - " print(f\"✅ Knowledge distillation: {compression_ratio:.2f}x size reduction\")\n", - "\n", - " # Test 3: Low-rank approximation\n", - " print(\"🔬 Integration Test: Low-rank approximation...\")\n", - "\n", - " large_matrix = np.random.randn(200, 150)\n", - " U, S, V = low_rank_approximate(large_matrix, rank_ratio=0.3)\n", - "\n", - " original_size = large_matrix.size\n", - " compressed_size = U.size + S.size + V.size\n", - " compression_ratio = compressed_size / original_size\n", - "\n", - " assert compression_ratio < 0.7, f\"Should achieve compression, got ratio {compression_ratio:.2f}\"\n", - "\n", - " # Test reconstruction\n", - " reconstructed = U @ np.diag(S) @ V\n", - " error = np.linalg.norm(large_matrix - reconstructed) / np.linalg.norm(large_matrix)\n", - " assert error < 0.5, f\"Reconstruction error too high: {error:.3f}\"\n", - "\n", - " print(f\"✅ Low-rank: {compression_ratio:.2f}x compression, {error:.3f} error\")\n", - "\n", - " print(\"\\n\" + \"=\" * 50)\n", - " print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n", - " print(\"Run: tito module complete 18\")\n", - "\n", - "# Call the integration test\n", - "test_module()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8445b205", - "metadata": {}, - "outputs": [], - "source": [ - "if __name__ == \"__main__\":\n", - " print(\"🚀 Running Compression module...\")\n", - " test_module()\n", - " print(\"✅ Module validation complete!\")" - ] - }, - { - "cell_type": "markdown", - "id": "eb215fc2", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🤔 ML Systems Thinking: Compression Foundations\n", - "\n", - "### Question 1: Compression Trade-offs\n", - "You implemented magnitude pruning that removes 90% of weights from a 10M parameter model.\n", - "- How many parameters remain active? _____ M parameters\n", - "- If the original model was 40MB, what's the theoretical minimum storage? _____ MB\n", - "- Why might actual speedup be less than 10x? _____________\n", - "\n", - "### Question 2: Structured vs Unstructured Sparsity\n", - "Your structured pruning removes entire channels, while magnitude pruning creates scattered zeros.\n", - "- Which enables better hardware acceleration? _____________\n", - "- Which preserves accuracy better at high sparsity? _____________\n", - "- Which creates more predictable memory access patterns? _____________\n", - "\n", - "### Question 3: Knowledge Distillation Efficiency\n", - "A teacher model has 100M parameters, student has 10M parameters, both achieve 85% accuracy.\n", - "- What's the compression ratio? _____x\n", - "- If teacher inference takes 100ms, student takes 15ms, what's the speedup? _____x\n", - "- Why is the speedup greater than the compression ratio? _____________\n", - "\n", - "### Question 4: Low-Rank Decomposition\n", - "You approximate a (512, 256) weight matrix with rank 64 using SVD.\n", - "- Original parameter count: _____ parameters\n", - "- Decomposed parameter count: _____ parameters\n", - "- Compression ratio: _____x\n", - "- At what rank does compression become ineffective? rank > _____" - ] - }, - { - "cell_type": "markdown", - "id": "0506c01f", - "metadata": { - "cell_marker": "\"\"\"" - }, - "source": [ - "## 🎯 MODULE SUMMARY: Compression\n", - "\n", - "Congratulations! You've built a comprehensive model compression system that can dramatically reduce model size while preserving intelligence!\n", - "\n", - "### Key Accomplishments\n", - "- Built magnitude-based and structured pruning techniques with clear sparsity patterns\n", - "- Implemented knowledge distillation for teacher-student compression with temperature scaling\n", - "- Created low-rank approximation using SVD decomposition for matrix factorization\n", - "- Developed sparsity measurement and comprehensive compression pipeline\n", - "- Analyzed compression trade-offs between size, speed, and accuracy with real measurements\n", - "- All tests pass ✅ (validated by `test_module()`)\n", - "\n", - "### Systems Insights Gained\n", - "- **Structured vs Unstructured**: Hardware-friendly sparsity patterns vs maximum compression ratios\n", - "- **Compression Cascading**: Multiple techniques compound benefits but require careful sequencing\n", - "- **Accuracy Preservation**: Knowledge distillation maintains performance better than pruning alone\n", - "- **Memory vs Speed**: Parameter reduction doesn't guarantee proportional speedup without sparse libraries\n", - "- **Deployment Strategy**: Different scenarios (mobile, edge, cloud) require different compression approaches\n", - "\n", - "### Technical Mastery\n", - "- **Sparsity Measurement**: Calculate and track zero weight percentages across models\n", - "- **Magnitude Pruning**: Global thresholding based on weight importance ranking\n", - "- **Structured Pruning**: Channel-wise removal using L2 norm importance metrics\n", - "- **Knowledge Distillation**: Teacher-student training with temperature-scaled soft targets\n", - "- **Low-Rank Approximation**: SVD-based matrix factorization for parameter reduction\n", - "- **Pipeline Integration**: Sequential application of multiple compression techniques\n", - "\n", - "### Ready for Next Steps\n", - "Your compression implementation enables efficient model deployment across diverse hardware constraints!\n", - "Export with: `tito module complete 18`\n", - "\n", - "**Next**: Module 19 will add comprehensive benchmarking to evaluate all optimization techniques together, measuring the cumulative effects of quantization, acceleration, and compression!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}