mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-25 22:59:40 -05:00
refactor: Remove old module and chapter files after reorganization
Cleanup of renamed files: - Deleted old module source files (14_kvcaching, 15_profiling, 16_acceleration, etc.) - Deleted old chapter markdown files - These have been replaced by reorganized versions in previous commits
This commit is contained in:
@@ -1,360 +0,0 @@
|
||||
---
|
||||
title: "Spatial - Convolutional Neural Networks"
|
||||
description: "Build CNNs from scratch for computer vision and spatial pattern recognition"
|
||||
difficulty: 3
|
||||
time_estimate: "6-8 hours"
|
||||
prerequisites: ["Tensor", "Activations", "Layers", "DataLoader"]
|
||||
next_steps: ["Tokenization"]
|
||||
learning_objectives:
|
||||
- "Implement convolution as sliding window operations with weight sharing"
|
||||
- "Design CNN architectures with feature extraction and classification components"
|
||||
- "Understand translation invariance and hierarchical feature learning"
|
||||
- "Build pooling operations for spatial downsampling and invariance"
|
||||
- "Apply computer vision principles to image classification tasks"
|
||||
---
|
||||
|
||||
# 09. Spatial (CNNs)
|
||||
|
||||
**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 6-8 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Implement convolutional neural networks (CNNs) from scratch. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image classification to autonomous driving.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Implement convolution** as sliding window operations with explicit loops, understanding weight sharing and local connectivity
|
||||
2. **Design CNN architectures** by composing convolutional, pooling, and dense layers for image classification
|
||||
3. **Understand translation invariance** and why CNNs are superior to dense networks for spatial data
|
||||
4. **Build pooling operations** (MaxPool, AvgPool) for spatial downsampling and feature invariance
|
||||
5. **Apply computer vision principles** to achieve >75% accuracy on CIFAR-10 image classification
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
CNNs are the backbone of modern computer vision systems:
|
||||
|
||||
- **Meta's Vision AI** uses CNN architectures to tag 2 billion photos daily across Facebook and Instagram
|
||||
- **Tesla Autopilot** processes camera feeds through CNN backbones for object detection and lane recognition
|
||||
- **Google Photos** built a CNN-based system that automatically organizes billions of images
|
||||
- **Medical Imaging** systems use CNNs to detect cancer in X-rays and MRIs with superhuman accuracy
|
||||
|
||||
### Historical Context
|
||||
|
||||
The convolution revolution transformed computer vision:
|
||||
|
||||
- **LeNet (1998)**: Yann LeCun's CNN read zip codes on mail; convolution proved viable but limited by compute
|
||||
- **AlexNet (2012)**: Won ImageNet with 16% error rate (vs 26% previous); GPUs + convolution = computer vision revolution
|
||||
- **ResNet (2015)**: 152-layer CNN achieved 3.6% error (better than human 5%); proved depth matters
|
||||
- **Modern Era (2020+)**: CNNs power production vision systems processing trillions of images daily
|
||||
|
||||
The patterns you're implementing revolutionized how machines see.
|
||||
|
||||
## Pedagogical Pattern: Build → Use → Analyze
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- Convolution as explicit sliding window operation
|
||||
- Conv2D layer with learnable filters and weight sharing
|
||||
- MaxPool2D and AvgPool2D for spatial downsampling
|
||||
- Flatten layer to connect spatial and dense layers
|
||||
- Complete CNN architecture with feature extraction and classification
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Build CNN for CIFAR-10 image classification
|
||||
- Extract and visualize learned feature maps
|
||||
- Compare CNN vs MLP performance on spatial data
|
||||
- Achieve >75% accuracy with proper architecture
|
||||
- Understand impact of kernel size, stride, and padding
|
||||
|
||||
### 3. Analyze
|
||||
|
||||
Deep-dive into architectural choices:
|
||||
- Why does weight sharing reduce parameters dramatically?
|
||||
- How do early vs late layers learn different features?
|
||||
- What's the trade-off between depth and width in CNNs?
|
||||
- Why are pooling operations crucial for translation invariance?
|
||||
- How does spatial structure preservation improve learning?
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**Conv2D Layer - The Heart of Computer Vision**
|
||||
```python
|
||||
class Conv2D:
|
||||
"""2D Convolutional layer with learnable filters.
|
||||
|
||||
Implements sliding window convolution:
|
||||
- Applies same filter across all spatial positions (weight sharing)
|
||||
- Each filter learns to detect different features (edges, textures, objects)
|
||||
- Output is feature map showing where filter activates strongly
|
||||
|
||||
Args:
|
||||
in_channels: Number of input channels (3 for RGB, 16 for feature maps)
|
||||
out_channels: Number of learned filters (feature detectors)
|
||||
kernel_size: Size of sliding window (typically 3 or 5)
|
||||
stride: Step size when sliding (1 = no downsampling)
|
||||
padding: Border padding to preserve spatial dimensions
|
||||
"""
|
||||
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0):
|
||||
# Initialize learnable filters
|
||||
self.weight = Tensor(shape=(out_channels, in_channels, kernel_size, kernel_size))
|
||||
self.bias = Tensor(shape=(out_channels,))
|
||||
|
||||
def forward(self, x):
|
||||
# x shape: (batch, in_channels, height, width)
|
||||
batch, _, H, W = x.shape
|
||||
kh, kw = self.kernel_size, self.kernel_size
|
||||
|
||||
# Calculate output dimensions
|
||||
out_h = (H + 2 * self.padding - kh) // self.stride + 1
|
||||
out_w = (W + 2 * self.padding - kw) // self.stride + 1
|
||||
|
||||
# Sliding window convolution
|
||||
output = Tensor(shape=(batch, self.out_channels, out_h, out_w))
|
||||
for b in range(batch):
|
||||
for oc in range(self.out_channels):
|
||||
for i in range(out_h):
|
||||
for j in range(out_w):
|
||||
# Extract local patch
|
||||
i_start = i * self.stride
|
||||
j_start = j * self.stride
|
||||
patch = x[b, :, i_start:i_start+kh, j_start:j_start+kw]
|
||||
|
||||
# Convolution: element-wise multiply and sum
|
||||
output[b, oc, i, j] = (patch * self.weight[oc]).sum() + self.bias[oc]
|
||||
|
||||
return output
|
||||
```
|
||||
|
||||
**Pooling Layers - Spatial Downsampling**
|
||||
```python
|
||||
class MaxPool2D:
|
||||
"""Max pooling for spatial downsampling and translation invariance.
|
||||
|
||||
Takes maximum value in each local region:
|
||||
- Reduces spatial dimensions while preserving important features
|
||||
- Provides invariance to small translations
|
||||
- Reduces computation in later layers
|
||||
"""
|
||||
def __init__(self, kernel_size=2, stride=None):
|
||||
self.kernel_size = kernel_size
|
||||
self.stride = stride or kernel_size
|
||||
|
||||
def forward(self, x):
|
||||
batch, channels, H, W = x.shape
|
||||
kh, kw = self.kernel_size, self.kernel_size
|
||||
|
||||
out_h = (H - kh) // self.stride + 1
|
||||
out_w = (W - kw) // self.stride + 1
|
||||
|
||||
output = Tensor(shape=(batch, channels, out_h, out_w))
|
||||
for b in range(batch):
|
||||
for c in range(channels):
|
||||
for i in range(out_h):
|
||||
for j in range(out_w):
|
||||
i_start = i * self.stride
|
||||
j_start = j * self.stride
|
||||
patch = x[b, c, i_start:i_start+kh, j_start:j_start+kw]
|
||||
output[b, c, i, j] = patch.max()
|
||||
|
||||
return output
|
||||
```
|
||||
|
||||
**Complete CNN Architecture**
|
||||
```python
|
||||
class SimpleCNN:
|
||||
"""CNN for CIFAR-10 classification.
|
||||
|
||||
Architecture:
|
||||
Conv(3→32, 3x3) → ReLU → MaxPool(2x2) # 32x32 → 16x16
|
||||
Conv(32→64, 3x3) → ReLU → MaxPool(2x2) # 16x16 → 8x8
|
||||
Flatten → Dense(64*8*8 → 128) → ReLU
|
||||
Dense(128 → 10) → Softmax
|
||||
"""
|
||||
def __init__(self):
|
||||
self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1)
|
||||
self.relu1 = ReLU()
|
||||
self.pool1 = MaxPool2D(kernel_size=2)
|
||||
|
||||
self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1)
|
||||
self.relu2 = ReLU()
|
||||
self.pool2 = MaxPool2D(kernel_size=2)
|
||||
|
||||
self.flatten = Flatten()
|
||||
self.fc1 = Linear(64 * 8 * 8, 128)
|
||||
self.relu3 = ReLU()
|
||||
self.fc2 = Linear(128, 10)
|
||||
|
||||
def forward(self, x):
|
||||
# Feature extraction
|
||||
x = self.pool1(self.relu1(self.conv1(x))) # (B, 32, 16, 16)
|
||||
x = self.pool2(self.relu2(self.conv2(x))) # (B, 64, 8, 8)
|
||||
|
||||
# Classification
|
||||
x = self.flatten(x) # (B, 4096)
|
||||
x = self.relu3(self.fc1(x)) # (B, 128)
|
||||
x = self.fc2(x) # (B, 10)
|
||||
return x
|
||||
```
|
||||
|
||||
### Step-by-Step Implementation
|
||||
|
||||
1. **Implement Conv2D Forward Pass**
|
||||
- Create sliding window iteration over spatial dimensions
|
||||
- Apply weight sharing: same filter at all positions
|
||||
- Handle batch processing efficiently
|
||||
- Verify output shape calculation
|
||||
|
||||
2. **Build Pooling Operations**
|
||||
- Implement MaxPool2D with maximum extraction
|
||||
- Add AvgPool2D for average pooling
|
||||
- Handle stride and kernel size correctly
|
||||
- Test spatial dimension reduction
|
||||
|
||||
3. **Create Flatten Layer**
|
||||
- Convert (B, C, H, W) to (B, C*H*W)
|
||||
- Prepare spatial features for dense layers
|
||||
- Preserve batch dimension
|
||||
- Enable gradient flow backward
|
||||
|
||||
4. **Design Complete CNN**
|
||||
- Stack Conv → ReLU → Pool blocks for feature extraction
|
||||
- Add Flatten → Dense for classification
|
||||
- Calculate dimensions at each layer
|
||||
- Test end-to-end forward pass
|
||||
|
||||
5. **Train on CIFAR-10**
|
||||
- Load CIFAR-10 using Module 08's DataLoader
|
||||
- Train with cross-entropy loss and SGD
|
||||
- Track accuracy on test set
|
||||
- Achieve >75% accuracy
|
||||
|
||||
## Testing
|
||||
|
||||
### Inline Tests (During Development)
|
||||
|
||||
Run inline tests while building:
|
||||
```bash
|
||||
cd modules/source/09_spatial
|
||||
python spatial_dev.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Unit Test: Conv2D implementation...
|
||||
✅ Sliding window convolution works correctly
|
||||
✅ Weight sharing applied at all positions
|
||||
✅ Output shapes match expected dimensions
|
||||
Progress: Conv2D ✓
|
||||
|
||||
Unit Test: MaxPool2D implementation...
|
||||
✅ Maximum extraction works correctly
|
||||
✅ Spatial dimensions reduced properly
|
||||
✅ Translation invariance verified
|
||||
Progress: Pooling ✓
|
||||
|
||||
Unit Test: Complete CNN architecture...
|
||||
✅ Forward pass through all layers successful
|
||||
✅ Output shape: (32, 10) for 10 classes
|
||||
✅ Parameter count reasonable: ~500K parameters
|
||||
Progress: CNN Architecture ✓
|
||||
```
|
||||
|
||||
### Export and Validate
|
||||
|
||||
After completing the module:
|
||||
```bash
|
||||
# Export to tinytorch package
|
||||
tito export 09_spatial
|
||||
|
||||
# Run integration tests
|
||||
tito test 09_spatial
|
||||
```
|
||||
|
||||
### CIFAR-10 Training Test
|
||||
|
||||
```bash
|
||||
# Train simple CNN on CIFAR-10
|
||||
python tests/integration/test_cnn_cifar10.py
|
||||
|
||||
Expected results:
|
||||
- Epoch 1: 35% accuracy
|
||||
- Epoch 5: 60% accuracy
|
||||
- Epoch 10: 75% accuracy
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── nn/
|
||||
│ └── spatial.py # Conv2D, MaxPool2D, etc.
|
||||
└── __init__.py # Exposes CNN components
|
||||
|
||||
Usage in other modules:
|
||||
>>> from tinytorch.nn import Conv2D, MaxPool2D
|
||||
>>> conv = Conv2D(3, 32, kernel_size=3)
|
||||
>>> pool = MaxPool2D(kernel_size=2)
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Parameter Efficiency**: A Conv2D(3, 32, 3) has ~900 parameters. How many parameters would a Dense layer need to connect a 32x32 image to 32 outputs? Why is this difference critical for scaling?
|
||||
|
||||
2. **Translation Invariance**: Why does a CNN detect a cat regardless of whether it's in the top-left or bottom-right of an image? How does weight sharing enable this property?
|
||||
|
||||
3. **Hierarchical Features**: Early CNN layers detect edges and textures. Later layers detect objects and faces. How does this emerge from stacking convolutions? Why doesn't this happen in dense networks?
|
||||
|
||||
4. **Receptive Field Growth**: A single Conv2D(kernel=3) sees a 3x3 region. After two Conv2D layers, what region does each output see? How do deep CNNs build global context from local operations?
|
||||
|
||||
5. **Compute vs Memory Trade-offs**: Large kernel sizes (7x7) have more parameters but fewer operations. Small kernels (3x3) stacked deeply have opposite trade-offs. Which is better and why?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
### Industry Applications
|
||||
|
||||
**Autonomous Vehicles (Tesla, Waymo)**
|
||||
- Multi-camera CNN systems process 30 FPS at 1920x1200 resolution
|
||||
- Feature maps from CNNs feed into object detection and segmentation
|
||||
- Real-time requirements demand efficient Conv2D implementations
|
||||
|
||||
**Medical Imaging (PathAI, Zebra Medical)**
|
||||
- CNNs analyze X-rays and CT scans for diagnostic assistance
|
||||
- Achieve superhuman performance on specific tasks (diabetic retinopathy detection)
|
||||
- Architecture design critical for accuracy-interpretability trade-off
|
||||
|
||||
**Face Recognition (Apple Face ID, Facebook DeepFace)**
|
||||
- CNN embeddings enable accurate face matching at billion-user scale
|
||||
- Lightweight CNN architectures run on mobile devices in real-time
|
||||
- Privacy concerns drive on-device processing
|
||||
|
||||
### Research Impact
|
||||
|
||||
This module implements patterns from:
|
||||
- LeNet-5 (1998): First successful CNN for digit recognition
|
||||
- AlexNet (2012): Sparked deep learning revolution with CNNs + GPUs
|
||||
- VGG (2014): Showed deeper is better with simple 3x3 convolutions
|
||||
- ResNet (2015): Enabled training 152-layer CNNs with skip connections
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 10: Tokenization**, you'll shift from processing images to processing text:
|
||||
|
||||
- Learn how to convert text into numerical representations
|
||||
- Implement tokenization strategies (character, word, subword)
|
||||
- Build vocabulary management systems
|
||||
- Prepare text data for transformers in Module 13
|
||||
|
||||
This completes the vision half of the Intelligence Tier. Next, you'll tackle language!
|
||||
|
||||
---
|
||||
|
||||
**Ready to build CNNs from scratch?** Open `modules/source/09_spatial/spatial_dev.py` and start implementing.
|
||||
@@ -1,446 +0,0 @@
|
||||
---
|
||||
title: "KV Caching - Optimizing Transformer Inference"
|
||||
description: "Cache attention key-value pairs for 10-100x faster autoregressive generation"
|
||||
difficulty: 3
|
||||
time_estimate: "4-5 hours"
|
||||
prerequisites: ["Attention", "Transformers"]
|
||||
next_steps: ["Profiling"]
|
||||
learning_objectives:
|
||||
- "Implement KV caching to eliminate redundant attention computations"
|
||||
- "Design cache management systems for multi-turn conversations"
|
||||
- "Understand memory-speed trade-offs in production inference"
|
||||
- "Optimize transformer latency from O(n²) to O(n) per token"
|
||||
- "Apply caching patterns used in ChatGPT and production LLMs"
|
||||
---
|
||||
|
||||
# 14. KV Caching
|
||||
|
||||
**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 4-5 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Implement KV (Key-Value) caching to optimize transformer inference. This critical production optimization reduces latency by 10-100× for autoregressive generation by caching attention keys and values, eliminating redundant recomputation.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Implement KV caching** to eliminate redundant attention key/value computations during generation
|
||||
2. **Design cache management systems** for efficient multi-turn conversation handling
|
||||
3. **Understand memory-speed trade-offs** between caching everything vs recomputing on-the-fly
|
||||
4. **Optimize transformer latency** from O(n²) to O(n) per generated token
|
||||
5. **Apply caching patterns** used in ChatGPT, Claude, and all production language models
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
KV caching is mandatory for production LLM serving:
|
||||
|
||||
- **ChatGPT** uses KV caching for all multi-turn conversations; without it, latency would be unusable
|
||||
- **Claude** caches up to 100K tokens of context; enables long document processing
|
||||
- **GitHub Copilot** caches code context; provides real-time completions
|
||||
- **Google Gemini** uses multi-level caching; serves billions of requests daily
|
||||
|
||||
### Historical Context
|
||||
|
||||
Caching evolved with transformer deployment:
|
||||
|
||||
- **Early Transformers (2017-2019)**: No caching; research focused on training, not inference
|
||||
- **GPT-2 Deployment (2019)**: KV caching implemented; enabled practical text generation
|
||||
- **Production Scale (2020+)**: Multi-level caching (KV + intermediate layers); critical for economics
|
||||
- **Modern Systems (2023+)**: Distributed caching across GPUs; 100K+ token contexts
|
||||
|
||||
Without KV caching, ChatGPT would be 50-100× slower and economically infeasible.
|
||||
|
||||
## Pedagogical Pattern: Build → Use → Optimize
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- KV cache data structure for attention
|
||||
- Cache management (append, reuse, clear)
|
||||
- Cached attention forward pass
|
||||
- Multi-turn conversation caching
|
||||
- Memory-efficient cache storage
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Optimize GPT decoder for text generation
|
||||
- Cache conversation history for multi-turn chat
|
||||
- Measure latency improvement (10-100× speedup)
|
||||
- Profile memory usage vs cache size
|
||||
- Compare cached vs non-cached inference
|
||||
|
||||
### 3. Optimize
|
||||
|
||||
Production-ready enhancements:
|
||||
- Implement cache eviction policies (LRU, FIFO)
|
||||
- Add distributed caching across GPUs
|
||||
- Optimize memory layout for cache hits
|
||||
- Compress cached values (quantization)
|
||||
- Build cache warmup strategies
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**Understanding the Problem - Why Caching Helps**
|
||||
```python
|
||||
# WITHOUT KV caching (naive autoregressive generation):
|
||||
# Generate token 1: compute attention for [t0]
|
||||
# Generate token 2: compute attention for [t0, t1] ← recomputes t0
|
||||
# Generate token 3: compute attention for [t0, t1, t2] ← recomputes t0, t1
|
||||
# Generate token n: compute attention for [t0, ..., tn] ← recomputes everything
|
||||
#
|
||||
# Complexity: O(n²) - quadratic in sequence length
|
||||
# For 100 tokens: ~5000 attention operations
|
||||
|
||||
# WITH KV caching:
|
||||
# Generate token 1: compute K,V for [t0], cache them
|
||||
# Generate token 2: reuse cached K,V for t0, compute only for t1
|
||||
# Generate token 3: reuse cached K,V for t0,t1, compute only for t2
|
||||
# Generate token n: reuse all cached, compute only for tn
|
||||
#
|
||||
# Complexity: O(n) - linear in sequence length
|
||||
# For 100 tokens: ~100 attention operations (50× speedup!)
|
||||
```
|
||||
|
||||
**KV Cache Data Structure**
|
||||
```python
|
||||
class KVCache:
|
||||
"""Cache for attention keys and values.
|
||||
|
||||
Stores computed K,V matrices to avoid recomputation during
|
||||
autoregressive generation.
|
||||
|
||||
Memory layout:
|
||||
keys: (num_layers, batch, num_heads, seq_len, d_k)
|
||||
values: (num_layers, batch, num_heads, seq_len, d_v)
|
||||
|
||||
For GPT-2:
|
||||
12 layers × 12 heads × 1024 seq × 64 dims = ~9M values
|
||||
At FP16 (2 bytes): 18MB per batch item
|
||||
"""
|
||||
def __init__(self, num_layers, batch_size, num_heads, d_k, d_v, max_seq_len):
|
||||
self.num_layers = num_layers
|
||||
self.batch_size = batch_size
|
||||
self.num_heads = num_heads
|
||||
self.max_seq_len = max_seq_len
|
||||
|
||||
# Pre-allocate cache tensors
|
||||
self.keys = {} # {layer_idx: (batch, heads, seq_len, d_k)}
|
||||
self.values = {} # {layer_idx: (batch, heads, seq_len, d_v)}
|
||||
|
||||
# Track current sequence length
|
||||
self.seq_len = 0
|
||||
|
||||
def append(self, layer_idx, new_keys, new_values):
|
||||
"""Append new keys/values to cache for a layer.
|
||||
|
||||
Args:
|
||||
layer_idx: Which transformer layer
|
||||
new_keys: (batch, heads, 1, d_k) - single new position
|
||||
new_values: (batch, heads, 1, d_v) - single new position
|
||||
"""
|
||||
if layer_idx not in self.keys:
|
||||
# Initialize cache for this layer
|
||||
self.keys[layer_idx] = new_keys
|
||||
self.values[layer_idx] = new_values
|
||||
else:
|
||||
# Concatenate with existing cache
|
||||
self.keys[layer_idx] = concat([self.keys[layer_idx], new_keys], dim=2)
|
||||
self.values[layer_idx] = concat([self.values[layer_idx], new_values], dim=2)
|
||||
|
||||
# Update sequence length (same across all layers)
|
||||
self.seq_len = self.keys[layer_idx].shape[2]
|
||||
|
||||
def get(self, layer_idx):
|
||||
"""Retrieve cached keys/values for a layer.
|
||||
|
||||
Returns:
|
||||
keys: (batch, heads, seq_len, d_k)
|
||||
values: (batch, heads, seq_len, d_v)
|
||||
"""
|
||||
return self.keys.get(layer_idx), self.values.get(layer_idx)
|
||||
|
||||
def clear(self):
|
||||
"""Clear all cached data."""
|
||||
self.keys.clear()
|
||||
self.values.clear()
|
||||
self.seq_len = 0
|
||||
|
||||
def memory_usage(self):
|
||||
"""Calculate cache memory usage in bytes."""
|
||||
total_elements = 0
|
||||
for k, v in zip(self.keys.values(), self.values.values()):
|
||||
total_elements += k.numel() + v.numel()
|
||||
# Assume FP16 (2 bytes per element)
|
||||
return total_elements * 2
|
||||
```
|
||||
|
||||
**Cached Attention Layer**
|
||||
```python
|
||||
class CachedMultiHeadAttention(MultiHeadAttention):
|
||||
"""Multi-head attention with KV caching support.
|
||||
|
||||
Extends MultiHeadAttention to cache K,V matrices during generation.
|
||||
"""
|
||||
def forward(self, query, key=None, value=None, kv_cache=None, layer_idx=None):
|
||||
"""Forward pass with optional KV caching.
|
||||
|
||||
Args:
|
||||
query: (batch, 1, d_model) - single new position
|
||||
key: (batch, seq_len, d_model) - optional, for initial pass
|
||||
value: (batch, seq_len, d_model) - optional, for initial pass
|
||||
kv_cache: KVCache object
|
||||
layer_idx: Which layer (for cache indexing)
|
||||
|
||||
Returns:
|
||||
output: (batch, 1, d_model) - attended output
|
||||
attention_weights: (batch, heads, 1, seq_len) - for analysis
|
||||
"""
|
||||
batch_size = query.shape[0]
|
||||
|
||||
# Project query for new position
|
||||
Q = self.W_q(query) # (batch, 1, d_model)
|
||||
Q = Q.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
# Q: (batch, heads, 1, d_k)
|
||||
|
||||
if kv_cache is not None and layer_idx is not None:
|
||||
# Check if cache exists for this layer
|
||||
cached_K, cached_V = kv_cache.get(layer_idx)
|
||||
|
||||
if cached_K is None:
|
||||
# First token: compute and cache K,V
|
||||
K = self.W_k(key)
|
||||
V = self.W_v(value)
|
||||
K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
|
||||
# Cache for future tokens
|
||||
kv_cache.append(layer_idx, K, V)
|
||||
else:
|
||||
# Subsequent tokens: compute only new K,V, concat with cache
|
||||
new_K = self.W_k(key) # key is just new position
|
||||
new_V = self.W_v(value)
|
||||
new_K = new_K.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
new_V = new_V.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
|
||||
# Append to cache
|
||||
kv_cache.append(layer_idx, new_K, new_V)
|
||||
|
||||
# Use full cached K,V
|
||||
K, V = kv_cache.get(layer_idx)
|
||||
else:
|
||||
# No caching: regular attention
|
||||
K = self.W_k(key)
|
||||
V = self.W_v(value)
|
||||
K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
|
||||
# Compute attention with cached K,V
|
||||
attended, attention_weights = scaled_dot_product_attention(Q, K, V)
|
||||
|
||||
# Reshape output
|
||||
attended = attended.transpose(1, 2).reshape(batch_size, 1, self.d_model)
|
||||
output = self.W_o(attended)
|
||||
|
||||
return output, attention_weights
|
||||
```
|
||||
|
||||
**Cached Generation - The Full Pipeline**
|
||||
```python
|
||||
def generate_with_cache(model, start_tokens, max_new_tokens, temperature=1.0):
|
||||
"""Autoregressive generation with KV caching.
|
||||
|
||||
Achieves 10-100× speedup over non-cached generation.
|
||||
|
||||
Args:
|
||||
model: Transformer with KV cache support
|
||||
start_tokens: (batch, start_len) initial sequence
|
||||
max_new_tokens: Number of tokens to generate
|
||||
temperature: Sampling temperature
|
||||
|
||||
Returns:
|
||||
generated: (batch, start_len + max_new_tokens) full sequence
|
||||
"""
|
||||
batch_size = start_tokens.shape[0]
|
||||
generated = start_tokens
|
||||
|
||||
# Initialize KV cache
|
||||
kv_cache = KVCache(
|
||||
num_layers=model.num_layers,
|
||||
batch_size=batch_size,
|
||||
num_heads=model.num_heads,
|
||||
d_k=model.d_k,
|
||||
d_v=model.d_k,
|
||||
max_seq_len=start_tokens.shape[1] + max_new_tokens
|
||||
)
|
||||
|
||||
# Process initial sequence (fills cache)
|
||||
_ = model.forward(start_tokens, kv_cache=kv_cache)
|
||||
|
||||
# Generate tokens one at a time (uses cache)
|
||||
for _ in range(max_new_tokens):
|
||||
# Forward pass on ONLY the last token
|
||||
# Cache provides context from all previous tokens
|
||||
last_token = generated[:, -1:] # (batch, 1)
|
||||
logits = model.forward(last_token, kv_cache=kv_cache) # (batch, 1, vocab_size)
|
||||
|
||||
# Sample next token
|
||||
next_token_logits = logits[:, -1, :] / temperature
|
||||
probs = softmax(next_token_logits, dim=-1)
|
||||
next_token = sample(probs)
|
||||
|
||||
# Append to sequence
|
||||
generated = concat([generated, next_token], dim=1)
|
||||
|
||||
return generated
|
||||
```
|
||||
|
||||
### Step-by-Step Implementation
|
||||
|
||||
1. **Design KV Cache Structure**
|
||||
- Create storage for keys and values per layer
|
||||
- Support appending new keys/values efficiently
|
||||
- Add retrieval and clearing methods
|
||||
- Calculate memory usage
|
||||
|
||||
2. **Modify Attention for Caching**
|
||||
- Add KV cache parameter to forward pass
|
||||
- Check if cache exists for current layer
|
||||
- Compute only new K,V when cache present
|
||||
- Concat new K,V with cached values
|
||||
|
||||
3. **Implement Cached Generation**
|
||||
- Initialize cache before generation loop
|
||||
- Process initial tokens (fill cache)
|
||||
- Generate new tokens using cached context
|
||||
- Measure speedup vs non-cached
|
||||
|
||||
4. **Add Cache Management**
|
||||
- Implement cache clearing between conversations
|
||||
- Add cache size limits and eviction
|
||||
- Support batch processing with caching
|
||||
- Handle variable sequence lengths
|
||||
|
||||
5. **Optimize Memory Layout**
|
||||
- Use contiguous tensors for cache hits
|
||||
- Implement FP16 caching for memory savings
|
||||
- Add cache compression (quantization)
|
||||
- Profile memory bandwidth bottlenecks
|
||||
|
||||
## Testing
|
||||
|
||||
### Inline Tests (During Development)
|
||||
|
||||
Run inline tests while building:
|
||||
```bash
|
||||
cd modules/source/14_kvcaching
|
||||
python kvcaching_dev.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Unit Test: KV cache data structure...
|
||||
✅ Cache initialization successful
|
||||
✅ Append and retrieval work correctly
|
||||
✅ Memory usage calculated: 18MB per batch
|
||||
Progress: KV Cache ✓
|
||||
|
||||
Unit Test: Cached attention...
|
||||
✅ First token: K,V computed and cached
|
||||
✅ Subsequent tokens: reuse cached K,V
|
||||
✅ Attention output matches non-cached version
|
||||
Progress: Cached Attention ✓
|
||||
|
||||
Unit Test: Generation with caching...
|
||||
✅ Generated 100 tokens with caching
|
||||
✅ Speedup: 47× faster than without cache
|
||||
✅ Output quality: identical to non-cached
|
||||
Progress: Cached Generation ✓
|
||||
```
|
||||
|
||||
### Export and Validate
|
||||
|
||||
After completing the module:
|
||||
```bash
|
||||
# Export to tinytorch package
|
||||
tito export 14_kvcaching
|
||||
|
||||
# Run integration tests
|
||||
tito test 14_kvcaching
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── nn/
|
||||
│ └── kvcache.py # Your implementation goes here
|
||||
└── __init__.py # Exposes KVCache, CachedMultiHeadAttention
|
||||
|
||||
Usage in other modules:
|
||||
>>> from tinytorch.nn import KVCache, CachedMultiHeadAttention
|
||||
>>> cache = KVCache(num_layers=12, batch_size=1, num_heads=12, d_k=64, d_v=64, max_seq_len=1024)
|
||||
>>> generated = generate_with_cache(model, start_tokens, max_new_tokens=100)
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Memory-Speed Trade-off**: KV cache uses 18MB per batch for GPT-2. For batch=32, that's 576MB. What if you have 8GB GPU? How many concurrent users can you serve? What's the trade-off?
|
||||
|
||||
2. **Cache Invalidation**: In multi-turn chat, when should you clear the cache? What if context exceeds max_seq_len? How do production systems handle this?
|
||||
|
||||
3. **Distributed Caching**: For models too large for one GPU, you need tensor parallelism. How do you partition the KV cache across GPUs? What's the communication overhead?
|
||||
|
||||
4. **Quantized Caching**: Storing cache in INT8 instead of FP16 saves 50% memory. What's the accuracy impact? When is this worth it?
|
||||
|
||||
5. **Speculation and Prefetching**: What if you predict the next query and pre-compute KV cache? How would you implement speculative caching?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
### Industry Applications
|
||||
|
||||
**Conversational AI (OpenAI ChatGPT, Anthropic Claude)**
|
||||
- KV caching for all multi-turn conversations
|
||||
- Cache eviction policies for context window limits
|
||||
- Memory-speed trade-offs define pricing ($/1M tokens)
|
||||
- Without caching, latency would be 50-100× worse
|
||||
|
||||
**Code Completion (GitHub Copilot, Cursor)**
|
||||
- Real-time caching of code context
|
||||
- Incremental updates as user types
|
||||
- Low-latency requirements (< 100ms) mandate caching
|
||||
- Cache hit rates directly impact user experience
|
||||
|
||||
**Search and Retrieval (Perplexity, Bing AI)**
|
||||
- Cache document embeddings and attention
|
||||
- Multi-stage caching (retrieval + generation)
|
||||
- Distributed caching across data centers
|
||||
- Cache warmup for popular queries
|
||||
|
||||
### Research Impact
|
||||
|
||||
This module implements patterns from:
|
||||
- GPT-2 (2019): First large-scale use of KV caching
|
||||
- Megatron-LM (2020): Distributed KV caching across GPUs
|
||||
- FlashAttention (2022): Memory-efficient attention without full caching
|
||||
- PagedAttention (2023): Virtual memory for KV cache management
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 15: Profiling**, you'll measure where time goes in your transformer:
|
||||
|
||||
- Profile attention, feedforward, and embedding operations
|
||||
- Identify computational bottlenecks beyond caching
|
||||
- Measure FLOPs, memory bandwidth, and latency
|
||||
- Understand performance characteristics across architectures
|
||||
|
||||
The caching you implemented solves the biggest inference bottleneck—now let's find what else to optimize!
|
||||
|
||||
---
|
||||
|
||||
**Ready to implement production-critical caching?** Open `modules/source/14_kvcaching/kvcaching_dev.py` and start implementing.
|
||||
@@ -1,451 +0,0 @@
|
||||
---
|
||||
title: "Profiling - Performance Analysis and Optimization"
|
||||
description: "Build profilers to identify bottlenecks and guide optimization decisions"
|
||||
difficulty: 3
|
||||
time_estimate: "5-6 hours"
|
||||
prerequisites: ["All modules 01-14"]
|
||||
next_steps: ["Acceleration"]
|
||||
learning_objectives:
|
||||
- "Implement timing profilers with statistical rigor for accurate measurements"
|
||||
- "Design memory profilers to track allocation patterns and identify leaks"
|
||||
- "Build FLOP counters to measure computational complexity"
|
||||
- "Understand performance bottlenecks across different architectures"
|
||||
- "Apply data-driven analysis to guide optimization priorities"
|
||||
---
|
||||
|
||||
# 15. Profiling
|
||||
|
||||
**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Build comprehensive profiling tools to measure where time and memory go in your ML systems. This module implements timing profilers, memory trackers, and FLOP counters that reveal bottlenecks and guide optimization decisions.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Implement timing profilers** with statistical rigor (multiple runs, confidence intervals) for accurate measurements
|
||||
2. **Design memory profilers** to track allocation patterns, peak usage, and identify memory leaks
|
||||
3. **Build FLOP counters** to measure theoretical computational complexity of different operations
|
||||
4. **Understand performance bottlenecks** by comparing MLPs, CNNs, and Transformers systematically
|
||||
5. **Apply data-driven analysis** to prioritize optimization efforts based on actual impact
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
Profiling is mandatory for production ML systems:
|
||||
|
||||
- **Google TPU teams** profile every operation to optimize hardware utilization
|
||||
- **OpenAI** profiles GPT training to identify $millions in compute savings
|
||||
- **Meta** profiles inference to serve billions of requests per day efficiently
|
||||
- **NVIDIA** uses profiling to optimize cuDNN kernels for peak performance
|
||||
|
||||
### Historical Context
|
||||
|
||||
Profiling evolved with ML scale:
|
||||
|
||||
- **Early ML (pre-2012)**: Ad-hoc timing with `time.time()`; no systematic profiling
|
||||
- **Deep Learning Era (2012-2017)**: NVIDIA profiler, TensorBoard timing; focus on GPU utilization
|
||||
- **Production Scale (2018+)**: Comprehensive profiling (compute, memory, I/O, network); optimization critical for economics
|
||||
- **Modern Systems (2020+)**: Automated profiling and optimization; ML compilers use profiling data
|
||||
|
||||
Without profiling, you're optimizing blind—profiling shows you where to focus.
|
||||
|
||||
## Pedagogical Pattern: Build → Use → Optimize
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- High-precision timing with multiple runs
|
||||
- Statistical analysis (mean, std, confidence intervals)
|
||||
- Memory profiler tracking allocations and deallocations
|
||||
- FLOP counter for theoretical complexity
|
||||
- Comparative profiler across architectures
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Profile attention vs feedforward in transformers
|
||||
- Compare MLP vs CNN vs Transformer efficiency
|
||||
- Identify memory bottlenecks in training loops
|
||||
- Measure impact of batch size on throughput
|
||||
- Analyze scaling behavior with model size
|
||||
|
||||
### 3. Optimize
|
||||
|
||||
Production insights:
|
||||
- Prioritize optimizations by impact (80/20 rule)
|
||||
- Measure before/after optimization
|
||||
- Understand hardware utilization (CPU vs GPU)
|
||||
- Identify memory bandwidth vs compute bottlenecks
|
||||
- Build optimization roadmap based on data
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**High-Precision Timer**
|
||||
```python
|
||||
class Timer:
|
||||
"""High-precision timing with statistical analysis.
|
||||
|
||||
Performs multiple runs to account for variance and noise.
|
||||
Reports mean, std, and confidence intervals.
|
||||
|
||||
Example:
|
||||
timer = Timer()
|
||||
with timer:
|
||||
model.forward(x)
|
||||
print(f"Time: {timer.mean:.3f}ms ± {timer.std:.3f}ms")
|
||||
"""
|
||||
def __init__(self, num_runs=10, warmup_runs=3):
|
||||
self.num_runs = num_runs
|
||||
self.warmup_runs = warmup_runs
|
||||
self.times = []
|
||||
|
||||
def __enter__(self):
|
||||
# Warmup runs (not counted)
|
||||
for _ in range(self.warmup_runs):
|
||||
start = time.perf_counter()
|
||||
# Operation happens in with block
|
||||
|
||||
# Timed runs
|
||||
self.start_time = time.perf_counter()
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
elapsed = time.perf_counter() - self.start_time
|
||||
self.times.append(elapsed * 1000) # Convert to ms
|
||||
|
||||
@property
|
||||
def mean(self):
|
||||
return np.mean(self.times)
|
||||
|
||||
@property
|
||||
def std(self):
|
||||
return np.std(self.times)
|
||||
|
||||
@property
|
||||
def confidence_interval(self, confidence=0.95):
|
||||
"""95% confidence interval using t-distribution."""
|
||||
from scipy import stats
|
||||
ci = stats.t.interval(confidence, len(self.times)-1,
|
||||
loc=self.mean, scale=stats.sem(self.times))
|
||||
return ci
|
||||
|
||||
def report(self):
|
||||
ci = self.confidence_interval()
|
||||
return f"{self.mean:.3f}ms ± {self.std:.3f}ms (95% CI: [{ci[0]:.3f}, {ci[1]:.3f}])"
|
||||
```
|
||||
|
||||
**Memory Profiler**
|
||||
```python
|
||||
class MemoryProfiler:
|
||||
"""Track memory allocations and peak usage.
|
||||
|
||||
Monitors memory throughout execution to identify:
|
||||
- Peak memory usage
|
||||
- Memory leaks
|
||||
- Allocation patterns
|
||||
- Memory bandwidth bottlenecks
|
||||
"""
|
||||
def __init__(self):
|
||||
self.snapshots = []
|
||||
self.peak_memory = 0
|
||||
|
||||
def snapshot(self, label=""):
|
||||
"""Take memory snapshot at current point."""
|
||||
import psutil
|
||||
process = psutil.Process()
|
||||
mem_info = process.memory_info()
|
||||
|
||||
snapshot = {
|
||||
'label': label,
|
||||
'rss': mem_info.rss / 1024**2, # MB
|
||||
'vms': mem_info.vms / 1024**2, # MB
|
||||
'timestamp': time.time()
|
||||
}
|
||||
self.snapshots.append(snapshot)
|
||||
self.peak_memory = max(self.peak_memory, snapshot['rss'])
|
||||
|
||||
return snapshot
|
||||
|
||||
def report(self):
|
||||
"""Generate memory usage report."""
|
||||
print(f"Peak Memory: {self.peak_memory:.2f} MB")
|
||||
print("\nMemory Timeline:")
|
||||
for snap in self.snapshots:
|
||||
print(f" {snap['label']:30s}: {snap['rss']:8.2f} MB")
|
||||
|
||||
# Calculate memory growth
|
||||
if len(self.snapshots) >= 2:
|
||||
growth = self.snapshots[-1]['rss'] - self.snapshots[0]['rss']
|
||||
print(f"\nTotal Growth: {growth:+.2f} MB")
|
||||
|
||||
# Check for potential memory leak
|
||||
if growth > 100: # Arbitrary threshold
|
||||
print("⚠️ Potential memory leak detected!")
|
||||
```
|
||||
|
||||
**FLOP Counter**
|
||||
```python
|
||||
class FLOPCounter:
|
||||
"""Count floating-point operations for complexity analysis.
|
||||
|
||||
Provides theoretical computational complexity independent of hardware.
|
||||
Useful for comparing different architectural choices.
|
||||
"""
|
||||
def __init__(self):
|
||||
self.total_flops = 0
|
||||
self.op_counts = {}
|
||||
|
||||
def count_matmul(self, A_shape, B_shape):
|
||||
"""Count FLOPs for matrix multiplication.
|
||||
|
||||
C = A @ B where A is (m, k) and B is (k, n)
|
||||
FLOPs = 2*m*k*n (multiply-add for each output element)
|
||||
"""
|
||||
m, k = A_shape
|
||||
k2, n = B_shape
|
||||
assert k == k2, "Invalid matmul dimensions"
|
||||
|
||||
flops = 2 * m * k * n
|
||||
self.total_flops += flops
|
||||
self.op_counts['matmul'] = self.op_counts.get('matmul', 0) + flops
|
||||
return flops
|
||||
|
||||
def count_attention(self, batch, seq_len, d_model, num_heads):
|
||||
"""Count FLOPs for multi-head attention.
|
||||
|
||||
Components:
|
||||
- Q,K,V projections: 3 * (batch * seq_len * d_model * d_model)
|
||||
- Attention scores: batch * heads * seq_len * seq_len * d_k
|
||||
- Attention weighting: batch * heads * seq_len * seq_len * d_v
|
||||
- Output projection: batch * seq_len * d_model * d_model
|
||||
"""
|
||||
d_k = d_model // num_heads
|
||||
|
||||
# QKV projections
|
||||
qkv_flops = 3 * self.count_matmul((batch * seq_len, d_model), (d_model, d_model))
|
||||
|
||||
# Attention computation
|
||||
scores_flops = batch * num_heads * seq_len * seq_len * d_k * 2
|
||||
weights_flops = batch * num_heads * seq_len * seq_len * d_k * 2
|
||||
attention_flops = scores_flops + weights_flops
|
||||
|
||||
# Output projection
|
||||
output_flops = self.count_matmul((batch * seq_len, d_model), (d_model, d_model))
|
||||
|
||||
total = qkv_flops + attention_flops + output_flops
|
||||
self.op_counts['attention'] = self.op_counts.get('attention', 0) + total
|
||||
return total
|
||||
|
||||
def report(self):
|
||||
"""Generate FLOP report with breakdown."""
|
||||
print(f"Total FLOPs: {self.total_flops / 1e9:.2f} GFLOPs")
|
||||
print("\nBreakdown by operation:")
|
||||
for op, flops in sorted(self.op_counts.items(), key=lambda x: x[1], reverse=True):
|
||||
percentage = (flops / self.total_flops) * 100
|
||||
print(f" {op:20s}: {flops/1e9:8.2f} GFLOPs ({percentage:5.1f}%)")
|
||||
```
|
||||
|
||||
**Architecture Profiler - Comparative Analysis**
|
||||
```python
|
||||
class ArchitectureProfiler:
|
||||
"""Compare performance across different architectures.
|
||||
|
||||
Profiles MLP, CNN, and Transformer on same task to understand
|
||||
compute/memory trade-offs.
|
||||
"""
|
||||
def __init__(self):
|
||||
self.results = {}
|
||||
|
||||
def profile_model(self, model, input_data, model_name):
|
||||
"""Profile a model comprehensively."""
|
||||
result = {
|
||||
'model_name': model_name,
|
||||
'parameters': count_parameters(model),
|
||||
'timing': {},
|
||||
'memory': {},
|
||||
'flops': {}
|
||||
}
|
||||
|
||||
# Timing profile
|
||||
timer = Timer(num_runs=10)
|
||||
for _ in range(timer.num_runs + timer.warmup_runs):
|
||||
with timer:
|
||||
output = model.forward(input_data)
|
||||
result['timing']['forward'] = timer.mean
|
||||
|
||||
# Memory profile
|
||||
mem = MemoryProfiler()
|
||||
mem.snapshot("Before forward")
|
||||
output = model.forward(input_data)
|
||||
mem.snapshot("After forward")
|
||||
result['memory']['peak'] = mem.peak_memory
|
||||
|
||||
# FLOP count
|
||||
flop_counter = FLOPCounter()
|
||||
# Count FLOPs based on model architecture
|
||||
result['flops']['total'] = flop_counter.total_flops
|
||||
|
||||
self.results[model_name] = result
|
||||
return result
|
||||
|
||||
def compare(self):
|
||||
"""Generate comparative report."""
|
||||
print("\nArchitecture Comparison")
|
||||
print("=" * 80)
|
||||
|
||||
for name, result in self.results.items():
|
||||
print(f"\n{name}:")
|
||||
print(f" Parameters: {result['parameters']/1e6:.2f}M")
|
||||
print(f" Forward time: {result['timing']['forward']:.3f}ms")
|
||||
print(f" Peak memory: {result['memory']['peak']:.2f}MB")
|
||||
print(f" FLOPs: {result['flops']['total']/1e9:.2f}GFLOPs")
|
||||
```
|
||||
|
||||
### Step-by-Step Implementation
|
||||
|
||||
1. **Build High-Precision Timer**
|
||||
- Use `time.perf_counter()` for nanosecond precision
|
||||
- Implement multiple runs with warmup
|
||||
- Calculate mean, std, confidence intervals
|
||||
- Test with known delays
|
||||
|
||||
2. **Implement Memory Profiler**
|
||||
- Track memory at key points (before/after operations)
|
||||
- Calculate peak memory usage
|
||||
- Identify memory growth patterns
|
||||
- Detect potential leaks
|
||||
|
||||
3. **Create FLOP Counter**
|
||||
- Count operations for matmul, convolution, attention
|
||||
- Build hierarchical counting (operation → layer → model)
|
||||
- Compare theoretical vs actual performance
|
||||
- Identify compute-bound vs memory-bound operations
|
||||
|
||||
4. **Build Architecture Profiler**
|
||||
- Profile MLP on MNIST/CIFAR
|
||||
- Profile CNN on CIFAR
|
||||
- Profile Transformer on text
|
||||
- Generate comparative reports
|
||||
|
||||
5. **Analyze Results**
|
||||
- Identify bottleneck operations (Pareto principle)
|
||||
- Compare efficiency across architectures
|
||||
- Understand scaling behavior
|
||||
- Prioritize optimization opportunities
|
||||
|
||||
## Testing
|
||||
|
||||
### Inline Tests
|
||||
|
||||
Run inline tests while building:
|
||||
```bash
|
||||
cd modules/source/15_profiling
|
||||
python profiling_dev.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Unit Test: Timer with statistical analysis...
|
||||
✅ Multiple runs produce consistent results
|
||||
✅ Confidence intervals computed correctly
|
||||
✅ Warmup runs excluded from statistics
|
||||
Progress: Timing Profiler ✓
|
||||
|
||||
Unit Test: Memory profiler...
|
||||
✅ Snapshots capture memory correctly
|
||||
✅ Peak memory tracked accurately
|
||||
✅ Memory growth detected
|
||||
Progress: Memory Profiler ✓
|
||||
|
||||
Unit Test: FLOP counter...
|
||||
✅ Matmul FLOPs: 2*m*k*n verified
|
||||
✅ Attention FLOPs match theoretical
|
||||
✅ Operation breakdown correct
|
||||
Progress: FLOP Counter ✓
|
||||
```
|
||||
|
||||
### Export and Validate
|
||||
|
||||
```bash
|
||||
tito export 15_profiling
|
||||
tito test 15_profiling
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── profiler/
|
||||
│ └── profiling.py # Your implementation goes here
|
||||
└── __init__.py # Exposes Timer, MemoryProfiler, etc.
|
||||
|
||||
Usage:
|
||||
>>> from tinytorch.profiler import Timer, MemoryProfiler, FLOPCounter
|
||||
>>> timer = Timer()
|
||||
>>> with timer:
|
||||
>>> model.forward(x)
|
||||
>>> print(timer.report())
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Amdahl's Law**: If attention is 70% of compute and you optimize it 2×, what's the overall speedup? Why can't you get 2× end-to-end speedup?
|
||||
|
||||
2. **Memory vs Compute Bottlenecks**: Your GPU can do 100 TFLOPs/s but memory bandwidth is 900 GB/s. For FP32 operations needing 4 bytes/FLOP, what's the bottleneck? When?
|
||||
|
||||
3. **Batch Size Impact**: Doubling batch size doesn't double throughput. Why? What's the relationship between batch size, memory, and throughput?
|
||||
|
||||
4. **Profiling Overhead**: Your profiler adds 5% overhead. Is this acceptable? When would you use sampling profilers vs instrumentation profilers?
|
||||
|
||||
5. **Hardware Differences**: Your code runs 10× slower on CPU than GPU for large matrices, but only 2× slower for small ones. Why? What's the crossover point?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
### Industry Applications
|
||||
|
||||
**Google TPU Optimization**
|
||||
- Profile every kernel to maximize TPU utilization
|
||||
- Optimize for both FLOPs and memory bandwidth
|
||||
- Use profiling to guide hardware design decisions
|
||||
- Achieve 40-50% utilization (very high for accelerators)
|
||||
|
||||
**OpenAI Training Optimization**
|
||||
- Profile GPT training to find $millions in savings
|
||||
- Identify gradient checkpointing opportunities
|
||||
- Optimize data loading pipelines
|
||||
- Achieve 50%+ MFU (model FLOPs utilization)
|
||||
|
||||
**Meta Inference Serving**
|
||||
- Profile PyTorch models for production deployment
|
||||
- Identify operator fusion opportunities
|
||||
- Optimize for latency (p50, p99) not just throughput
|
||||
- Serve billions of requests per day efficiently
|
||||
|
||||
### Research Impact
|
||||
|
||||
This module implements patterns from:
|
||||
- TensorBoard Profiler (Google, 2019): Visual profiling for TensorFlow
|
||||
- PyTorch Profiler (Meta, 2020): Comprehensive profiling for PyTorch
|
||||
- NVIDIA Nsight (2021): GPU-specific profiling and optimization
|
||||
- MLPerf (2022): Standardized benchmarking and profiling
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 16: Acceleration**, you'll use your profiling data to actually optimize:
|
||||
|
||||
- Implement operator fusion based on profiling insights
|
||||
- Optimize memory access patterns
|
||||
- Apply algorithmic improvements to bottlenecks
|
||||
- Measure impact of each optimization
|
||||
|
||||
Profiling shows you *what* to optimize—acceleration shows you *how* to optimize it!
|
||||
|
||||
---
|
||||
|
||||
**Ready to become a performance detective?** Open `modules/source/15_profiling/profiling_dev.py` and start implementing.
|
||||
@@ -1,148 +0,0 @@
|
||||
---
|
||||
title: "Acceleration - Hardware-Aware Optimization"
|
||||
description: "Optimize ML operations with SIMD, cache-friendly algorithms, and parallel computing"
|
||||
difficulty: 4
|
||||
time_estimate: "6-8 hours"
|
||||
prerequisites: ["Profiling"]
|
||||
next_steps: ["Quantization"]
|
||||
learning_objectives:
|
||||
- "Implement cache-friendly algorithms for matrix operations"
|
||||
- "Apply SIMD vectorization for parallel data processing"
|
||||
- "Design multi-core parallelization strategies for batch operations"
|
||||
- "Understand hardware bottlenecks (compute vs memory bandwidth)"
|
||||
- "Optimize ML kernels based on profiling data from Module 15"
|
||||
---
|
||||
|
||||
# 16. Acceleration
|
||||
|
||||
**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐⭐ (4/4) | Time: 6-8 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Optimize ML operations through hardware-aware programming. This module implements cache-friendly algorithms, SIMD vectorization, and multi-core parallelization to achieve significant speedups based on profiling insights from Module 15.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Implement cache-friendly algorithms** for matrix multiplication and convolution using blocked algorithms
|
||||
2. **Apply SIMD vectorization** to parallelize element-wise operations across data
|
||||
3. **Design multi-core parallelization strategies** for batch processing and data parallelism
|
||||
4. **Understand hardware bottlenecks** (compute-bound vs memory-bound operations)
|
||||
5. **Optimize ML kernels** based on actual profiling data, achieving measurable speedups
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
Hardware optimization is critical for production ML:
|
||||
|
||||
- **PyTorch** uses custom CUDA kernels and CPU vectorization; 100× faster than naive Python
|
||||
- **TensorFlow XLA** compiles models to optimized machine code; reduces latency by 2-5×
|
||||
- **ONNX Runtime** applies hardware-specific optimizations; powers Microsoft/Azure ML serving
|
||||
- **Apple Neural Engine** uses custom accelerators; enables on-device ML on iPhones
|
||||
|
||||
### Historical Context
|
||||
|
||||
Hardware optimization evolved with ML scale:
|
||||
|
||||
- **Pre-Deep Learning (pre-2010)**: Hand-written assembly for critical loops; library implementations
|
||||
- **GPU Era (2010-2017)**: CUDA kernels dominate; cuDNN becomes standard; 10-100× speedups
|
||||
- **Specialized Hardware (2018+)**: TPUs, custom ASICs; compiler-based optimization
|
||||
- **Modern Systems (2020+)**: ML compilers (TVM, XLA); automated kernel generation and tuning
|
||||
|
||||
Understanding hardware optimization separates production engineers from researchers.
|
||||
|
||||
## Pedagogical Pattern: Build → Use → Optimize
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- Blocked matrix multiplication for cache efficiency
|
||||
- SIMD-vectorized element-wise operations
|
||||
- Multi-threaded batch processing
|
||||
- Memory-aligned data structures
|
||||
- Profiling integration
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Optimize bottlenecks identified in Module 15
|
||||
- Accelerate attention computation
|
||||
- Speed up convolutional operations
|
||||
- Parallelize data loading pipelines
|
||||
- Measure actual speedups
|
||||
|
||||
### 3. Optimize
|
||||
|
||||
Production techniques:
|
||||
- Auto-tuning for different hardware
|
||||
- Mixed-precision computation (FP16/FP32)
|
||||
- Operator fusion to reduce memory traffic
|
||||
- Batch processing for amortized overhead
|
||||
- Hardware-specific code paths
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Patterns
|
||||
|
||||
**Cache-Friendly Matrix Multiplication**
|
||||
- Block matrices into cache-sized tiles
|
||||
- Reuse data while in cache (temporal locality)
|
||||
- Access memory sequentially (spatial locality)
|
||||
- Typical speedup: 2-5× over naive implementation
|
||||
|
||||
**SIMD Vectorization**
|
||||
- Process multiple data elements simultaneously
|
||||
- Use Numba/Cython for automatic vectorization
|
||||
- Align data to vector boundaries (16/32/64 bytes)
|
||||
- Typical speedup: 2-8× for element-wise ops
|
||||
|
||||
**Multi-Core Parallelization**
|
||||
- Divide work across CPU cores
|
||||
- Use thread pools for batch processing
|
||||
- Minimize synchronization overhead
|
||||
- Typical speedup: 0.5-0.8× number of cores (due to overhead)
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
cd modules/source/16_acceleration
|
||||
python acceleration_dev.py
|
||||
tito export 16_acceleration
|
||||
tito test 16_acceleration
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── acceleration/
|
||||
│ └── kernels.py # Optimized implementations
|
||||
└── __init__.py
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Roofline Model**: Your operation needs 1000 FLOPs and 100 bytes. At 100 GFLOPs/s compute and 10 GB/s bandwidth, what's the bottleneck?
|
||||
|
||||
2. **Amdahl's Law Applied**: You parallelize 90% of code perfectly across 8 cores. What's max speedup? Why not 8×?
|
||||
|
||||
3. **Cache Hierarchy**: L1 cache is 10× faster than L2, which is 10× faster than RAM. How does blocking matrix multiplication exploit this?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
**PyTorch/TensorFlow**: Custom CUDA kernels for all operations
|
||||
**ONNX Runtime**: Hardware-specific optimization for production serving
|
||||
**Apple ML**: Metal shaders and Neural Engine for on-device inference
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 17: Quantization**, you'll reduce precision for even more speedups:
|
||||
- INT8 quantization for 4× memory reduction
|
||||
- Mixed-precision training and inference
|
||||
- Calibration and accuracy preservation
|
||||
|
||||
---
|
||||
|
||||
**Ready to optimize for hardware?** Open `modules/source/16_acceleration/acceleration_dev.py` and start implementing.
|
||||
@@ -1,113 +0,0 @@
|
||||
---
|
||||
title: "Quantization - Reduced Precision for Efficiency"
|
||||
description: "INT8 quantization, calibration, and mixed-precision strategies"
|
||||
difficulty: 3
|
||||
time_estimate: "5-6 hours"
|
||||
prerequisites: ["Acceleration"]
|
||||
next_steps: ["Compression"]
|
||||
learning_objectives:
|
||||
- "Implement INT8 quantization for weights and activations"
|
||||
- "Design calibration strategies to minimize accuracy loss"
|
||||
- "Apply mixed-precision training and inference patterns"
|
||||
- "Understand quantization-aware training vs post-training quantization"
|
||||
- "Measure memory and speed improvements from reduced precision"
|
||||
---
|
||||
|
||||
# 17. Quantization
|
||||
|
||||
**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Reduce model precision from FP32 to INT8 for 4× memory reduction and 2-4× inference speedup. This module implements quantization, calibration, and mixed-precision strategies used in production deployment.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Implement INT8 quantization** for model weights and activations with scale/zero-point parameters
|
||||
2. **Design calibration strategies** using representative data to minimize accuracy degradation
|
||||
3. **Apply mixed-precision training** (FP16/FP32) for faster training with maintained accuracy
|
||||
4. **Understand quantization-aware training** vs post-training quantization trade-offs
|
||||
5. **Measure memory and speed improvements** while tracking accuracy impact
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
Quantization is mandatory for edge deployment:
|
||||
|
||||
- **TensorFlow Lite** uses INT8 quantization for mobile deployment; 4× smaller models
|
||||
- **ONNX Runtime** supports INT8 inference; 2-4× faster on CPUs
|
||||
- **Apple Core ML** quantizes models for iPhone Neural Engine; enables on-device ML
|
||||
- **Google Edge TPU** requires INT8; optimized hardware for quantized operations
|
||||
|
||||
### Historical Context
|
||||
|
||||
- **Pre-2017**: FP32 standard; quantization for special cases only
|
||||
- **2017-2019**: INT8 post-training quantization; TensorFlow Lite adoption
|
||||
- **2019-2021**: Quantization-aware training; maintains accuracy better
|
||||
- **2021+**: INT4, mixed-precision, dynamic quantization; aggressive compression
|
||||
|
||||
Quantization enables deployment where FP32 models wouldn't fit or run fast enough.
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**Symmetric INT8 Quantization**
|
||||
```
|
||||
Quantization: x_int8 = round(x_fp32 / scale)
|
||||
Dequantization: x_fp32 = x_int8 * scale
|
||||
|
||||
where scale = max(|x|) / 127
|
||||
```
|
||||
|
||||
**Asymmetric Quantization (with zero-point)**
|
||||
```
|
||||
Quantization: x_int8 = round(x_fp32 / scale) + zero_point
|
||||
Dequantization: x_fp32 = (x_int8 - zero_point) * scale
|
||||
```
|
||||
|
||||
**Calibration**: Use representative data to find optimal scale/zero-point parameters
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
tito export 17_quantization
|
||||
tito test 17_quantization
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── quantization/
|
||||
│ └── quantize.py
|
||||
└── __init__.py
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Accuracy vs Efficiency**: INT8 loses precision. When is <1% accuracy drop acceptable? When must you use QAT?
|
||||
|
||||
2. **Per-Tensor vs Per-Channel**: Per-channel quantization preserves accuracy better but increases complexity. When is it worth it?
|
||||
|
||||
3. **Quantized Operations**: INT8 matmul is faster, but quantize/dequantize adds overhead. When does quantization win overall?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
**Mobile Deployment**: TensorFlow Lite, Core ML use INT8 for on-device inference
|
||||
**Cloud Serving**: ONNX Runtime, TensorRT use INT8 for cost-effective serving
|
||||
**Edge AI**: INT8 required for Coral Edge TPU, Jetson Nano deployment
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 18: Compression**, you'll combine quantization with pruning:
|
||||
- Remove unimportant weights (pruning)
|
||||
- Quantize remaining weights (INT8)
|
||||
- Achieve 10-50× compression with minimal accuracy loss
|
||||
|
||||
---
|
||||
|
||||
**Ready to quantize models?** Open `modules/source/17_quantization/quantization_dev.py` and start implementing.
|
||||
@@ -1,121 +0,0 @@
|
||||
---
|
||||
title: "Compression - Pruning and Model Compression"
|
||||
description: "Prune unnecessary weights and compress models for deployment"
|
||||
difficulty: 3
|
||||
time_estimate: "5-6 hours"
|
||||
prerequisites: ["Quantization"]
|
||||
next_steps: ["Benchmarking"]
|
||||
learning_objectives:
|
||||
- "Implement magnitude-based pruning to remove unimportant weights"
|
||||
- "Design structured pruning strategies (channel, layer-wise)"
|
||||
- "Apply iterative pruning with fine-tuning for accuracy preservation"
|
||||
- "Combine pruning with quantization for maximum compression"
|
||||
- "Measure compression ratios and inference speedups"
|
||||
---
|
||||
|
||||
# 18. Compression
|
||||
|
||||
**⚡ OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Compress neural networks through pruning (removing weights) and combining with quantization. This module implements techniques to achieve 10-50× compression with minimal accuracy loss, enabling deployment on resource-constrained devices.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Implement magnitude-based pruning** to identify and remove unimportant weights
|
||||
2. **Design structured pruning strategies** (channel pruning, layer-wise) for actual speedups
|
||||
3. **Apply iterative pruning** with fine-tuning to maintain model accuracy
|
||||
4. **Combine pruning with quantization** for maximum compression (50-100× possible)
|
||||
5. **Measure compression ratios** and verify inference speedup vs accuracy trade-offs
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
Compression enables practical deployment:
|
||||
|
||||
- **BERT Distillation (DistilBERT)**: 40% smaller, 60% faster, 97% accuracy retention
|
||||
- **MobileNet**: Structured pruning + quantization for mobile deployment
|
||||
- **Lottery Ticket Hypothesis**: Sparse networks train as well as dense ones
|
||||
- **GPT-3 Distillation**: Smaller models approaching GPT-3 performance
|
||||
|
||||
### Historical Context
|
||||
|
||||
- **Pre-2015**: Limited compression work; models small enough for hardware
|
||||
- **2015-2017**: Magnitude pruning (Han et al.); Lottery Ticket Hypothesis
|
||||
- **2018-2020**: Structured pruning; distillation; BERT compression
|
||||
- **2020+**: Extreme compression (100×); sparse transformers; efficient architectures
|
||||
|
||||
Compression is now standard for deployment, not optional.
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Techniques
|
||||
|
||||
**Magnitude Pruning**
|
||||
- Sort weights by absolute value
|
||||
- Remove smallest X% (typically 50-90%)
|
||||
- Fine-tune remaining weights
|
||||
- Can achieve 10× compression with <1% accuracy loss
|
||||
|
||||
**Structured Pruning**
|
||||
- Remove entire channels/neurons
|
||||
- Achieves actual speedup (vs unstructured sparsity)
|
||||
- Typically 2-5× compression
|
||||
- More aggressive accuracy impact
|
||||
|
||||
**Iterative Pruning**
|
||||
- Prune gradually (10% at a time)
|
||||
- Fine-tune after each pruning step
|
||||
- Better accuracy than one-shot pruning
|
||||
- More training cost
|
||||
|
||||
**Pruning + Quantization**
|
||||
- Prune 90% of weights → 10× reduction
|
||||
- Quantize FP32 → INT8 → 4× reduction
|
||||
- Combined: 40× compression
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
tito export 18_compression
|
||||
tito test 18_compression
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── compression/
|
||||
│ └── prune.py
|
||||
└── __init__.py
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Lottery Ticket Hypothesis**: Why can pruned networks retrain to full accuracy? What does this say about overparameterization?
|
||||
|
||||
2. **Structured vs Unstructured**: Unstructured pruning achieves better compression but no speedup. Why? When is sparse computation actually faster?
|
||||
|
||||
3. **Distillation vs Pruning**: Both compress models. When would you use each? Can you combine them?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
**DistilBERT**: 40% smaller BERT with 97% performance
|
||||
**MobileNetV2**: Efficient architectures + pruning for mobile
|
||||
**NVIDIA TensorRT**: Automatic pruning + quantization for deployment
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 19: Benchmarking**, you'll measure everything you've built:
|
||||
- Fair comparison across optimizations
|
||||
- Statistical significance testing
|
||||
- MLPerf-style benchmarking protocols
|
||||
- Comprehensive performance reports
|
||||
|
||||
---
|
||||
|
||||
**Ready to compress models?** Open `modules/source/18_compression/compression_dev.py` and start implementing.
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user