mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 03:57:30 -05:00
Milestone 04 - CNN Revolution: ✅ Complete 5-Act narrative structure (Challenge → Reflection) ✅ SimpleCNN architecture: Conv2d → ReLU → MaxPool → Linear ✅ Trains on 8x8 digits dataset (1,437 train, 360 test) ✅ Achieves 84.2% accuracy with only 810 parameters ✅ Demonstrates spatial operations preserve structure ✅ Beautiful visual output with progress tracking Key Features: - Conv2d (1→8 channels, 3×3 kernel) detects local patterns - MaxPool2d (2×2) provides translation invariance - 100× fewer parameters than equivalent MLP - Training completes in ~105 seconds (50 epochs) - Sample predictions table shows 9/10 correct Module 09 Spatial Improvements: - Removed ugly try/except import pattern - Clean imports: 'from tinytorch.core.tensor import Tensor' - Matches PyTorch style (simple and professional) - No fallback logic needed All 4 milestones now follow consistent 5-Act structure!
1913 lines
80 KiB
Plaintext
1913 lines
80 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a742161d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 09: Spatial - Processing Images with Convolutions\n",
|
||
"\n",
|
||
"Welcome to Module 09! You'll implement spatial operations that transform machine learning from working with simple vectors to understanding images and spatial patterns.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Complete training pipeline with MLPs, optimizers, and data loaders\n",
|
||
"**You'll Build**: Spatial operations - Conv2d, MaxPool2d, AvgPool2d for image processing\n",
|
||
"**You'll Enable**: Convolutional Neural Networks (CNNs) for computer vision\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"Training Pipeline → Spatial Operations → CNN (Milestone 03)\n",
|
||
" (MLPs) (Conv/Pool) (Computer Vision)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement Conv2d with explicit loops to understand O(N²M²K²) complexity\n",
|
||
"2. Build pooling operations (Max and Average) for spatial reduction\n",
|
||
"3. Understand receptive fields and spatial feature extraction\n",
|
||
"4. Analyze memory vs computation trade-offs in spatial operations\n",
|
||
"\n",
|
||
"Let's get started!\n",
|
||
"\n",
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/09_spatial/spatial_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.core.spatial`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.core.spatial import Conv2d, MaxPool2d, AvgPool2d\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete spatial processing system in one focused module for deep understanding\n",
|
||
"- **Production:** Proper organization like PyTorch's torch.nn.Conv2d with all spatial operations together\n",
|
||
"- **Consistency:** All convolution and pooling operations in core.spatial\n",
|
||
"- **Integration:** Works seamlessly with existing layers for complete CNN architectures"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "26448ded",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "spatial-setup",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"#| default_exp core.spatial\n",
|
||
"\n",
|
||
"#| export\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"from tinytorch.core.tensor import Tensor"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eae6c314",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction - What are Spatial Operations?\n",
|
||
"\n",
|
||
"Spatial operations transform machine learning from working with simple vectors to understanding images and spatial patterns. When you look at a photo, your brain naturally processes spatial relationships - edges, textures, objects. Spatial operations give neural networks this same capability.\n",
|
||
"\n",
|
||
"### The Two Core Spatial Operations\n",
|
||
"\n",
|
||
"**Convolution**: Detects local patterns by sliding filters across the input\n",
|
||
"**Pooling**: Reduces spatial dimensions while preserving important features\n",
|
||
"\n",
|
||
"### Visual Example: How Convolution Works\n",
|
||
"\n",
|
||
"```\n",
|
||
"Input Image (5×5): Kernel (3×3): Output (3×3):\n",
|
||
"┌─────────────────┐ ┌─────────┐ ┌─────────┐\n",
|
||
"│ 1 2 3 4 5 │ │ 1 0 -1 │ │ ? ? ? │\n",
|
||
"│ 6 7 8 9 0 │ * │ 1 0 -1 │ = │ ? ? ? │\n",
|
||
"│ 1 2 3 4 5 │ │ 1 0 -1 │ │ ? ? ? │\n",
|
||
"│ 6 7 8 9 0 │ └─────────┘ └─────────┘\n",
|
||
"│ 1 2 3 4 5 │\n",
|
||
"└─────────────────┘\n",
|
||
"\n",
|
||
"Sliding Window Process:\n",
|
||
"Position (0,0): [1,2,3] Position (0,1): [2,3,4] Position (0,2): [3,4,5]\n",
|
||
" [6,7,8] * [7,8,9] * [8,9,0] *\n",
|
||
" [1,2,3] [2,3,4] [3,4,5]\n",
|
||
" = Output[0,0] = Output[0,1] = Output[0,2]\n",
|
||
"```\n",
|
||
"\n",
|
||
"Each output pixel summarizes a local neighborhood, allowing the network to detect patterns like edges, corners, and textures.\n",
|
||
"\n",
|
||
"### Why Spatial Operations Transform ML\n",
|
||
"\n",
|
||
"```\n",
|
||
"Without Convolution: With Convolution:\n",
|
||
"32×32×3 image = 3,072 inputs 32×32×3 → Conv → 32×32×16\n",
|
||
"↓ ↓ ↓\n",
|
||
"Dense(3072 → 1000) = 3M parameters Shared 3×3 kernel = 432 parameters\n",
|
||
"↓ ↓ ↓\n",
|
||
"Memory explosion + no spatial awareness Efficient + preserves spatial structure\n",
|
||
"```\n",
|
||
"\n",
|
||
"Convolution achieves dramatic parameter reduction (1000× fewer!) while preserving the spatial relationships that matter for visual understanding."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5d723557",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Mathematical Foundations\n",
|
||
"\n",
|
||
"### Understanding Convolution Step by Step\n",
|
||
"\n",
|
||
"Convolution sounds complex, but it's just \"sliding window multiplication and summation.\" Let's see exactly how it works:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Step 1: Position the kernel over input\n",
|
||
"Input: Kernel:\n",
|
||
"┌─────────┐ ┌─────┐\n",
|
||
"│ 1 2 3 4 │ │ 1 0 │ ← Place kernel at position (0,0)\n",
|
||
"│ 5 6 7 8 │ × │ 0 1 │\n",
|
||
"│ 9 0 1 2 │ └─────┘\n",
|
||
"└─────────┘\n",
|
||
"\n",
|
||
"Step 2: Multiply corresponding elements\n",
|
||
"Overlap: Computation:\n",
|
||
"┌─────┐ 1×1 + 2×0 + 5×0 + 6×1 = 1 + 0 + 0 + 6 = 7\n",
|
||
"│ 1 2 │\n",
|
||
"│ 5 6 │\n",
|
||
"└─────┘\n",
|
||
"\n",
|
||
"Step 3: Slide kernel and repeat\n",
|
||
"Position (0,1): Position (1,0): Position (1,1):\n",
|
||
"┌─────┐ ┌─────┐ ┌─────┐\n",
|
||
"│ 2 3 │ │ 5 6 │ │ 6 7 │\n",
|
||
"│ 6 7 │ │ 9 0 │ │ 0 1 │\n",
|
||
"└─────┘ └─────┘ └─────┘\n",
|
||
"Result: 9 Result: 5 Result: 8\n",
|
||
"\n",
|
||
"Final Output: ┌─────┐\n",
|
||
" │ 7 9 │\n",
|
||
" │ 5 8 │\n",
|
||
" └─────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Mathematical Formula\n",
|
||
"\n",
|
||
"For 2D convolution, we slide kernel K across input I:\n",
|
||
"```\n",
|
||
"O[i,j] = Σ Σ I[i+m, j+n] × K[m,n]\n",
|
||
" m n\n",
|
||
"```\n",
|
||
"\n",
|
||
"This formula captures the \"multiply and sum\" operation for each kernel position.\n",
|
||
"\n",
|
||
"### Pooling: Spatial Summarization\n",
|
||
"\n",
|
||
"```\n",
|
||
"Max Pooling Example (2×2 window):\n",
|
||
"Input: Output:\n",
|
||
"┌───────────┐ ┌─────┐\n",
|
||
"│ 1 3 2 4 │ │ 6 8 │ ← max([1,3,5,6])=6, max([2,4,7,8])=8\n",
|
||
"│ 5 6 7 8 │ → │ 9 9 │ ← max([5,2,9,1])=9, max([7,4,9,3])=9\n",
|
||
"│ 2 9 1 3 │ └─────┘\n",
|
||
"│ 0 1 9 3 │\n",
|
||
"└───────────┘\n",
|
||
"\n",
|
||
"Average Pooling (same window):\n",
|
||
"┌─────┐ ← avg([1,3,5,6])=3.75, avg([2,4,7,8])=5.25\n",
|
||
"│3.75 5.25│\n",
|
||
"│2.75 5.75│ ← avg([5,2,9,1])=4.25, avg([7,4,9,3])=5.75\n",
|
||
"└─────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why This Complexity Matters\n",
|
||
"\n",
|
||
"For convolution with input (1, 3, 224, 224) and kernel (64, 3, 3, 3):\n",
|
||
"- **Operations**: 1 × 64 × 3 × 3 × 3 × 224 × 224 = 86.7 million multiply-adds\n",
|
||
"- **Memory**: Input (600KB) + Weights (6.9KB) + Output (12.8MB) = ~13.4MB\n",
|
||
"\n",
|
||
"This is why kernel size matters enormously - a 7×7 kernel would require 5.4× more computation!\n",
|
||
"\n",
|
||
"### Key Properties That Enable Deep Learning\n",
|
||
"\n",
|
||
"**Translation Equivariance**: Move the cat → detection moves the same way\n",
|
||
"**Parameter Sharing**: Same edge detector works everywhere in the image\n",
|
||
"**Local Connectivity**: Each output only looks at nearby inputs (like human vision)\n",
|
||
"**Hierarchical Features**: Early layers detect edges → later layers detect objects"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7d8b6461",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 3. Implementation - Building Spatial Operations\n",
|
||
"\n",
|
||
"Now we'll implement convolution step by step, using explicit loops so you can see and feel the computational complexity. This helps you understand why modern optimizations matter!\n",
|
||
"\n",
|
||
"### Conv2d: Detecting Patterns with Sliding Windows\n",
|
||
"\n",
|
||
"Convolution slides a small filter (kernel) across the entire input, computing weighted sums at each position. Think of it like using a template to find matching patterns everywhere in an image.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Convolution Visualization:\n",
|
||
"Input (4×4): Kernel (3×3): Output (2×2):\n",
|
||
"┌─────────────┐ ┌─────────┐ ┌─────────┐\n",
|
||
"│ a b c d │ │ k1 k2 k3│ │ o1 o2 │\n",
|
||
"│ e f g h │ × │ k4 k5 k6│ = │ o3 o4 │\n",
|
||
"│ i j k l │ │ k7 k8 k9│ └─────────┘\n",
|
||
"│ m n o p │ └─────────┘\n",
|
||
"└─────────────┘\n",
|
||
"\n",
|
||
"Computation Details:\n",
|
||
"o1 = a×k1 + b×k2 + c×k3 + e×k4 + f×k5 + g×k6 + i×k7 + j×k8 + k×k9\n",
|
||
"o2 = b×k1 + c×k2 + d×k3 + f×k4 + g×k5 + h×k6 + j×k7 + k×k8 + l×k9\n",
|
||
"o3 = e×k1 + f×k2 + g×k3 + i×k4 + j×k5 + k×k6 + m×k7 + n×k8 + o×k9\n",
|
||
"o4 = f×k1 + g×k2 + h×k3 + j×k4 + k×k5 + l×k6 + n×k7 + o×k8 + p×k9\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Six Nested Loops of Convolution\n",
|
||
"\n",
|
||
"Our implementation will use explicit loops to show exactly where the computational cost comes from:\n",
|
||
"\n",
|
||
"```\n",
|
||
"for batch in range(B): # Loop 1: Process each sample\n",
|
||
" for out_ch in range(C_out): # Loop 2: Generate each output channel\n",
|
||
" for out_h in range(H_out): # Loop 3: Each output row\n",
|
||
" for out_w in range(W_out): # Loop 4: Each output column\n",
|
||
" for k_h in range(K_h): # Loop 5: Each kernel row\n",
|
||
" for k_w in range(K_w): # Loop 6: Each kernel column\n",
|
||
" for in_ch in range(C_in): # Loop 7: Each input channel\n",
|
||
" # The actual multiply-accumulate operation\n",
|
||
" result += input[...] * kernel[...]\n",
|
||
"```\n",
|
||
"\n",
|
||
"Total operations: B × C_out × H_out × W_out × K_h × K_w × C_in\n",
|
||
"\n",
|
||
"For typical values (B=32, C_out=64, H_out=224, W_out=224, K_h=3, K_w=3, C_in=3):\n",
|
||
"That's 32 × 64 × 224 × 224 × 3 × 3 × 3 = **2.8 billion operations** per forward pass!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c2453317",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Conv2d Implementation - Building the Core of Computer Vision\n",
|
||
"\n",
|
||
"Conv2d is the workhorse of computer vision. It slides learned filters across images to detect patterns like edges, textures, and eventually complex objects.\n",
|
||
"\n",
|
||
"#### How Conv2d Transforms Machine Learning\n",
|
||
"\n",
|
||
"```\n",
|
||
"Before Conv2d (Dense Only): After Conv2d (Spatial Aware):\n",
|
||
"Input: 32×32×3 = 3,072 values Input: 32×32×3 structured as image\n",
|
||
" ↓ ↓\n",
|
||
"Dense(3072→1000) = 3M params Conv2d(3→16, 3×3) = 448 params\n",
|
||
" ↓ ↓\n",
|
||
"No spatial awareness Preserves spatial relationships\n",
|
||
"Massive parameter count Parameter sharing across space\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Weight Initialization: He Initialization for ReLU Networks\n",
|
||
"\n",
|
||
"Our Conv2d uses He initialization, specifically designed for ReLU activations:\n",
|
||
"- **Problem**: Wrong initialization → vanishing/exploding gradients\n",
|
||
"- **Solution**: std = sqrt(2 / fan_in) where fan_in = channels × kernel_height × kernel_width\n",
|
||
"- **Why it works**: Maintains variance through ReLU nonlinearity\n",
|
||
"\n",
|
||
"#### The 6-Loop Implementation Strategy\n",
|
||
"\n",
|
||
"We'll implement convolution with explicit loops to show the true computational cost:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Nested Loop Structure:\n",
|
||
"for batch: ← Process each sample in parallel (in practice)\n",
|
||
" for out_channel: ← Generate each output feature map\n",
|
||
" for out_h: ← Each row of output\n",
|
||
" for out_w: ← Each column of output\n",
|
||
" for k_h: ← Each row of kernel\n",
|
||
" for k_w: ← Each column of kernel\n",
|
||
" for in_ch: ← Accumulate across input channels\n",
|
||
" result += input[...] * weight[...]\n",
|
||
"```\n",
|
||
"\n",
|
||
"This reveals why convolution is expensive: O(B×C_out×H×W×K_h×K_w×C_in) operations!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9d90c81a",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "conv2d-class",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"class Conv2d:\n",
|
||
" \"\"\"\n",
|
||
" 2D Convolution layer for spatial feature extraction.\n",
|
||
"\n",
|
||
" Implements convolution with explicit loops to demonstrate\n",
|
||
" computational complexity and memory access patterns.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" in_channels: Number of input channels\n",
|
||
" out_channels: Number of output feature maps\n",
|
||
" kernel_size: Size of convolution kernel (int or tuple)\n",
|
||
" stride: Stride of convolution (default: 1)\n",
|
||
" padding: Zero-padding added to input (default: 0)\n",
|
||
" bias: Whether to add learnable bias (default: True)\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True):\n",
|
||
" \"\"\"\n",
|
||
" Initialize Conv2d layer with proper weight initialization.\n",
|
||
"\n",
|
||
" TODO: Complete Conv2d initialization\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Store hyperparameters (channels, kernel_size, stride, padding)\n",
|
||
" 2. Initialize weights using He initialization for ReLU compatibility\n",
|
||
" 3. Initialize bias (if enabled) to zeros\n",
|
||
" 4. Use proper shapes: weight (out_channels, in_channels, kernel_h, kernel_w)\n",
|
||
"\n",
|
||
" WEIGHT INITIALIZATION:\n",
|
||
" - He init: std = sqrt(2 / (in_channels * kernel_h * kernel_w))\n",
|
||
" - This prevents vanishing/exploding gradients with ReLU\n",
|
||
"\n",
|
||
" HINT: Convert kernel_size to tuple if it's an integer\n",
|
||
" \"\"\"\n",
|
||
" super().__init__()\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.in_channels = in_channels\n",
|
||
" self.out_channels = out_channels\n",
|
||
"\n",
|
||
" # Handle kernel_size as int or tuple\n",
|
||
" if isinstance(kernel_size, int):\n",
|
||
" self.kernel_size = (kernel_size, kernel_size)\n",
|
||
" else:\n",
|
||
" self.kernel_size = kernel_size\n",
|
||
"\n",
|
||
" self.stride = stride\n",
|
||
" self.padding = padding\n",
|
||
"\n",
|
||
" # He initialization for ReLU networks\n",
|
||
" kernel_h, kernel_w = self.kernel_size\n",
|
||
" fan_in = in_channels * kernel_h * kernel_w\n",
|
||
" std = np.sqrt(2.0 / fan_in)\n",
|
||
"\n",
|
||
" # Weight shape: (out_channels, in_channels, kernel_h, kernel_w)\n",
|
||
" self.weight = Tensor(np.random.normal(0, std,\n",
|
||
" (out_channels, in_channels, kernel_h, kernel_w)))\n",
|
||
"\n",
|
||
" # Bias initialization\n",
|
||
" if bias:\n",
|
||
" self.bias = Tensor(np.zeros(out_channels))\n",
|
||
" else:\n",
|
||
" self.bias = None\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through Conv2d layer.\n",
|
||
"\n",
|
||
" TODO: Implement convolution with explicit loops\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Extract input dimensions and validate\n",
|
||
" 2. Calculate output dimensions\n",
|
||
" 3. Apply padding if needed\n",
|
||
" 4. Implement 6 nested loops for full convolution\n",
|
||
" 5. Add bias if present\n",
|
||
"\n",
|
||
" LOOP STRUCTURE:\n",
|
||
" for batch in range(batch_size):\n",
|
||
" for out_ch in range(out_channels):\n",
|
||
" for out_h in range(out_height):\n",
|
||
" for out_w in range(out_width):\n",
|
||
" for k_h in range(kernel_height):\n",
|
||
" for k_w in range(kernel_width):\n",
|
||
" for in_ch in range(in_channels):\n",
|
||
" # Accumulate: out += input * weight\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> conv = Conv2d(3, 16, kernel_size=3, padding=1)\n",
|
||
" >>> x = Tensor(np.random.randn(2, 3, 32, 32)) # batch=2, RGB, 32x32\n",
|
||
" >>> out = conv(x)\n",
|
||
" >>> print(out.shape) # Should be (2, 16, 32, 32)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Handle padding by creating padded input array\n",
|
||
" - Watch array bounds in inner loops\n",
|
||
" - Accumulate products for each output position\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Input validation and shape extraction\n",
|
||
" if len(x.shape) != 4:\n",
|
||
" raise ValueError(f\"Expected 4D input (batch, channels, height, width), got {x.shape}\")\n",
|
||
"\n",
|
||
" batch_size, in_channels, in_height, in_width = x.shape\n",
|
||
" out_channels = self.out_channels\n",
|
||
" kernel_h, kernel_w = self.kernel_size\n",
|
||
"\n",
|
||
" # Calculate output dimensions\n",
|
||
" out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1\n",
|
||
" out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1\n",
|
||
"\n",
|
||
" # Apply padding if needed\n",
|
||
" if self.padding > 0:\n",
|
||
" padded_input = np.pad(x.data,\n",
|
||
" ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),\n",
|
||
" mode='constant', constant_values=0)\n",
|
||
" else:\n",
|
||
" padded_input = x.data\n",
|
||
"\n",
|
||
" # Initialize output\n",
|
||
" output = np.zeros((batch_size, out_channels, out_height, out_width))\n",
|
||
"\n",
|
||
" # Explicit 6-nested loop convolution to show complexity\n",
|
||
" for b in range(batch_size):\n",
|
||
" for out_ch in range(out_channels):\n",
|
||
" for out_h in range(out_height):\n",
|
||
" for out_w in range(out_width):\n",
|
||
" # Calculate input region for this output position\n",
|
||
" in_h_start = out_h * self.stride\n",
|
||
" in_w_start = out_w * self.stride\n",
|
||
"\n",
|
||
" # Accumulate convolution result\n",
|
||
" conv_sum = 0.0\n",
|
||
" for k_h in range(kernel_h):\n",
|
||
" for k_w in range(kernel_w):\n",
|
||
" for in_ch in range(in_channels):\n",
|
||
" # Get input and weight values\n",
|
||
" input_val = padded_input[b, in_ch,\n",
|
||
" in_h_start + k_h,\n",
|
||
" in_w_start + k_w]\n",
|
||
" weight_val = self.weight.data[out_ch, in_ch, k_h, k_w]\n",
|
||
"\n",
|
||
" # Accumulate\n",
|
||
" conv_sum += input_val * weight_val\n",
|
||
"\n",
|
||
" # Store result\n",
|
||
" output[b, out_ch, out_h, out_w] = conv_sum\n",
|
||
"\n",
|
||
" # Add bias if present\n",
|
||
" if self.bias is not None:\n",
|
||
" # Broadcast bias across spatial dimensions\n",
|
||
" for out_ch in range(out_channels):\n",
|
||
" output[:, out_ch, :, :] += self.bias.data[out_ch]\n",
|
||
"\n",
|
||
" return Tensor(output)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return trainable parameters.\"\"\"\n",
|
||
" params = [self.weight]\n",
|
||
" if self.bias is not None:\n",
|
||
" params.append(self.bias)\n",
|
||
" return params\n",
|
||
"\n",
|
||
" def __call__(self, x):\n",
|
||
" \"\"\"Enable model(x) syntax.\"\"\"\n",
|
||
" return self.forward(x)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2a1949dc",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Conv2d Implementation\n",
|
||
"This test validates our convolution implementation with different configurations.\n",
|
||
"**What we're testing**: Shape preservation, padding, stride effects\n",
|
||
"**Why it matters**: Convolution is the foundation of computer vision\n",
|
||
"**Expected**: Correct output shapes and reasonable value ranges"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ad42d2bb",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-conv2d",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"def test_unit_conv2d():\n",
|
||
" \"\"\"🔬 Test Conv2d implementation with multiple configurations.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Conv2d...\")\n",
|
||
"\n",
|
||
" # Test 1: Basic convolution without padding\n",
|
||
" print(\" Testing basic convolution...\")\n",
|
||
" conv1 = Conv2d(in_channels=3, out_channels=16, kernel_size=3)\n",
|
||
" x1 = Tensor(np.random.randn(2, 3, 32, 32))\n",
|
||
" out1 = conv1(x1)\n",
|
||
"\n",
|
||
" expected_h = (32 - 3) + 1 # 30\n",
|
||
" expected_w = (32 - 3) + 1 # 30\n",
|
||
" assert out1.shape == (2, 16, expected_h, expected_w), f\"Expected (2, 16, 30, 30), got {out1.shape}\"\n",
|
||
"\n",
|
||
" # Test 2: Convolution with padding (same size)\n",
|
||
" print(\" Testing convolution with padding...\")\n",
|
||
" conv2 = Conv2d(in_channels=3, out_channels=8, kernel_size=3, padding=1)\n",
|
||
" x2 = Tensor(np.random.randn(1, 3, 28, 28))\n",
|
||
" out2 = conv2(x2)\n",
|
||
"\n",
|
||
" # With padding=1, output should be same size as input\n",
|
||
" assert out2.shape == (1, 8, 28, 28), f\"Expected (1, 8, 28, 28), got {out2.shape}\"\n",
|
||
"\n",
|
||
" # Test 3: Convolution with stride\n",
|
||
" print(\" Testing convolution with stride...\")\n",
|
||
" conv3 = Conv2d(in_channels=1, out_channels=4, kernel_size=3, stride=2)\n",
|
||
" x3 = Tensor(np.random.randn(1, 1, 16, 16))\n",
|
||
" out3 = conv3(x3)\n",
|
||
"\n",
|
||
" expected_h = (16 - 3) // 2 + 1 # 7\n",
|
||
" expected_w = (16 - 3) // 2 + 1 # 7\n",
|
||
" assert out3.shape == (1, 4, expected_h, expected_w), f\"Expected (1, 4, 7, 7), got {out3.shape}\"\n",
|
||
"\n",
|
||
" # Test 4: Parameter counting\n",
|
||
" print(\" Testing parameter counting...\")\n",
|
||
" conv4 = Conv2d(in_channels=64, out_channels=128, kernel_size=3, bias=True)\n",
|
||
" params = conv4.parameters()\n",
|
||
"\n",
|
||
" # Weight: (128, 64, 3, 3) = 73,728 parameters\n",
|
||
" # Bias: (128,) = 128 parameters\n",
|
||
" # Total: 73,856 parameters\n",
|
||
" weight_params = 128 * 64 * 3 * 3\n",
|
||
" bias_params = 128\n",
|
||
" total_params = weight_params + bias_params\n",
|
||
"\n",
|
||
" actual_weight_params = np.prod(conv4.weight.shape)\n",
|
||
" actual_bias_params = np.prod(conv4.bias.shape) if conv4.bias is not None else 0\n",
|
||
" actual_total = actual_weight_params + actual_bias_params\n",
|
||
"\n",
|
||
" assert actual_total == total_params, f\"Expected {total_params} parameters, got {actual_total}\"\n",
|
||
" assert len(params) == 2, f\"Expected 2 parameter tensors, got {len(params)}\"\n",
|
||
"\n",
|
||
" # Test 5: No bias configuration\n",
|
||
" print(\" Testing no bias configuration...\")\n",
|
||
" conv5 = Conv2d(in_channels=3, out_channels=16, kernel_size=5, bias=False)\n",
|
||
" params5 = conv5.parameters()\n",
|
||
" assert len(params5) == 1, f\"Expected 1 parameter tensor (no bias), got {len(params5)}\"\n",
|
||
" assert conv5.bias is None, \"Bias should be None when bias=False\"\n",
|
||
"\n",
|
||
" print(\"✅ Conv2d works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_conv2d()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2bac6b87",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 4. Pooling Operations - Spatial Dimension Reduction\n",
|
||
"\n",
|
||
"Pooling operations compress spatial information while keeping the most important features. Think of them as creating \"thumbnail summaries\" of local regions.\n",
|
||
"\n",
|
||
"### MaxPool2d: Keeping the Strongest Signals\n",
|
||
"\n",
|
||
"Max pooling finds the strongest activation in each window, preserving sharp features like edges and corners.\n",
|
||
"\n",
|
||
"```\n",
|
||
"MaxPool2d Example (2×2 kernel, stride=2):\n",
|
||
"Input (4×4): Windows: Output (2×2):\n",
|
||
"┌─────────────┐ ┌─────┬─────┐ ┌─────┐\n",
|
||
"│ 1 3 │ 2 8 │ │ 1 3 │ 2 8 │ │ 6 8 │\n",
|
||
"│ 5 6 │ 7 4 │ → │ 5 6 │ 7 4 │ → │ 9 7 │\n",
|
||
"├─────┼─────┤ ├─────┼─────┤ └─────┘\n",
|
||
"│ 2 9 │ 1 7 │ │ 2 9 │ 1 7 │\n",
|
||
"│ 0 1 │ 3 6 │ │ 0 1 │ 3 6 │\n",
|
||
"└─────────────┘ └─────┴─────┘\n",
|
||
"\n",
|
||
"Window Computations:\n",
|
||
"Top-left: max(1,3,5,6) = 6 Top-right: max(2,8,7,4) = 8\n",
|
||
"Bottom-left: max(2,9,0,1) = 9 Bottom-right: max(1,7,3,6) = 7\n",
|
||
"```\n",
|
||
"\n",
|
||
"### AvgPool2d: Smoothing Local Features\n",
|
||
"\n",
|
||
"Average pooling computes the mean of each window, creating smoother, more general features.\n",
|
||
"\n",
|
||
"```\n",
|
||
"AvgPool2d Example (same 2×2 kernel, stride=2):\n",
|
||
"Input (4×4): Output (2×2):\n",
|
||
"┌─────────────┐ ┌──────────┐\n",
|
||
"│ 1 3 │ 2 8 │ │ 3.75 5.25│\n",
|
||
"│ 5 6 │ 7 4 │ → │ 3.0 4.25│\n",
|
||
"├─────┼─────┤ └──────────┘\n",
|
||
"│ 2 9 │ 1 7 │\n",
|
||
"│ 0 1 │ 3 6 │\n",
|
||
"└─────────────┘\n",
|
||
"\n",
|
||
"Window Computations:\n",
|
||
"Top-left: (1+3+5+6)/4 = 3.75 Top-right: (2+8+7+4)/4 = 5.25\n",
|
||
"Bottom-left: (2+9+0+1)/4 = 3.0 Bottom-right: (1+7+3+6)/4 = 4.25\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why Pooling Matters for Computer Vision\n",
|
||
"\n",
|
||
"```\n",
|
||
"Memory Impact:\n",
|
||
"Input: 224×224×64 = 3.2M values After 2×2 pooling: 112×112×64 = 0.8M values\n",
|
||
"Memory reduction: 4× less! Computation reduction: 4× less!\n",
|
||
"\n",
|
||
"Information Trade-off:\n",
|
||
"✅ Preserves important features ⚠️ Loses fine spatial detail\n",
|
||
"✅ Provides translation invariance ⚠️ Reduces localization precision\n",
|
||
"✅ Reduces overfitting ⚠️ May lose small objects\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Sliding Window Pattern\n",
|
||
"\n",
|
||
"Both pooling operations follow the same sliding window pattern:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Sliding 2×2 window with stride=2:\n",
|
||
"Step 1: Step 2: Step 3: Step 4:\n",
|
||
"┌──┐ ┌──┐\n",
|
||
"│▓▓│ │▓▓│\n",
|
||
"└──┘ └──┘ ┌──┐ ┌──┐\n",
|
||
" │▓▓│ │▓▓│\n",
|
||
" └──┘ └──┘\n",
|
||
"\n",
|
||
"Non-overlapping windows → Each input pixel used exactly once\n",
|
||
"Stride=2 → Output dimensions halved in each direction\n",
|
||
"```\n",
|
||
"\n",
|
||
"The key difference: MaxPool takes max(window), AvgPool takes mean(window)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "24ac0d1f",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### MaxPool2d Implementation - Preserving Strong Features\n",
|
||
"\n",
|
||
"MaxPool2d finds the strongest activation in each spatial window, creating a compressed representation that keeps the most important information.\n",
|
||
"\n",
|
||
"#### Why Max Pooling Works for Computer Vision\n",
|
||
"\n",
|
||
"```\n",
|
||
"Edge Detection Example:\n",
|
||
"Input Window (2×2): Max Pooling Result:\n",
|
||
"┌─────┬─────┐\n",
|
||
"│ 0.1 │ 0.8 │ ← Strong edge signal\n",
|
||
"├─────┼─────┤\n",
|
||
"│ 0.2 │ 0.1 │ Output: 0.8 (preserves edge)\n",
|
||
"└─────┴─────┘\n",
|
||
"\n",
|
||
"Noise Reduction Example:\n",
|
||
"Input Window (2×2):\n",
|
||
"┌─────┬─────┐\n",
|
||
"│ 0.9 │ 0.1 │ ← Feature + noise\n",
|
||
"├─────┼─────┤\n",
|
||
"│ 0.2 │ 0.1 │ Output: 0.9 (removes noise)\n",
|
||
"└─────┴─────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### The Sliding Window Pattern\n",
|
||
"\n",
|
||
"```\n",
|
||
"MaxPool with 2×2 kernel, stride=2:\n",
|
||
"\n",
|
||
"Input (4×4): Output (2×2):\n",
|
||
"┌───┬───┬───┬───┐ ┌───────┬───────┐\n",
|
||
"│ a │ b │ c │ d │ │max(a,b│max(c,d│\n",
|
||
"├───┼───┼───┼───┤ → │ e,f)│ g,h)│\n",
|
||
"│ e │ f │ g │ h │ ├───────┼───────┤\n",
|
||
"├───┼───┼───┼───┤ │max(i,j│max(k,l│\n",
|
||
"│ i │ j │ k │ l │ │ m,n)│ o,p)│\n",
|
||
"├───┼───┼───┼───┤ └───────┴───────┘\n",
|
||
"│ m │ n │ o │ p │\n",
|
||
"└───┴───┴───┴───┘\n",
|
||
"\n",
|
||
"Benefits:\n",
|
||
"✓ Translation invariance (cat moved 1 pixel still detected)\n",
|
||
"✓ Computational efficiency (4× fewer values to process)\n",
|
||
"✓ Hierarchical feature building (next layer sees larger receptive field)\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Memory and Computation Impact\n",
|
||
"\n",
|
||
"For input (1, 64, 224, 224) with 2×2 pooling:\n",
|
||
"- **Input memory**: 64 × 224 × 224 × 4 bytes = 12.8 MB\n",
|
||
"- **Output memory**: 64 × 112 × 112 × 4 bytes = 3.2 MB\n",
|
||
"- **Memory reduction**: 4× less memory needed\n",
|
||
"- **Computation**: No parameters, minimal compute cost"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "fce4d432",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "maxpool2d-class",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"class MaxPool2d:\n",
|
||
" \"\"\"\n",
|
||
" 2D Max Pooling layer for spatial dimension reduction.\n",
|
||
"\n",
|
||
" Applies maximum operation over spatial windows, preserving\n",
|
||
" the strongest activations while reducing computational load.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" kernel_size: Size of pooling window (int or tuple)\n",
|
||
" stride: Stride of pooling operation (default: same as kernel_size)\n",
|
||
" padding: Zero-padding added to input (default: 0)\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, kernel_size, stride=None, padding=0):\n",
|
||
" \"\"\"\n",
|
||
" Initialize MaxPool2d layer.\n",
|
||
"\n",
|
||
" TODO: Store pooling parameters\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Convert kernel_size to tuple if needed\n",
|
||
" 2. Set stride to kernel_size if not provided (non-overlapping)\n",
|
||
" 3. Store padding parameter\n",
|
||
"\n",
|
||
" HINT: Default stride equals kernel_size for non-overlapping windows\n",
|
||
" \"\"\"\n",
|
||
" super().__init__()\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Handle kernel_size as int or tuple\n",
|
||
" if isinstance(kernel_size, int):\n",
|
||
" self.kernel_size = (kernel_size, kernel_size)\n",
|
||
" else:\n",
|
||
" self.kernel_size = kernel_size\n",
|
||
"\n",
|
||
" # Default stride equals kernel_size (non-overlapping)\n",
|
||
" if stride is None:\n",
|
||
" self.stride = self.kernel_size[0]\n",
|
||
" else:\n",
|
||
" self.stride = stride\n",
|
||
"\n",
|
||
" self.padding = padding\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through MaxPool2d layer.\n",
|
||
"\n",
|
||
" TODO: Implement max pooling with explicit loops\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Extract input dimensions\n",
|
||
" 2. Calculate output dimensions\n",
|
||
" 3. Apply padding if needed\n",
|
||
" 4. Implement nested loops for pooling windows\n",
|
||
" 5. Find maximum value in each window\n",
|
||
"\n",
|
||
" LOOP STRUCTURE:\n",
|
||
" for batch in range(batch_size):\n",
|
||
" for channel in range(channels):\n",
|
||
" for out_h in range(out_height):\n",
|
||
" for out_w in range(out_width):\n",
|
||
" # Find max in window [in_h:in_h+k_h, in_w:in_w+k_w]\n",
|
||
" max_val = -infinity\n",
|
||
" for k_h in range(kernel_height):\n",
|
||
" for k_w in range(kernel_width):\n",
|
||
" max_val = max(max_val, input[...])\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> pool = MaxPool2d(kernel_size=2, stride=2)\n",
|
||
" >>> x = Tensor(np.random.randn(1, 3, 8, 8))\n",
|
||
" >>> out = pool(x)\n",
|
||
" >>> print(out.shape) # Should be (1, 3, 4, 4)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Initialize max_val to negative infinity\n",
|
||
" - Handle stride correctly when accessing input\n",
|
||
" - No parameters to update (pooling has no weights)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Input validation and shape extraction\n",
|
||
" if len(x.shape) != 4:\n",
|
||
" raise ValueError(f\"Expected 4D input (batch, channels, height, width), got {x.shape}\")\n",
|
||
"\n",
|
||
" batch_size, channels, in_height, in_width = x.shape\n",
|
||
" kernel_h, kernel_w = self.kernel_size\n",
|
||
"\n",
|
||
" # Calculate output dimensions\n",
|
||
" out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1\n",
|
||
" out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1\n",
|
||
"\n",
|
||
" # Apply padding if needed\n",
|
||
" if self.padding > 0:\n",
|
||
" padded_input = np.pad(x.data,\n",
|
||
" ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),\n",
|
||
" mode='constant', constant_values=-np.inf)\n",
|
||
" else:\n",
|
||
" padded_input = x.data\n",
|
||
"\n",
|
||
" # Initialize output\n",
|
||
" output = np.zeros((batch_size, channels, out_height, out_width))\n",
|
||
"\n",
|
||
" # Explicit nested loop max pooling\n",
|
||
" for b in range(batch_size):\n",
|
||
" for c in range(channels):\n",
|
||
" for out_h in range(out_height):\n",
|
||
" for out_w in range(out_width):\n",
|
||
" # Calculate input region for this output position\n",
|
||
" in_h_start = out_h * self.stride\n",
|
||
" in_w_start = out_w * self.stride\n",
|
||
"\n",
|
||
" # Find maximum in window\n",
|
||
" max_val = -np.inf\n",
|
||
" for k_h in range(kernel_h):\n",
|
||
" for k_w in range(kernel_w):\n",
|
||
" input_val = padded_input[b, c,\n",
|
||
" in_h_start + k_h,\n",
|
||
" in_w_start + k_w]\n",
|
||
" max_val = max(max_val, input_val)\n",
|
||
"\n",
|
||
" # Store result\n",
|
||
" output[b, c, out_h, out_w] = max_val\n",
|
||
"\n",
|
||
" return Tensor(output)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return empty list (pooling has no parameters).\"\"\"\n",
|
||
" return []\n",
|
||
"\n",
|
||
" def __call__(self, x):\n",
|
||
" \"\"\"Enable model(x) syntax.\"\"\"\n",
|
||
" return self.forward(x)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8f993dc1",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### AvgPool2d Implementation - Smoothing and Generalizing Features\n",
|
||
"\n",
|
||
"AvgPool2d computes the average of each spatial window, creating smoother features that are less sensitive to noise and exact pixel positions.\n",
|
||
"\n",
|
||
"#### MaxPool vs AvgPool: Different Philosophies\n",
|
||
"\n",
|
||
"```\n",
|
||
"Same Input Window (2×2): MaxPool Output: AvgPool Output:\n",
|
||
"┌─────┬─────┐\n",
|
||
"│ 0.1 │ 0.9 │ 0.9 0.425\n",
|
||
"├─────┼─────┤ (max) (mean)\n",
|
||
"│ 0.3 │ 0.3 │\n",
|
||
"└─────┴─────┘\n",
|
||
"\n",
|
||
"Interpretation:\n",
|
||
"MaxPool: \"What's the strongest feature here?\"\n",
|
||
"AvgPool: \"What's the general feature level here?\"\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### When to Use Average Pooling\n",
|
||
"\n",
|
||
"```\n",
|
||
"Use Cases:\n",
|
||
"✓ Global Average Pooling (GAP) for classification\n",
|
||
"✓ When you want smoother, less noisy features\n",
|
||
"✓ When exact feature location doesn't matter\n",
|
||
"✓ In shallower networks where sharp features aren't critical\n",
|
||
"\n",
|
||
"Typical Pattern:\n",
|
||
"Feature Maps → Global Average Pool → Dense → Classification\n",
|
||
"(256×7×7) → (256×1×1) → FC → (10)\n",
|
||
" Replaces flatten+dense with parameter reduction\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Mathematical Implementation\n",
|
||
"\n",
|
||
"```\n",
|
||
"Average Pooling Computation:\n",
|
||
"Window: [a, b] Result = (a + b + c + d) / 4\n",
|
||
" [c, d]\n",
|
||
"\n",
|
||
"For efficiency, we:\n",
|
||
"1. Sum all values in window: window_sum = a + b + c + d\n",
|
||
"2. Divide by window area: result = window_sum / (kernel_h × kernel_w)\n",
|
||
"3. Store result at output position\n",
|
||
"\n",
|
||
"Memory access pattern identical to MaxPool, just different aggregation!\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Practical Considerations\n",
|
||
"\n",
|
||
"- **Memory**: Same 4× reduction as MaxPool\n",
|
||
"- **Computation**: Slightly more expensive (sum + divide vs max)\n",
|
||
"- **Features**: Smoother, more generalized than MaxPool\n",
|
||
"- **Use**: Often in final layers (Global Average Pooling) to reduce parameters"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5514114f",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "avgpool2d-class",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"class AvgPool2d:\n",
|
||
" \"\"\"\n",
|
||
" 2D Average Pooling layer for spatial dimension reduction.\n",
|
||
"\n",
|
||
" Applies average operation over spatial windows, smoothing\n",
|
||
" features while reducing computational load.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" kernel_size: Size of pooling window (int or tuple)\n",
|
||
" stride: Stride of pooling operation (default: same as kernel_size)\n",
|
||
" padding: Zero-padding added to input (default: 0)\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, kernel_size, stride=None, padding=0):\n",
|
||
" \"\"\"\n",
|
||
" Initialize AvgPool2d layer.\n",
|
||
"\n",
|
||
" TODO: Store pooling parameters (same as MaxPool2d)\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Convert kernel_size to tuple if needed\n",
|
||
" 2. Set stride to kernel_size if not provided\n",
|
||
" 3. Store padding parameter\n",
|
||
" \"\"\"\n",
|
||
" super().__init__()\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Handle kernel_size as int or tuple\n",
|
||
" if isinstance(kernel_size, int):\n",
|
||
" self.kernel_size = (kernel_size, kernel_size)\n",
|
||
" else:\n",
|
||
" self.kernel_size = kernel_size\n",
|
||
"\n",
|
||
" # Default stride equals kernel_size (non-overlapping)\n",
|
||
" if stride is None:\n",
|
||
" self.stride = self.kernel_size[0]\n",
|
||
" else:\n",
|
||
" self.stride = stride\n",
|
||
"\n",
|
||
" self.padding = padding\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through AvgPool2d layer.\n",
|
||
"\n",
|
||
" TODO: Implement average pooling with explicit loops\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Similar structure to MaxPool2d\n",
|
||
" 2. Instead of max, compute average of window\n",
|
||
" 3. Divide sum by window area for true average\n",
|
||
"\n",
|
||
" LOOP STRUCTURE:\n",
|
||
" for batch in range(batch_size):\n",
|
||
" for channel in range(channels):\n",
|
||
" for out_h in range(out_height):\n",
|
||
" for out_w in range(out_width):\n",
|
||
" # Compute average in window\n",
|
||
" window_sum = 0\n",
|
||
" for k_h in range(kernel_height):\n",
|
||
" for k_w in range(kernel_width):\n",
|
||
" window_sum += input[...]\n",
|
||
" avg_val = window_sum / (kernel_height * kernel_width)\n",
|
||
"\n",
|
||
" HINT: Remember to divide by window area to get true average\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Input validation and shape extraction\n",
|
||
" if len(x.shape) != 4:\n",
|
||
" raise ValueError(f\"Expected 4D input (batch, channels, height, width), got {x.shape}\")\n",
|
||
"\n",
|
||
" batch_size, channels, in_height, in_width = x.shape\n",
|
||
" kernel_h, kernel_w = self.kernel_size\n",
|
||
"\n",
|
||
" # Calculate output dimensions\n",
|
||
" out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1\n",
|
||
" out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1\n",
|
||
"\n",
|
||
" # Apply padding if needed\n",
|
||
" if self.padding > 0:\n",
|
||
" padded_input = np.pad(x.data,\n",
|
||
" ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),\n",
|
||
" mode='constant', constant_values=0)\n",
|
||
" else:\n",
|
||
" padded_input = x.data\n",
|
||
"\n",
|
||
" # Initialize output\n",
|
||
" output = np.zeros((batch_size, channels, out_height, out_width))\n",
|
||
"\n",
|
||
" # Explicit nested loop average pooling\n",
|
||
" for b in range(batch_size):\n",
|
||
" for c in range(channels):\n",
|
||
" for out_h in range(out_height):\n",
|
||
" for out_w in range(out_width):\n",
|
||
" # Calculate input region for this output position\n",
|
||
" in_h_start = out_h * self.stride\n",
|
||
" in_w_start = out_w * self.stride\n",
|
||
"\n",
|
||
" # Compute sum in window\n",
|
||
" window_sum = 0.0\n",
|
||
" for k_h in range(kernel_h):\n",
|
||
" for k_w in range(kernel_w):\n",
|
||
" input_val = padded_input[b, c,\n",
|
||
" in_h_start + k_h,\n",
|
||
" in_w_start + k_w]\n",
|
||
" window_sum += input_val\n",
|
||
"\n",
|
||
" # Compute average\n",
|
||
" avg_val = window_sum / (kernel_h * kernel_w)\n",
|
||
"\n",
|
||
" # Store result\n",
|
||
" output[b, c, out_h, out_w] = avg_val\n",
|
||
"\n",
|
||
" return Tensor(output)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return empty list (pooling has no parameters).\"\"\"\n",
|
||
" return []\n",
|
||
"\n",
|
||
" def __call__(self, x):\n",
|
||
" \"\"\"Enable model(x) syntax.\"\"\"\n",
|
||
" return self.forward(x)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c69ed499",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Pooling Operations\n",
|
||
"This test validates both max and average pooling implementations.\n",
|
||
"**What we're testing**: Dimension reduction, aggregation correctness\n",
|
||
"**Why it matters**: Pooling is essential for computational efficiency in CNNs\n",
|
||
"**Expected**: Correct output shapes and proper value aggregation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "3a9e7e1a",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-pooling",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"def test_unit_pooling():\n",
|
||
" \"\"\"🔬 Test MaxPool2d and AvgPool2d implementations.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Pooling Operations...\")\n",
|
||
"\n",
|
||
" # Test 1: MaxPool2d basic functionality\n",
|
||
" print(\" Testing MaxPool2d...\")\n",
|
||
" maxpool = MaxPool2d(kernel_size=2, stride=2)\n",
|
||
" x1 = Tensor(np.random.randn(1, 3, 8, 8))\n",
|
||
" out1 = maxpool(x1)\n",
|
||
"\n",
|
||
" expected_shape = (1, 3, 4, 4) # 8/2 = 4\n",
|
||
" assert out1.shape == expected_shape, f\"MaxPool expected {expected_shape}, got {out1.shape}\"\n",
|
||
"\n",
|
||
" # Test 2: AvgPool2d basic functionality\n",
|
||
" print(\" Testing AvgPool2d...\")\n",
|
||
" avgpool = AvgPool2d(kernel_size=2, stride=2)\n",
|
||
" x2 = Tensor(np.random.randn(2, 16, 16, 16))\n",
|
||
" out2 = avgpool(x2)\n",
|
||
"\n",
|
||
" expected_shape = (2, 16, 8, 8) # 16/2 = 8\n",
|
||
" assert out2.shape == expected_shape, f\"AvgPool expected {expected_shape}, got {out2.shape}\"\n",
|
||
"\n",
|
||
" # Test 3: MaxPool vs AvgPool on known data\n",
|
||
" print(\" Testing max vs avg behavior...\")\n",
|
||
" # Create simple test case with known values\n",
|
||
" test_data = np.array([[[[1, 2, 3, 4],\n",
|
||
" [5, 6, 7, 8],\n",
|
||
" [9, 10, 11, 12],\n",
|
||
" [13, 14, 15, 16]]]], dtype=np.float32)\n",
|
||
" x3 = Tensor(test_data)\n",
|
||
"\n",
|
||
" maxpool_test = MaxPool2d(kernel_size=2, stride=2)\n",
|
||
" avgpool_test = AvgPool2d(kernel_size=2, stride=2)\n",
|
||
"\n",
|
||
" max_out = maxpool_test(x3)\n",
|
||
" avg_out = avgpool_test(x3)\n",
|
||
"\n",
|
||
" # For 2x2 windows:\n",
|
||
" # Top-left: max([1,2,5,6]) = 6, avg = 3.5\n",
|
||
" # Top-right: max([3,4,7,8]) = 8, avg = 5.5\n",
|
||
" # Bottom-left: max([9,10,13,14]) = 14, avg = 11.5\n",
|
||
" # Bottom-right: max([11,12,15,16]) = 16, avg = 13.5\n",
|
||
"\n",
|
||
" expected_max = np.array([[[[6, 8], [14, 16]]]])\n",
|
||
" expected_avg = np.array([[[[3.5, 5.5], [11.5, 13.5]]]])\n",
|
||
"\n",
|
||
" assert np.allclose(max_out.data, expected_max), f\"MaxPool values incorrect: {max_out.data} vs {expected_max}\"\n",
|
||
" assert np.allclose(avg_out.data, expected_avg), f\"AvgPool values incorrect: {avg_out.data} vs {expected_avg}\"\n",
|
||
"\n",
|
||
" # Test 4: Overlapping pooling (stride < kernel_size)\n",
|
||
" print(\" Testing overlapping pooling...\")\n",
|
||
" overlap_pool = MaxPool2d(kernel_size=3, stride=1)\n",
|
||
" x4 = Tensor(np.random.randn(1, 1, 5, 5))\n",
|
||
" out4 = overlap_pool(x4)\n",
|
||
"\n",
|
||
" # Output: (5-3)/1 + 1 = 3\n",
|
||
" expected_shape = (1, 1, 3, 3)\n",
|
||
" assert out4.shape == expected_shape, f\"Overlapping pool expected {expected_shape}, got {out4.shape}\"\n",
|
||
"\n",
|
||
" # Test 5: No parameters in pooling layers\n",
|
||
" print(\" Testing parameter counts...\")\n",
|
||
" assert len(maxpool.parameters()) == 0, \"MaxPool should have no parameters\"\n",
|
||
" assert len(avgpool.parameters()) == 0, \"AvgPool should have no parameters\"\n",
|
||
"\n",
|
||
" print(\"✅ Pooling operations work correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_pooling()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "32650529",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis - Understanding Spatial Operation Performance\n",
|
||
"\n",
|
||
"Now let's analyze the computational complexity and memory trade-offs of spatial operations. This analysis reveals why certain design choices matter for real-world performance.\n",
|
||
"\n",
|
||
"### Key Questions We'll Answer:\n",
|
||
"1. How does convolution complexity scale with input size and kernel size?\n",
|
||
"2. What's the memory vs computation trade-off in different approaches?\n",
|
||
"3. How do modern optimizations (like im2col) change the performance characteristics?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c534d20c",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "spatial-analysis",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"def analyze_convolution_complexity():\n",
|
||
" \"\"\"📊 Analyze convolution computational complexity across different configurations.\"\"\"\n",
|
||
" print(\"📊 Analyzing Convolution Complexity...\")\n",
|
||
"\n",
|
||
" # Test configurations optimized for educational demonstration (smaller sizes)\n",
|
||
" configs = [\n",
|
||
" {\"input\": (1, 3, 16, 16), \"conv\": (8, 3, 3), \"name\": \"Small (16×16)\"},\n",
|
||
" {\"input\": (1, 3, 24, 24), \"conv\": (12, 3, 3), \"name\": \"Medium (24×24)\"},\n",
|
||
" {\"input\": (1, 3, 32, 32), \"conv\": (16, 3, 3), \"name\": \"Large (32×32)\"},\n",
|
||
" {\"input\": (1, 3, 16, 16), \"conv\": (8, 3, 5), \"name\": \"Large Kernel (5×5)\"},\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(f\"{'Configuration':<20} {'FLOPs':<15} {'Memory (MB)':<12} {'Time (ms)':<10}\")\n",
|
||
" print(\"-\" * 70)\n",
|
||
"\n",
|
||
" for config in configs:\n",
|
||
" # Create convolution layer\n",
|
||
" in_ch = config[\"input\"][1]\n",
|
||
" out_ch, k_size = config[\"conv\"][0], config[\"conv\"][1]\n",
|
||
" conv = Conv2d(in_ch, out_ch, kernel_size=k_size, padding=k_size//2)\n",
|
||
"\n",
|
||
" # Create input tensor\n",
|
||
" x = Tensor(np.random.randn(*config[\"input\"]))\n",
|
||
"\n",
|
||
" # Calculate theoretical FLOPs\n",
|
||
" batch, in_channels, h, w = config[\"input\"]\n",
|
||
" out_channels, kernel_size = config[\"conv\"][0], config[\"conv\"][1]\n",
|
||
"\n",
|
||
" # Each output element requires in_channels * kernel_size² multiply-adds\n",
|
||
" flops_per_output = in_channels * kernel_size * kernel_size * 2 # 2 for MAC\n",
|
||
" total_outputs = batch * out_channels * h * w # Assuming same size with padding\n",
|
||
" total_flops = flops_per_output * total_outputs\n",
|
||
"\n",
|
||
" # Measure memory usage\n",
|
||
" input_memory = np.prod(config[\"input\"]) * 4 # float32 = 4 bytes\n",
|
||
" weight_memory = out_channels * in_channels * kernel_size * kernel_size * 4\n",
|
||
" output_memory = batch * out_channels * h * w * 4\n",
|
||
" total_memory = (input_memory + weight_memory + output_memory) / (1024 * 1024) # MB\n",
|
||
"\n",
|
||
" # Measure execution time\n",
|
||
" start_time = time.time()\n",
|
||
" _ = conv(x)\n",
|
||
" end_time = time.time()\n",
|
||
" exec_time = (end_time - start_time) * 1000 # ms\n",
|
||
"\n",
|
||
" print(f\"{config['name']:<20} {total_flops:<15,} {total_memory:<12.2f} {exec_time:<10.2f}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Key Insights:\")\n",
|
||
" print(\"🔸 FLOPs scale as O(H×W×C_in×C_out×K²) - quadratic in spatial and kernel size\")\n",
|
||
" print(\"🔸 Memory scales linearly with spatial dimensions and channels\")\n",
|
||
" print(\"🔸 Large kernels dramatically increase computational cost\")\n",
|
||
" print(\"🚀 This motivates depthwise separable convolutions and attention mechanisms\")\n",
|
||
"\n",
|
||
"# Analysis will be called in main execution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "acccb231",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "pooling-analysis",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"def analyze_pooling_effects():\n",
|
||
" \"\"\"📊 Analyze pooling's impact on spatial dimensions and features.\"\"\"\n",
|
||
" print(\"\\n📊 Analyzing Pooling Effects...\")\n",
|
||
"\n",
|
||
" # Create sample input with spatial structure\n",
|
||
" # Simple edge pattern that pooling should preserve differently\n",
|
||
" pattern = np.zeros((1, 1, 8, 8))\n",
|
||
" pattern[0, 0, :, 3:5] = 1.0 # Vertical edge\n",
|
||
" pattern[0, 0, 3:5, :] = 1.0 # Horizontal edge\n",
|
||
" x = Tensor(pattern)\n",
|
||
"\n",
|
||
" print(\"Original 8×8 pattern:\")\n",
|
||
" print(x.data[0, 0])\n",
|
||
"\n",
|
||
" # Test different pooling strategies\n",
|
||
" pools = [\n",
|
||
" (MaxPool2d(2, stride=2), \"MaxPool 2×2\"),\n",
|
||
" (AvgPool2d(2, stride=2), \"AvgPool 2×2\"),\n",
|
||
" (MaxPool2d(4, stride=4), \"MaxPool 4×4\"),\n",
|
||
" (AvgPool2d(4, stride=4), \"AvgPool 4×4\"),\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(f\"\\n{'Operation':<15} {'Output Shape':<15} {'Feature Preservation'}\")\n",
|
||
" print(\"-\" * 60)\n",
|
||
"\n",
|
||
" for pool_op, name in pools:\n",
|
||
" result = pool_op(x)\n",
|
||
" # Measure how much of the original pattern is preserved\n",
|
||
" preservation = np.sum(result.data > 0.1) / np.prod(result.shape)\n",
|
||
" print(f\"{name:<15} {str(result.shape):<15} {preservation:<.2%}\")\n",
|
||
"\n",
|
||
" print(f\" Output:\")\n",
|
||
" print(f\" {result.data[0, 0]}\")\n",
|
||
" print()\n",
|
||
"\n",
|
||
" print(\"💡 Key Insights:\")\n",
|
||
" print(\"🔸 MaxPool preserves sharp features better (edge detection)\")\n",
|
||
" print(\"🔸 AvgPool smooths features (noise reduction)\")\n",
|
||
" print(\"🔸 Larger pooling windows lose more spatial detail\")\n",
|
||
" print(\"🚀 Choice depends on task: classification vs detection vs segmentation\")\n",
|
||
"\n",
|
||
"# Analysis will be called in main execution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "62685a86",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 6. Integration - Building a Complete CNN\n",
|
||
"\n",
|
||
"Now let's combine convolution and pooling into a complete CNN architecture. You'll see how spatial operations work together to transform raw pixels into meaningful features.\n",
|
||
"\n",
|
||
"### CNN Architecture: From Pixels to Predictions\n",
|
||
"\n",
|
||
"A CNN processes images through alternating convolution and pooling layers, gradually extracting higher-level features:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Complete CNN Pipeline:\n",
|
||
"\n",
|
||
"Input Image (32×32×3) Raw RGB pixels\n",
|
||
" ↓\n",
|
||
"Conv2d(3→16, 3×3) Detect edges, textures\n",
|
||
" ↓\n",
|
||
"ReLU Activation Remove negative values\n",
|
||
" ↓\n",
|
||
"MaxPool(2×2) Reduce to (16×16×16)\n",
|
||
" ↓\n",
|
||
"Conv2d(16→32, 3×3) Detect shapes, patterns\n",
|
||
" ↓\n",
|
||
"ReLU Activation Remove negative values\n",
|
||
" ↓\n",
|
||
"MaxPool(2×2) Reduce to (8×8×32)\n",
|
||
" ↓\n",
|
||
"Flatten Reshape to vector (2048,)\n",
|
||
" ↓\n",
|
||
"Linear(2048→10) Final classification\n",
|
||
" ↓\n",
|
||
"Softmax Probability distribution\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Parameter Efficiency Story\n",
|
||
"\n",
|
||
"```\n",
|
||
"CNN vs Dense Network Comparison:\n",
|
||
"\n",
|
||
"CNN Approach: Dense Approach:\n",
|
||
"┌─────────────────┐ ┌─────────────────┐\n",
|
||
"│ Conv1: 3→16 │ │ Input: 32×32×3 │\n",
|
||
"│ Params: 448 │ │ = 3,072 values │\n",
|
||
"├─────────────────┤ ├─────────────────┤\n",
|
||
"│ Conv2: 16→32 │ │ Hidden: 1,000 │\n",
|
||
"│ Params: 4,640 │ │ Params: 3M+ │\n",
|
||
"├─────────────────┤ ├─────────────────┤\n",
|
||
"│ Linear: 2048→10 │ │ Output: 10 │\n",
|
||
"│ Params: 20,490 │ │ Params: 10K │\n",
|
||
"└─────────────────┘ └─────────────────┘\n",
|
||
"Total: ~25K params Total: ~3M params\n",
|
||
"\n",
|
||
"CNN wins with 120× fewer parameters!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Spatial Hierarchy: Why This Architecture Works\n",
|
||
"\n",
|
||
"```\n",
|
||
"Layer-by-Layer Feature Evolution:\n",
|
||
"\n",
|
||
"Layer 1 (Conv 3→16): Layer 2 (Conv 16→32):\n",
|
||
"┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐\n",
|
||
"│Edge │ │Edge │ │Edge │ │Shape│ │Corner│ │Texture│\n",
|
||
"│ \\\\ /│ │ | │ │ / \\\\│ │ ◇ │ │ L │ │ ≈≈≈ │\n",
|
||
"└─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘\n",
|
||
"Simple features Complex combinations\n",
|
||
"\n",
|
||
"Why pooling between layers:\n",
|
||
"✓ Reduces computation for next layer\n",
|
||
"✓ Increases receptive field (each conv sees larger input area)\n",
|
||
"✓ Provides translation invariance (cat moved 1 pixel still detected)\n",
|
||
"```\n",
|
||
"\n",
|
||
"This hierarchical approach mirrors human vision: we first detect edges, then shapes, then objects!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a13a91ca",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### SimpleCNN Implementation - Putting It All Together\n",
|
||
"\n",
|
||
"Now we'll build a complete CNN that demonstrates how convolution and pooling work together. This is your first step from processing individual tensors to understanding complete images!\n",
|
||
"\n",
|
||
"#### The CNN Architecture Pattern\n",
|
||
"\n",
|
||
"```\n",
|
||
"SimpleCNN Architecture Visualization:\n",
|
||
"\n",
|
||
"Input: (batch, 3, 32, 32) ← RGB images (CIFAR-10 size)\n",
|
||
" ↓\n",
|
||
"┌─────────────────────────┐\n",
|
||
"│ Conv2d(3→16, 3×3, p=1) │ ← Detect edges, textures\n",
|
||
"│ ReLU() │ ← Remove negative values\n",
|
||
"│ MaxPool(2×2) │ ← Reduce to (batch, 16, 16, 16)\n",
|
||
"└─────────────────────────┘\n",
|
||
" ↓\n",
|
||
"┌─────────────────────────┐\n",
|
||
"│ Conv2d(16→32, 3×3, p=1) │ ← Detect shapes, patterns\n",
|
||
"│ ReLU() │ ← Remove negative values\n",
|
||
"│ MaxPool(2×2) │ ← Reduce to (batch, 32, 8, 8)\n",
|
||
"└─────────────────────────┘\n",
|
||
" ↓\n",
|
||
"┌─────────────────────────┐\n",
|
||
"│ Flatten() │ ← Reshape to (batch, 2048)\n",
|
||
"│ Linear(2048→10) │ ← Final classification\n",
|
||
"└─────────────────────────┘\n",
|
||
" ↓\n",
|
||
"Output: (batch, 10) ← Class probabilities\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Why This Architecture Works\n",
|
||
"\n",
|
||
"```\n",
|
||
"Feature Hierarchy Development:\n",
|
||
"\n",
|
||
"Layer 1 Features (3→16): Layer 2 Features (16→32):\n",
|
||
"┌─────┬─────┬─────┬─────┐ ┌─────┬─────┬─────┬─────┐\n",
|
||
"│Edge │Edge │Edge │Blob │ │Shape│Corner│Tex-│Pat- │\n",
|
||
"│ \\\\ │ | │ / │ ○ │ │ ◇ │ L │ture│tern │\n",
|
||
"└─────┴─────┴─────┴─────┘ └─────┴─────┴─────┴─────┘\n",
|
||
"Simple features Complex combinations\n",
|
||
"\n",
|
||
"Spatial Dimension Reduction:\n",
|
||
"32×32 → 16×16 → 8×8\n",
|
||
" 1024 256 64 (per channel)\n",
|
||
"\n",
|
||
"Channel Expansion:\n",
|
||
"3 → 16 → 32\n",
|
||
"More feature types at each level\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Parameter Efficiency Demonstration\n",
|
||
"\n",
|
||
"```\n",
|
||
"CNN vs Dense Comparison for 32×32×3 → 10 classes:\n",
|
||
"\n",
|
||
"CNN Approach: Dense Approach:\n",
|
||
"┌────────────────────┐ ┌────────────────────┐\n",
|
||
"│ Conv1: 3→16, 3×3 │ │ Input: 3072 values │\n",
|
||
"│ Params: 448 │ │ ↓ │\n",
|
||
"├────────────────────┤ │ Dense: 3072→512 │\n",
|
||
"│ Conv2: 16→32, 3×3 │ │ Params: 1.57M │\n",
|
||
"│ Params: 4,640 │ ├────────────────────┤\n",
|
||
"├────────────────────┤ │ Dense: 512→10 │\n",
|
||
"│ Dense: 2048→10 │ │ Params: 5,120 │\n",
|
||
"│ Params: 20,490 │ └────────────────────┘\n",
|
||
"└────────────────────┘ Total: 1.58M params\n",
|
||
"Total: 25,578 params\n",
|
||
"\n",
|
||
"CNN has 62× fewer parameters while preserving spatial structure!\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Receptive Field Growth\n",
|
||
"\n",
|
||
"```\n",
|
||
"How each layer sees progressively larger input regions:\n",
|
||
"\n",
|
||
"Layer 1 Conv (3×3): Layer 2 Conv (3×3):\n",
|
||
"Each output pixel sees Each output pixel sees\n",
|
||
"3×3 = 9 input pixels 7×7 = 49 input pixels\n",
|
||
" (due to pooling+conv)\n",
|
||
"\n",
|
||
"Final Result: Layer 2 can detect complex patterns\n",
|
||
"spanning 7×7 regions of original image!\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "aada7027",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "simple-cnn",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"class SimpleCNN:\n",
|
||
" \"\"\"\n",
|
||
" Simple CNN demonstrating spatial operations integration.\n",
|
||
"\n",
|
||
" Architecture:\n",
|
||
" - Conv2d(3→16, 3×3) + ReLU + MaxPool(2×2)\n",
|
||
" - Conv2d(16→32, 3×3) + ReLU + MaxPool(2×2)\n",
|
||
" - Flatten + Linear(features→num_classes)\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, num_classes=10):\n",
|
||
" \"\"\"\n",
|
||
" Initialize SimpleCNN.\n",
|
||
"\n",
|
||
" TODO: Build CNN architecture with spatial and dense layers\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Conv layer 1: 3 → 16 channels, 3×3 kernel, padding=1\n",
|
||
" 2. Pool layer 1: 2×2 max pooling\n",
|
||
" 3. Conv layer 2: 16 → 32 channels, 3×3 kernel, padding=1\n",
|
||
" 4. Pool layer 2: 2×2 max pooling\n",
|
||
" 5. Calculate flattened size and add final linear layer\n",
|
||
"\n",
|
||
" HINT: For 32×32 input → 32→16→8→4 spatial reduction\n",
|
||
" Final feature size: 32 channels × 4×4 = 512 features\n",
|
||
" \"\"\"\n",
|
||
" super().__init__()\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Convolutional layers\n",
|
||
" self.conv1 = Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)\n",
|
||
" self.pool1 = MaxPool2d(kernel_size=2, stride=2)\n",
|
||
"\n",
|
||
" self.conv2 = Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)\n",
|
||
" self.pool2 = MaxPool2d(kernel_size=2, stride=2)\n",
|
||
"\n",
|
||
" # Calculate flattened size\n",
|
||
" # Input: 32×32 → Conv1+Pool1: 16×16 → Conv2+Pool2: 8×8\n",
|
||
" # Wait, let's recalculate: 32×32 → Pool1: 16×16 → Pool2: 8×8\n",
|
||
" # Final: 32 channels × 8×8 = 2048 features\n",
|
||
" self.flattened_size = 32 * 8 * 8\n",
|
||
"\n",
|
||
" # Import Linear layer (we'll implement a simple version)\n",
|
||
" # For now, we'll use a placeholder that we can replace\n",
|
||
" # This represents the final classification layer\n",
|
||
" self.num_classes = num_classes\n",
|
||
" self.flattened_size = 32 * 8 * 8 # Will be used when we add Linear layer\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through SimpleCNN.\n",
|
||
"\n",
|
||
" TODO: Implement CNN forward pass\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Apply conv1 → ReLU → pool1\n",
|
||
" 2. Apply conv2 → ReLU → pool2\n",
|
||
" 3. Flatten spatial dimensions\n",
|
||
" 4. Apply final linear layer (when available)\n",
|
||
"\n",
|
||
" For now, return features before final linear layer\n",
|
||
" since we haven't imported Linear from layers module yet.\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # First conv block\n",
|
||
" x = self.conv1(x)\n",
|
||
" x = self.relu(x) # ReLU activation\n",
|
||
" x = self.pool1(x)\n",
|
||
"\n",
|
||
" # Second conv block\n",
|
||
" x = self.conv2(x)\n",
|
||
" x = self.relu(x) # ReLU activation\n",
|
||
" x = self.pool2(x)\n",
|
||
"\n",
|
||
" # Flatten for classification (reshape to 2D)\n",
|
||
" batch_size = x.shape[0]\n",
|
||
" x_flat = x.data.reshape(batch_size, -1)\n",
|
||
"\n",
|
||
" # Return flattened features\n",
|
||
" # In a complete implementation, this would go through a Linear layer\n",
|
||
" return Tensor(x_flat)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def relu(self, x):\n",
|
||
" \"\"\"Simple ReLU implementation for CNN.\"\"\"\n",
|
||
" return Tensor(np.maximum(0, x.data))\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return all trainable parameters.\"\"\"\n",
|
||
" params = []\n",
|
||
" params.extend(self.conv1.parameters())\n",
|
||
" params.extend(self.conv2.parameters())\n",
|
||
" # Linear layer parameters would be added here\n",
|
||
" return params\n",
|
||
"\n",
|
||
" def __call__(self, x):\n",
|
||
" \"\"\"Enable model(x) syntax.\"\"\"\n",
|
||
" return self.forward(x)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d75c9ea6",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: SimpleCNN Integration\n",
|
||
"This test validates that spatial operations work together in a complete CNN architecture.\n",
|
||
"**What we're testing**: End-to-end spatial processing pipeline\n",
|
||
"**Why it matters**: Spatial operations must compose correctly for real CNNs\n",
|
||
"**Expected**: Proper dimension reduction and feature extraction"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7f466cde",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-simple-cnn",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"def test_unit_simple_cnn():\n",
|
||
" \"\"\"🔬 Test SimpleCNN integration with spatial operations.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: SimpleCNN Integration...\")\n",
|
||
"\n",
|
||
" # Test 1: Forward pass with CIFAR-10 sized input\n",
|
||
" print(\" Testing forward pass...\")\n",
|
||
" model = SimpleCNN(num_classes=10)\n",
|
||
" x = Tensor(np.random.randn(2, 3, 32, 32)) # Batch of 2, RGB, 32×32\n",
|
||
"\n",
|
||
" features = model(x)\n",
|
||
"\n",
|
||
" # Expected: 2 samples, 32 channels × 8×8 spatial = 2048 features\n",
|
||
" expected_shape = (2, 2048)\n",
|
||
" assert features.shape == expected_shape, f\"Expected {expected_shape}, got {features.shape}\"\n",
|
||
"\n",
|
||
" # Test 2: Parameter counting\n",
|
||
" print(\" Testing parameter counting...\")\n",
|
||
" params = model.parameters()\n",
|
||
"\n",
|
||
" # Conv1: (16, 3, 3, 3) + bias (16,) = 432 + 16 = 448\n",
|
||
" # Conv2: (32, 16, 3, 3) + bias (32,) = 4608 + 32 = 4640\n",
|
||
" # Total: 448 + 4640 = 5088 parameters\n",
|
||
"\n",
|
||
" conv1_params = 16 * 3 * 3 * 3 + 16 # weights + bias\n",
|
||
" conv2_params = 32 * 16 * 3 * 3 + 32 # weights + bias\n",
|
||
" expected_total = conv1_params + conv2_params\n",
|
||
"\n",
|
||
" actual_total = sum(np.prod(p.shape) for p in params)\n",
|
||
" assert actual_total == expected_total, f\"Expected {expected_total} parameters, got {actual_total}\"\n",
|
||
"\n",
|
||
" # Test 3: Different input sizes\n",
|
||
" print(\" Testing different input sizes...\")\n",
|
||
"\n",
|
||
" # Test with different spatial dimensions\n",
|
||
" x_small = Tensor(np.random.randn(1, 3, 16, 16))\n",
|
||
" features_small = model(x_small)\n",
|
||
"\n",
|
||
" # 16×16 → 8×8 → 4×4, so 32 × 4×4 = 512 features\n",
|
||
" expected_small = (1, 512)\n",
|
||
" assert features_small.shape == expected_small, f\"Expected {expected_small}, got {features_small.shape}\"\n",
|
||
"\n",
|
||
" # Test 4: Batch processing\n",
|
||
" print(\" Testing batch processing...\")\n",
|
||
" x_batch = Tensor(np.random.randn(8, 3, 32, 32))\n",
|
||
" features_batch = model(x_batch)\n",
|
||
"\n",
|
||
" expected_batch = (8, 2048)\n",
|
||
" assert features_batch.shape == expected_batch, f\"Expected {expected_batch}, got {features_batch.shape}\"\n",
|
||
"\n",
|
||
" print(\"✅ SimpleCNN integration works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_simple_cnn()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0ce293e3",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 7. Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that everything works together correctly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d373eecf",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "module-integration",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"\n",
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire spatial module functionality.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All unit tests pass\n",
|
||
" - Functions work together correctly\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_conv2d()\n",
|
||
" test_unit_pooling()\n",
|
||
" test_unit_simple_cnn()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test realistic CNN workflow\n",
|
||
" print(\"🔬 Integration Test: Complete CNN pipeline...\")\n",
|
||
"\n",
|
||
" # Create a mini CNN for CIFAR-10\n",
|
||
" conv1 = Conv2d(3, 8, kernel_size=3, padding=1)\n",
|
||
" pool1 = MaxPool2d(2, stride=2)\n",
|
||
" conv2 = Conv2d(8, 16, kernel_size=3, padding=1)\n",
|
||
" pool2 = AvgPool2d(2, stride=2)\n",
|
||
"\n",
|
||
" # Process batch of images\n",
|
||
" batch_images = Tensor(np.random.randn(4, 3, 32, 32))\n",
|
||
"\n",
|
||
" # Forward pass through spatial layers\n",
|
||
" x = conv1(batch_images) # (4, 8, 32, 32)\n",
|
||
" x = pool1(x) # (4, 8, 16, 16)\n",
|
||
" x = conv2(x) # (4, 16, 16, 16)\n",
|
||
" features = pool2(x) # (4, 16, 8, 8)\n",
|
||
"\n",
|
||
" # Validate shapes at each step\n",
|
||
" assert x.shape[0] == 4, f\"Batch size should be preserved, got {x.shape[0]}\"\n",
|
||
" assert features.shape == (4, 16, 8, 8), f\"Final features shape incorrect: {features.shape}\"\n",
|
||
"\n",
|
||
" # Test parameter collection across all layers\n",
|
||
" all_params = []\n",
|
||
" all_params.extend(conv1.parameters())\n",
|
||
" all_params.extend(conv2.parameters())\n",
|
||
" # Pooling has no parameters\n",
|
||
" assert len(pool1.parameters()) == 0\n",
|
||
" assert len(pool2.parameters()) == 0\n",
|
||
"\n",
|
||
" # Verify we have the right number of parameter tensors\n",
|
||
" assert len(all_params) == 4, f\"Expected 4 parameter tensors (2 conv × 2 each), got {len(all_params)}\"\n",
|
||
"\n",
|
||
" print(\"✅ Complete CNN pipeline works!\")\n",
|
||
"\n",
|
||
" # Test memory efficiency comparison\n",
|
||
" print(\"🔬 Integration Test: Memory efficiency analysis...\")\n",
|
||
"\n",
|
||
" # Compare different pooling strategies (reduced size for faster execution)\n",
|
||
" input_data = Tensor(np.random.randn(1, 16, 32, 32))\n",
|
||
"\n",
|
||
" # No pooling: maintain spatial size\n",
|
||
" conv_only = Conv2d(16, 32, kernel_size=3, padding=1)\n",
|
||
" no_pool_out = conv_only(input_data)\n",
|
||
" no_pool_size = np.prod(no_pool_out.shape) * 4 # float32 bytes\n",
|
||
"\n",
|
||
" # With pooling: reduce spatial size\n",
|
||
" conv_with_pool = Conv2d(16, 32, kernel_size=3, padding=1)\n",
|
||
" pool = MaxPool2d(2, stride=2)\n",
|
||
" pool_out = pool(conv_with_pool(input_data))\n",
|
||
" pool_size = np.prod(pool_out.shape) * 4 # float32 bytes\n",
|
||
"\n",
|
||
" memory_reduction = no_pool_size / pool_size\n",
|
||
" assert memory_reduction == 4.0, f\"2×2 pooling should give 4× memory reduction, got {memory_reduction:.1f}×\"\n",
|
||
"\n",
|
||
" print(f\" Memory reduction with pooling: {memory_reduction:.1f}×\")\n",
|
||
" print(\"✅ Memory efficiency analysis complete!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 09\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "102d7cd4",
|
||
"metadata": {
|
||
"lines_to_next_cell": 2,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "main-execution",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Run comprehensive module test\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9c435d5e",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Spatial Operations\n",
|
||
"\n",
|
||
"Congratulations! You've built the spatial processing foundation that powers computer vision!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built Conv2d with explicit loops showing O(N²M²K²) complexity ✅\n",
|
||
"- Implemented MaxPool2d and AvgPool2d for spatial dimension reduction ✅\n",
|
||
"- Created SimpleCNN demonstrating spatial operation integration ✅\n",
|
||
"- Analyzed computational complexity and memory trade-offs in spatial processing ✅\n",
|
||
"- All tests pass including complete CNN pipeline validation ✅\n",
|
||
"\n",
|
||
"### Systems Insights Discovered\n",
|
||
"- **Convolution Complexity**: Quadratic scaling with spatial size, kernel size significantly impacts cost\n",
|
||
"- **Memory Patterns**: Pooling provides 4× memory reduction while preserving important features\n",
|
||
"- **Architecture Design**: Strategic spatial reduction enables parameter-efficient feature extraction\n",
|
||
"- **Cache Performance**: Spatial locality in convolution benefits from optimal memory access patterns\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your spatial operations enable building complete CNNs for computer vision tasks!\n",
|
||
"Export with: `tito module complete 09`\n",
|
||
"\n",
|
||
"**Next**: Milestone 03 will combine your spatial operations with training pipeline to build a CNN for CIFAR-10!\n",
|
||
"\n",
|
||
"Your implementation shows why:\n",
|
||
"- Modern CNNs use small kernels (3×3) instead of large ones (computational efficiency)\n",
|
||
"- Pooling layers are crucial for managing memory in deep networks (4× reduction per layer)\n",
|
||
"- Explicit loops reveal the true computational cost hidden by optimized implementations\n",
|
||
"- Spatial operations unlock computer vision - from MLPs processing vectors to CNNs understanding images!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|