All module references updated to reflect new ordering: - Module 15: Quantization (was 16) - Module 16: Compression (was 17) - Module 17: Memoization (was 15) Updated by module-developer and website-manager agents: - Module ABOUT files with correct numbers and prerequisites - Cross-references and "What's Next" chains - Website navigation (_toc.yml) and content - Learning path progression in LEARNING_PATH.md - Profile milestone completion message (Module 17) Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
13 KiB
title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
| title | description | difficulty | time_estimate | prerequisites | next_steps | learning_objectives | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Convolutional Networks | Build CNNs from scratch for computer vision and spatial pattern recognition | 3 | 6-8 hours |
|
|
|
09. Convolutional Networks
🏛️ ARCHITECTURE TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 6-8 hours
Overview
Implement convolutional neural networks (CNNs) from scratch. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image classification to autonomous driving.
Learning Objectives
By completing this module, you will be able to:
- Implement convolution as sliding window operations with explicit loops, understanding weight sharing and local connectivity
- Design CNN architectures by composing convolutional, pooling, and dense layers for image classification
- Understand translation invariance and why CNNs are superior to dense networks for spatial data
- Build pooling operations (MaxPool, AvgPool) for spatial downsampling and feature invariance
- Apply computer vision principles to achieve >75% accuracy on CIFAR-10 image classification
Why This Matters
Production Context
CNNs are the backbone of modern computer vision systems:
- Meta's Vision AI uses CNN architectures to tag 2 billion photos daily across Facebook and Instagram
- Tesla Autopilot processes camera feeds through CNN backbones for object detection and lane recognition
- Google Photos built a CNN-based system that automatically organizes billions of images
- Medical Imaging systems use CNNs to detect cancer in X-rays and MRIs with superhuman accuracy
Historical Context
The convolution revolution transformed computer vision:
- LeNet (1998): Yann LeCun's CNN read zip codes on mail; convolution proved viable but limited by compute
- AlexNet (2012): Won ImageNet with 16% error rate (vs 26% previous); GPUs + convolution = computer vision revolution
- ResNet (2015): 152-layer CNN achieved 3.6% error (better than human 5%); proved depth matters
- Modern Era (2020+): CNNs power production vision systems processing trillions of images daily
The patterns you're implementing revolutionized how machines see.
Pedagogical Pattern: Build → Use → Analyze
1. Build
Implement from first principles:
- Convolution as explicit sliding window operation
- Conv2D layer with learnable filters and weight sharing
- MaxPool2D and AvgPool2D for spatial downsampling
- Flatten layer to connect spatial and dense layers
- Complete CNN architecture with feature extraction and classification
2. Use
Apply to real problems:
- Build CNN for CIFAR-10 image classification
- Extract and visualize learned feature maps
- Compare CNN vs MLP performance on spatial data
- Achieve >75% accuracy with proper architecture
- Understand impact of kernel size, stride, and padding
3. Analyze
Deep-dive into architectural choices:
- Why does weight sharing reduce parameters dramatically?
- How do early vs late layers learn different features?
- What's the trade-off between depth and width in CNNs?
- Why are pooling operations crucial for translation invariance?
- How does spatial structure preservation improve learning?
Implementation Guide
Core Components
Conv2D Layer - The Heart of Computer Vision
class Conv2D:
"""2D Convolutional layer with learnable filters.
Implements sliding window convolution:
- Applies same filter across all spatial positions (weight sharing)
- Each filter learns to detect different features (edges, textures, objects)
- Output is feature map showing where filter activates strongly
Args:
in_channels: Number of input channels (3 for RGB, 16 for feature maps)
out_channels: Number of learned filters (feature detectors)
kernel_size: Size of sliding window (typically 3 or 5)
stride: Step size when sliding (1 = no downsampling)
padding: Border padding to preserve spatial dimensions
"""
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0):
# Initialize learnable filters
self.weight = Tensor(shape=(out_channels, in_channels, kernel_size, kernel_size))
self.bias = Tensor(shape=(out_channels,))
def forward(self, x):
# x shape: (batch, in_channels, height, width)
batch, _, H, W = x.shape
kh, kw = self.kernel_size, self.kernel_size
# Calculate output dimensions
out_h = (H + 2 * self.padding - kh) // self.stride + 1
out_w = (W + 2 * self.padding - kw) // self.stride + 1
# Sliding window convolution
output = Tensor(shape=(batch, self.out_channels, out_h, out_w))
for b in range(batch):
for oc in range(self.out_channels):
for i in range(out_h):
for j in range(out_w):
# Extract local patch
i_start = i * self.stride
j_start = j * self.stride
patch = x[b, :, i_start:i_start+kh, j_start:j_start+kw]
# Convolution: element-wise multiply and sum
output[b, oc, i, j] = (patch * self.weight[oc]).sum() + self.bias[oc]
return output
Pooling Layers - Spatial Downsampling
class MaxPool2D:
"""Max pooling for spatial downsampling and translation invariance.
Takes maximum value in each local region:
- Reduces spatial dimensions while preserving important features
- Provides invariance to small translations
- Reduces computation in later layers
"""
def __init__(self, kernel_size=2, stride=None):
self.kernel_size = kernel_size
self.stride = stride or kernel_size
def forward(self, x):
batch, channels, H, W = x.shape
kh, kw = self.kernel_size, self.kernel_size
out_h = (H - kh) // self.stride + 1
out_w = (W - kw) // self.stride + 1
output = Tensor(shape=(batch, channels, out_h, out_w))
for b in range(batch):
for c in range(channels):
for i in range(out_h):
for j in range(out_w):
i_start = i * self.stride
j_start = j * self.stride
patch = x[b, c, i_start:i_start+kh, j_start:j_start+kw]
output[b, c, i, j] = patch.max()
return output
Complete CNN Architecture
class SimpleCNN:
"""CNN for CIFAR-10 classification.
Architecture:
Conv(3→32, 3x3) → ReLU → MaxPool(2x2) # 32x32 → 16x16
Conv(32→64, 3x3) → ReLU → MaxPool(2x2) # 16x16 → 8x8
Flatten → Dense(64*8*8 → 128) → ReLU
Dense(128 → 10) → Softmax
"""
def __init__(self):
self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1)
self.relu1 = ReLU()
self.pool1 = MaxPool2D(kernel_size=2)
self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1)
self.relu2 = ReLU()
self.pool2 = MaxPool2D(kernel_size=2)
self.flatten = Flatten()
self.fc1 = Linear(64 * 8 * 8, 128)
self.relu3 = ReLU()
self.fc2 = Linear(128, 10)
def forward(self, x):
# Feature extraction
x = self.pool1(self.relu1(self.conv1(x))) # (B, 32, 16, 16)
x = self.pool2(self.relu2(self.conv2(x))) # (B, 64, 8, 8)
# Classification
x = self.flatten(x) # (B, 4096)
x = self.relu3(self.fc1(x)) # (B, 128)
x = self.fc2(x) # (B, 10)
return x
Step-by-Step Implementation
-
Implement Conv2D Forward Pass
- Create sliding window iteration over spatial dimensions
- Apply weight sharing: same filter at all positions
- Handle batch processing efficiently
- Verify output shape calculation
-
Build Pooling Operations
- Implement MaxPool2D with maximum extraction
- Add AvgPool2D for average pooling
- Handle stride and kernel size correctly
- Test spatial dimension reduction
-
Create Flatten Layer
- Convert (B, C, H, W) to (B, CHW)
- Prepare spatial features for dense layers
- Preserve batch dimension
- Enable gradient flow backward
-
Design Complete CNN
- Stack Conv → ReLU → Pool blocks for feature extraction
- Add Flatten → Dense for classification
- Calculate dimensions at each layer
- Test end-to-end forward pass
-
Train on CIFAR-10
- Load CIFAR-10 using Module 08's DataLoader
- Train with cross-entropy loss and SGD
- Track accuracy on test set
- Achieve >75% accuracy
Testing
Inline Tests (During Development)
Run inline tests while building:
cd modules/09_spatial
python spatial_dev.py
Expected output:
Unit Test: Conv2D implementation...
✅ Sliding window convolution works correctly
✅ Weight sharing applied at all positions
✅ Output shapes match expected dimensions
Progress: Conv2D ✓
Unit Test: MaxPool2D implementation...
✅ Maximum extraction works correctly
✅ Spatial dimensions reduced properly
✅ Translation invariance verified
Progress: Pooling ✓
Unit Test: Complete CNN architecture...
✅ Forward pass through all layers successful
✅ Output shape: (32, 10) for 10 classes
✅ Parameter count reasonable: ~500K parameters
Progress: CNN Architecture ✓
Export and Validate
After completing the module:
# Export to tinytorch package
tito export 09_spatial
# Run integration tests
tito test 09_spatial
CIFAR-10 Training Test
# Train simple CNN on CIFAR-10
python tests/integration/test_cnn_cifar10.py
Expected results:
- Epoch 1: 35% accuracy
- Epoch 5: 60% accuracy
- Epoch 10: 75% accuracy
Where This Code Lives
tinytorch/
├── nn/
│ └── spatial.py # Conv2D, MaxPool2D, etc.
└── __init__.py # Exposes CNN components
Usage in other modules:
>>> from tinytorch.nn import Conv2D, MaxPool2D
>>> conv = Conv2D(3, 32, kernel_size=3)
>>> pool = MaxPool2D(kernel_size=2)
Systems Thinking Questions
-
Parameter Efficiency: A Conv2D(3, 32, 3) has ~900 parameters. How many parameters would a Dense layer need to connect a 32x32 image to 32 outputs? Why is this difference critical for scaling?
-
Translation Invariance: Why does a CNN detect a cat regardless of whether it's in the top-left or bottom-right of an image? How does weight sharing enable this property?
-
Hierarchical Features: Early CNN layers detect edges and textures. Later layers detect objects and faces. How does this emerge from stacking convolutions? Why doesn't this happen in dense networks?
-
Receptive Field Growth: A single Conv2D(kernel=3) sees a 3x3 region. After two Conv2D layers, what region does each output see? How do deep CNNs build global context from local operations?
-
Compute vs Memory Trade-offs: Large kernel sizes (7x7) have more parameters but fewer operations. Small kernels (3x3) stacked deeply have opposite trade-offs. Which is better and why?
Real-World Connections
Industry Applications
Autonomous Vehicles (Tesla, Waymo)
- Multi-camera CNN systems process 30 FPS at 1920x1200 resolution
- Feature maps from CNNs feed into object detection and segmentation
- Real-time requirements demand efficient Conv2D implementations
Medical Imaging (PathAI, Zebra Medical)
- CNNs analyze X-rays and CT scans for diagnostic assistance
- Achieve superhuman performance on specific tasks (diabetic retinopathy detection)
- Architecture design critical for accuracy-interpretability trade-off
Face Recognition (Apple Face ID, Facebook DeepFace)
- CNN embeddings enable accurate face matching at billion-user scale
- Lightweight CNN architectures run on mobile devices in real-time
- Privacy concerns drive on-device processing
Research Impact
This module implements patterns from:
- LeNet-5 (1998): First successful CNN for digit recognition
- AlexNet (2012): Sparked deep learning revolution with CNNs + GPUs
- VGG (2014): Showed deeper is better with simple 3x3 convolutions
- ResNet (2015): Enabled training 152-layer CNNs with skip connections
What's Next?
In Module 10: Tokenization, you'll shift from processing images to processing text:
- Learn how to convert text into numerical representations
- Implement tokenization strategies (character, word, subword)
- Build vocabulary management systems
- Prepare text data for transformers in Module 13
This completes the vision half of the Architecture Tier. Next, you'll tackle language!
Ready to build CNNs from scratch? Open modules/09_spatial/spatial_dev.py and start implementing.