mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-12 02:24:46 -05:00

Files

Vijay Janapa Reddi a2e4586f18 Update documentation after module reordering

All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate

2025-11-10 19:37:41 -05:00

13 KiB

Raw Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives

title

description

difficulty

time_estimate

prerequisites

next_steps

learning_objectives

Convolutional Networks

Build CNNs from scratch for computer vision and spatial pattern recognition

6-8 hours

Tensor

Activations

Layers

DataLoader

Tokenization

Implement convolution as sliding window operations with weight sharing

Design CNN architectures with feature extraction and classification components

Understand translation invariance and hierarchical feature learning

Build pooling operations for spatial downsampling and invariance

Apply computer vision principles to image classification tasks

09. Convolutional Networks

🏛️ ARCHITECTURE TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 6-8 hours

Overview

Implement convolutional neural networks (CNNs) from scratch. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image classification to autonomous driving.

Learning Objectives

By completing this module, you will be able to:

Implement convolution as sliding window operations with explicit loops, understanding weight sharing and local connectivity
Design CNN architectures by composing convolutional, pooling, and dense layers for image classification
Understand translation invariance and why CNNs are superior to dense networks for spatial data
Build pooling operations (MaxPool, AvgPool) for spatial downsampling and feature invariance
Apply computer vision principles to achieve >75% accuracy on CIFAR-10 image classification

Why This Matters

Production Context

CNNs are the backbone of modern computer vision systems:

Meta's Vision AI uses CNN architectures to tag 2 billion photos daily across Facebook and Instagram
Tesla Autopilot processes camera feeds through CNN backbones for object detection and lane recognition
Google Photos built a CNN-based system that automatically organizes billions of images
Medical Imaging systems use CNNs to detect cancer in X-rays and MRIs with superhuman accuracy

Historical Context

The convolution revolution transformed computer vision:

LeNet (1998): Yann LeCun's CNN read zip codes on mail; convolution proved viable but limited by compute
AlexNet (2012): Won ImageNet with 16% error rate (vs 26% previous); GPUs + convolution = computer vision revolution
ResNet (2015): 152-layer CNN achieved 3.6% error (better than human 5%); proved depth matters
Modern Era (2020+): CNNs power production vision systems processing trillions of images daily

The patterns you're implementing revolutionized how machines see.

Pedagogical Pattern: Build → Use → Analyze

1. Build

Implement from first principles:

Convolution as explicit sliding window operation
Conv2D layer with learnable filters and weight sharing
MaxPool2D and AvgPool2D for spatial downsampling
Flatten layer to connect spatial and dense layers
Complete CNN architecture with feature extraction and classification

2. Use

Apply to real problems:

Build CNN for CIFAR-10 image classification
Extract and visualize learned feature maps
Compare CNN vs MLP performance on spatial data
Achieve >75% accuracy with proper architecture
Understand impact of kernel size, stride, and padding

3. Analyze

Deep-dive into architectural choices:

Why does weight sharing reduce parameters dramatically?
How do early vs late layers learn different features?
What's the trade-off between depth and width in CNNs?
Why are pooling operations crucial for translation invariance?
How does spatial structure preservation improve learning?

Implementation Guide

Core Components

Conv2D Layer - The Heart of Computer Vision

class Conv2D:
    """2D Convolutional layer with learnable filters.
    
    Implements sliding window convolution:
    - Applies same filter across all spatial positions (weight sharing)
    - Each filter learns to detect different features (edges, textures, objects)
    - Output is feature map showing where filter activates strongly
    
    Args:
        in_channels: Number of input channels (3 for RGB, 16 for feature maps)
        out_channels: Number of learned filters (feature detectors)
        kernel_size: Size of sliding window (typically 3 or 5)
        stride: Step size when sliding (1 = no downsampling)
        padding: Border padding to preserve spatial dimensions
    """
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0):
        # Initialize learnable filters
        self.weight = Tensor(shape=(out_channels, in_channels, kernel_size, kernel_size))
        self.bias = Tensor(shape=(out_channels,))
        
    def forward(self, x):
        # x shape: (batch, in_channels, height, width)
        batch, _, H, W = x.shape
        kh, kw = self.kernel_size, self.kernel_size
        
        # Calculate output dimensions
        out_h = (H + 2 * self.padding - kh) // self.stride + 1
        out_w = (W + 2 * self.padding - kw) // self.stride + 1
        
        # Sliding window convolution
        output = Tensor(shape=(batch, self.out_channels, out_h, out_w))
        for b in range(batch):
            for oc in range(self.out_channels):
                for i in range(out_h):
                    for j in range(out_w):
                        # Extract local patch
                        i_start = i * self.stride
                        j_start = j * self.stride
                        patch = x[b, :, i_start:i_start+kh, j_start:j_start+kw]
                        
                        # Convolution: element-wise multiply and sum
                        output[b, oc, i, j] = (patch * self.weight[oc]).sum() + self.bias[oc]
        
        return output

Pooling Layers - Spatial Downsampling

class MaxPool2D:
    """Max pooling for spatial downsampling and translation invariance.
    
    Takes maximum value in each local region:
    - Reduces spatial dimensions while preserving important features
    - Provides invariance to small translations
    - Reduces computation in later layers
    """
    def __init__(self, kernel_size=2, stride=None):
        self.kernel_size = kernel_size
        self.stride = stride or kernel_size
    
    def forward(self, x):
        batch, channels, H, W = x.shape
        kh, kw = self.kernel_size, self.kernel_size
        
        out_h = (H - kh) // self.stride + 1
        out_w = (W - kw) // self.stride + 1
        
        output = Tensor(shape=(batch, channels, out_h, out_w))
        for b in range(batch):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        i_start = i * self.stride
                        j_start = j * self.stride
                        patch = x[b, c, i_start:i_start+kh, j_start:j_start+kw]
                        output[b, c, i, j] = patch.max()
        
        return output

Complete CNN Architecture

class SimpleCNN:
    """CNN for CIFAR-10 classification.
    
    Architecture:
        Conv(3→32, 3x3) → ReLU → MaxPool(2x2)    # 32x32 → 16x16
        Conv(32→64, 3x3) → ReLU → MaxPool(2x2)   # 16x16 → 8x8
        Flatten → Dense(64*8*8 → 128) → ReLU
        Dense(128 → 10) → Softmax
    """
    def __init__(self):
        self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1)
        self.relu1 = ReLU()
        self.pool1 = MaxPool2D(kernel_size=2)
        
        self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1)
        self.relu2 = ReLU()
        self.pool2 = MaxPool2D(kernel_size=2)
        
        self.flatten = Flatten()
        self.fc1 = Linear(64 * 8 * 8, 128)
        self.relu3 = ReLU()
        self.fc2 = Linear(128, 10)
    
    def forward(self, x):
        # Feature extraction
        x = self.pool1(self.relu1(self.conv1(x)))  # (B, 32, 16, 16)
        x = self.pool2(self.relu2(self.conv2(x)))  # (B, 64, 8, 8)
        
        # Classification
        x = self.flatten(x)                        # (B, 4096)
        x = self.relu3(self.fc1(x))               # (B, 128)
        x = self.fc2(x)                           # (B, 10)
        return x

Step-by-Step Implementation

Implement Conv2D Forward Pass
- Create sliding window iteration over spatial dimensions
- Apply weight sharing: same filter at all positions
- Handle batch processing efficiently
- Verify output shape calculation
Build Pooling Operations
- Implement MaxPool2D with maximum extraction
- Add AvgPool2D for average pooling
- Handle stride and kernel size correctly
- Test spatial dimension reduction
Create Flatten Layer
- Convert (B, C, H, W) to (B, CHW)
- Prepare spatial features for dense layers
- Preserve batch dimension
- Enable gradient flow backward
Design Complete CNN
- Stack Conv → ReLU → Pool blocks for feature extraction
- Add Flatten → Dense for classification
- Calculate dimensions at each layer
- Test end-to-end forward pass
Train on CIFAR-10
- Load CIFAR-10 using Module 08's DataLoader
- Train with cross-entropy loss and SGD
- Track accuracy on test set
- Achieve >75% accuracy

Testing

Inline Tests (During Development)

Run inline tests while building:

cd modules/09_spatial
python spatial_dev.py

Expected output:

Unit Test: Conv2D implementation...
✅ Sliding window convolution works correctly
✅ Weight sharing applied at all positions
✅ Output shapes match expected dimensions
Progress: Conv2D ✓

Unit Test: MaxPool2D implementation...
✅ Maximum extraction works correctly
✅ Spatial dimensions reduced properly
✅ Translation invariance verified
Progress: Pooling ✓

Unit Test: Complete CNN architecture...
✅ Forward pass through all layers successful
✅ Output shape: (32, 10) for 10 classes
✅ Parameter count reasonable: ~500K parameters
Progress: CNN Architecture ✓

Export and Validate

After completing the module:

# Export to tinytorch package
tito export 09_spatial

# Run integration tests
tito test 09_spatial

CIFAR-10 Training Test

# Train simple CNN on CIFAR-10
python tests/integration/test_cnn_cifar10.py

Expected results:
- Epoch 1: 35% accuracy
- Epoch 5: 60% accuracy
- Epoch 10: 75% accuracy

Where This Code Lives

tinytorch/
├── nn/
│   └── spatial.py              # Conv2D, MaxPool2D, etc.
└── __init__.py                 # Exposes CNN components

Usage in other modules:
>>> from tinytorch.nn import Conv2D, MaxPool2D
>>> conv = Conv2D(3, 32, kernel_size=3)
>>> pool = MaxPool2D(kernel_size=2)

Systems Thinking Questions

Parameter Efficiency: A Conv2D(3, 32, 3) has ~900 parameters. How many parameters would a Dense layer need to connect a 32x32 image to 32 outputs? Why is this difference critical for scaling?
Translation Invariance: Why does a CNN detect a cat regardless of whether it's in the top-left or bottom-right of an image? How does weight sharing enable this property?
Hierarchical Features: Early CNN layers detect edges and textures. Later layers detect objects and faces. How does this emerge from stacking convolutions? Why doesn't this happen in dense networks?
Receptive Field Growth: A single Conv2D(kernel=3) sees a 3x3 region. After two Conv2D layers, what region does each output see? How do deep CNNs build global context from local operations?
Compute vs Memory Trade-offs: Large kernel sizes (7x7) have more parameters but fewer operations. Small kernels (3x3) stacked deeply have opposite trade-offs. Which is better and why?

Real-World Connections

Industry Applications

Autonomous Vehicles (Tesla, Waymo)

Multi-camera CNN systems process 30 FPS at 1920x1200 resolution
Feature maps from CNNs feed into object detection and segmentation
Real-time requirements demand efficient Conv2D implementations

Medical Imaging (PathAI, Zebra Medical)

CNNs analyze X-rays and CT scans for diagnostic assistance
Achieve superhuman performance on specific tasks (diabetic retinopathy detection)
Architecture design critical for accuracy-interpretability trade-off

Face Recognition (Apple Face ID, Facebook DeepFace)

CNN embeddings enable accurate face matching at billion-user scale
Lightweight CNN architectures run on mobile devices in real-time
Privacy concerns drive on-device processing

Research Impact

This module implements patterns from:

LeNet-5 (1998): First successful CNN for digit recognition
AlexNet (2012): Sparked deep learning revolution with CNNs + GPUs
VGG (2014): Showed deeper is better with simple 3x3 convolutions
ResNet (2015): Enabled training 152-layer CNNs with skip connections

What's Next?

In Module 10: Tokenization, you'll shift from processing images to processing text:

Learn how to convert text into numerical representations
Implement tokenization strategies (character, word, subword)
Build vocabulary management systems
Prepare text data for transformers in Module 13

This completes the vision half of the Architecture Tier. Next, you'll tackle language!

Ready to build CNNs from scratch? Open modules/09_spatial/spatial_dev.py and start implementing.

13 KiB Raw Blame History