Files
Vijay Janapa Reddi 4f06392de5 Apply formatting fixes to achieve 10/10 consistency
- Add 🧪 emoji to all test_module() docstrings (20 modules)
- Fix Module 16 (compression): Add if __name__ guards to 6 test functions
- Fix Module 08 (dataloader): Add if __name__ guard to test_training_integration

All modules now follow consistent formatting standards for release.
2025-11-24 15:07:32 -05:00
..
2025-11-22 20:30:58 -05:00

Module 09: Spatial Operations - CNNs for Vision

Overview

Time: 3-4 hours Difficulty:

Build convolutional neural networks (CNNs) - the foundation of computer vision. Learn how spatial operations enable pattern recognition in images through local connectivity and parameter sharing.

Prerequisites

Required Modules: 01-08 must be completed and tested

  • Module 01 (Tensor): Data structures
  • Module 02 (Activations): ReLU for feature detection
  • Module 03 (Layers): Linear layers foundation
  • Module 04 (Losses): CrossEntropy for classification
  • Module 05 (Autograd): Gradient computation
  • Module 06 (Optimizers): SGD/Adam for training
  • Module 07 (Training): Training loop patterns
  • Module 08 (Data): Efficient data loading

Before starting, verify prerequisites:

pytest modules/01_tensor/test_tensor.py
pytest modules/02_activations/test_activations.py
# ... test all modules 01-08

Learning Objectives

By the end of this module, you will:

Core Concepts

  1. Understand Convolutional Operations

    • Sliding window computation over spatial dimensions
    • Filter/kernel mathematics (cross-correlation)
    • Output size calculations: (H-K+2P)/S + 1
    • Why convolution works for spatial data
  2. Implement Conv2d Layers

    • Forward pass: applying filters to extract features
    • Backward pass: gradients for filters, inputs, and biases
    • Parameter sharing reduces model size vs fully-connected
    • Local connectivity captures spatial patterns
  3. Master Pooling Operations

    • MaxPool2d: dimensionality reduction while preserving features
    • Stride and kernel size trade-offs
    • Translation invariance for robust recognition
    • When to pool vs when to use strided convolution
  4. Build Spatial Hierarchies

    • Early layers: edges and textures (local patterns)
    • Middle layers: parts and shapes (combinations)
    • Deep layers: objects and scenes (high-level concepts)
    • How receptive fields grow with depth

Systems Understanding

  1. Computational Complexity

    • FLOPs analysis: O(N²M²K²) for naive convolution
    • Why convolution is expensive (6 nested loops)
    • Memory bottlenecks in spatial operations
    • Cache efficiency and data locality
  2. Optimization Techniques

    • Im2col algorithm: trade memory for speed
    • Vectorization strategies for convolution
    • Why GPUs excel at convolutional operations
    • Batch processing for throughput
  3. Production Considerations

    • Parameter efficiency: CNNs vs MLPs for images
    • Mobile deployment: depthwise-separable convolutions
    • Memory footprint during training (activations + gradients)
    • Inference optimization patterns

ML Engineering Skills

  1. Architecture Design

    • Choosing filter sizes (1×1, 3×3, 5×5)
    • Balancing depth vs width
    • When to pool and when to stride
    • Building feature extraction pipelines
  2. Debugging Spatial Layers

    • Shape tracking through conv and pool layers
    • Gradient flow verification in deep networks
    • Common errors: dimension mismatches
    • Validating learned filters visually
  3. Performance Profiling

    • Measuring convolution speed vs input size
    • Memory usage scaling with batch size
    • Comparing naive vs optimized implementations
    • Bottleneck identification in CNN pipelines

What You'll Build

Core Components

  1. Conv2d: Convolutional layer with learnable filters
  2. MaxPool2d: Max pooling for dimensionality reduction
  3. Flatten: Reshape spatial features for classification
  4. Helper functions: Shape calculation utilities

Complete CNN System

By module end, you'll have all components to build:

  • LeNet-style architectures (1998 - digit recognition)
  • Feature extraction pipelines
  • Spatial hierarchy networks
  • Ready for Milestone 04: LeNet CNN

Module Structure

modules/09_spatial/
├── README.md                 ← You are here
├── spatial_dev.py            ← Main implementation file
├── spatial_dev.ipynb         ← Jupyter notebook version
└── test_spatial.py           ← Validation tests

After This Module

Immediate Next Step

→ Milestone 04: LeNet CNN (1998) Build Yann LeCun's historic convolutional network that revolutionized digit recognition. You now have all components: Conv2d, MaxPool2d, ReLU, and training loops.

Future Modules Will Add

  • Module 10: Normalization (BatchNorm, LayerNorm)
  • Module 11: Modern architectures (ResNets, skip connections)
  • Module 12: Attention mechanisms (transformers)

What Becomes Possible

  • Image classification (MNIST, CIFAR-10)
  • Feature extraction for transfer learning
  • Spatial pattern recognition
  • Building blocks for modern vision models

Key Insights You'll Discover

Why CNNs Work

  1. Parameter Sharing: Same filter applied everywhere → fewer parameters
  2. Local Connectivity: Neurons see small regions → translation equivariance
  3. Hierarchical Features: Stack layers → learn complex patterns
  4. Spatial Structure: Preserve 2D topology → better for images

Performance Realities

  1. Convolution is Expensive: O(N²M²K²) complexity → GPUs essential
  2. Memory Scales Quadratically: Large images → huge activations
  3. Im2col Trade-off: 10× memory → 100× speed possible
  4. Batch Processing: Amortize overhead → better throughput

Architectural Patterns

  1. Gradual Downsampling: Increase channels, decrease spatial size
  2. 3×3 Dominance: Best balance of expressiveness and efficiency
  3. Pooling Alternatives: Strided conv can replace pooling
  4. Depth Matters: More layers → better hierarchies

Tips for Success

Implementation Strategy

  1. Start Simple: Get 3×3 convolution working first
  2. Test Incrementally: Verify shapes at each step
  3. Profile Early: Measure performance to understand complexity
  4. Visualize Outputs: Check feature maps make sense

Common Pitfalls

  • ⚠️ Shape Mismatches: Track dimensions carefully through conv/pool
  • ⚠️ Memory Errors: Batch size × spatial size can be huge
  • ⚠️ Gradient Issues: Deep networks need careful initialization
  • ⚠️ Performance: Naive implementation will be slow (that's the point!)

Debugging Techniques

# Always print shapes during development
print(f"Input: {x.shape}")
x = conv1(x)
print(f"After conv1: {x.shape}")
x = pool1(x)
print(f"After pool1: {x.shape}")

Estimated Timeline

  • Part 1-2: Introduction & Math (30 minutes)
  • Part 3: Conv2d Implementation (90 minutes)
  • Part 4: MaxPool2d & Flatten (45 minutes)
  • Part 5: Systems Analysis (30 minutes)
  • Part 6: Integration & Testing (30 minutes)
  • Total: 3-4 hours with breaks

Learning Approach

This is a Core Module (complexity level 4/5):

  • Full implementation with explicit loops (see the complexity!)
  • Systems analysis reveals performance characteristics
  • Connection to production patterns (im2col, GPU kernels)
  • Immediate testing after each component

Don't rush - understanding spatial operations deeply is crucial for modern ML.

Getting Started

Open spatial_dev.py and begin with Part 1: Introduction to Spatial Operations.

Remember: You're building the foundation of computer vision. Take time to understand how these operations enable hierarchical feature learning in images.


Ready? Let's build CNNs! 🏗️