Standardize Module 09 (Spatial/CNNs) to professional template

- Add complete YAML frontmatter with metadata - Add INTELLIGENCE tier badge - Standardize to exactly 5 learning objectives (systems/implementation/patterns/framework/optimization) - Implement Build → Use → Analyze pedagogical pattern - Add Why This Matters with production context (Tesla, Meta, medical imaging) - Add historical context (LeNet to ResNet evolution) - Add detailed Implementation Guide with Conv2D and pooling code - Add Systems Thinking Questions on parameter efficiency and hierarchical features - Add Real-World Connections to autonomous vehicles and medical imaging - Reduce emoji usage for professional tone - Add clear What's Next navigation to Module 10
2026-06-02 17:16:34 -05:00 · 2025-11-07 17:16:03 -05:00
parent e7f031b4cb
commit 8bf6eaedab
1 changed files with 313 additions and 206 deletions
--- a/book/chapters/09-spatial.md
+++ b/book/chapters/09-spatial.md
@@ -1,253 +1,360 @@
 ---
-title: "Spatial Networks"
-description: "Convolutional networks for spatial pattern recognition and image processing"
-difficulty: "⭐⭐⭐"
+title: "Spatial - Convolutional Neural Networks"
+description: "Build CNNs from scratch for computer vision and spatial pattern recognition"
+difficulty: 3
 time_estimate: "6-8 hours"
-prerequisites: []
-next_steps: []
-learning_objectives: []
+prerequisites: ["Tensor", "Activations", "Layers", "DataLoader"]
+next_steps: ["Tokenization"]
+learning_objectives:
+  - "Implement convolution as sliding window operations with weight sharing"
+  - "Design CNN architectures with feature extraction and classification components"
+  - "Understand translation invariance and hierarchical feature learning"
+  - "Build pooling operations for spatial downsampling and invariance"
+  - "Apply computer vision principles to image classification tasks"
 ---

-# Module: CNN
+# 09. Spatial (CNNs)

-```{div} badges
-⭐⭐⭐ | ⏱️ 6-8 hours
-```
+**🧠 INTELLIGENCE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 6-8 hours

+## Overview

-## 📊 Module Info
- **Difficulty**: ⭐⭐⭐ Advanced
- **Time Estimate**: 6-8 hours
- **Prerequisites**: Tensor, Activations, Layers, Networks modules
- **Next Steps**: Training, Computer Vision modules
+Implement convolutional neural networks (CNNs) from scratch. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image classification to autonomous driving.

-Implement the core building block of modern computer vision: the convolutional layer. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image recognition to autonomous vehicles.
+## Learning Objectives

-## 🎯 Learning Objectives
+By completing this module, you will be able to:

-By the end of this module, you will be able to:
+1. **Implement convolution** as sliding window operations with explicit loops, understanding weight sharing and local connectivity
+2. **Design CNN architectures** by composing convolutional, pooling, and dense layers for image classification
+3. **Understand translation invariance** and why CNNs are superior to dense networks for spatial data
+4. **Build pooling operations** (MaxPool, AvgPool) for spatial downsampling and feature invariance
+5. **Apply computer vision principles** to achieve >75% accuracy on CIFAR-10 image classification

- **Understand convolution fundamentals**: Master the sliding window operation, local connectivity, and weight sharing principles
- **Implement Conv2D from scratch**: Build convolutional layers using explicit loops to understand the core operation
- **Visualize feature learning**: See how convolution builds feature maps and hierarchical representations
- **Design CNN architectures**: Compose convolutional layers with pooling and dense layers into complete networks
- **Apply computer vision principles**: Understand how CNNs revolutionized image processing and pattern recognition
+## Why This Matters

-## 🧠 Build → Use → Analyze
+### Production Context

-This module follows TinyTorch's **Build → Use → Analyze** framework:
+CNNs are the backbone of modern computer vision systems:

-1. **Build**: Implement Conv2D from scratch using explicit for-loops to understand the core convolution operation
-2. **Use**: Compose Conv2D with activation functions and other layers to build complete convolutional networks
-3. **Analyze**: Visualize learned features, understand architectural choices, and compare CNN performance characteristics
+- **Meta's Vision AI** uses CNN architectures to tag 2 billion photos daily across Facebook and Instagram
+- **Tesla Autopilot** processes camera feeds through CNN backbones for object detection and lane recognition
+- **Google Photos** built a CNN-based system that automatically organizes billions of images
+- **Medical Imaging** systems use CNNs to detect cancer in X-rays and MRIs with superhuman accuracy

-## 📚 What You'll Build
+### Historical Context

-### Core Convolution Implementation
+The convolution revolution transformed computer vision:
+
+- **LeNet (1998)**: Yann LeCun's CNN read zip codes on mail; convolution proved viable but limited by compute
+- **AlexNet (2012)**: Won ImageNet with 16% error rate (vs 26% previous); GPUs + convolution = computer vision revolution
+- **ResNet (2015)**: 152-layer CNN achieved 3.6% error (better than human 5%); proved depth matters
+- **Modern Era (2020+)**: CNNs power production vision systems processing trillions of images daily
+
+The patterns you're implementing revolutionized how machines see.
+
+## Pedagogical Pattern: Build → Use → Analyze
+
+### 1. Build
+
+Implement from first principles:
+- Convolution as explicit sliding window operation
+- Conv2D layer with learnable filters and weight sharing
+- MaxPool2D and AvgPool2D for spatial downsampling
+- Flatten layer to connect spatial and dense layers
+- Complete CNN architecture with feature extraction and classification
+
+### 2. Use
+
+Apply to real problems:
+- Build CNN for CIFAR-10 image classification
+- Extract and visualize learned feature maps
+- Compare CNN vs MLP performance on spatial data
+- Achieve >75% accuracy with proper architecture
+- Understand impact of kernel size, stride, and padding
+
+### 3. Analyze
+
+Deep-dive into architectural choices:
+- Why does weight sharing reduce parameters dramatically?
+- How do early vs late layers learn different features?
+- What's the trade-off between depth and width in CNNs?
+- Why are pooling operations crucial for translation invariance?
+- How does spatial structure preservation improve learning?
+
+## Implementation Guide
+
+### Core Components
+
+**Conv2D Layer - The Heart of Computer Vision**
 ```python
-# Conv2D layer: the heart of computer vision
-conv_layer = Conv2D(in_channels=3, out_channels=16, kernel_size=3)
-input_image = Tensor([[[[...]]]])  # (batch, channels, height, width)
-feature_maps = conv_layer(input_image)  # Learned features
-
-# Understanding the operation
-print(f"Input shape: {input_image.shape}")     # (1, 3, 32, 32)
-print(f"Output shape: {feature_maps.shape}")   # (1, 16, 30, 30)
-print(f"Learned {feature_maps.shape[1]} different feature detectors")
+class Conv2D:
+    """2D Convolutional layer with learnable filters.
+    
+    Implements sliding window convolution:
+    - Applies same filter across all spatial positions (weight sharing)
+    - Each filter learns to detect different features (edges, textures, objects)
+    - Output is feature map showing where filter activates strongly
+    
+    Args:
+        in_channels: Number of input channels (3 for RGB, 16 for feature maps)
+        out_channels: Number of learned filters (feature detectors)
+        kernel_size: Size of sliding window (typically 3 or 5)
+        stride: Step size when sliding (1 = no downsampling)
+        padding: Border padding to preserve spatial dimensions
+    """
+    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0):
+        # Initialize learnable filters
+        self.weight = Tensor(shape=(out_channels, in_channels, kernel_size, kernel_size))
+        self.bias = Tensor(shape=(out_channels,))
+        
+    def forward(self, x):
+        # x shape: (batch, in_channels, height, width)
+        batch, _, H, W = x.shape
+        kh, kw = self.kernel_size, self.kernel_size
+        
+        # Calculate output dimensions
+        out_h = (H + 2 * self.padding - kh) // self.stride + 1
+        out_w = (W + 2 * self.padding - kw) // self.stride + 1
+        
+        # Sliding window convolution
+        output = Tensor(shape=(batch, self.out_channels, out_h, out_w))
+        for b in range(batch):
+            for oc in range(self.out_channels):
+                for i in range(out_h):
+                    for j in range(out_w):
+                        # Extract local patch
+                        i_start = i * self.stride
+                        j_start = j * self.stride
+                        patch = x[b, :, i_start:i_start+kh, j_start:j_start+kw]
+                        
+                        # Convolution: element-wise multiply and sum
+                        output[b, oc, i, j] = (patch * self.weight[oc]).sum() + self.bias[oc]
+        
+        return output
 ```

-### Complete CNN Architecture
+**Pooling Layers - Spatial Downsampling**
 ```python
-# Simple CNN for image classification
-cnn = Sequential([
-    Conv2D(3, 16, kernel_size=3),    # Feature extraction
-    ReLU(),                          # Nonlinearity
-    MaxPool2D(kernel_size=2),        # Dimensionality reduction
-    Conv2D(16, 32, kernel_size=3),   # Higher-level features
-    ReLU(),                          # More nonlinearity
-    Flatten(),                       # Prepare for dense layers
-    Dense(32 * 13 * 13, 128),        # Feature integration
-    ReLU(),
-    Dense(128, 10),                  # Classification head
-    Sigmoid()                        # Probability outputs
-])
-
-# End-to-end image classification
-image_batch = Tensor([[[[...]]]])  # Batch of images
-predictions = cnn(image_batch)     # Class probabilities
+class MaxPool2D:
+    """Max pooling for spatial downsampling and translation invariance.
+    
+    Takes maximum value in each local region:
+    - Reduces spatial dimensions while preserving important features
+    - Provides invariance to small translations
+    - Reduces computation in later layers
+    """
+    def __init__(self, kernel_size=2, stride=None):
+        self.kernel_size = kernel_size
+        self.stride = stride or kernel_size
+    
+    def forward(self, x):
+        batch, channels, H, W = x.shape
+        kh, kw = self.kernel_size, self.kernel_size
+        
+        out_h = (H - kh) // self.stride + 1
+        out_w = (W - kw) // self.stride + 1
+        
+        output = Tensor(shape=(batch, channels, out_h, out_w))
+        for b in range(batch):
+            for c in range(channels):
+                for i in range(out_h):
+                    for j in range(out_w):
+                        i_start = i * self.stride
+                        j_start = j * self.stride
+                        patch = x[b, c, i_start:i_start+kh, j_start:j_start+kw]
+                        output[b, c, i, j] = patch.max()
+        
+        return output
 ```

-### Convolution Operation Details
- **Sliding Window**: Filter moves across input to detect local patterns
- **Weight Sharing**: Same filter applied everywhere for translation invariance
- **Local Connectivity**: Each output depends only on local input region
- **Feature Maps**: Multiple filters learn different feature detectors
+**Complete CNN Architecture**
+```python
+class SimpleCNN:
+    """CNN for CIFAR-10 classification.
+    
+    Architecture:
+        Conv(3→32, 3x3) → ReLU → MaxPool(2x2)    # 32x32 → 16x16
+        Conv(32→64, 3x3) → ReLU → MaxPool(2x2)   # 16x16 → 8x8
+        Flatten → Dense(64*8*8 → 128) → ReLU
+        Dense(128 → 10) → Softmax
+    """
+    def __init__(self):
+        self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1)
+        self.relu1 = ReLU()
+        self.pool1 = MaxPool2D(kernel_size=2)
+        
+        self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1)
+        self.relu2 = ReLU()
+        self.pool2 = MaxPool2D(kernel_size=2)
+        
+        self.flatten = Flatten()
+        self.fc1 = Linear(64 * 8 * 8, 128)
+        self.relu3 = ReLU()
+        self.fc2 = Linear(128, 10)
+    
+    def forward(self, x):
+        # Feature extraction
+        x = self.pool1(self.relu1(self.conv1(x)))  # (B, 32, 16, 16)
+        x = self.pool2(self.relu2(self.conv2(x)))  # (B, 64, 8, 8)
+        
+        # Classification
+        x = self.flatten(x)                        # (B, 4096)
+        x = self.relu3(self.fc1(x))               # (B, 128)
+        x = self.fc2(x)                           # (B, 10)
+        return x
+```

-### CNN Building Blocks
- **Conv2D Layer**: Core convolution operation with learnable filters
- **Pooling Layers**: MaxPool and AvgPool for spatial downsampling
- **Flatten Layer**: Converts 2D feature maps to 1D for dense layers
- **Complete Networks**: Integration with existing Dense and activation layers
+### Step-by-Step Implementation

-## 🚀 Getting Started
+1. **Implement Conv2D Forward Pass**
+   - Create sliding window iteration over spatial dimensions
+   - Apply weight sharing: same filter at all positions
+   - Handle batch processing efficiently
+   - Verify output shape calculation

-### Prerequisites
-Ensure you have mastered the foundational network building blocks:
+2. **Build Pooling Operations**
+   - Implement MaxPool2D with maximum extraction
+   - Add AvgPool2D for average pooling
+   - Handle stride and kernel size correctly
+   - Test spatial dimension reduction

+3. **Create Flatten Layer**
+   - Convert (B, C, H, W) to (B, C*H*W)
+   - Prepare spatial features for dense layers
+   - Preserve batch dimension
+   - Enable gradient flow backward
+
+4. **Design Complete CNN**
+   - Stack Conv → ReLU → Pool blocks for feature extraction
+   - Add Flatten → Dense for classification
+   - Calculate dimensions at each layer
+   - Test end-to-end forward pass
+
+5. **Train on CIFAR-10**
+   - Load CIFAR-10 using Module 08's DataLoader
+   - Train with cross-entropy loss and SGD
+   - Track accuracy on test set
+   - Achieve >75% accuracy
+
+## Testing
+
+### Inline Tests (During Development)
+
+Run inline tests while building:
 ```bash
-# Activate TinyTorch environment
-source bin/activate-tinytorch.sh
-
-# Verify all prerequisite modules
-tito test --module tensor
-tito test --module activations
-tito test --module layers
-tito test --module networks
+cd modules/source/09_spatial
+python spatial_dev.py
 ```

-### Development Workflow
-1. **Open the development file**: `modules/source/06_cnn/cnn_dev.py`
-2. **Implement convolution operation**: Start with explicit for-loop implementation for understanding
-3. **Build Conv2D layer class**: Wrap convolution in reusable layer interface
-4. **Add pooling operations**: Implement MaxPool and AvgPool for spatial reduction
-5. **Create complete CNNs**: Compose layers into full computer vision architectures
-6. **Export and verify**: `tito export --module cnn && tito test --module cnn`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify computer vision functionality:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module cnn
-
-# Direct pytest execution
-python -m pytest tests/ -k cnn -v
+Expected output:
 ```
-
-### Test Coverage Areas
- ✅ **Convolution Operation**: Verify sliding window operation and local connectivity
- ✅ **Filter Learning**: Test weight initialization and parameter management
- ✅ **Shape Transformations**: Ensure proper input/output shape handling
- ✅ **Pooling Operations**: Verify spatial downsampling and feature preservation
- ✅ **CNN Integration**: Test complete networks with real image-like data
-
-### Inline Testing & Visualization
-The module includes comprehensive educational feedback and visual analysis:
-```python
-# Example inline test output
-🔬 Unit Test: Conv2D implementation...
-✅ Convolution sliding window works correctly
-✅ Weight sharing applied consistently
+Unit Test: Conv2D implementation...
+✅ Sliding window convolution works correctly
+✅ Weight sharing applied at all positions
 ✅ Output shapes match expected dimensions
-📈 Progress: Conv2D ✓
+Progress: Conv2D ✓

-# Visualization feedback
-📊 Visualizing convolution operation...
-📈 Showing filter sliding across input
-📊 Feature map generation: 3→16 channels
+Unit Test: MaxPool2D implementation...
+✅ Maximum extraction works correctly
+✅ Spatial dimensions reduced properly
+✅ Translation invariance verified
+Progress: Pooling ✓
+
+Unit Test: Complete CNN architecture...
+✅ Forward pass through all layers successful
+✅ Output shape: (32, 10) for 10 classes
+✅ Parameter count reasonable: ~500K parameters
+Progress: CNN Architecture ✓
 ```

-### Manual Testing Examples
-```python
-from tinytorch.core.tensor import Tensor
-from cnn_dev import Conv2D, MaxPool2D, Flatten
-from activations_dev import ReLU
+### Export and Validate

-# Test basic convolution
-conv = Conv2D(in_channels=1, out_channels=4, kernel_size=3)
-input_img = Tensor([[[[1, 2, 3, 4, 5],
-                      [6, 7, 8, 9, 10],
-                      [11, 12, 13, 14, 15],
-                      [16, 17, 18, 19, 20],
-                      [21, 22, 23, 24, 25]]]])
-feature_maps = conv(input_img)
-print(f"Input: {input_img.shape}, Features: {feature_maps.shape}")
+After completing the module:
+```bash
+# Export to tinytorch package
+tito export 09_spatial

-# Test complete CNN pipeline
-relu = ReLU()
-pool = MaxPool2D(kernel_size=2)
-flatten = Flatten()
-
-# Forward pass through CNN layers
-activated = relu(feature_maps)
-pooled = pool(activated)
-flattened = flatten(pooled)
-print(f"Final shape: {flattened.shape}")
+# Run integration tests
+tito test 09_spatial
 ```

-## 🎯 Key Concepts
+### CIFAR-10 Training Test

-### Real-World Applications
- **Image Classification**: CNNs power systems like ImageNet winners (AlexNet, ResNet, EfficientNet)
- **Object Detection**: YOLO and R-CNN families use CNN backbones for feature extraction
- **Medical Imaging**: CNNs analyze X-rays, MRIs, and CT scans for diagnostic assistance
- **Autonomous Vehicles**: CNN-based perception systems process camera feeds for navigation
+```bash
+# Train simple CNN on CIFAR-10
+python tests/integration/test_cnn_cifar10.py

-### Computer Vision Fundamentals
- **Translation Invariance**: Convolution detects patterns regardless of position in image
- **Hierarchical Features**: Early layers detect edges, later layers detect objects and concepts
- **Parameter Efficiency**: Weight sharing dramatically reduces parameters compared to dense layers
- **Spatial Structure**: CNNs preserve and leverage 2D spatial relationships in images
-
-### Convolution Mathematics
- **Sliding Window Operation**: Filter moves across input with stride and padding parameters
- **Cross-Correlation vs Convolution**: Deep learning typically uses cross-correlation operation
- **Feature Map Computation**: Output[i,j] = sum(input[i:i+k, j:j+k] * filter)
- **Receptive Field**: Region of input that influences each output activation
-
-### CNN Architecture Patterns
- **Feature Extraction**: Convolution + ReLU + Pooling blocks extract hierarchical features
- **Classification Head**: Flatten + Dense layers perform final classification
- **Progressive Filtering**: Increasing filter count with decreasing spatial dimensions
- **Skip Connections**: Advanced architectures add residual connections for deeper networks
-
-## 🎉 Ready to Build?
-
-You're about to implement the technology that revolutionized computer vision! CNNs transformed image processing from hand-crafted features to learned representations, enabling everything from photo tagging to medical diagnosis to autonomous driving.
-
-Understanding convolution from the ground up—implementing the sliding window operation yourself—will give you deep insight into why CNNs work so well for visual tasks. Take your time with the core operation, visualize what's happening, and enjoy building the foundation of modern computer vision!
-
-
-
-
-Choose your preferred way to engage with this module:
-
-````{grid} 1 2 3 3
-
-```{grid-item-card} 🚀 Launch Binder
-:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/source/06_spatial/spatial_dev.ipynb
-:class-header: bg-light
-
-Run this module interactively in your browser. No installation required!
+Expected results:
+- Epoch 1: 35% accuracy
+- Epoch 5: 60% accuracy
+- Epoch 10: 75% accuracy
 ```

-```{grid-item-card} ⚡ Open in Colab  
-:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/source/06_spatial/spatial_dev.ipynb
-:class-header: bg-light
-
-Use Google Colab for GPU access and cloud compute power.
-```
-
-```{grid-item-card} 📖 View Source
-:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/source/06_spatial/spatial_dev.py
-:class-header: bg-light
-
-Browse the Python source code and understand the implementation.
-```
-
-````
-
-```{admonition} 💾 Save Your Progress
-:class: tip
-**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work.
+## Where This Code Lives

 ```
+tinytorch/
+├── nn/
+│   └── spatial.py              # Conv2D, MaxPool2D, etc.
+└── __init__.py                 # Exposes CNN components
+
+Usage in other modules:
+>>> from tinytorch.nn import Conv2D, MaxPool2D
+>>> conv = Conv2D(3, 32, kernel_size=3)
+>>> pool = MaxPool2D(kernel_size=2)
+```
+
+## Systems Thinking Questions
+
+1. **Parameter Efficiency**: A Conv2D(3, 32, 3) has ~900 parameters. How many parameters would a Dense layer need to connect a 32x32 image to 32 outputs? Why is this difference critical for scaling?
+
+2. **Translation Invariance**: Why does a CNN detect a cat regardless of whether it's in the top-left or bottom-right of an image? How does weight sharing enable this property?
+
+3. **Hierarchical Features**: Early CNN layers detect edges and textures. Later layers detect objects and faces. How does this emerge from stacking convolutions? Why doesn't this happen in dense networks?
+
+4. **Receptive Field Growth**: A single Conv2D(kernel=3) sees a 3x3 region. After two Conv2D layers, what region does each output see? How do deep CNNs build global context from local operations?
+
+5. **Compute vs Memory Trade-offs**: Large kernel sizes (7x7) have more parameters but fewer operations. Small kernels (3x3) stacked deeply have opposite trade-offs. Which is better and why?
+
+## Real-World Connections
+
+### Industry Applications
+
+**Autonomous Vehicles (Tesla, Waymo)**
+- Multi-camera CNN systems process 30 FPS at 1920x1200 resolution
+- Feature maps from CNNs feed into object detection and segmentation
+- Real-time requirements demand efficient Conv2D implementations
+
+**Medical Imaging (PathAI, Zebra Medical)**
+- CNNs analyze X-rays and CT scans for diagnostic assistance
+- Achieve superhuman performance on specific tasks (diabetic retinopathy detection)
+- Architecture design critical for accuracy-interpretability trade-off
+
+**Face Recognition (Apple Face ID, Facebook DeepFace)**
+- CNN embeddings enable accurate face matching at billion-user scale
+- Lightweight CNN architectures run on mobile devices in real-time
+- Privacy concerns drive on-device processing
+
+### Research Impact
+
+This module implements patterns from:
+- LeNet-5 (1998): First successful CNN for digit recognition
+- AlexNet (2012): Sparked deep learning revolution with CNNs + GPUs
+- VGG (2014): Showed deeper is better with simple 3x3 convolutions
+- ResNet (2015): Enabled training 152-layer CNNs with skip connections
+
+## What's Next?
+
+In **Module 10: Tokenization**, you'll shift from processing images to processing text:
+
+- Learn how to convert text into numerical representations
+- Implement tokenization strategies (character, word, subword)
+- Build vocabulary management systems
+- Prepare text data for transformers in Module 13
+
+This completes the vision half of the Intelligence Tier. Next, you'll tackle language!

 ---

-<div class="prev-next-area">
-<a class="left-prev" href="../chapters/05_dense.html" title="previous page">← Previous Module</a>
-<a class="right-next" href="../chapters/07_attention.html" title="next page">Next Module →</a>
-</div>
+**Ready to build CNNs from scratch?** Open `modules/source/09_spatial/spatial_dev.py` and start implementing.