mirror of https://github.com/MLSysBook/TinyTorch.git synced 2025-12-05 19:17:52 -06:00

Files

Vijay Janapa Reddi c4d0bdb901 Add ASCII box alignment tool and fix 46 simple boxes

- Add tools/dev/fix_ascii_boxes.py for aligning ASCII art boxes
- Fix alignment of right-side vertical bars in simple boxes
- Tool handles simple boxes (2 vertical bars per line)
- Reports complex nested boxes for manual review (118 found)
- Fixed boxes in: src/, milestones/

2025-11-30 08:57:51 -05:00

01_lecun_tinydigits.py

Fix duplicate autograd enabled messages

2025-11-22 15:31:39 -05:00

02_lecun_cifar10.py

Add ASCII box alignment tool and fix 46 simple boxes

2025-11-30 08:57:51 -05:00

README.md

Remove emoji prefixes from markdown headers in milestones and site chapters

2025-11-11 21:17:22 -05:00

README.md

Milestone 04: The CNN Revolution (1998)

Historical Context

After backpropagation revived neural networks (1986), researchers still struggled with image recognition. MLPs treated pixels independently, requiring millions of parameters and ignoring spatial structure.

Then in 1998, Yann LeCun's LeNet-5 revolutionized computer vision with Convolutional Neural Networks (CNNs). By using:

Shared weights (convolution) → 100× fewer parameters
Local connectivity → preserves spatial structure
Pooling → translation invariance

LeNet achieved 99%+ accuracy on handwritten digits, launching the deep learning revolution that led to ImageNet (2012), object detection, and modern computer vision.

What You're Building

CNNs that exploit spatial structure in images:

TinyDigits - Prove convolution works on 8×8 digits
CIFAR-10 - Scale to natural color images (32×32)

Required Modules

Run after Module 09 (Spatial operations: Conv2d + Pooling)

Module	Component	What It Provides
Module 01	Tensor	YOUR data structure
Module 02	Activations	YOUR ReLU activation
Module 03	Layers	YOUR Linear layers
Module 04	Losses	YOUR CrossEntropyLoss
Module 05	Autograd	YOUR automatic differentiation
Module 06	Optimizers	YOUR SGD/Adam optimizers
Module 07	Training	YOUR end-to-end training loop
Module 08	DataLoader	YOUR data batching
Module 09	Spatial	YOUR Conv2d + MaxPool2d

Milestone Structure

This milestone uses spatial architecture progression with 2 scripts:

01_lecun_tinydigits.py

Purpose: Prove CNNs > MLPs on same data

Dataset: TinyDigits (8×8 handwritten digits)
Architecture: Conv(1→8) → Pool → Conv(8→16) → Pool → Linear(→10)
Comparison: CNN ~90% vs MLP ~80% (Milestone 03)
Key Learning: "Convolution preserves spatial structure!"

Why This Comparison Matters:

Same dataset, different architecture
Direct proof that spatial operations help
~10% accuracy gain from exploiting locality

02_lecun_cifar10.py

Purpose: Scale to natural color images

Dataset: CIFAR-10 (60K images, 32×32 RGB, 10 classes)
Architecture: Deeper CNN with multiple conv blocks
Expected: 65-75% accuracy (decent for pure Python!)
Key Learning: "CNNs scale to realistic vision tasks!"

Historical Note: CIFAR-10 (2009) became the benchmark for evaluating CNN architectures before ImageNet.

Expected Results

Script	Dataset	Image Size	Architecture	Accuracy	Training Time	vs MLP
01 (TinyDigits)	1K train	8×8 gray	Simple CNN	~90%	5-7 min	+10% improvement
02 (CIFAR-10)	50K train	32×32 RGB	Deeper CNN	65-75%	30-60 min	MLPs struggle here

Key Learning: Why Convolution Dominates Vision

CNNs exploit three key principles:

1. Local Connectivity

MLP: Every pixel connects to every neuron (millions of parameters) CNN: Only local regions connect (shared filters, 100× fewer params)

2. Translation Invariance

MLP: "Cat in top-left" ≠ "Cat in bottom-right" (different weights!) CNN: Same filter detects features anywhere (shared weights)

3. Hierarchical Features

Layer 1: Edge detectors (vertical, horizontal, diagonal) Layer 2: Texture patterns (combinations of edges) Layer 3: Object parts (wheels, faces, legs) Output: Full objects (cars, cats, planes)

This is why CNNs remained state-of-the-art for vision until Vision Transformers (2020)!

Running the Milestone

cd milestones/04_1998_cnn

# Step 1: Prove CNNs > MLPs (run after Module 09)
python 01_lecun_tinydigits.py

# Step 2: Scale to natural images (run after Module 09)
python 02_lecun_cifar10.py

Achievement Unlocked

After completing this milestone, you'll understand:

Why convolution works better than dense layers for images
How local connectivity + weight sharing reduce parameters
What CNNs learn at each layer (edges → textures → parts → objects)
Why spatial operations dominated vision until transformers

You've recreated the architecture that launched modern computer vision!

Note for Next Milestone: CNNs excel at vision, but what about sequences (text, audio, time series)? Milestone 05 introduces Transformers - the architecture that unified vision AND language!

README.md Unescape Escape

Milestone 04: The CNN Revolution (1998)

Historical Context

What You're Building

Required Modules

Milestone Structure

01_lecun_tinydigits.py

02_lecun_cifar10.py

Expected Results

Key Learning: Why Convolution Dominates Vision

1. Local Connectivity

2. Translation Invariance

3. Hierarchical Features

Running the Milestone

Further Reading

Achievement Unlocked

README.md