mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-30 21:47:30 -05:00

Files

Vijay Janapa Reddi a9fed98b66 Clean up repository: remove temp files, organize modules, prepare for PyPI publication

- Removed temporary test files and audit reports
- Deleted backup and temp_holding directories
- Reorganized module structure (07->09 spatial, 09->07 dataloader)
- Added new modules: 11-14 (tokenization, embeddings, attention, transformers)
- Updated examples with historical ML milestones
- Cleaned up documentation structure

2025-09-24 10:13:37 -04:00

11 KiB

Raw Blame History

🎯 TinyTorch Master Plan V2: Minimal Viable Learning

Build ML Systems Through Implementation, Not Over-Engineering

Core Philosophy

Build JUST ENOUGH to understand WHY PyTorch works the way it does.

Students implement minimal but complete systems that demonstrate core algorithmic and engineering concepts underlying modern AI frameworks.

📚 15-Module Curriculum: From Tensors to Transformers

PHASE 1: MINIMAL WORKING NETWORK (Modules 1-4)

Milestone: XOR network inference in 4 modules

Module	Name	What Students Build (Just Enough)	Engineering Concepts to Emphasize
1	Setup	• Virtual environment setup • Basic memory profiler (tracemalloc) • Simple test runner	• Development environment = foundation • Measure before optimizing • Reproducible environments
2	Tensor	• Basic Tensor class with .data • Shape, dtype properties • Essential ops: +, -, *, / • Basic indexing [i, j]	• Memory layout (row vs column major) • Views vs copies demonstration • NumPy vectorization = 10-100x speedup • O(N) memory scaling
3	Activations	• ReLU, Sigmoid (forward only) • Broadcasting for element-wise ops • XOR impossibility proof	• Nonlinearity = intelligence • Broadcasting memory implications • Numerical stability (sigmoid overflow) • Why linear networks can't learn XOR
4	Layers	• Parameter class (tensor + grad flag) • Linear layer (W·x + b) • Sequential container • Forward pass only	• Matrix multiplication O(N³) • Parameter memory quadratic scaling • Composition enables depth • Memory per layer analysis

🎯 Phase 1 Milestone: Run XOR network inference

# Students can execute:
net = Sequential([Linear(2,4), ReLU(), Linear(4,1)])
output = net(xor_input)  # Works without training!

PHASE 2: INTELLIGENT LEARNING (Modules 5-8)

Milestone: Self-training XOR network with 100% accuracy

Module	Name	What Students Build (Just Enough)	Engineering Concepts to Emphasize
5	Autograd	• Computational graph nodes • Chain rule implementation • Backward for +, *, Linear • Gradient accumulation	• Memory explosion during backprop • Reverse-mode AD efficiency • Graph retention = memory cost • O(N) memory for gradients
6	Losses	• MSE Loss (for XOR) • CrossEntropy (preview) • loss.backward() integration	• Scalar loss enables backprop • Loss choice affects convergence • Gradient magnitude analysis
7	Optimizers	• SGD only (w = w - lr*grad) • Parameter update loop • Gradient zeroing	• Learning rate = critical hyperparameter • Why zero gradients (accumulation bug) • O(parameters) update cost
8	Training	• Basic train() function • Forward→loss→backward→step • Simple validation loop	• Training memory = activations + gradients • Train vs eval modes • Gradient accumulation for memory

🎯 Phase 2 Milestone: Train XOR to convergence

# Students watch learning happen:
for epoch in range(100):
    pred = net(X)
    loss = mse_loss(pred, y)
    loss.backward()  # Autograd magic!
    optimizer.step()  # Parameters update!
    print(f"Epoch {epoch}: Loss = {loss.data}")
# Loss: 1.0 → 0.01 (network learned!)

PHASE 3: REAL DATA MASTERY (Modules 9-12)

Milestone: MNIST CNN with >95% accuracy

Module	Name	What Students Build (Just Enough)	Engineering Concepts to Emphasize
9	Spatial	• Conv2d (simple, unoptimized) • MaxPool2d • Flatten layer • Basic CNN architecture	• Conv memory O(batch×C×H×W×K²) • Pooling reduces params exponentially • Receptive field growth • Why CNNs for images
10	DataLoader	• Dataset class for MNIST • Basic batch iteration • Simple preprocessing	• I/O bottlenecks from disk • Batch size vs memory tradeoff • Why preprocessing matters • Data pipeline optimization
11	Advanced Opt	• Adam optimizer • CrossEntropy loss • Image training loop • Validation metrics	• Adam = 3× parameter memory • Adaptive learning rates • Momentum accumulation cost • Validation prevents overfitting
12	Production	• Model checkpointing • Early stopping • Learning rate decay • Accuracy tracking	• Checkpoint size = model params • Early stopping as regularization • LR scheduling for convergence • Metric computation cost

🎯 Phase 3 Milestone: MNIST digit recognition

# Real computer vision:
cnn = Sequential([
    Conv2d(1, 16, 3), ReLU(), MaxPool2d(2),
    Conv2d(16, 32, 3), ReLU(), MaxPool2d(2),
    Flatten(), Linear(32*5*5, 10)
])
trainer.fit(mnist_train, epochs=5)
accuracy = evaluate(mnist_test)  # >95%!

PHASE 4: MODERN AI (Modules 13-15)

Milestone: TinyGPT text generation

Module	Name	What Students Build (Just Enough)	Engineering Concepts to Emphasize
13	Attention	• Scaled dot-product attention • Single-head Q,K,V • Causal masking • Position encoding	• O(N²) memory scaling • Sequence length bottlenecks • Causal masks prevent leakage • Why attention > recurrence
14	Transformers	• Multi-head attention • LayerNorm • Transformer block • GPT architecture	• Multi-head = parallel attention • LayerNorm vs BatchNorm • Residuals prevent vanishing • Layer memory accumulation
15	Generation	• Character tokenization • Embedding layers • Autoregressive generation • Temperature sampling	• Sequential inference cost • Embedding lookup efficiency • Generation memory patterns • Temperature controls diversity

🎯 Phase 4 Milestone: Generate text with TinyGPT

# Modern AI from scratch:
model = TinyGPT(vocab_size=1000, layers=6, heads=8)
train_on_shakespeare(model)
generated = model.generate("To be or not to be")
print(generated)  # Coherent continuation!

🎯 What Students DON'T Build (But Understand)

Deferred Complexity

GPU/CUDA: Understand device abstraction, implement CPU-only
Optimized kernels: Use NumPy, understand why optimization matters
Dynamic graphs: Simple static graphs, understand flexibility tradeoff
Production features: Focus on algorithms, not deployment

Integrated Simplifications

Memory profiling: Built into every module with tracemalloc
Performance timing: Simple time.time(), not complex profiling
Batch normalization: Mentioned but not implemented (complexity)
Dropout: Brief mention in CNNs, not full implementation

📊 Learning Validation Metrics

Concrete Success Criteria

Phase	Module	Success Metric	Systems Understanding
1	4	XOR inference runs	Memory layout, matrix ops
2	8	XOR trains to <0.01 loss	Gradient flow, optimization
3	12	MNIST >95% accuracy	CNN efficiency, data pipelines
4	15	Coherent text generation	Attention scaling, generation

Time Investment

Per module: 3-4 hours (read, implement, test)
Per phase: 12-16 hours
Total: 48-64 hours (realistic semester)
Complexity curve: ▁▂▃▄ ▅▅▆▆ ▇▇██ ███ (gradual increase)

🔬 Systems Engineering Thread

Every Module Teaches

Memory patterns: Where does memory go? When are copies made?
Computational complexity: O(N), O(N²), O(N³) analysis
Performance bottlenecks: What breaks first at scale?
PyTorch comparison: How does real PyTorch handle this?

Key Systems Insights Students Gain

Why matrix multiplication dominates neural network compute
Why autograd requires retaining intermediate activations
Why convolution is memory-bandwidth limited
Why attention creates quadratic scaling challenges
Why batch size affects GPU utilization
Why data loading becomes the bottleneck at scale

🚀 Why This Structure Works

Pedagogical Advantages

Immediate validation: Every phase produces working code
Progressive complexity: Each phase builds on the last
Industry relevance: Uses standard benchmarks (XOR, MNIST)
Modern relevance: Ends with transformer architecture

Engineering Focus

Just enough implementation: Learn concepts without over-engineering
Memory-first thinking: Understand resource constraints
Production awareness: Know how real systems differ
Debugging skills: Build systems that can be understood

Student Outcomes

After completing TinyTorch, students can:

Read and understand PyTorch source code
Debug training failures in production ML systems
Make informed architecture decisions based on resource constraints
Understand the engineering tradeoffs in modern AI systems

📝 Implementation Notes

Module Structure

Each module follows consistent pattern:

Minimal implementation of core concepts
Unit tests validating functionality
Memory/performance analysis section
PyTorch comparison showing production version
Systems thinking questions for reflection

Code Philosophy

Readable > Optimized: Clear code that teaches
Explicit > Magic: Show how things work
Working > Complete: Just enough to achieve milestone
Tested > Assumed: Validate everything works

✅ Success Metrics

Students successfully complete TinyTorch when they can:

Explain why neural networks need nonlinear activations (Phase 1)
Debug gradient flow problems in training (Phase 2)
Choose appropriate architectures for data types (Phase 3)
Understand transformer memory scaling (Phase 4)
Read PyTorch source with comprehension (Overall)

The Ultimate Test: Can students build and train a working model from scratch that achieves meaningful results on a real dataset?

This plan eliminates over-engineering while maintaining the core insight: students learn ML systems by building minimal but complete implementations that demonstrate the key algorithmic and systems concepts underlying modern AI frameworks.

11 KiB Raw Blame History Unescape Escape