mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 14:22:32 -05:00
This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles. ## Module Reordering Summary **Previous Order (Problems)**: - 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training - Issues: Autograd before optimizers, DataLoader before training, scattered dependencies **New Order (Beautiful Progression)**: - 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader - Benefits: Each module creates inevitable need for the next ## Pedagogical Flow Achieved **05_losses** → "Need systematic weight updates" → **06_optimizers** **06_optimizers** → "Need automatic gradients" → **07_autograd** **07_autograd** → "Need systematic training" → **08_training** **08_training** → "MLPs hit limits on images" → **09_spatial** **09_spatial** → "Training is too slow" → **10_dataloader** ## Technical Changes ### Module Directory Renaming - `06_autograd` → `07_autograd` - `07_dataloader` → `10_dataloader` - `08_optimizers` → `06_optimizers` - `10_training` → `08_training` - `09_spatial` → `09_spatial` (no change) ### System Integration Updates - **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py - **Test directories**: Renamed module_XX directories to match new numbers - **Documentation**: Updated all references in MD files and agent configurations - **CLI integration**: Updated next-steps suggestions for proper flow ### Agent Configuration Updates - **Quality Assurance**: Updated module audit status with new numbers - **Module Developer**: Updated work tracking with new sequence - **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression ## Educational Benefits 1. **Inevitable Discovery**: Each module naturally leads to the next 2. **Cognitive Load**: Concepts introduced exactly when needed 3. **Motivation**: Students understand WHY each tool is necessary 4. **Synthesis**: Everything flows toward complete ML systems understanding 5. **Professional Alignment**: Matches real ML engineering workflows ## Quality Assurance - ✅ All CLI commands still function - ✅ Checkpoint system mappings updated - ✅ Documentation consistency maintained - ✅ Test directory structure aligned - ✅ Agent configurations synchronized **Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
7.1 KiB
7.1 KiB
TinyTorch Optimization Module Architecture
PyTorch Expert Review and Design Recommendations
Current Architecture Analysis
Strengths:
- Clean module progression (tensor → layers → networks → training)
- Solid pedagogical foundation with NBGrader integration
- Export system preserves student learning journey
- Real systems focus with memory profiling
Challenge: Need to add competition-ready optimizations without breaking existing learning progression or export system.
Recommended Architecture: Backend Dispatch System
1. Backend Interface Design
# New: tinytorch/backends/__init__.py
from abc import ABC, abstractmethod
class ComputeBackend(ABC):
"""Abstract base class for computational backends"""
@abstractmethod
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""Matrix multiplication implementation"""
pass
@abstractmethod
def conv2d(self, input: np.ndarray, kernel: np.ndarray,
stride: int = 1, padding: int = 0) -> np.ndarray:
"""2D convolution implementation"""
pass
class NaiveBackend(ComputeBackend):
"""Pedagogical reference implementation"""
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
# Triple-loop O(n³) implementation for learning
m, k = a.shape
k2, n = b.shape
assert k == k2
result = np.zeros((m, n))
for i in range(m):
for j in range(n):
for l in range(k):
result[i, j] += a[i, l] * b[l, j]
return result
def conv2d(self, input, kernel, stride=1, padding=0):
# Naive sliding window implementation
return naive_conv2d(input, kernel, stride, padding)
class OptimizedBackend(ComputeBackend):
"""Competition-ready optimized implementation"""
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
# Cache-friendly blocked matrix multiplication
return optimized_blocked_matmul(a, b)
def conv2d(self, input, kernel, stride=1, padding=0):
# im2col + GEMM optimization
return optimized_conv2d(input, kernel, stride, padding)
2. Configuration System
# New: tinytorch/config.py
_backend = None
def set_backend(backend_name: str):
"""Switch computational backend globally"""
global _backend
if backend_name == 'naive':
_backend = NaiveBackend()
elif backend_name == 'optimized':
_backend = OptimizedBackend()
else:
raise ValueError(f"Unknown backend: {backend_name}")
def get_backend() -> ComputeBackend:
"""Get current backend, defaulting to naive"""
global _backend
if _backend is None:
_backend = NaiveBackend() # Default to learning mode
return _backend
3. Existing API Modifications (Minimal Changes)
# Modified: tinytorch/core/layers.py (line ~112)
def matmul(a: Tensor, b: Tensor) -> Tensor:
"""Matrix multiplication with backend dispatch"""
from tinytorch.config import get_backend
backend = get_backend()
result_data = backend.matmul(a.data, b.data)
return Tensor(result_data)
# The Dense layer automatically gets the optimization!
# No changes needed to Dense.forward() method
Module Progression Strategy
Modules 1-10: Pure Learning Mode
- Always use
NaiveBackend(hardcoded) - Focus on understanding algorithms
- No mention of optimization
Module 11-12: Introduce Backend Concept
- Explain why optimizations matter
- Show backend switching API
- Compare naive vs optimized performance
Module 13: Performance Kernels (NEW)
- Implement optimized backends
- Cache-friendly algorithms
- Memory access pattern optimization
- SIMD/vectorization techniques
Module 14: Benchmarking & Competition (MODIFIED)
- Comprehensive performance measurement
- Memory profiling tools
- Competition leaderboard system
- Head-to-head performance comparisons
Competition Framework Design
Benchmark Context Manager
# New: tinytorch/benchmark.py
import time
import tracemalloc
from contextlib import contextmanager
@contextmanager
def benchmark():
"""Context manager for performance measurement"""
tracemalloc.start()
start_time = time.perf_counter()
try:
yield BenchmarkResult()
finally:
end_time = time.perf_counter()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
# Store results in returned object
result.time_ms = (end_time - start_time) * 1000
result.peak_memory_mb = peak / 1024 / 1024
result.current_memory_mb = current / 1024 / 1024
class BenchmarkResult:
def __init__(self):
self.time_ms = 0
self.peak_memory_mb = 0
self.current_memory_mb = 0
Competition API
# Student competition usage
import tinytorch
# Learning phase
tinytorch.set_backend('naive')
with tinytorch.benchmark() as bench:
output = model(input)
print(f"Naive: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB")
# Competition phase
tinytorch.set_backend('optimized')
with tinytorch.benchmark() as bench:
output = model(input)
print(f"Optimized: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB")
# Speedup calculation
speedup = naive_time / optimized_time
print(f"Speedup: {speedup:.1f}x faster!")
Implementation Benefits
1. Zero Breaking Changes
- Existing student code works unchanged
- Export system remains intact
- Learning progression preserved
2. Easy Competition Setup
# Same model, same data, dramatic performance difference
model = build_resnet()
data = load_cifar10()
# Students compete on who can optimize best
tinytorch.set_backend('student_submission_1')
tinytorch.set_backend('student_submission_2')
3. Realistic Performance Differences
- Naive matmul: O(n³) with poor cache behavior
- Optimized matmul: Blocked + SIMD → 10-100x speedup
- Students see why optimization matters!
4. Clean Separation of Concerns
- Modules 1-10: Pure learning (algorithms)
- Modules 11-14: Systems engineering (optimization)
- Competition: Best of both worlds
PyTorch Design Lessons Applied
This architecture mirrors how PyTorch actually works:
- Dispatcher Pattern: PyTorch uses dispatching to different backends (CPU/CUDA/XLA)
- Operator Fusion: High-level operations dispatch to optimized kernels
- Backward Compatibility: Old code works unchanged when optimizations are added
- Performance Isolation: Learning code doesn't need to know about optimizations
Next Steps Recommendation
- Start small: Implement backend system for just
matmulfirst - Prove the pattern: Show 10x+ speedup possible with same API
- Expand gradually: Add conv2d, attention, etc.
- Build competition tools: Leaderboards, automated benchmarking
- Create optimization modules: Let students implement their own backends
This architecture gives you the best of both worlds: clean learning progression AND competition-ready performance, using the same patterns that make PyTorch successful in production.