mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-03 04:22:33 -05:00
MAJOR: Implement beautiful module progression through strategic reordering
This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles. ## Module Reordering Summary **Previous Order (Problems)**: - 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training - Issues: Autograd before optimizers, DataLoader before training, scattered dependencies **New Order (Beautiful Progression)**: - 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader - Benefits: Each module creates inevitable need for the next ## Pedagogical Flow Achieved **05_losses** → "Need systematic weight updates" → **06_optimizers** **06_optimizers** → "Need automatic gradients" → **07_autograd** **07_autograd** → "Need systematic training" → **08_training** **08_training** → "MLPs hit limits on images" → **09_spatial** **09_spatial** → "Training is too slow" → **10_dataloader** ## Technical Changes ### Module Directory Renaming - `06_autograd` → `07_autograd` - `07_dataloader` → `10_dataloader` - `08_optimizers` → `06_optimizers` - `10_training` → `08_training` - `09_spatial` → `09_spatial` (no change) ### System Integration Updates - **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py - **Test directories**: Renamed module_XX directories to match new numbers - **Documentation**: Updated all references in MD files and agent configurations - **CLI integration**: Updated next-steps suggestions for proper flow ### Agent Configuration Updates - **Quality Assurance**: Updated module audit status with new numbers - **Module Developer**: Updated work tracking with new sequence - **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression ## Educational Benefits 1. **Inevitable Discovery**: Each module naturally leads to the next 2. **Cognitive Load**: Concepts introduced exactly when needed 3. **Motivation**: Students understand WHY each tool is necessary 4. **Synthesis**: Everything flows toward complete ML systems understanding 5. **Professional Alignment**: Matches real ML engineering workflows ## Quality Assurance - ✅ All CLI commands still function - ✅ Checkpoint system mappings updated - ✅ Documentation consistency maintained - ✅ Test directory structure aligned - ✅ Agent configurations synchronized **Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
This commit is contained in:
235
OPTIMIZATION_MODULE_ARCHITECTURE.md
Normal file
235
OPTIMIZATION_MODULE_ARCHITECTURE.md
Normal file
@@ -0,0 +1,235 @@
|
||||
# TinyTorch Optimization Module Architecture
|
||||
## PyTorch Expert Review and Design Recommendations
|
||||
|
||||
### Current Architecture Analysis
|
||||
|
||||
**Strengths:**
|
||||
- Clean module progression (tensor → layers → networks → training)
|
||||
- Solid pedagogical foundation with NBGrader integration
|
||||
- Export system preserves student learning journey
|
||||
- Real systems focus with memory profiling
|
||||
|
||||
**Challenge:**
|
||||
Need to add competition-ready optimizations without breaking existing learning progression or export system.
|
||||
|
||||
### Recommended Architecture: Backend Dispatch System
|
||||
|
||||
#### 1. Backend Interface Design
|
||||
|
||||
```python
|
||||
# New: tinytorch/backends/__init__.py
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
class ComputeBackend(ABC):
|
||||
"""Abstract base class for computational backends"""
|
||||
|
||||
@abstractmethod
|
||||
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||||
"""Matrix multiplication implementation"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def conv2d(self, input: np.ndarray, kernel: np.ndarray,
|
||||
stride: int = 1, padding: int = 0) -> np.ndarray:
|
||||
"""2D convolution implementation"""
|
||||
pass
|
||||
|
||||
class NaiveBackend(ComputeBackend):
|
||||
"""Pedagogical reference implementation"""
|
||||
|
||||
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||||
# Triple-loop O(n³) implementation for learning
|
||||
m, k = a.shape
|
||||
k2, n = b.shape
|
||||
assert k == k2
|
||||
|
||||
result = np.zeros((m, n))
|
||||
for i in range(m):
|
||||
for j in range(n):
|
||||
for l in range(k):
|
||||
result[i, j] += a[i, l] * b[l, j]
|
||||
return result
|
||||
|
||||
def conv2d(self, input, kernel, stride=1, padding=0):
|
||||
# Naive sliding window implementation
|
||||
return naive_conv2d(input, kernel, stride, padding)
|
||||
|
||||
class OptimizedBackend(ComputeBackend):
|
||||
"""Competition-ready optimized implementation"""
|
||||
|
||||
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||||
# Cache-friendly blocked matrix multiplication
|
||||
return optimized_blocked_matmul(a, b)
|
||||
|
||||
def conv2d(self, input, kernel, stride=1, padding=0):
|
||||
# im2col + GEMM optimization
|
||||
return optimized_conv2d(input, kernel, stride, padding)
|
||||
```
|
||||
|
||||
#### 2. Configuration System
|
||||
|
||||
```python
|
||||
# New: tinytorch/config.py
|
||||
_backend = None
|
||||
|
||||
def set_backend(backend_name: str):
|
||||
"""Switch computational backend globally"""
|
||||
global _backend
|
||||
if backend_name == 'naive':
|
||||
_backend = NaiveBackend()
|
||||
elif backend_name == 'optimized':
|
||||
_backend = OptimizedBackend()
|
||||
else:
|
||||
raise ValueError(f"Unknown backend: {backend_name}")
|
||||
|
||||
def get_backend() -> ComputeBackend:
|
||||
"""Get current backend, defaulting to naive"""
|
||||
global _backend
|
||||
if _backend is None:
|
||||
_backend = NaiveBackend() # Default to learning mode
|
||||
return _backend
|
||||
```
|
||||
|
||||
#### 3. Existing API Modifications (Minimal Changes)
|
||||
|
||||
```python
|
||||
# Modified: tinytorch/core/layers.py (line ~112)
|
||||
def matmul(a: Tensor, b: Tensor) -> Tensor:
|
||||
"""Matrix multiplication with backend dispatch"""
|
||||
from tinytorch.config import get_backend
|
||||
backend = get_backend()
|
||||
result_data = backend.matmul(a.data, b.data)
|
||||
return Tensor(result_data)
|
||||
|
||||
# The Dense layer automatically gets the optimization!
|
||||
# No changes needed to Dense.forward() method
|
||||
```
|
||||
|
||||
### Module Progression Strategy
|
||||
|
||||
#### Modules 1-10: Pure Learning Mode
|
||||
- Always use `NaiveBackend` (hardcoded)
|
||||
- Focus on understanding algorithms
|
||||
- No mention of optimization
|
||||
|
||||
#### Module 11-12: Introduce Backend Concept
|
||||
- Explain why optimizations matter
|
||||
- Show backend switching API
|
||||
- Compare naive vs optimized performance
|
||||
|
||||
#### Module 13: Performance Kernels (NEW)
|
||||
- Implement optimized backends
|
||||
- Cache-friendly algorithms
|
||||
- Memory access pattern optimization
|
||||
- SIMD/vectorization techniques
|
||||
|
||||
#### Module 14: Benchmarking & Competition (MODIFIED)
|
||||
- Comprehensive performance measurement
|
||||
- Memory profiling tools
|
||||
- Competition leaderboard system
|
||||
- Head-to-head performance comparisons
|
||||
|
||||
### Competition Framework Design
|
||||
|
||||
#### Benchmark Context Manager
|
||||
|
||||
```python
|
||||
# New: tinytorch/benchmark.py
|
||||
import time
|
||||
import tracemalloc
|
||||
from contextlib import contextmanager
|
||||
|
||||
@contextmanager
|
||||
def benchmark():
|
||||
"""Context manager for performance measurement"""
|
||||
tracemalloc.start()
|
||||
start_time = time.perf_counter()
|
||||
|
||||
try:
|
||||
yield BenchmarkResult()
|
||||
finally:
|
||||
end_time = time.perf_counter()
|
||||
current, peak = tracemalloc.get_traced_memory()
|
||||
tracemalloc.stop()
|
||||
|
||||
# Store results in returned object
|
||||
result.time_ms = (end_time - start_time) * 1000
|
||||
result.peak_memory_mb = peak / 1024 / 1024
|
||||
result.current_memory_mb = current / 1024 / 1024
|
||||
|
||||
class BenchmarkResult:
|
||||
def __init__(self):
|
||||
self.time_ms = 0
|
||||
self.peak_memory_mb = 0
|
||||
self.current_memory_mb = 0
|
||||
```
|
||||
|
||||
#### Competition API
|
||||
|
||||
```python
|
||||
# Student competition usage
|
||||
import tinytorch
|
||||
|
||||
# Learning phase
|
||||
tinytorch.set_backend('naive')
|
||||
with tinytorch.benchmark() as bench:
|
||||
output = model(input)
|
||||
print(f"Naive: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB")
|
||||
|
||||
# Competition phase
|
||||
tinytorch.set_backend('optimized')
|
||||
with tinytorch.benchmark() as bench:
|
||||
output = model(input)
|
||||
print(f"Optimized: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB")
|
||||
|
||||
# Speedup calculation
|
||||
speedup = naive_time / optimized_time
|
||||
print(f"Speedup: {speedup:.1f}x faster!")
|
||||
```
|
||||
|
||||
### Implementation Benefits
|
||||
|
||||
#### 1. **Zero Breaking Changes**
|
||||
- Existing student code works unchanged
|
||||
- Export system remains intact
|
||||
- Learning progression preserved
|
||||
|
||||
#### 2. **Easy Competition Setup**
|
||||
```python
|
||||
# Same model, same data, dramatic performance difference
|
||||
model = build_resnet()
|
||||
data = load_cifar10()
|
||||
|
||||
# Students compete on who can optimize best
|
||||
tinytorch.set_backend('student_submission_1')
|
||||
tinytorch.set_backend('student_submission_2')
|
||||
```
|
||||
|
||||
#### 3. **Realistic Performance Differences**
|
||||
- Naive matmul: O(n³) with poor cache behavior
|
||||
- Optimized matmul: Blocked + SIMD → 10-100x speedup
|
||||
- Students see why optimization matters!
|
||||
|
||||
#### 4. **Clean Separation of Concerns**
|
||||
- Modules 1-10: Pure learning (algorithms)
|
||||
- Modules 11-14: Systems engineering (optimization)
|
||||
- Competition: Best of both worlds
|
||||
|
||||
### PyTorch Design Lessons Applied
|
||||
|
||||
This architecture mirrors how PyTorch actually works:
|
||||
|
||||
1. **Dispatcher Pattern**: PyTorch uses dispatching to different backends (CPU/CUDA/XLA)
|
||||
2. **Operator Fusion**: High-level operations dispatch to optimized kernels
|
||||
3. **Backward Compatibility**: Old code works unchanged when optimizations are added
|
||||
4. **Performance Isolation**: Learning code doesn't need to know about optimizations
|
||||
|
||||
### Next Steps Recommendation
|
||||
|
||||
1. **Start small**: Implement backend system for just `matmul` first
|
||||
2. **Prove the pattern**: Show 10x+ speedup possible with same API
|
||||
3. **Expand gradually**: Add conv2d, attention, etc.
|
||||
4. **Build competition tools**: Leaderboards, automated benchmarking
|
||||
5. **Create optimization modules**: Let students implement their own backends
|
||||
|
||||
This architecture gives you the best of both worlds: clean learning progression AND competition-ready performance, using the same patterns that make PyTorch successful in production.
|
||||
Reference in New Issue
Block a user