# TinyTorch Optimization Module Architecture ## PyTorch Expert Review and Design Recommendations ### Current Architecture Analysis **Strengths:** - Clean module progression (tensor → layers → networks → training) - Solid pedagogical foundation with NBGrader integration - Export system preserves student learning journey - Real systems focus with memory profiling **Challenge:** Need to add competition-ready optimizations without breaking existing learning progression or export system. ### Recommended Architecture: Backend Dispatch System #### 1. Backend Interface Design ```python # New: tinytorch/backends/__init__.py from abc import ABC, abstractmethod class ComputeBackend(ABC): """Abstract base class for computational backends""" @abstractmethod def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray: """Matrix multiplication implementation""" pass @abstractmethod def conv2d(self, input: np.ndarray, kernel: np.ndarray, stride: int = 1, padding: int = 0) -> np.ndarray: """2D convolution implementation""" pass class NaiveBackend(ComputeBackend): """Pedagogical reference implementation""" def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray: # Triple-loop O(n³) implementation for learning m, k = a.shape k2, n = b.shape assert k == k2 result = np.zeros((m, n)) for i in range(m): for j in range(n): for l in range(k): result[i, j] += a[i, l] * b[l, j] return result def conv2d(self, input, kernel, stride=1, padding=0): # Naive sliding window implementation return naive_conv2d(input, kernel, stride, padding) class OptimizedBackend(ComputeBackend): """Competition-ready optimized implementation""" def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray: # Cache-friendly blocked matrix multiplication return optimized_blocked_matmul(a, b) def conv2d(self, input, kernel, stride=1, padding=0): # im2col + GEMM optimization return optimized_conv2d(input, kernel, stride, padding) ``` #### 2. Configuration System ```python # New: tinytorch/config.py _backend = None def set_backend(backend_name: str): """Switch computational backend globally""" global _backend if backend_name == 'naive': _backend = NaiveBackend() elif backend_name == 'optimized': _backend = OptimizedBackend() else: raise ValueError(f"Unknown backend: {backend_name}") def get_backend() -> ComputeBackend: """Get current backend, defaulting to naive""" global _backend if _backend is None: _backend = NaiveBackend() # Default to learning mode return _backend ``` #### 3. Existing API Modifications (Minimal Changes) ```python # Modified: tinytorch/core/layers.py (line ~112) def matmul(a: Tensor, b: Tensor) -> Tensor: """Matrix multiplication with backend dispatch""" from tinytorch.config import get_backend backend = get_backend() result_data = backend.matmul(a.data, b.data) return Tensor(result_data) # The Dense layer automatically gets the optimization! # No changes needed to Dense.forward() method ``` ### Module Progression Strategy #### Modules 1-10: Pure Learning Mode - Always use `NaiveBackend` (hardcoded) - Focus on understanding algorithms - No mention of optimization #### Module 11-12: Introduce Backend Concept - Explain why optimizations matter - Show backend switching API - Compare naive vs optimized performance #### Module 13: Performance Kernels (NEW) - Implement optimized backends - Cache-friendly algorithms - Memory access pattern optimization - SIMD/vectorization techniques #### Module 14: Benchmarking & Competition (MODIFIED) - Comprehensive performance measurement - Memory profiling tools - Competition leaderboard system - Head-to-head performance comparisons ### Competition Framework Design #### Benchmark Context Manager ```python # New: tinytorch/benchmark.py import time import tracemalloc from contextlib import contextmanager @contextmanager def benchmark(): """Context manager for performance measurement""" tracemalloc.start() start_time = time.perf_counter() try: yield BenchmarkResult() finally: end_time = time.perf_counter() current, peak = tracemalloc.get_traced_memory() tracemalloc.stop() # Store results in returned object result.time_ms = (end_time - start_time) * 1000 result.peak_memory_mb = peak / 1024 / 1024 result.current_memory_mb = current / 1024 / 1024 class BenchmarkResult: def __init__(self): self.time_ms = 0 self.peak_memory_mb = 0 self.current_memory_mb = 0 ``` #### Competition API ```python # Student competition usage import tinytorch # Learning phase tinytorch.set_backend('naive') with tinytorch.benchmark() as bench: output = model(input) print(f"Naive: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB") # Competition phase tinytorch.set_backend('optimized') with tinytorch.benchmark() as bench: output = model(input) print(f"Optimized: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB") # Speedup calculation speedup = naive_time / optimized_time print(f"Speedup: {speedup:.1f}x faster!") ``` ### Implementation Benefits #### 1. **Zero Breaking Changes** - Existing student code works unchanged - Export system remains intact - Learning progression preserved #### 2. **Easy Competition Setup** ```python # Same model, same data, dramatic performance difference model = build_resnet() data = load_cifar10() # Students compete on who can optimize best tinytorch.set_backend('student_submission_1') tinytorch.set_backend('student_submission_2') ``` #### 3. **Realistic Performance Differences** - Naive matmul: O(n³) with poor cache behavior - Optimized matmul: Blocked + SIMD → 10-100x speedup - Students see why optimization matters! #### 4. **Clean Separation of Concerns** - Modules 1-10: Pure learning (algorithms) - Modules 11-14: Systems engineering (optimization) - Competition: Best of both worlds ### PyTorch Design Lessons Applied This architecture mirrors how PyTorch actually works: 1. **Dispatcher Pattern**: PyTorch uses dispatching to different backends (CPU/CUDA/XLA) 2. **Operator Fusion**: High-level operations dispatch to optimized kernels 3. **Backward Compatibility**: Old code works unchanged when optimizations are added 4. **Performance Isolation**: Learning code doesn't need to know about optimizations ### Next Steps Recommendation 1. **Start small**: Implement backend system for just `matmul` first 2. **Prove the pattern**: Show 10x+ speedup possible with same API 3. **Expand gradually**: Add conv2d, attention, etc. 4. **Build competition tools**: Leaderboards, automated benchmarking 5. **Create optimization modules**: Let students implement their own backends This architecture gives you the best of both worlds: clean learning progression AND competition-ready performance, using the same patterns that make PyTorch successful in production.