Files
TinyTorch/docs/archive/OPTIMIZATION_MODULE_ARCHITECTURE.md
Vijay Janapa Reddi 56f374efa3 FOUNDATION: Establish AI Engineering as a discipline through TinyTorch
🎯 NORTH STAR VISION DOCUMENTED:
'Don't Just Import It, Build It' - Training AI Engineers, not just ML users

AI Engineering emerges as a foundational discipline like Computer Engineering,
bridging algorithms and systems to build the AI infrastructure of the future.

🧪 ROBUST TESTING FRAMEWORK ESTABLISHED:
- Created tests/regression/ for sandbox integrity tests
- Implemented test-driven bug prevention workflow
- Clear separation: student tests (pedagogical) vs system tests (robustness)
- Every bug becomes a test to prevent recurrence

 KEY IMPLEMENTATIONS:
- NORTH_STAR.md: Vision for AI Engineering discipline
- Testing best practices: Focus on robust student sandbox
- Git workflow standards: Professional development practices
- Regression test suite: Prevent infrastructure issues
- Conv->Linear dimension tests (found CNN bug)
- Transformer reshaping tests (found GPT bug)

🏗️ SANDBOX INTEGRITY:
Students need a solid, predictable environment where they focus on ML concepts,
not debugging framework issues. The framework must be invisible.

📚 EDUCATIONAL PHILOSOPHY:
TinyTorch isn't just teaching a framework - it's founding the AI Engineering
discipline by training engineers who understand how to BUILD ML systems.

This establishes the foundation for training the first generation of true
AI Engineers who will define this emerging discipline.
2025-09-25 11:16:28 -04:00

7.1 KiB

TinyTorch Optimization Module Architecture

PyTorch Expert Review and Design Recommendations

Current Architecture Analysis

Strengths:

  • Clean module progression (tensor → layers → networks → training)
  • Solid pedagogical foundation with NBGrader integration
  • Export system preserves student learning journey
  • Real systems focus with memory profiling

Challenge: Need to add competition-ready optimizations without breaking existing learning progression or export system.

1. Backend Interface Design

# New: tinytorch/backends/__init__.py
from abc import ABC, abstractmethod

class ComputeBackend(ABC):
    """Abstract base class for computational backends"""
    
    @abstractmethod
    def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
        """Matrix multiplication implementation"""
        pass
    
    @abstractmethod 
    def conv2d(self, input: np.ndarray, kernel: np.ndarray, 
              stride: int = 1, padding: int = 0) -> np.ndarray:
        """2D convolution implementation"""
        pass

class NaiveBackend(ComputeBackend):
    """Pedagogical reference implementation"""
    
    def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
        # Triple-loop O(n³) implementation for learning
        m, k = a.shape
        k2, n = b.shape
        assert k == k2
        
        result = np.zeros((m, n))
        for i in range(m):
            for j in range(n):
                for l in range(k):
                    result[i, j] += a[i, l] * b[l, j]
        return result
    
    def conv2d(self, input, kernel, stride=1, padding=0):
        # Naive sliding window implementation
        return naive_conv2d(input, kernel, stride, padding)

class OptimizedBackend(ComputeBackend):
    """Competition-ready optimized implementation"""
    
    def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
        # Cache-friendly blocked matrix multiplication
        return optimized_blocked_matmul(a, b)
    
    def conv2d(self, input, kernel, stride=1, padding=0):
        # im2col + GEMM optimization
        return optimized_conv2d(input, kernel, stride, padding)

2. Configuration System

# New: tinytorch/config.py
_backend = None

def set_backend(backend_name: str):
    """Switch computational backend globally"""
    global _backend
    if backend_name == 'naive':
        _backend = NaiveBackend()
    elif backend_name == 'optimized':
        _backend = OptimizedBackend()
    else:
        raise ValueError(f"Unknown backend: {backend_name}")

def get_backend() -> ComputeBackend:
    """Get current backend, defaulting to naive"""
    global _backend
    if _backend is None:
        _backend = NaiveBackend()  # Default to learning mode
    return _backend

3. Existing API Modifications (Minimal Changes)

# Modified: tinytorch/core/layers.py (line ~112)
def matmul(a: Tensor, b: Tensor) -> Tensor:
    """Matrix multiplication with backend dispatch"""
    from tinytorch.config import get_backend
    backend = get_backend()
    result_data = backend.matmul(a.data, b.data)
    return Tensor(result_data)

# The Dense layer automatically gets the optimization!
# No changes needed to Dense.forward() method

Module Progression Strategy

Modules 1-10: Pure Learning Mode

  • Always use NaiveBackend (hardcoded)
  • Focus on understanding algorithms
  • No mention of optimization

Module 11-12: Introduce Backend Concept

  • Explain why optimizations matter
  • Show backend switching API
  • Compare naive vs optimized performance

Module 13: Performance Kernels (NEW)

  • Implement optimized backends
  • Cache-friendly algorithms
  • Memory access pattern optimization
  • SIMD/vectorization techniques

Module 14: Benchmarking & Competition (MODIFIED)

  • Comprehensive performance measurement
  • Memory profiling tools
  • Competition leaderboard system
  • Head-to-head performance comparisons

Competition Framework Design

Benchmark Context Manager

# New: tinytorch/benchmark.py
import time
import tracemalloc
from contextlib import contextmanager

@contextmanager
def benchmark():
    """Context manager for performance measurement"""
    tracemalloc.start()
    start_time = time.perf_counter()
    
    try:
        yield BenchmarkResult()
    finally:
        end_time = time.perf_counter()
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()
        
        # Store results in returned object
        result.time_ms = (end_time - start_time) * 1000
        result.peak_memory_mb = peak / 1024 / 1024
        result.current_memory_mb = current / 1024 / 1024

class BenchmarkResult:
    def __init__(self):
        self.time_ms = 0
        self.peak_memory_mb = 0  
        self.current_memory_mb = 0

Competition API

# Student competition usage
import tinytorch

# Learning phase
tinytorch.set_backend('naive')
with tinytorch.benchmark() as bench:
    output = model(input)
print(f"Naive: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB")

# Competition phase  
tinytorch.set_backend('optimized')
with tinytorch.benchmark() as bench:
    output = model(input)
print(f"Optimized: {bench.time_ms:.1f}ms, {bench.peak_memory_mb:.1f}MB")

# Speedup calculation
speedup = naive_time / optimized_time
print(f"Speedup: {speedup:.1f}x faster!")

Implementation Benefits

1. Zero Breaking Changes

  • Existing student code works unchanged
  • Export system remains intact
  • Learning progression preserved

2. Easy Competition Setup

# Same model, same data, dramatic performance difference
model = build_resnet()
data = load_cifar10()

# Students compete on who can optimize best
tinytorch.set_backend('student_submission_1')  
tinytorch.set_backend('student_submission_2')

3. Realistic Performance Differences

  • Naive matmul: O(n³) with poor cache behavior
  • Optimized matmul: Blocked + SIMD → 10-100x speedup
  • Students see why optimization matters!

4. Clean Separation of Concerns

  • Modules 1-10: Pure learning (algorithms)
  • Modules 11-14: Systems engineering (optimization)
  • Competition: Best of both worlds

PyTorch Design Lessons Applied

This architecture mirrors how PyTorch actually works:

  1. Dispatcher Pattern: PyTorch uses dispatching to different backends (CPU/CUDA/XLA)
  2. Operator Fusion: High-level operations dispatch to optimized kernels
  3. Backward Compatibility: Old code works unchanged when optimizations are added
  4. Performance Isolation: Learning code doesn't need to know about optimizations

Next Steps Recommendation

  1. Start small: Implement backend system for just matmul first
  2. Prove the pattern: Show 10x+ speedup possible with same API
  3. Expand gradually: Add conv2d, attention, etc.
  4. Build competition tools: Leaderboards, automated benchmarking
  5. Create optimization modules: Let students implement their own backends

This architecture gives you the best of both worlds: clean learning progression AND competition-ready performance, using the same patterns that make PyTorch successful in production.