TinyTorch/modules/15_acceleration/module.yaml

assessment:
- Understand why naive loops have poor cache performance
- Implement cache-friendly blocked matrix multiplication showing 10-50x speedups
- Recognize why NumPy provides 100x+ speedups over custom implementations
- Build backend system that automatically chooses optimal implementations
- 'Apply the ''free speedup'' principle: use better tools, don''t write faster code'
description: 'Master the easiest optimization: using better backends! Learn why naive
  loops are slow, how cache-friendly blocking helps, and why NumPy provides 100x+
  speedups.'
difficulty: Advanced
estimated_time: 3-4 hours
exports:
- matmul_naive
- matmul_blocked
- matmul_numpy
- OptimizedBackend
- matmul
- set_backend
learning_objectives:
- Understand CPU cache hierarchy and memory access performance bottlenecks
- Implement cache-friendly blocked matrix multiplication algorithms
- Build vectorized operations with optimized memory access patterns
- Design transparent backend systems for automatic optimization selection
- Measure and quantify real performance improvements scientifically
- Apply systems thinking to optimization decisions in ML workflows
name: acceleration
prerequisites:
- 'Module 2: Tensor operations and NumPy fundamentals'
- 'Module 4: Linear layers and matrix multiplication'
- Understanding of basic algorithmic complexity (O notation)
tags:
- performance
- optimization
- systems
- hardware
- acceleration
- cache
- vectorization
- backends
title: Hardware Acceleration - The Simplest Optimization