Files
TinyTorch/modules/source/12_kernels
Vijay Janapa Reddi 01e4aec62b Update module numbering from 00-13 to 01-14 and refresh tagline
- Updated all module references to start from 01 instead of 00
- Changed tagline to 'Build your own ML framework. Start small. Go deep.'
- Added educational foundation section linking to ML Systems book
- Updated README, documentation, CLI examples, and prerequisites
- Regenerated book content with consistent numbering throughout
- Maintains 14 modules total but with natural numbering (01-14)
2025-07-15 21:11:07 -04:00
..

🚀 Module 11: Kernels - Hardware-Aware Optimization

📊 Module Info

  • Difficulty: Expert
  • Time Estimate: 8-10 hours
  • Prerequisites: All previous modules (01-11), especially Compression
  • Next Steps: Benchmarking, MLOps modules

Bridge the gap between algorithmic optimization and hardware-level performance engineering

🎯 Learning Objectives

After completing this module, you will:

  • Understand how to implement custom ML operations beyond NumPy
  • Apply SIMD vectorization and CPU optimization techniques
  • Optimize memory layout and access patterns for cache efficiency
  • Implement GPU-style parallel computing concepts
  • Build comprehensive performance profiling and benchmarking tools
  • Create hardware-optimized operations for quantized and pruned models

🔗 Connection to Previous Modules

What You Already Know

  • Compression (Module 10): What to optimize (model size, computation)
  • Layers (Module 03): Basic matrix multiplication with matmul()
  • Training (Module 09): High-level optimization workflows
  • Networks (Module 04): How operations compose into architectures

The Performance Gap

Students understand algorithmic optimization but not hardware optimization:

  • Algorithmic: Pruning, quantization, knowledge distillation
  • Hardware: Memory layout, vectorization, parallel processing

🧠 Build → Use → Optimize

This module follows the "Build → Use → Optimize" pedagogical framework:

1. Build: Custom Operations

  • Move beyond NumPy's black box implementations
  • Implement specialized matrix multiplication and activations
  • Understand the computational patterns underlying ML

2. Use: Performance Optimization

  • Apply SIMD vectorization for CPU optimization
  • Implement cache-friendly memory layouts
  • Build GPU-style parallel computing concepts

3. Optimize: Real-World Integration

  • Profile and benchmark performance improvements
  • Integrate with compressed models from Module 10
  • Apply systematic evaluation to validate optimizations

📚 What You'll Build

Core Operations

  • Matrix Multiplication: Custom matmul_baseline() and cache_friendly_matmul()
  • Activation Functions: Vectorized vectorized_relu() and parallel_relu()
  • Batch Processing: parallel_batch_processing() with multiprocessing
  • Quantized Operations: quantized_matmul() and quantized_relu()

Performance Tools

  • Profiling: profile_operation() for detailed timing analysis
  • Benchmarking: benchmark_operation() for statistical validation
  • Memory Analysis: Cache-friendly data layout optimization
  • Parallel Computing: Multi-core processing patterns

🛠️ Key Components

Hardware-Optimized Operations

  • Purpose: Implement custom ML operations with hardware awareness
  • Methods: matmul_baseline(), vectorized_relu(), cache_friendly_matmul()
  • Learning: Understanding computational patterns beyond NumPy

Parallel Processing Framework

  • Purpose: Multi-core optimization for batch operations
  • Methods: parallel_batch_processing(), parallel_relu()
  • Learning: GPU-style parallel computing concepts

Quantization Integration

  • Purpose: Hardware-optimized operations for compressed models
  • Methods: quantized_matmul(), quantized_relu()
  • Learning: Bridging compression and performance optimization

Performance Profiling

  • Purpose: Systematic measurement and validation of optimizations
  • Methods: profile_operation(), benchmark_operation()
  • Learning: Evidence-based performance engineering

🌟 Real-World Applications

Industry Examples

  • Google TPUs: Custom hardware for ML operations
  • Intel MKL: Optimized math libraries for CPU performance
  • NVIDIA cuDNN: GPU-accelerated neural network operations
  • Apple Neural Engine: Hardware-specific ML acceleration

Performance Patterns

  • Memory Layout: Row-major vs column-major access patterns
  • Vectorization: SIMD instructions for parallel computation
  • Cache Optimization: Data locality for memory hierarchy
  • Parallel Processing: Multi-core utilization strategies

🚀 Getting Started

Prerequisites Check

tito test --module compression  # Should pass
tito status --module 11_kernels  # Check module status

Development Workflow

cd modules/source/11_kernels
jupyter notebook kernels_dev.py  # or edit directly

Testing Your Implementation

# Test inline (within notebook)
# Run comprehensive tests
tito test --module kernels

📖 Module Structure

modules/source/11_kernels/
├── kernels_dev.py      # Main development file
├── README.md           # This overview
└── module.yaml         # Module configuration

🔗 Integration Points

Input from Previous Modules

  • Tensor operations → Custom implementations
  • Compressed models → Hardware-optimized execution
  • Training workflows → Performance-critical operations

Output to Next Modules

  • Benchmarking → Operations to evaluate systematically
  • MLOps → Production-ready optimized operations
  • Complete system → End-to-end performance optimization

🎓 Educational Philosophy

This module bridges the gap between algorithmic understanding and systems performance. Students learn that optimization isn't just about better algorithms—it's about understanding how algorithms interact with hardware to achieve real-world performance gains.

By the end, you'll think like a performance engineer, not just a machine learning practitioner.