mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-06 20:14:44 -05:00

Files

Vijay Janapa Reddi 01e4aec62b Update module numbering from 00-13 to 01-14 and refresh tagline

- Updated all module references to start from 01 instead of 00
- Changed tagline to 'Build your own ML framework. Start small. Go deep.'
- Added educational foundation section linking to ML Systems book
- Updated README, documentation, CLI examples, and prerequisites
- Regenerated book content with consistent numbering throughout
- Maintains 14 modules total but with natural numbering (01-14)

2025-07-15 21:11:07 -04:00

kernels_dev_backup.py

Renumber modules from 00-13 to 01-14 for natural numbering

2025-07-15 18:51:36 -04:00

kernels_dev.ipynb

Renumber modules from 00-13 to 01-14 for natural numbering

2025-07-15 18:51:36 -04:00

kernels_dev.py

Renumber modules from 00-13 to 01-14 for natural numbering

2025-07-15 18:51:36 -04:00

module.yaml

Renumber modules from 00-13 to 01-14 for natural numbering

2025-07-15 18:51:36 -04:00

README.md

Update module numbering from 00-13 to 01-14 and refresh tagline

2025-07-15 21:11:07 -04:00

README.md

🚀 Module 11: Kernels - Hardware-Aware Optimization

📊 Module Info

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time Estimate: 8-10 hours
Prerequisites: All previous modules (01-11), especially Compression
Next Steps: Benchmarking, MLOps modules

Bridge the gap between algorithmic optimization and hardware-level performance engineering

🎯 Learning Objectives

After completing this module, you will:

Understand how to implement custom ML operations beyond NumPy
Apply SIMD vectorization and CPU optimization techniques
Optimize memory layout and access patterns for cache efficiency
Implement GPU-style parallel computing concepts
Build comprehensive performance profiling and benchmarking tools
Create hardware-optimized operations for quantized and pruned models

🔗 Connection to Previous Modules

What You Already Know

Compression (Module 10): What to optimize (model size, computation)
Layers (Module 03): Basic matrix multiplication with matmul()
Training (Module 09): High-level optimization workflows
Networks (Module 04): How operations compose into architectures

The Performance Gap

Students understand algorithmic optimization but not hardware optimization:

✅ Algorithmic: Pruning, quantization, knowledge distillation
❌ Hardware: Memory layout, vectorization, parallel processing

🧠 Build → Use → Optimize

This module follows the "Build → Use → Optimize" pedagogical framework:

1. Build: Custom Operations

Move beyond NumPy's black box implementations
Implement specialized matrix multiplication and activations
Understand the computational patterns underlying ML

2. Use: Performance Optimization

Apply SIMD vectorization for CPU optimization
Implement cache-friendly memory layouts
Build GPU-style parallel computing concepts

3. Optimize: Real-World Integration

Profile and benchmark performance improvements
Integrate with compressed models from Module 10
Apply systematic evaluation to validate optimizations

📚 What You'll Build

Core Operations

Matrix Multiplication: Custom matmul_baseline() and cache_friendly_matmul()
Activation Functions: Vectorized vectorized_relu() and parallel_relu()
Batch Processing: parallel_batch_processing() with multiprocessing
Quantized Operations: quantized_matmul() and quantized_relu()

Performance Tools

Profiling: profile_operation() for detailed timing analysis
Benchmarking: benchmark_operation() for statistical validation
Memory Analysis: Cache-friendly data layout optimization
Parallel Computing: Multi-core processing patterns

🛠️ Key Components

Hardware-Optimized Operations

Purpose: Implement custom ML operations with hardware awareness
Methods: matmul_baseline(), vectorized_relu(), cache_friendly_matmul()
Learning: Understanding computational patterns beyond NumPy

Parallel Processing Framework

Purpose: Multi-core optimization for batch operations
Methods: parallel_batch_processing(), parallel_relu()
Learning: GPU-style parallel computing concepts

Quantization Integration

Purpose: Hardware-optimized operations for compressed models
Methods: quantized_matmul(), quantized_relu()
Learning: Bridging compression and performance optimization

Performance Profiling

Purpose: Systematic measurement and validation of optimizations
Methods: profile_operation(), benchmark_operation()
Learning: Evidence-based performance engineering

🌟 Real-World Applications

Industry Examples

Google TPUs: Custom hardware for ML operations
Intel MKL: Optimized math libraries for CPU performance
NVIDIA cuDNN: GPU-accelerated neural network operations
Apple Neural Engine: Hardware-specific ML acceleration

Performance Patterns

Memory Layout: Row-major vs column-major access patterns
Vectorization: SIMD instructions for parallel computation
Cache Optimization: Data locality for memory hierarchy
Parallel Processing: Multi-core utilization strategies

🚀 Getting Started

Prerequisites Check

tito test --module compression  # Should pass
tito status --module 11_kernels  # Check module status

Development Workflow

cd modules/source/11_kernels
jupyter notebook kernels_dev.py  # or edit directly

Testing Your Implementation

# Test inline (within notebook)
# Run comprehensive tests
tito test --module kernels

📖 Module Structure

modules/source/11_kernels/
├── kernels_dev.py      # Main development file
├── README.md           # This overview
└── module.yaml         # Module configuration

🔗 Integration Points

Input from Previous Modules

Tensor operations → Custom implementations
Compressed models → Hardware-optimized execution
Training workflows → Performance-critical operations

Output to Next Modules

Benchmarking → Operations to evaluate systematically
MLOps → Production-ready optimized operations
Complete system → End-to-end performance optimization

🎓 Educational Philosophy

This module bridges the gap between algorithmic understanding and systems performance. Students learn that optimization isn't just about better algorithms—it's about understanding how algorithms interact with hardware to achieve real-world performance gains.

By the end, you'll think like a performance engineer, not just a machine learning practitioner.