mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-06 14:02:35 -05:00

Files

Vijay Janapa Reddi eb7a26d741 feat: Complete standardized testing implementation across all modules

- Added standardized testing sections to modules 07_autograd and 08_optimizers
- Updated module.yaml files to reference inline testing approach
- Reorganized kernels module structure with proper testing placement
- All 12 TinyTorch modules now have consistent testing framework
- Fixed kernels module structure to match optimizers/training pattern

2025-07-14 14:15:23 -04:00

kernels_dev_backup.py

Add TinyTorch Profiler Utility

2025-07-14 13:04:44 -04:00

kernels_dev.py

Add TinyTorch Profiler Utility

2025-07-14 13:04:44 -04:00

module.yaml

Add TinyTorch Profiler Utility

2025-07-14 13:04:44 -04:00

README.md

feat: Complete standardized testing implementation across all modules

2025-07-14 14:15:23 -04:00

README.md

🚀 Module 11: Kernels - Hardware-Aware Optimization

📊 Module Info

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time Estimate: 8-10 hours
Prerequisites: All previous modules (00-10), especially Compression
Next Steps: Benchmarking, MLOps modules

Bridge the gap between algorithmic optimization and hardware-level performance engineering

🎯 Learning Objectives

After completing this module, you will:

Understand how to implement custom ML operations beyond NumPy
Apply SIMD vectorization and CPU optimization techniques
Optimize memory layout and access patterns for cache efficiency
Implement GPU-style parallel computing concepts
Build comprehensive performance profiling and benchmarking tools
Create hardware-optimized operations for quantized and pruned models

🔗 Connection to Previous Modules

What You Already Know

Compression (Module 10): What to optimize (model size, computation)
Layers (Module 03): Basic matrix multiplication with matmul()
Training (Module 09): High-level optimization workflows
Networks (Module 04): How operations compose into architectures

The Performance Gap

Students understand algorithmic optimization but not hardware optimization:

✅ Algorithmic: Pruning, quantization, knowledge distillation
❌ Hardware: Memory layout, vectorization, parallel processing

🧠 Build → Use → Optimize

This module follows the "Build → Use → Optimize" pedagogical framework:

1. Build: Custom Operations

Move beyond NumPy's black box implementations
Implement specialized matrix multiplication and activations
Understand the computational patterns underlying ML

2. Use: Performance Optimization

Apply SIMD vectorization for CPU optimization
Implement cache-friendly memory layouts
Build GPU-style parallel computing concepts

3. Optimize: Real-World Integration

Profile and benchmark performance improvements
Integrate with compressed models from Module 10
Bridge to production deployment considerations

📚 What You'll Build

Step 1: Understanding Custom Operations

# Build on TinyTorch's proven implementations
def matmul_baseline(A, B):
    # Use TinyTorch's reliable matmul as baseline
    return matmul(A.data, B.data)

Step 2: SIMD Vectorization

# CPU optimization with vector operations
def vectorized_relu(x):
    # SIMD-optimized activation using NumPy's vectorized operations
    return np.maximum(0, x_data)

def vectorized_operations(x, y):
    # Element-wise operations optimized for SIMD
    return {
        'multiply': x * y,
        'add': x + y,
        'squared_diff': (x - y)**2
    }

Step 3: Memory Layout Optimization

# Cache-friendly data structures
def cache_friendly_matmul(A, B, block_size=32):
    # Blocked matrix multiplication for better cache utilization
    return blocked_result

Step 4: GPU-Style Parallel Computing

# Parallel processing patterns
def parallel_relu(x, num_workers=4):
    # Multi-core CPU utilization with ThreadPoolExecutor
    return parallel_result

def parallel_batch_processing(batch_data, operation, num_workers=4):
    # Process multiple tensors simultaneously
    return batch_results

Step 5: Performance Profiling

# Measure and optimize performance
profiler = SimpleProfiler()
result, metrics = profiler.profile(kernel_function, *args)
print(f"Wall time: {metrics['wall_time']:.4f}s")

Step 6: Compressed Model Kernels

# Hardware-optimized operations for compressed models
def quantized_matmul(A, B, scale_A=1.0, scale_B=1.0):
    # INT8 matrix multiplication for mobile deployment
    return quantized_result

def quantized_relu(x, scale=1.0):
    # Integer domain ReLU activation
    return quantized_result

🎓 Learning Path

Foundation Level: Understanding Implementation

See what happens inside NumPy operations
Build on TinyTorch's proven components
Debug performance bottlenecks

Intermediate Level: CPU Optimization

Apply vectorization techniques
Optimize memory access patterns
Understand cache behavior

Advanced Level: Parallel Computing

Implement GPU-style parallel algorithms
Profile and benchmark performance
Integrate with compressed models

Expert Level: Production Integration

Build kernels for real deployment scenarios
Optimize for specific hardware targets
Connect to MLOps and monitoring systems

🔧 Technical Skills Developed

Low-Level Programming

Manual memory management
Understanding of CPU architecture
Assembly-level optimization concepts

Performance Engineering

Profiling and benchmarking with SimpleProfiler
Bottleneck identification
Performance optimization strategies

Parallel Computing

Thread-level parallelism with ThreadPoolExecutor
SIMD vectorization principles
GPU computing concepts

Systems Integration

Hardware-software co-design
Production deployment considerations
Real-world performance constraints

🎯 Real-World Applications

Production ML Systems

Custom kernels for edge deployment
Hardware-specific optimizations
Real-time inference requirements

Research and Development

Prototype new operations
Benchmark algorithm improvements
Understand performance trade-offs

MLOps and Deployment

Optimize for specific hardware
Monitor performance in production
Scale to distributed systems

🚀 Getting Started

Prerequisites Check

✅ Complete all previous modules (00-10)
✅ Understand compression techniques
✅ Familiar with NumPy operations
✅ Basic understanding of computer architecture

Development Setup

# Navigate to the kernels module
cd modules/source/11_kernels

# Work in the development file
code kernels_dev.py

# Or work in the Jupyter notebook
jupyter notebook kernels_dev.ipynb

📖 Module Structure

modules/source/11_kernels/
├── kernels_dev.py           # Main development file (work here!)
├── kernels_dev.ipynb        # Jupyter notebook version
├── README.md               # This file
└── module.yaml             # Module metadata

🧪 Testing Your Implementation

Inline Testing

# All tests are inline within kernels_dev.py
def test_matmul_baseline():
    # Test baseline matrix multiplication
    pass

def test_vectorized_operations():
    # Test SIMD vectorization
    pass

def test_cache_friendly_matmul():
    # Test cache optimization
    pass

def test_parallel_processing():
    # Test parallel computing
    pass

def test_performance_profiling():
    # Test profiling tools
    pass

def test_compressed_kernels():
    # Test quantized operations
    pass

def final_performance_test():
    # Comprehensive performance comparison
    pass

Performance Benchmarking

# Run comprehensive performance tests
final_performance_test()

🎯 Success Criteria

You've mastered hardware-aware optimization when:

✅ Can implement custom ML operations building on TinyTorch components
✅ Understand CPU optimization techniques (SIMD, caching)
✅ Can profile and benchmark performance improvements
✅ Successfully integrate with compressed models
✅ Bridge algorithmic and hardware optimization

🔍 Common Challenges

Performance Debugging

Use SimpleProfiler to identify bottlenecks
Understand the difference between algorithmic and implementation efficiency
Learn to read performance metrics

Hardware Complexity

Start with CPU optimization before GPU concepts
Focus on understanding principles, not memorizing details
Use abstractions to manage complexity

Integration Complexity

Test kernels independently before integration
Verify correctness before optimizing for performance
Maintain compatibility with existing TinyTorch components

🚀 What's Next

After completing this module, you're ready for:

Module 12: Benchmarking - Systematic performance measurement
Module 13: MLOps - Production deployment and monitoring

📊 Performance Insights

Performance Hierarchy

Python loops:        1x speed    (baseline)
NumPy operations:    10x speed   (vectorized)
Optimized kernels:   100x speed  (hardware-aware)
GPU kernels:         1000x speed (massive parallelism)

Memory Hierarchy

CPU Registers:    1 cycle     (fastest, tiny)
L1 Cache:         3 cycles    (fast, small)
L2 Cache:         10 cycles   (medium, medium)
L3 Cache:         40 cycles   (slow, large)
Main Memory:      200+ cycles (slowest, huge)

Real-World Impact

Training time: 10 hours → 1 hour
Inference cost: $1000/month → $100/month
Energy efficiency: 90% reduction

🏆 What Students Build

By the end of this module, students have implemented:

matmul_baseline() - Reliable matrix multiplication using TinyTorch
vectorized_relu() - SIMD-optimized ReLU activation
vectorized_operations() - Element-wise operations with vectorization
cache_friendly_matmul() - Blocked matrix multiplication
parallel_relu() - Multi-core CPU utilization
parallel_batch_processing() - Batch processing with workers
quantized_matmul() - INT8 matrix multiplication
quantized_relu() - Integer domain ReLU
Performance profiling - Using SimpleProfiler for benchmarking
Final performance test - Comprehensive comparison of all implementations

Students understand how modern ML frameworks like PyTorch (2000+ CUDA kernels) and TensorFlow (XLA compiler) achieve their performance through hardware-aware optimization.