# πŸ”₯ Module: Kernels ## πŸ“Š Module Info - **Difficulty**: ⭐⭐⭐⭐ Expert - **Time Estimate**: 8-10 hours - **Prerequisites**: All previous modules (01-11), especially Compression - **Next Steps**: Benchmarking, MLOps modules Bridge the gap between algorithmic optimization and hardware-level performance engineering. This module teaches the systems programming skills that make ML frameworks fastβ€”moving beyond NumPy's black box to understand how computation really works on modern hardware. ## 🎯 Learning Objectives By the end of this module, you will be able to: - **Master hardware-aware programming**: Understand CPU cache hierarchies, SIMD vectorization, and memory layout optimization - **Implement custom ML operations**: Build matrix multiplication, activations, and batch processing from scratch with performance awareness - **Apply parallel computing principles**: Use multi-core processing and GPU-style parallelism for ML workloads - **Optimize compressed models**: Create hardware-efficient operations for quantized and pruned neural networks - **Build performance engineering workflows**: Develop profiling, benchmarking, and optimization methodologies ## 🧠 Build β†’ Use β†’ Optimize This module follows TinyTorch's **Build β†’ Use β†’ Optimize** framework: 1. **Build**: Implement custom ML operations with hardware awareness, moving beyond NumPy to understand computational patterns 2. **Use**: Apply SIMD vectorization, cache optimization, and parallel processing to real ML workloads 3. **Optimize**: Profile performance systematically, integrate with compressed models, and achieve measurable speedups ## πŸ“š What You'll Build ### Hardware-Optimized Core Operations ```python # Custom matrix multiplication with cache awareness import numba from multiprocessing import Pool # Baseline implementation for understanding def matmul_baseline(A, B): """Reference implementation showing the core computation""" rows_A, cols_A = A.shape rows_B, cols_B = B.shape C = np.zeros((rows_A, cols_B)) for i in range(rows_A): for j in range(cols_B): for k in range(cols_A): C[i, j] += A[i, k] * B[k, j] return C # Cache-friendly optimized version def cache_friendly_matmul(A, B): """Optimized for memory access patterns and cache efficiency""" # Implementation with blocked matrix multiplication # and memory-friendly access patterns pass # Performance comparison baseline_time = profile_operation(matmul_baseline, A, B) optimized_time = profile_operation(cache_friendly_matmul, A, B) speedup = baseline_time / optimized_time print(f"Speedup: {speedup:.2f}x") ``` ### SIMD Vectorized Operations ```python # Vectorized activation functions @numba.jit(nopython=True) def vectorized_relu(x): """SIMD-optimized ReLU using numba compilation""" return np.maximum(0, x) # Parallel batch processing def parallel_batch_processing(batch_data, operation, num_workers=4): """Multi-core processing for batch operations""" with Pool(num_workers) as pool: results = pool.map(operation, batch_data) return np.array(results) # Compare single-threaded vs parallel single_time = profile_operation(sequential_relu, large_batch) parallel_time = profile_operation(parallel_relu, large_batch) efficiency = single_time / (parallel_time * num_cores) print(f"Parallel efficiency: {efficiency:.2f}") ``` ### Quantized Operation Optimization ```python # Hardware-optimized quantized operations def quantized_matmul(A_int8, B_int8, scale_A, scale_B, zero_point_A, zero_point_B): """INT8 matrix multiplication with proper scaling""" # Use INT32 accumulator to prevent overflow C_int32 = np.dot(A_int8.astype(np.int32), B_int8.astype(np.int32)) # Apply scaling and zero-point corrections scale_C = scale_A * scale_B C_float = scale_C * (C_int32 - zero_point_corrections) return C_float # Measure memory and compute benefits fp32_memory = measure_memory_usage(model_fp32) int8_memory = measure_memory_usage(model_int8) memory_reduction = fp32_memory / int8_memory print(f"Memory reduction: {memory_reduction:.1f}x") ``` ### Performance Profiling Framework ```python # Comprehensive operation profiling class PerformanceProfiler: def __init__(self): self.results = {} def profile_operation(self, operation, *args, num_runs=100): """Statistical profiling with multiple runs""" times = [] for _ in range(num_runs): start = time.perf_counter() result = operation(*args) end = time.perf_counter() times.append(end - start) return { 'mean_time': np.mean(times), 'std_time': np.std(times), 'min_time': np.min(times), 'max_time': np.max(times) } def compare_operations(self, baseline_op, optimized_op, *args): """Compare two implementations statistically""" baseline_stats = self.profile_operation(baseline_op, *args) optimized_stats = self.profile_operation(optimized_op, *args) speedup = baseline_stats['mean_time'] / optimized_stats['mean_time'] significance = self.statistical_significance(baseline_stats, optimized_stats) return {'speedup': speedup, 'significant': significance} ``` ## πŸš€ Getting Started ### Prerequisites Ensure you have mastered the optimization foundation: ```bash # Activate TinyTorch environment source bin/activate-tinytorch.sh # Verify all prerequisite modules tito test --module compression # Essential for integration tito test --module training # Understanding of ML workflows ``` ### Development Workflow 1. **Open the development file**: `modules/source/12_kernels/kernels_dev.py` 2. **Implement baseline operations**: Build reference implementations for understanding 3. **Add hardware optimizations**: Apply SIMD, cache optimization, and parallelization 4. **Create quantized operations**: Build INT8 and hardware-efficient variants 5. **Build profiling tools**: Develop systematic performance measurement 6. **Export and verify**: `tito export --module kernels && tito test --module kernels` ## πŸ§ͺ Testing Your Implementation ### Comprehensive Test Suite Run the full test suite to verify performance optimization functionality: ```bash # TinyTorch CLI (recommended) tito test --module kernels # Direct pytest execution python -m pytest tests/ -k kernels -v ``` ### Test Coverage Areas - βœ… **Operation Correctness**: Verify optimized operations produce identical results to baselines - βœ… **Performance Improvements**: Measure and validate actual speedups from optimizations - βœ… **Hardware Utilization**: Test SIMD usage, cache efficiency, and parallel scaling - βœ… **Quantization Integration**: Verify INT8 operations maintain accuracy while improving performance - βœ… **Profiling Accuracy**: Ensure performance measurement tools provide reliable statistics ### Inline Testing & Performance Analysis The module includes comprehensive performance validation and optimization verification: ```python # Example inline test output πŸ”¬ Unit Test: Cache-friendly matrix multiplication... βœ… Correctness: Results match NumPy reference βœ… Performance: 2.3x speedup over baseline βœ… Memory efficiency: 40% reduction in cache misses πŸ“ˆ Progress: Optimized Matrix Operations βœ“ # SIMD vectorization testing πŸ”¬ Unit Test: Vectorized ReLU implementation... βœ… SIMD utilization: 8-wide vectors detected βœ… Throughput: 4.1x improvement over scalar code βœ… Batch scaling: Linear performance with data size πŸ“ˆ Progress: Vectorized Operations βœ“ # Quantization optimization πŸ”¬ Unit Test: INT8 quantized operations... βœ… Accuracy preservation: <0.1% degradation βœ… Memory reduction: 4x smaller model size βœ… Compute speedup: 2.8x faster inference πŸ“ˆ Progress: Quantized Kernels βœ“ ``` ### Manual Testing Examples ```python from kernels_dev import matmul_baseline, cache_friendly_matmul, PerformanceProfiler import numpy as np # Create test matrices A = np.random.randn(1000, 500).astype(np.float32) B = np.random.randn(500, 800).astype(np.float32) # Compare implementations profiler = PerformanceProfiler() baseline_result = matmul_baseline(A, B) optimized_result = cache_friendly_matmul(A, B) # Verify correctness np.testing.assert_allclose(baseline_result, optimized_result, rtol=1e-5) print("βœ… Optimized implementation matches baseline") # Measure performance comparison = profiler.compare_operations(matmul_baseline, cache_friendly_matmul, A, B) print(f"Speedup: {comparison['speedup']:.2f}x") print(f"Statistically significant: {comparison['significant']}") ``` ## 🎯 Key Concepts ### Real-World Applications - **PyTorch/TensorFlow**: Production ML frameworks use similar kernel optimization techniques - **Intel MKL/OpenBLAS**: Optimized math libraries employ cache-friendly algorithms and SIMD instructions - **NVIDIA cuDNN**: GPU libraries optimize memory access patterns and parallel computation - **Google TPUs**: Custom hardware accelerators use similar quantization and optimization principles ### Hardware Performance Fundamentals - **CPU Cache Hierarchy**: L1/L2/L3 cache optimization through memory access pattern design - **SIMD Vectorization**: Single Instruction Multiple Data processing for parallel computation - **Memory Layout**: Row-major vs column-major access patterns and cache line utilization - **Parallel Computing**: Multi-core CPU utilization and GPU-style parallel programming patterns ### Optimization Techniques - **Algorithmic Optimization**: Choosing algorithms that match hardware characteristics - **Memory Optimization**: Cache-friendly data structures and access patterns - **Vectorization**: SIMD instruction utilization for parallel arithmetic operations - **Quantization Integration**: Hardware-efficient low-precision computation ### Performance Engineering Methodology - **Profiling-Driven Optimization**: Measure first, optimize second, validate third - **Statistical Validation**: Ensuring performance improvements are statistically significant - **Bottleneck Analysis**: Identifying and addressing the most impactful performance constraints - **Hardware-Software Co-design**: Understanding hardware capabilities and designing software accordingly ## πŸŽ‰ Ready to Build? You're about to learn the systems programming skills that make modern ML frameworks fast! This is where computer science meets practical engineeringβ€”understanding how algorithms interact with hardware to achieve real performance gains. From smartphone AI to data center training, all efficient ML systems depend on the optimization techniques you're about to master. You'll think like a performance engineer, understanding not just what to compute but how to compute it efficiently. Take your time, profile everything, and enjoy building systems that are both intelligent and fast! ```{grid} 3 :gutter: 3 :margin: 2 {grid-item-card} πŸš€ Launch Builder :link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/12_kernels/kernels_dev.py :class-title: text-center :class-body: text-center Interactive development environment {grid-item-card} πŸ““ Open in Colab :link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/12_kernels/kernels_dev.ipynb :class-title: text-center :class-body: text-center Google Colab notebook {grid-item-card} πŸ‘€ View Source :link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/12_kernels/kernels_dev.py :class-title: text-center :class-body: text-center Browse the code on GitHub ```