Add TinyTorch Profiler Utility

- Add tinytorch.utils.profiler following PyTorch's utils pattern
- Includes SimpleProfiler class for educational performance measurement
- Provides timing, memory usage, and system metrics
- Follows PyTorch's torch.utils.* organizational pattern
- Module 11: Kernels uses profiler for performance demonstrations

Features:
- Wall time and CPU time measurement
- Memory usage tracking (peak, delta, percentages)
- Array information (shape, size, dtype)
- CPU and system metrics
- Clean educational interface for ML performance learning

Import pattern:
  from tinytorch.utils.profiler import SimpleProfiler
This commit is contained in:
Vijay Janapa Reddi
2025-07-14 13:04:44 -04:00
parent 100f0cc3fa
commit 4ea5a4e024
7 changed files with 2919 additions and 0 deletions

View File

@@ -0,0 +1,264 @@
# 🚀 Module 11: Kernels - Hardware-Aware Optimization
## 📊 Module Info
- **Difficulty**: ⭐⭐⭐⭐⭐ Expert
- **Time Estimate**: 8-10 hours
- **Prerequisites**: All previous modules (00-10), especially Compression
- **Next Steps**: Benchmarking, MLOps modules
**Bridge the gap between algorithmic optimization and hardware-level performance engineering**
## 🎯 Learning Objectives
After completing this module, you will:
- Understand how to implement custom ML operations beyond NumPy
- Apply SIMD vectorization and CPU optimization techniques
- Optimize memory layout and access patterns for cache efficiency
- Implement GPU-style parallel computing concepts
- Build comprehensive performance profiling and benchmarking tools
- Create hardware-optimized operations for quantized and pruned models
## 🔗 Connection to Previous Modules
### What You Already Know
- **Compression (Module 10)**: *What* to optimize (model size, computation)
- **Layers (Module 03)**: Basic matrix multiplication with `matmul()`
- **Training (Module 09)**: High-level optimization workflows
- **Networks (Module 04)**: How operations compose into architectures
### The Performance Gap
Students understand **algorithmic optimization** but not **hardware optimization**:
-**Algorithmic**: Pruning, quantization, knowledge distillation
-**Hardware**: Memory layout, vectorization, parallel processing
## 🧠 Build → Use → Optimize
This module follows the **"Build → Use → Optimize"** pedagogical framework:
### 1. **Build**: Custom Operations
- Move beyond NumPy's black box implementations
- Implement matrix multiplication, convolution, and activations from scratch
- Understand the computational patterns underlying ML
### 2. **Use**: Performance Optimization
- Apply SIMD vectorization for CPU optimization
- Implement cache-friendly memory layouts
- Build GPU-style parallel computing concepts
### 3. **Optimize**: Real-World Integration
- Profile and benchmark performance improvements
- Integrate with compressed models from Module 10
- Bridge to production deployment considerations
## 📚 What You'll Build
### **Step 1: Understanding Custom Operations**
```python
# Move beyond NumPy to custom implementations
def matmul_custom(A, B):
# Your low-level implementation
return result
def relu_custom(x):
# Understanding what happens inside activation functions
return np.maximum(0, x)
```
### **Step 2: SIMD Vectorization**
```python
# CPU optimization with vector operations
def matmul_vectorized(A, B):
# Use SIMD instructions for parallel computation
return optimized_result
```
### **Step 3: Memory Layout Optimization**
```python
# Cache-friendly data structures
def matmul_cache_optimized(A, B):
# Optimize memory access patterns
return cache_friendly_result
```
### **Step 4: GPU-Style Parallel Computing**
```python
# Understand parallel computing concepts
def matmul_parallel(A, B):
# Parallel processing patterns
return parallel_result
```
### **Step 5: Performance Profiling**
```python
# Measure and optimize performance
profiler = KernelProfiler()
profiler.benchmark(matmul_custom, matmul_vectorized, matmul_parallel)
```
### **Step 6: Compressed Model Kernels**
```python
# Hardware-optimized operations for compressed models
def quantized_matmul(A_int8, B_int8):
# Optimized kernels for quantized models
return result
def sparse_matmul(A_sparse, B):
# Efficient sparse matrix operations
return result
```
## 🎓 Learning Path
### **Foundation Level**: Understanding Implementation
- See what happens inside NumPy operations
- Implement basic kernels with explicit loops
- Debug performance bottlenecks
### **Intermediate Level**: CPU Optimization
- Apply vectorization techniques
- Optimize memory access patterns
- Understand cache behavior
### **Advanced Level**: Parallel Computing
- Implement GPU-style parallel algorithms
- Profile and benchmark performance
- Integrate with compressed models
### **Expert Level**: Production Integration
- Build kernels for real deployment scenarios
- Optimize for specific hardware targets
- Connect to MLOps and monitoring systems
## 🔧 Technical Skills Developed
### **Low-Level Programming**
- Manual memory management
- Understanding of CPU architecture
- Assembly-level optimization concepts
### **Performance Engineering**
- Profiling and benchmarking
- Bottleneck identification
- Performance optimization strategies
### **Parallel Computing**
- Thread-level parallelism
- SIMD vectorization
- GPU computing concepts
### **Systems Integration**
- Hardware-software co-design
- Production deployment considerations
- Real-world performance constraints
## 🎯 Real-World Applications
### **Production ML Systems**
- Custom kernels for edge deployment
- Hardware-specific optimizations
- Real-time inference requirements
### **Research and Development**
- Prototype new operations
- Benchmark algorithm improvements
- Understand performance trade-offs
### **MLOps and Deployment**
- Optimize for specific hardware
- Monitor performance in production
- Scale to distributed systems
## 🚀 Getting Started
### Prerequisites Check
- ✅ Complete all previous modules (00-10)
- ✅ Understand compression techniques
- ✅ Familiar with NumPy operations
- ✅ Basic understanding of computer architecture
### Development Setup
```bash
# Navigate to the kernels module
cd modules/source/11_kernels
# Work in the development notebook
jupyter notebook kernels_dev.ipynb
# Or work in the Python file
code kernels_dev.py
```
## 📖 Module Structure
```
modules/source/11_kernels/
├── kernels_dev.py # Main development file (work here!)
├── kernels_dev.ipynb # Jupyter notebook version
├── tests/
│ └── test_kernels.py # Performance and correctness tests
├── README.md # This file
└── benchmarks/ # Performance benchmarking tools
```
## 🧪 Testing Your Implementation
### Performance Testing
```bash
# Run performance benchmarks
python -m pytest tests/test_kernels.py -v --benchmark
# Profile specific operations
python -c "from kernels_dev import benchmark_kernels; benchmark_kernels()"
```
### Integration Testing
```bash
# Test with compressed models
python -c "from kernels_dev import test_compressed_kernels; test_compressed_kernels()"
```
## 🎯 Success Criteria
You've mastered hardware-aware optimization when:
- ✅ Can implement custom ML operations from scratch
- ✅ Understand CPU optimization techniques (SIMD, caching)
- ✅ Can profile and benchmark performance improvements
- ✅ Successfully integrate with compressed models
- ✅ Bridge algorithmic and hardware optimization
## 🔍 Common Challenges
### **Performance Debugging**
- Use profiling tools to identify bottlenecks
- Understand the difference between algorithmic and implementation efficiency
- Learn to read performance metrics
### **Hardware Complexity**
- Start with CPU optimization before GPU concepts
- Focus on understanding principles, not memorizing details
- Use abstractions to manage complexity
### **Integration Complexity**
- Test kernels independently before integration
- Verify correctness before optimizing for performance
- Maintain compatibility with existing TinyTorch components
## 🚀 What's Next
After completing this module, you're ready for:
- **Module 12: Benchmarking** - Systematic performance measurement
- **Module 13: MLOps** - Production deployment and monitoring
- **Real-world applications** - Apply optimization skills to production systems
## 🤝 Getting Help
- Focus on understanding principles over memorizing techniques
- Use profiling tools to guide optimization decisions
- Connect optimization choices to real-world constraints
- Remember: **Build → Use → Optimize!**
---
**Ready to optimize ML systems for real-world performance?** 🚀
*This module bridges the gap between algorithmic optimization and hardware-level performance engineering, preparing you for production ML systems deployment.*

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,733 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# ---
# %% [markdown]
"""
# Module 11: Kernels - Hardware-Aware Optimization
Welcome to the Kernels module! This is where you'll learn to optimize ML operations for real hardware performance.
## Learning Goals
- Understand hardware optimization principles for ML systems
- Implement vectorized operations using SIMD capabilities
- Build cache-friendly algorithms with memory hierarchy awareness
- Create parallel processing implementations for multi-core systems
- Connect basic algorithms to production-level performance optimization
## Build → Use → Understand
1. **Build**: Implement hardware-aware optimization techniques from scratch
2. **Use**: Compare performance differences between basic and optimized implementations
3. **Understand**: How hardware characteristics drive optimization strategies
"""
# %% nbgrader={"grade": false, "grade_id": "kernels-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
#| default_exp core.kernels
import numpy as np
import time
import multiprocessing as mp
import sys
import os
from typing import Callable, Dict, Any, Optional, List, Tuple
# Import the basic matrix multiplication from Module 03
# This is the triple-loop implementation students built earlier
try:
from tinytorch.core.layers import matmul_naive as matmul_from_module03
except ImportError:
# For development, import from local modules
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
try:
from layers_dev import matmul as matmul_from_module03
except ImportError:
# Fallback if we can't import, define it directly
def matmul_from_module03(A: np.ndarray, B: np.ndarray) -> np.ndarray:
rows_a, cols_a = A.shape
rows_b, cols_b = B.shape
if cols_a != rows_b:
raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
result = np.zeros((rows_a, cols_b))
for i in range(rows_a):
for j in range(cols_b):
for k in range(cols_a):
result[i, j] += A[i, k] * B[k, j]
return result
# Import shared profiling utilities
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'utils'))
from profiler import SimpleProfiler, profile_function
print("🚀 TinyTorch Kernels Module")
print(f"NumPy version: {np.__version__}")
print(f"Python version: {sys.version.split()[0]}")
print(f"CPU count: {mp.cpu_count()}")
print("Ready to optimize ML systems for hardware!")
# %% [markdown]
"""
## 📦 Where This Code Lives in the Final Package
**Learning Side:** You work in `modules/source/11_kernels/kernels_dev.py`
**Building Side:** Code exports to `tinytorch.core.kernels`
```python
# Final package structure:
from tinytorch.core.kernels import matmul_vectorized, matmul_cache_optimized
from tinytorch.core.kernels import matmul_parallel # Hardware-optimized operations
```
**Why this matters:**
- **Learning:** Understand optimization from first principles
- **Production:** Real ML systems need hardware-aware implementations
- **Performance:** Bridge between theory and practical efficiency
- **Foundation:** Optimization skills transfer to all ML system components
"""
# %% [markdown]
"""
## Step 1: Understanding Our Baseline - The Module 03 Implementation
In Module 03, you implemented matrix multiplication using three nested loops. Let's use that **exact same function** as our baseline for optimization:
```python
# From Module 03 (your original implementation):
def matmul(A, B):
rows_a, cols_a = A.shape
rows_b, cols_b = B.shape
result = np.zeros((rows_a, cols_b))
for i in range(rows_a):
for j in range(cols_b):
for k in range(cols_a):
result[i, j] += A[i, k] * B[k, j]
return result
```
This is our **baseline** - the implementation we'll optimize in this module.
### Why This Baseline Matters
- **Your own code**: You built this in Module 03, so you understand it completely
- **Clear performance reference**: See exactly how much faster optimized versions are
- **Real improvement**: Measure actual performance gains of your optimizations
- **Educational value**: Connect basic concepts to hardware-aware programming
### Performance Characteristics of the Baseline
- **Time complexity**: O(n³) for n×n matrices
- **Memory access**: Not cache-friendly - jumps around memory
- **Parallelization**: No parallel execution
- **Vectorization**: No SIMD optimizations
**Goal**: Take this basic implementation and make it hardware-aware!
"""
# %%
#| export
def matmul_custom(A: np.ndarray, B: np.ndarray) -> np.ndarray:
"""
Baseline matrix multiplication using the exact same implementation from Module 03.
This directly calls the matmul function you built in Module 03,
showing the clear connection between modules and avoiding code duplication.
"""
return matmul_from_module03(A, B)
# %% [markdown]
"""
## Step 2: Vectorized Operations - Leveraging SIMD
### What is Vectorization?
**Vectorization** means using Single Instruction, Multiple Data (SIMD) operations to process multiple data elements simultaneously. Modern CPUs can perform the same operation on multiple numbers at once.
### Why Vectorization Matters
- **SIMD Instructions**: Modern CPUs have 128-bit, 256-bit, or 512-bit registers
- **Parallel Arithmetic**: One instruction can operate on 4-16 numbers simultaneously
- **Automatic Optimization**: NumPy uses highly optimized BLAS libraries
- **Massive Speedup**: Often 10-100x faster than basic loops
### The NumPy Advantage
NumPy operations like `@` (matrix multiplication) automatically use:
- **Intel MKL**: Math Kernel Library with hand-optimized assembly
- **OpenBLAS**: Open-source optimized BLAS implementation
- **BLAS/LAPACK**: Industry-standard linear algebra routines
### Learning Connection
This is why production ML frameworks (PyTorch, TensorFlow) are built on optimized libraries rather than pure Python loops.
"""
# %% nbgrader={"grade": false, "grade_id": "matmul-vectorized", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
def matmul_vectorized(A: np.ndarray, B: np.ndarray) -> np.ndarray:
"""
Vectorized matrix multiplication using NumPy's optimized operations.
TODO: Implement vectorized matrix multiplication using NumPy's @ operator.
STEP-BY-STEP IMPLEMENTATION:
1. Validate that the matrices can be multiplied (A.shape[1] == B.shape[0])
2. Use NumPy's @ operator for optimized matrix multiplication
3. Return the result directly
EXAMPLE:
A = [[1, 2], B = [[5, 6], Result = [[19, 22],
[3, 4]] [7, 8]] [43, 50]]
IMPLEMENTATION HINTS:
- Check shape compatibility with A.shape[1] == B.shape[0]
- Use A @ B for the actual multiplication
- NumPy handles all the SIMD optimization automatically
- This should be much faster than the triple-loop version
LEARNING CONNECTIONS:
- This uses the same optimized libraries as PyTorch and TensorFlow
- SIMD operations process multiple numbers simultaneously
- Why you should use library functions instead of writing loops
"""
### BEGIN SOLUTION
# Validate matrix dimensions
if A.shape[1] != B.shape[0]:
raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
# Use NumPy's optimized matrix multiplication
return A @ B
### END SOLUTION
# %% [markdown]
"""
### 🧪 Test Your Vectorized Implementation
Once you implement the `matmul_vectorized` function above, run this cell to test it:
"""
# %% nbgrader={"grade": true, "grade_id": "test-vectorized-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
def test_vectorized_matrix_multiplication():
"""Test vectorized matrix multiplication implementation"""
print("🔬 Unit Test: Vectorized Matrix Multiplication...")
# Test simple 2x2 case
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
result = matmul_vectorized(A, B)
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
assert np.allclose(result, expected), f"Vectorized multiplication failed: expected {expected}, got {result}"
# Compare with baseline
baseline_result = matmul_custom(A, B)
assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
# Test different shapes
A2 = np.array([[1, 2, 3]], dtype=np.float32) # 1x3
B2 = np.array([[4], [5], [6]], dtype=np.float32) # 3x1
result2 = matmul_vectorized(A2, B2)
expected2 = np.array([[32]], dtype=np.float32) # 1*4 + 2*5 + 3*6 = 32
assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"
print("✅ Vectorized matrix multiplication works correctly!")
# Run the test
test_vectorized_matrix_multiplication()
# %% [markdown]
"""
## Step 3: Cache-Optimized Implementation - Memory Hierarchy Awareness
### What is Cache Optimization?
**Cache optimization** means organizing memory accesses to work efficiently with the CPU's memory hierarchy. Modern processors have multiple levels of cache that are much faster than main memory.
### Memory Hierarchy
- **L1 Cache**: ~1 cycle, ~32KB, per-core
- **L2 Cache**: ~10 cycles, ~256KB, per-core
- **L3 Cache**: ~50 cycles, ~8MB, shared
- **Main Memory**: ~200 cycles, GB-scale
### Cache-Friendly Strategy: Blocking
**Blocking** (or tiling) divides large matrices into smaller blocks that fit in cache:
- Process one block at a time
- Reuse data while it's still in cache
- Reduce expensive memory fetches
- Better performance for large matrices
### Why This Matters
- **Cache misses are expensive**: 200x slower than cache hits
- **Locality of reference**: Access nearby data together
- **Real systems**: Production ML uses cache-aware algorithms
"""
# %% nbgrader={"grade": false, "grade_id": "matmul-cache", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
def matmul_cache_optimized(A: np.ndarray, B: np.ndarray, block_size: int = 64) -> np.ndarray:
"""
Cache-optimized matrix multiplication using blocked algorithm.
TODO: Implement cache-friendly matrix multiplication with blocking.
STEP-BY-STEP IMPLEMENTATION:
1. Validate matrix dimensions for multiplication
2. Get matrix dimensions: m, n, p
3. Initialize result matrix with zeros
4. Use three nested loops over blocks (not elements):
- i_block: iterate through row blocks of A
- j_block: iterate through column blocks of B
- k_block: iterate through shared dimension blocks
5. For each block combination, extract submatrices and multiply
6. Add the block result to the appropriate section of the output matrix
EXAMPLE WALKTHROUGH:
For 4x4 matrices with block_size=2:
- Divide into 2x2 blocks
- Multiply corresponding blocks
- Accumulate results in output matrix
IMPLEMENTATION HINTS:
- Use range(0, dimension, block_size) for block iteration
- Extract blocks: A_block = A[i:i_end, k:k_end]
- Use @ operator for block multiplication
- Accumulate: result[i:i_end, j:j_end] += A_block @ B_block
- Handle edge cases where blocks don't divide evenly
LEARNING CONNECTIONS:
- This is how BLAS libraries optimize for cache hierarchy
- Block size should match cache size for optimal performance
- Cache-aware algorithms are essential for large-scale ML
"""
### BEGIN SOLUTION
# Validate matrix dimensions
if A.shape[1] != B.shape[0]:
raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
m, n = A.shape
n2, p = B.shape
# Initialize result matrix
result = np.zeros((m, p))
# Blocked matrix multiplication
for i in range(0, m, block_size):
for j in range(0, p, block_size):
for k in range(0, n, block_size):
# Calculate block boundaries
i_end = min(i + block_size, m)
j_end = min(j + block_size, p)
k_end = min(k + block_size, n)
# Extract blocks
A_block = A[i:i_end, k:k_end]
B_block = B[k:k_end, j:j_end]
# Multiply blocks and accumulate
result[i:i_end, j:j_end] += A_block @ B_block
return result
### END SOLUTION
# %% [markdown]
"""
### 🧪 Test Your Cache-Optimized Implementation
Once you implement the `matmul_cache_optimized` function above, run this cell to test it:
"""
# %% nbgrader={"grade": true, "grade_id": "test-cache-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
def test_cache_optimized_matrix_multiplication():
"""Test cache-optimized matrix multiplication implementation"""
print("🔬 Unit Test: Cache-Optimized Matrix Multiplication...")
# Test simple 2x2 case
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
result = matmul_cache_optimized(A, B, block_size=2)
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
assert np.allclose(result, expected), f"Cache-optimized multiplication failed: expected {expected}, got {result}"
# Compare with baseline
baseline_result = matmul_custom(A, B)
assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
# Test larger matrix with different block sizes
A2 = np.random.randn(8, 6).astype(np.float32)
B2 = np.random.randn(6, 10).astype(np.float32)
result_block2 = matmul_cache_optimized(A2, B2, block_size=2)
result_block4 = matmul_cache_optimized(A2, B2, block_size=4)
expected_large = A2 @ B2
assert np.allclose(result_block2, expected_large), "Block size 2 failed on larger matrix"
assert np.allclose(result_block4, expected_large), "Block size 4 failed on larger matrix"
print("✅ Cache-optimized matrix multiplication works correctly!")
# Run the test
test_cache_optimized_matrix_multiplication()
# %% [markdown]
"""
## Step 4: Parallel Processing - Multi-Core Utilization
### What is Parallel Processing?
**Parallel processing** means distributing computation across multiple CPU cores to reduce overall execution time. Modern processors have multiple cores that can work simultaneously.
### Why Parallelization Matters
- **Multi-core CPUs**: Modern systems have 4-16+ cores
- **Independent operations**: Matrix rows can be computed independently
- **Linear speedup potential**: 4 cores → ~4x faster (ideally)
- **Real-world necessity**: Production systems must use all available cores
### Parallelization Strategy: Row-wise Distribution
- Divide matrix rows among available cores
- Each core computes its assigned rows independently
- No communication needed between cores during computation
- Simple and effective for matrix multiplication
### Learning Connection
This demonstrates the foundations of distributed computing and parallel algorithms used in modern ML training.
"""
# %% nbgrader={"grade": false, "grade_id": "matmul-parallel", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
def matmul_parallel(A: np.ndarray, B: np.ndarray, num_processes: Optional[int] = None) -> np.ndarray:
"""
Parallel matrix multiplication using multiple CPU cores.
TODO: Implement parallel matrix multiplication with row-wise distribution.
STEP-BY-STEP IMPLEMENTATION:
1. Validate matrix dimensions for multiplication
2. Set default number of processes to CPU count if not specified
3. For very small matrices, use single-threaded approach (overhead not worth it)
4. Calculate chunk size: divide rows among processes
5. Process chunks sequentially (simulating parallel processing)
6. Combine results using np.vstack()
EXAMPLE WALKTHROUGH:
For 8x8 matrix with 4 cores:
- Core 1: processes rows 0-1
- Core 2: processes rows 2-3
- Core 3: processes rows 4-5
- Core 4: processes rows 6-7
IMPLEMENTATION HINTS:
- Check if A.shape[0] < 20, use A @ B directly
- Calculate chunk_size = max(1, A.shape[0] // num_processes)
- Use range(0, A.shape[0], chunk_size) to iterate through chunks
- For each chunk: A_chunk = A[i:end_i], result_chunk = A_chunk @ B
- Collect all chunks in a list, then np.vstack(results)
LEARNING CONNECTIONS:
- This is how distributed training works across multiple GPUs/machines
- Row-wise parallelization is embarrassingly parallel
- Real parallel processing would use multiprocessing.Pool
"""
### BEGIN SOLUTION
# Validate dimensions
if A.shape[1] != B.shape[0]:
raise ValueError(f"Matrix dimensions incompatible: A is {A.shape}, B is {B.shape}")
# Default to number of CPU cores
if num_processes is None:
num_processes = mp.cpu_count()
# For very small matrices, use single-threaded approach
if A.shape[0] < 20:
return A @ B
# Simple row-wise parallel processing simulation
# (In real implementation, would use multiprocessing.Pool)
chunk_size = max(1, A.shape[0] // num_processes)
results = []
for i in range(0, A.shape[0], chunk_size):
end_i = min(i + chunk_size, A.shape[0])
# Process this chunk of rows
A_chunk = A[i:end_i]
chunk_result = A_chunk @ B
results.append(chunk_result)
# Combine results
return np.vstack(results)
### END SOLUTION
# %% [markdown]
"""
### 🧪 Test Your Parallel Implementation
Once you implement the `matmul_parallel` function above, run this cell to test it:
"""
# %% nbgrader={"grade": true, "grade_id": "test-parallel-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
def test_parallel_matrix_multiplication():
"""Test parallel matrix multiplication implementation"""
print("🔬 Unit Test: Parallel Matrix Multiplication...")
# Test simple 2x2 case
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
result = matmul_parallel(A, B, num_processes=2)
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
assert np.allclose(result, expected), f"Parallel multiplication failed: expected {expected}, got {result}"
# Compare with baseline
baseline_result = matmul_custom(A, B)
assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
# Test larger matrix
A2 = np.random.randn(24, 16).astype(np.float32)
B2 = np.random.randn(16, 20).astype(np.float32)
result_parallel = matmul_parallel(A2, B2, num_processes=4)
expected_large = A2 @ B2
assert np.allclose(result_parallel, expected_large), "Parallel processing failed on larger matrix"
# Test different number of processes
result_parallel2 = matmul_parallel(A2, B2, num_processes=2)
assert np.allclose(result_parallel2, expected_large), "Different process count failed"
print("✅ Parallel matrix multiplication works correctly!")
# Run the test
test_parallel_matrix_multiplication()
# %% [markdown]
"""
## 🧪 Unit Test: All Matrix Multiplication Implementations
**This is a unit test** - it tests all our matrix multiplication implementations for correctness.
"""
# %%
print("### 🧪 Unit Test: Matrix Multiplication Implementations")
print("**This is a unit test** - it tests our matrix multiplication implementations.")
# Test basic functionality
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
print("🔬 Unit Test: Matrix Multiplication...")
# Test our implementations
result_custom = matmul_custom(A, B)
result_vectorized = matmul_vectorized(A, B)
result_cache_optimized = matmul_cache_optimized(A, B)
result_parallel = matmul_parallel(A, B)
# Expected result
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
print(f"✅ Custom result correct: {np.allclose(result_custom, expected)}")
print(f"✅ Vectorized result correct: {np.allclose(result_vectorized, expected)}")
print(f"✅ Cache-optimized result correct: {np.allclose(result_cache_optimized, expected)}")
print(f"✅ Parallel result correct: {np.allclose(result_parallel, expected)}")
print("📈 Progress: All implementations work correctly ✓")
# %% [markdown]
"""
## Performance Comparison: Your Optimizations in Action
Now let's see how much faster your optimized implementations are compared to the Module 03 baseline:
"""
# %%
print("🔬 Testing performance comparison...")
print("Students can collect their own performance data:")
# Create profiler for measuring performance
profiler = SimpleProfiler(track_memory=True, track_cpu=True)
# Test matrices
A = np.random.randn(50, 50).astype(np.float32)
B = np.random.randn(50, 50).astype(np.float32)
# Profile each implementation
basic_result = profiler.profile(matmul_custom, A, B, name="Module 03 Baseline")
vectorized_result = profiler.profile(matmul_vectorized, A, B, name="Vectorized")
cache_result = profiler.profile(matmul_cache_optimized, A, B, name="Cache-Optimized")
parallel_result = profiler.profile(matmul_parallel, A, B, name="Parallel")
# Students analyze results themselves
print(f"✓ Module 03 Baseline: {basic_result['wall_time']:.4f}s")
print(f"✓ Vectorized: {vectorized_result['wall_time']:.4f}s")
print(f"✓ Cache-Optimized: {cache_result['wall_time']:.4f}s")
print(f"✓ Parallel: {parallel_result['wall_time']:.4f}s")
# Calculate speedups (students learn to do this themselves)
if basic_result['wall_time'] > 0:
vec_speedup = basic_result['wall_time'] / vectorized_result['wall_time']
cache_speedup = basic_result['wall_time'] / cache_result['wall_time']
parallel_speedup = basic_result['wall_time'] / parallel_result['wall_time']
print(f"\n📊 Performance Summary:")
print(f"🏆 Speedups vs Module 03 Baseline:")
print(f" Vectorized: {vec_speedup:.1f}x faster")
print(f" Cache-Optimized: {cache_speedup:.1f}x faster")
print(f" Parallel: {parallel_speedup:.1f}x faster")
print("✅ Performance comparison works")
print("📈 Progress: Kernels Module ✓")
print("\n🎯 Module 11: Kernels Summary:")
print("- Built hardware-aware optimizations from scratch")
print("- Implemented vectorized, cache-optimized, and parallel algorithms")
print("- Learned to profile and compare performance systematically")
print("- Connected basic algorithms to production optimization strategies")
print("- Ready for comprehensive benchmarking and real-world optimization!")
print("\n🎉 Module 11: Kernels - Complete!")
print("Ready for Module 12: Benchmarking!")
# %% [markdown]
"""
## 🔍 Profiling and Analysis
You've learned to use the shared profiler utility to measure individual functions. Here are examples of how to collect data and make your own comparisons:
### Basic Performance Analysis
```python
from utils.profiler import SimpleProfiler, profile_function
# Create profiler
profiler = SimpleProfiler(track_memory=True, track_cpu=True)
# Profile individual functions
A = np.random.randn(100, 100)
B = np.random.randn(100, 100)
basic_result = profiler.profile(matmul_custom, A, B, name="Basic")
optimized_result = profiler.profile(matmul_vectorized, A, B, name="Optimized")
# Students analyze the results themselves
print(f"Basic: {basic_result['wall_time']:.4f}s")
print(f"Optimized: {optimized_result['wall_time']:.4f}s")
speedup = basic_result['wall_time'] / optimized_result['wall_time']
print(f"Speedup: {speedup:.1f}x")
```
### Memory and CPU Analysis
```python
# Profile with detailed output
result = profiler.profile(matmul_cache_optimized, A, B, name="Cache-Optimized")
profiler.print_result(result, show_details=True)
# Access specific metrics
wall_time = result['wall_time']
memory_used = result['memory_delta_mb']
cpu_efficiency = result['cpu_efficiency']
```
### Key Optimization Insights
- **Vectorization**: Massive speedups from SIMD operations
- **Cache optimization**: Better memory access patterns for large matrices
- **Parallelization**: Utilizing multiple CPU cores effectively
- **Hardware awareness**: Understanding system architecture drives optimization
### Educational Approach
Students learn to:
1. **Measure**: Profile individual functions with comprehensive metrics
2. **Collect**: Gather data from multiple implementations
3. **Compare**: Calculate speedups and efficiency differences themselves
4. **Analyze**: Understand what the metrics mean for optimization
This teaches proper benchmarking methodology and critical thinking about performance!
"""
# %% [markdown]
"""
## 🧪 Module Testing
Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.
**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified.
"""
# %% nbgrader={"grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false}
# =============================================================================
# STANDARDIZED MODULE TESTING - DO NOT MODIFY
# This cell is locked to ensure consistent testing across all TinyTorch modules
# =============================================================================
if __name__ == "__main__":
from tito.tools.testing import run_module_tests_auto
# Automatically discover and run all tests in this module
success = run_module_tests_auto("Kernels")
# %% [markdown]
"""
## 🎯 Module Summary: Hardware-Aware Optimization Mastery!
Congratulations! You've successfully implemented hardware-aware optimization techniques for ML systems:
### ✅ What You've Built
- **Vectorized Operations**: Leveraging SIMD instructions for massive speedups
- **Cache-Optimized Algorithms**: Memory hierarchy-aware blocked implementations
- **Parallel Processing**: Multi-core utilization with row-wise distribution
- **Performance Profiling**: Systematic measurement and analysis techniques
### ✅ Key Learning Outcomes
- **Hardware Understanding**: How CPU architecture drives optimization strategies
- **Implementation Skills**: Built optimizations from scratch with detailed guidance
- **Performance Analysis**: Learned to measure and compare implementations systematically
- **Real-world Connection**: Connected basic algorithms to production optimization
### ✅ Optimization Mastery
- **SIMD Vectorization**: Using hardware parallel arithmetic units
- **Memory Hierarchy**: Organizing computation for cache efficiency
- **Parallel Computing**: Distributing work across multiple cores
- **Profiling Methodology**: Measuring performance correctly and fairly
### ✅ Professional Skills Developed
- **Systems thinking**: Understanding hardware-software interaction
- **Optimization mindset**: Identifying performance bottlenecks and solutions
- **Benchmarking skills**: Fair comparison and analysis techniques
- **Production awareness**: How real ML systems achieve high performance
### ✅ Ready for Next Steps
Your optimization skills are now ready for:
- **Module 12**: Comprehensive benchmarking and performance analysis
- **Production Systems**: Understanding how PyTorch, TensorFlow optimize operations
- **Custom Kernels**: Writing specialized operations for specific hardware
- **Distributed Computing**: Scaling optimizations across multiple machines
### 🔗 Connection to Real ML Systems
Your implementations demonstrate the foundations of:
- **BLAS Libraries**: Intel MKL, OpenBLAS, cuBLAS optimization strategies
- **Framework Internals**: How PyTorch and TensorFlow achieve performance
- **Hardware Acceleration**: GPU kernels, TPU operations, specialized chips
- **Production Deployment**: Optimizing ML inference for real-world constraints
### 🎯 The Power of Hardware-Aware Programming
You've learned the essential mindset for high-performance computing:
- **Know your hardware**: Understanding system architecture guides optimization
- **Profile everything**: Measurement drives optimization decisions
- **Optimize systematically**: Vectorization → Memory → Parallelization
- **Think like production**: Real systems demand hardware-aware implementations
### 🧠 Optimization Insights
- **Why optimization matters**: Performance gaps can be 1000x+ between naive and optimized code
- **Hardware evolution**: Modern ML requires understanding of specialized accelerators
- **System design**: Optimization considerations influence software architecture
- **Continuous learning**: Hardware advances require constant learning and adaptation
**Next Module**: Benchmarking - Comprehensive performance analysis and systematic optimization!
You've built the optimizations. Now let's learn to analyze and benchmark them like production ML engineers!
"""

View File

@@ -0,0 +1,77 @@
# TinyTorch Module Metadata
# Essential system information for CLI tools and build systems
name: "11_kernels"
title: "Kernels - Hardware-Aware Optimization"
description: "Custom operations, performance optimization, and hardware-aware computing for ML systems"
version: "1.0.0"
author: "TinyTorch Team"
# Dependencies - Used by CLI for module ordering and prerequisites
dependencies:
prerequisites: [
"00_setup", "01_tensor", "02_activations", "03_layers",
"04_networks", "05_cnn", "06_dataloader", "07_autograd",
"08_optimizers", "09_training", "10_compression"
]
enables: ["12_benchmarking", "13_mlops"]
# Package Export - What gets built into tinytorch package
exports_to: "tinytorch.core.kernels"
# File Structure - What files exist in this module
files:
dev_file: "kernels_dev.py"
test_file: "tests/test_kernels.py"
readme: "README.md"
benchmark_dir: "benchmarks/"
# Components - What's implemented in this module
components:
# Custom Operations
- "matmul_custom"
- "relu_custom"
- "conv2d_custom"
# Optimized Implementations
- "matmul_vectorized"
- "matmul_cache_optimized"
- "matmul_parallel"
# Compressed Model Kernels
- "quantized_matmul"
- "sparse_matmul"
- "pruned_conv2d"
# Performance Tools
- "KernelProfiler"
- "PerformanceBenchmark"
- "HardwareProfiler"
# Learning Objectives - What students will achieve
learning_objectives:
- "Implement custom ML operations beyond NumPy"
- "Apply SIMD vectorization and CPU optimization"
- "Optimize memory layout and cache efficiency"
- "Understand GPU-style parallel computing"
- "Build performance profiling tools"
- "Create hardware-optimized compressed model operations"
# Educational Approach
pedagogy:
framework: "Build → Use → Optimize"
difficulty: "Expert"
time_estimate: "8-10 hours"
# Integration Points - How this connects to other modules
integration:
builds_on: "10_compression" # Extends compression with hardware optimization
enables: "12_benchmarking" # Provides optimized kernels for benchmarking
connects_to: "13_mlops" # Hardware optimization for production deployment
# Testing Strategy
testing:
inline_tests: true
performance_tests: true
integration_tests: true
benchmark_tests: true

View File

@@ -0,0 +1,9 @@
"""
TinyTorch Utils Package
Shared utilities for TinyTorch modules.
"""
from .profiler import SimpleProfiler, profile_function
__all__ = ['SimpleProfiler', 'profile_function']

View File

@@ -0,0 +1,226 @@
"""
TinyTorch Utils: Simple Educational Profiler
A lightweight profiling utility for measuring performance of ML operations.
Focused on measuring individual functions - students do their own comparisons.
"""
import time
import sys
import gc
import numpy as np
from typing import Callable, Dict, Any, Optional
try:
import psutil
HAS_PSUTIL = True
except ImportError:
HAS_PSUTIL = False
try:
import tracemalloc
HAS_TRACEMALLOC = True
except ImportError:
HAS_TRACEMALLOC = False
class SimpleProfiler:
"""
Simple profiler for measuring individual function performance.
Measures timing, memory usage, and other key metrics for a single function.
Students collect multiple measurements and compare results themselves.
"""
def __init__(self, track_memory: bool = True, track_cpu: bool = True):
self.track_memory = track_memory and HAS_TRACEMALLOC
self.track_cpu = track_cpu and HAS_PSUTIL
if self.track_memory:
tracemalloc.start()
def _get_memory_info(self) -> Dict[str, Any]:
"""Get current memory information."""
if not self.track_memory:
return {}
try:
current, peak = tracemalloc.get_traced_memory()
return {
'current_memory_mb': current / 1024 / 1024,
'peak_memory_mb': peak / 1024 / 1024
}
except:
return {}
def _get_cpu_info(self) -> Dict[str, Any]:
"""Get current CPU information."""
if not self.track_cpu:
return {}
try:
process = psutil.Process()
return {
'cpu_percent': process.cpu_percent(),
'memory_percent': process.memory_percent(),
'num_threads': process.num_threads()
}
except:
return {}
def _get_array_info(self, result: Any) -> Dict[str, Any]:
"""Get information about numpy arrays."""
if not isinstance(result, np.ndarray):
return {}
return {
'result_shape': result.shape,
'result_dtype': str(result.dtype),
'result_size_mb': result.nbytes / 1024 / 1024,
'result_elements': result.size
}
def profile(self, func: Callable, *args, name: Optional[str] = None, warmup: bool = True, **kwargs) -> Dict[str, Any]:
"""
Profile a single function execution with comprehensive metrics.
Args:
func: Function to profile
*args: Arguments to pass to function
name: Optional name for the function (defaults to func.__name__)
warmup: Whether to do a warmup run (recommended for fair timing)
**kwargs: Keyword arguments to pass to function
Returns:
Dictionary with comprehensive performance metrics
Example:
profiler = SimpleProfiler()
result = profiler.profile(my_function, arg1, arg2, name="My Function")
print(f"Time: {result['wall_time']:.4f}s")
print(f"Memory: {result['memory_delta_mb']:.2f}MB")
"""
func_name = name or func.__name__
# Reset memory tracking
if self.track_memory:
tracemalloc.clear_traces()
# Warm up (important for fair comparison)
if warmup:
try:
warmup_result = func(*args, **kwargs)
del warmup_result
except:
pass
# Force garbage collection for clean measurement
gc.collect()
# Get baseline measurements
memory_before = self._get_memory_info()
cpu_before = self._get_cpu_info()
# Time the actual execution
start_time = time.time()
start_cpu_time = time.process_time()
result = func(*args, **kwargs)
end_time = time.time()
end_cpu_time = time.process_time()
# Get post-execution measurements
memory_after = self._get_memory_info()
cpu_after = self._get_cpu_info()
# Calculate metrics
wall_time = end_time - start_time
cpu_time = end_cpu_time - start_cpu_time
profile_result = {
'name': func_name,
'wall_time': wall_time,
'cpu_time': cpu_time,
'cpu_efficiency': (cpu_time / wall_time) if wall_time > 0 else 0,
'result': result
}
# Add memory metrics
if self.track_memory and memory_before and memory_after:
profile_result.update({
'memory_before_mb': memory_before.get('current_memory_mb', 0),
'memory_after_mb': memory_after.get('current_memory_mb', 0),
'peak_memory_mb': memory_after.get('peak_memory_mb', 0),
'memory_delta_mb': memory_after.get('current_memory_mb', 0) - memory_before.get('current_memory_mb', 0)
})
# Add CPU metrics
if self.track_cpu and cpu_after:
profile_result.update({
'cpu_percent': cpu_after.get('cpu_percent', 0),
'memory_percent': cpu_after.get('memory_percent', 0),
'num_threads': cpu_after.get('num_threads', 1)
})
# Add array information
profile_result.update(self._get_array_info(result))
return profile_result
def print_result(self, profile_result: Dict[str, Any], show_details: bool = False) -> None:
"""
Print profiling results in a readable format.
Args:
profile_result: Result from profile() method
show_details: Whether to show detailed metrics
"""
name = profile_result['name']
wall_time = profile_result['wall_time']
print(f"📊 {name}: {wall_time:.4f}s")
if show_details:
if 'memory_delta_mb' in profile_result:
print(f" 💾 Memory: {profile_result['memory_delta_mb']:.2f}MB delta, {profile_result['peak_memory_mb']:.2f}MB peak")
if 'result_size_mb' in profile_result:
print(f" 🔢 Output: {profile_result['result_shape']} ({profile_result['result_size_mb']:.2f}MB)")
if 'cpu_efficiency' in profile_result:
print(f" ⚡ CPU: {profile_result['cpu_efficiency']:.2f} efficiency")
def get_capabilities(self) -> Dict[str, bool]:
"""Get information about profiler capabilities."""
return {
'memory_tracking': self.track_memory,
'cpu_tracking': self.track_cpu,
'has_psutil': HAS_PSUTIL,
'has_tracemalloc': HAS_TRACEMALLOC
}
# Convenience function for quick profiling
def profile_function(func: Callable, *args, name: Optional[str] = None,
show_details: bool = False, **kwargs) -> Dict[str, Any]:
"""
Quick profiling of a single function.
Args:
func: Function to profile
*args: Arguments to pass to function
name: Optional name for the function
show_details: Whether to print detailed metrics
**kwargs: Keyword arguments to pass to function
Returns:
Dictionary with profiling results
Example:
result = profile_function(my_matmul, A, B, name="Custom MatMul", show_details=True)
print(f"Execution time: {result['wall_time']:.4f}s")
"""
profiler = SimpleProfiler(track_memory=True, track_cpu=True)
result = profiler.profile(func, *args, name=name, **kwargs)
if show_details:
profiler.print_result(result, show_details=True)
return result