mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-06 12:22:41 -05:00
- Add tinytorch.utils.profiler following PyTorch's utils pattern - Includes SimpleProfiler class for educational performance measurement - Provides timing, memory usage, and system metrics - Follows PyTorch's torch.utils.* organizational pattern - Module 11: Kernels uses profiler for performance demonstrations Features: - Wall time and CPU time measurement - Memory usage tracking (peak, delta, percentages) - Array information (shape, size, dtype) - CPU and system metrics - Clean educational interface for ML performance learning Import pattern: from tinytorch.utils.profiler import SimpleProfiler
733 lines
29 KiB
Python
733 lines
29 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Module 11: Kernels - Hardware-Aware Optimization
|
||
|
||
Welcome to the Kernels module! This is where you'll learn to optimize ML operations for real hardware performance.
|
||
|
||
## Learning Goals
|
||
- Understand hardware optimization principles for ML systems
|
||
- Implement vectorized operations using SIMD capabilities
|
||
- Build cache-friendly algorithms with memory hierarchy awareness
|
||
- Create parallel processing implementations for multi-core systems
|
||
- Connect basic algorithms to production-level performance optimization
|
||
|
||
## Build → Use → Understand
|
||
1. **Build**: Implement hardware-aware optimization techniques from scratch
|
||
2. **Use**: Compare performance differences between basic and optimized implementations
|
||
3. **Understand**: How hardware characteristics drive optimization strategies
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "kernels-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
#| default_exp core.kernels
|
||
|
||
import numpy as np
|
||
import time
|
||
import multiprocessing as mp
|
||
import sys
|
||
import os
|
||
from typing import Callable, Dict, Any, Optional, List, Tuple
|
||
|
||
# Import the basic matrix multiplication from Module 03
|
||
# This is the triple-loop implementation students built earlier
|
||
try:
|
||
from tinytorch.core.layers import matmul_naive as matmul_from_module03
|
||
except ImportError:
|
||
# For development, import from local modules
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
|
||
try:
|
||
from layers_dev import matmul as matmul_from_module03
|
||
except ImportError:
|
||
# Fallback if we can't import, define it directly
|
||
def matmul_from_module03(A: np.ndarray, B: np.ndarray) -> np.ndarray:
|
||
rows_a, cols_a = A.shape
|
||
rows_b, cols_b = B.shape
|
||
|
||
if cols_a != rows_b:
|
||
raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
|
||
|
||
result = np.zeros((rows_a, cols_b))
|
||
|
||
for i in range(rows_a):
|
||
for j in range(cols_b):
|
||
for k in range(cols_a):
|
||
result[i, j] += A[i, k] * B[k, j]
|
||
|
||
return result
|
||
|
||
# Import shared profiling utilities
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'utils'))
|
||
from profiler import SimpleProfiler, profile_function
|
||
|
||
print("🚀 TinyTorch Kernels Module")
|
||
print(f"NumPy version: {np.__version__}")
|
||
print(f"Python version: {sys.version.split()[0]}")
|
||
print(f"CPU count: {mp.cpu_count()}")
|
||
print("Ready to optimize ML systems for hardware!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/source/11_kernels/kernels_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.core.kernels`
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.kernels import matmul_vectorized, matmul_cache_optimized
|
||
from tinytorch.core.kernels import matmul_parallel # Hardware-optimized operations
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Understand optimization from first principles
|
||
- **Production:** Real ML systems need hardware-aware implementations
|
||
- **Performance:** Bridge between theory and practical efficiency
|
||
- **Foundation:** Optimization skills transfer to all ML system components
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 1: Understanding Our Baseline - The Module 03 Implementation
|
||
|
||
In Module 03, you implemented matrix multiplication using three nested loops. Let's use that **exact same function** as our baseline for optimization:
|
||
|
||
```python
|
||
# From Module 03 (your original implementation):
|
||
def matmul(A, B):
|
||
rows_a, cols_a = A.shape
|
||
rows_b, cols_b = B.shape
|
||
|
||
result = np.zeros((rows_a, cols_b))
|
||
|
||
for i in range(rows_a):
|
||
for j in range(cols_b):
|
||
for k in range(cols_a):
|
||
result[i, j] += A[i, k] * B[k, j]
|
||
|
||
return result
|
||
```
|
||
|
||
This is our **baseline** - the implementation we'll optimize in this module.
|
||
|
||
### Why This Baseline Matters
|
||
- **Your own code**: You built this in Module 03, so you understand it completely
|
||
- **Clear performance reference**: See exactly how much faster optimized versions are
|
||
- **Real improvement**: Measure actual performance gains of your optimizations
|
||
- **Educational value**: Connect basic concepts to hardware-aware programming
|
||
|
||
### Performance Characteristics of the Baseline
|
||
- **Time complexity**: O(n³) for n×n matrices
|
||
- **Memory access**: Not cache-friendly - jumps around memory
|
||
- **Parallelization**: No parallel execution
|
||
- **Vectorization**: No SIMD optimizations
|
||
|
||
**Goal**: Take this basic implementation and make it hardware-aware!
|
||
"""
|
||
|
||
# %%
|
||
#| export
|
||
def matmul_custom(A: np.ndarray, B: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Baseline matrix multiplication using the exact same implementation from Module 03.
|
||
|
||
This directly calls the matmul function you built in Module 03,
|
||
showing the clear connection between modules and avoiding code duplication.
|
||
"""
|
||
return matmul_from_module03(A, B)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 2: Vectorized Operations - Leveraging SIMD
|
||
|
||
### What is Vectorization?
|
||
**Vectorization** means using Single Instruction, Multiple Data (SIMD) operations to process multiple data elements simultaneously. Modern CPUs can perform the same operation on multiple numbers at once.
|
||
|
||
### Why Vectorization Matters
|
||
- **SIMD Instructions**: Modern CPUs have 128-bit, 256-bit, or 512-bit registers
|
||
- **Parallel Arithmetic**: One instruction can operate on 4-16 numbers simultaneously
|
||
- **Automatic Optimization**: NumPy uses highly optimized BLAS libraries
|
||
- **Massive Speedup**: Often 10-100x faster than basic loops
|
||
|
||
### The NumPy Advantage
|
||
NumPy operations like `@` (matrix multiplication) automatically use:
|
||
- **Intel MKL**: Math Kernel Library with hand-optimized assembly
|
||
- **OpenBLAS**: Open-source optimized BLAS implementation
|
||
- **BLAS/LAPACK**: Industry-standard linear algebra routines
|
||
|
||
### Learning Connection
|
||
This is why production ML frameworks (PyTorch, TensorFlow) are built on optimized libraries rather than pure Python loops.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "matmul-vectorized", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
def matmul_vectorized(A: np.ndarray, B: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Vectorized matrix multiplication using NumPy's optimized operations.
|
||
|
||
TODO: Implement vectorized matrix multiplication using NumPy's @ operator.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Validate that the matrices can be multiplied (A.shape[1] == B.shape[0])
|
||
2. Use NumPy's @ operator for optimized matrix multiplication
|
||
3. Return the result directly
|
||
|
||
EXAMPLE:
|
||
A = [[1, 2], B = [[5, 6], Result = [[19, 22],
|
||
[3, 4]] [7, 8]] [43, 50]]
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Check shape compatibility with A.shape[1] == B.shape[0]
|
||
- Use A @ B for the actual multiplication
|
||
- NumPy handles all the SIMD optimization automatically
|
||
- This should be much faster than the triple-loop version
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This uses the same optimized libraries as PyTorch and TensorFlow
|
||
- SIMD operations process multiple numbers simultaneously
|
||
- Why you should use library functions instead of writing loops
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Validate matrix dimensions
|
||
if A.shape[1] != B.shape[0]:
|
||
raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
|
||
|
||
# Use NumPy's optimized matrix multiplication
|
||
return A @ B
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Vectorized Implementation
|
||
|
||
Once you implement the `matmul_vectorized` function above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-vectorized-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
||
def test_vectorized_matrix_multiplication():
|
||
"""Test vectorized matrix multiplication implementation"""
|
||
print("🔬 Unit Test: Vectorized Matrix Multiplication...")
|
||
|
||
# Test simple 2x2 case
|
||
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result = matmul_vectorized(A, B)
|
||
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
|
||
|
||
assert np.allclose(result, expected), f"Vectorized multiplication failed: expected {expected}, got {result}"
|
||
|
||
# Compare with baseline
|
||
baseline_result = matmul_custom(A, B)
|
||
assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
|
||
|
||
# Test different shapes
|
||
A2 = np.array([[1, 2, 3]], dtype=np.float32) # 1x3
|
||
B2 = np.array([[4], [5], [6]], dtype=np.float32) # 3x1
|
||
result2 = matmul_vectorized(A2, B2)
|
||
expected2 = np.array([[32]], dtype=np.float32) # 1*4 + 2*5 + 3*6 = 32
|
||
|
||
assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"
|
||
|
||
print("✅ Vectorized matrix multiplication works correctly!")
|
||
|
||
# Run the test
|
||
test_vectorized_matrix_multiplication()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 3: Cache-Optimized Implementation - Memory Hierarchy Awareness
|
||
|
||
### What is Cache Optimization?
|
||
**Cache optimization** means organizing memory accesses to work efficiently with the CPU's memory hierarchy. Modern processors have multiple levels of cache that are much faster than main memory.
|
||
|
||
### Memory Hierarchy
|
||
- **L1 Cache**: ~1 cycle, ~32KB, per-core
|
||
- **L2 Cache**: ~10 cycles, ~256KB, per-core
|
||
- **L3 Cache**: ~50 cycles, ~8MB, shared
|
||
- **Main Memory**: ~200 cycles, GB-scale
|
||
|
||
### Cache-Friendly Strategy: Blocking
|
||
**Blocking** (or tiling) divides large matrices into smaller blocks that fit in cache:
|
||
- Process one block at a time
|
||
- Reuse data while it's still in cache
|
||
- Reduce expensive memory fetches
|
||
- Better performance for large matrices
|
||
|
||
### Why This Matters
|
||
- **Cache misses are expensive**: 200x slower than cache hits
|
||
- **Locality of reference**: Access nearby data together
|
||
- **Real systems**: Production ML uses cache-aware algorithms
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "matmul-cache", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
def matmul_cache_optimized(A: np.ndarray, B: np.ndarray, block_size: int = 64) -> np.ndarray:
|
||
"""
|
||
Cache-optimized matrix multiplication using blocked algorithm.
|
||
|
||
TODO: Implement cache-friendly matrix multiplication with blocking.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Validate matrix dimensions for multiplication
|
||
2. Get matrix dimensions: m, n, p
|
||
3. Initialize result matrix with zeros
|
||
4. Use three nested loops over blocks (not elements):
|
||
- i_block: iterate through row blocks of A
|
||
- j_block: iterate through column blocks of B
|
||
- k_block: iterate through shared dimension blocks
|
||
5. For each block combination, extract submatrices and multiply
|
||
6. Add the block result to the appropriate section of the output matrix
|
||
|
||
EXAMPLE WALKTHROUGH:
|
||
For 4x4 matrices with block_size=2:
|
||
- Divide into 2x2 blocks
|
||
- Multiply corresponding blocks
|
||
- Accumulate results in output matrix
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Use range(0, dimension, block_size) for block iteration
|
||
- Extract blocks: A_block = A[i:i_end, k:k_end]
|
||
- Use @ operator for block multiplication
|
||
- Accumulate: result[i:i_end, j:j_end] += A_block @ B_block
|
||
- Handle edge cases where blocks don't divide evenly
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This is how BLAS libraries optimize for cache hierarchy
|
||
- Block size should match cache size for optimal performance
|
||
- Cache-aware algorithms are essential for large-scale ML
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Validate matrix dimensions
|
||
if A.shape[1] != B.shape[0]:
|
||
raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
|
||
|
||
m, n = A.shape
|
||
n2, p = B.shape
|
||
|
||
# Initialize result matrix
|
||
result = np.zeros((m, p))
|
||
|
||
# Blocked matrix multiplication
|
||
for i in range(0, m, block_size):
|
||
for j in range(0, p, block_size):
|
||
for k in range(0, n, block_size):
|
||
# Calculate block boundaries
|
||
i_end = min(i + block_size, m)
|
||
j_end = min(j + block_size, p)
|
||
k_end = min(k + block_size, n)
|
||
|
||
# Extract blocks
|
||
A_block = A[i:i_end, k:k_end]
|
||
B_block = B[k:k_end, j:j_end]
|
||
|
||
# Multiply blocks and accumulate
|
||
result[i:i_end, j:j_end] += A_block @ B_block
|
||
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Cache-Optimized Implementation
|
||
|
||
Once you implement the `matmul_cache_optimized` function above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-cache-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_cache_optimized_matrix_multiplication():
|
||
"""Test cache-optimized matrix multiplication implementation"""
|
||
print("🔬 Unit Test: Cache-Optimized Matrix Multiplication...")
|
||
|
||
# Test simple 2x2 case
|
||
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result = matmul_cache_optimized(A, B, block_size=2)
|
||
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
|
||
|
||
assert np.allclose(result, expected), f"Cache-optimized multiplication failed: expected {expected}, got {result}"
|
||
|
||
# Compare with baseline
|
||
baseline_result = matmul_custom(A, B)
|
||
assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
|
||
|
||
# Test larger matrix with different block sizes
|
||
A2 = np.random.randn(8, 6).astype(np.float32)
|
||
B2 = np.random.randn(6, 10).astype(np.float32)
|
||
|
||
result_block2 = matmul_cache_optimized(A2, B2, block_size=2)
|
||
result_block4 = matmul_cache_optimized(A2, B2, block_size=4)
|
||
expected_large = A2 @ B2
|
||
|
||
assert np.allclose(result_block2, expected_large), "Block size 2 failed on larger matrix"
|
||
assert np.allclose(result_block4, expected_large), "Block size 4 failed on larger matrix"
|
||
|
||
print("✅ Cache-optimized matrix multiplication works correctly!")
|
||
|
||
# Run the test
|
||
test_cache_optimized_matrix_multiplication()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 4: Parallel Processing - Multi-Core Utilization
|
||
|
||
### What is Parallel Processing?
|
||
**Parallel processing** means distributing computation across multiple CPU cores to reduce overall execution time. Modern processors have multiple cores that can work simultaneously.
|
||
|
||
### Why Parallelization Matters
|
||
- **Multi-core CPUs**: Modern systems have 4-16+ cores
|
||
- **Independent operations**: Matrix rows can be computed independently
|
||
- **Linear speedup potential**: 4 cores → ~4x faster (ideally)
|
||
- **Real-world necessity**: Production systems must use all available cores
|
||
|
||
### Parallelization Strategy: Row-wise Distribution
|
||
- Divide matrix rows among available cores
|
||
- Each core computes its assigned rows independently
|
||
- No communication needed between cores during computation
|
||
- Simple and effective for matrix multiplication
|
||
|
||
### Learning Connection
|
||
This demonstrates the foundations of distributed computing and parallel algorithms used in modern ML training.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "matmul-parallel", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
def matmul_parallel(A: np.ndarray, B: np.ndarray, num_processes: Optional[int] = None) -> np.ndarray:
|
||
"""
|
||
Parallel matrix multiplication using multiple CPU cores.
|
||
|
||
TODO: Implement parallel matrix multiplication with row-wise distribution.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Validate matrix dimensions for multiplication
|
||
2. Set default number of processes to CPU count if not specified
|
||
3. For very small matrices, use single-threaded approach (overhead not worth it)
|
||
4. Calculate chunk size: divide rows among processes
|
||
5. Process chunks sequentially (simulating parallel processing)
|
||
6. Combine results using np.vstack()
|
||
|
||
EXAMPLE WALKTHROUGH:
|
||
For 8x8 matrix with 4 cores:
|
||
- Core 1: processes rows 0-1
|
||
- Core 2: processes rows 2-3
|
||
- Core 3: processes rows 4-5
|
||
- Core 4: processes rows 6-7
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Check if A.shape[0] < 20, use A @ B directly
|
||
- Calculate chunk_size = max(1, A.shape[0] // num_processes)
|
||
- Use range(0, A.shape[0], chunk_size) to iterate through chunks
|
||
- For each chunk: A_chunk = A[i:end_i], result_chunk = A_chunk @ B
|
||
- Collect all chunks in a list, then np.vstack(results)
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This is how distributed training works across multiple GPUs/machines
|
||
- Row-wise parallelization is embarrassingly parallel
|
||
- Real parallel processing would use multiprocessing.Pool
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Validate dimensions
|
||
if A.shape[1] != B.shape[0]:
|
||
raise ValueError(f"Matrix dimensions incompatible: A is {A.shape}, B is {B.shape}")
|
||
|
||
# Default to number of CPU cores
|
||
if num_processes is None:
|
||
num_processes = mp.cpu_count()
|
||
|
||
# For very small matrices, use single-threaded approach
|
||
if A.shape[0] < 20:
|
||
return A @ B
|
||
|
||
# Simple row-wise parallel processing simulation
|
||
# (In real implementation, would use multiprocessing.Pool)
|
||
chunk_size = max(1, A.shape[0] // num_processes)
|
||
results = []
|
||
|
||
for i in range(0, A.shape[0], chunk_size):
|
||
end_i = min(i + chunk_size, A.shape[0])
|
||
# Process this chunk of rows
|
||
A_chunk = A[i:end_i]
|
||
chunk_result = A_chunk @ B
|
||
results.append(chunk_result)
|
||
|
||
# Combine results
|
||
return np.vstack(results)
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Parallel Implementation
|
||
|
||
Once you implement the `matmul_parallel` function above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-parallel-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_parallel_matrix_multiplication():
|
||
"""Test parallel matrix multiplication implementation"""
|
||
print("🔬 Unit Test: Parallel Matrix Multiplication...")
|
||
|
||
# Test simple 2x2 case
|
||
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result = matmul_parallel(A, B, num_processes=2)
|
||
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
|
||
|
||
assert np.allclose(result, expected), f"Parallel multiplication failed: expected {expected}, got {result}"
|
||
|
||
# Compare with baseline
|
||
baseline_result = matmul_custom(A, B)
|
||
assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
|
||
|
||
# Test larger matrix
|
||
A2 = np.random.randn(24, 16).astype(np.float32)
|
||
B2 = np.random.randn(16, 20).astype(np.float32)
|
||
|
||
result_parallel = matmul_parallel(A2, B2, num_processes=4)
|
||
expected_large = A2 @ B2
|
||
|
||
assert np.allclose(result_parallel, expected_large), "Parallel processing failed on larger matrix"
|
||
|
||
# Test different number of processes
|
||
result_parallel2 = matmul_parallel(A2, B2, num_processes=2)
|
||
assert np.allclose(result_parallel2, expected_large), "Different process count failed"
|
||
|
||
print("✅ Parallel matrix multiplication works correctly!")
|
||
|
||
# Run the test
|
||
test_parallel_matrix_multiplication()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🧪 Unit Test: All Matrix Multiplication Implementations
|
||
|
||
**This is a unit test** - it tests all our matrix multiplication implementations for correctness.
|
||
"""
|
||
|
||
# %%
|
||
print("### 🧪 Unit Test: Matrix Multiplication Implementations")
|
||
print("**This is a unit test** - it tests our matrix multiplication implementations.")
|
||
|
||
# Test basic functionality
|
||
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
print("🔬 Unit Test: Matrix Multiplication...")
|
||
|
||
# Test our implementations
|
||
result_custom = matmul_custom(A, B)
|
||
result_vectorized = matmul_vectorized(A, B)
|
||
result_cache_optimized = matmul_cache_optimized(A, B)
|
||
result_parallel = matmul_parallel(A, B)
|
||
|
||
# Expected result
|
||
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
|
||
|
||
print(f"✅ Custom result correct: {np.allclose(result_custom, expected)}")
|
||
print(f"✅ Vectorized result correct: {np.allclose(result_vectorized, expected)}")
|
||
print(f"✅ Cache-optimized result correct: {np.allclose(result_cache_optimized, expected)}")
|
||
print(f"✅ Parallel result correct: {np.allclose(result_parallel, expected)}")
|
||
|
||
print("📈 Progress: All implementations work correctly ✓")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Performance Comparison: Your Optimizations in Action
|
||
|
||
Now let's see how much faster your optimized implementations are compared to the Module 03 baseline:
|
||
"""
|
||
|
||
# %%
|
||
print("🔬 Testing performance comparison...")
|
||
print("Students can collect their own performance data:")
|
||
|
||
# Create profiler for measuring performance
|
||
profiler = SimpleProfiler(track_memory=True, track_cpu=True)
|
||
|
||
# Test matrices
|
||
A = np.random.randn(50, 50).astype(np.float32)
|
||
B = np.random.randn(50, 50).astype(np.float32)
|
||
|
||
# Profile each implementation
|
||
basic_result = profiler.profile(matmul_custom, A, B, name="Module 03 Baseline")
|
||
vectorized_result = profiler.profile(matmul_vectorized, A, B, name="Vectorized")
|
||
cache_result = profiler.profile(matmul_cache_optimized, A, B, name="Cache-Optimized")
|
||
parallel_result = profiler.profile(matmul_parallel, A, B, name="Parallel")
|
||
|
||
# Students analyze results themselves
|
||
print(f"✓ Module 03 Baseline: {basic_result['wall_time']:.4f}s")
|
||
print(f"✓ Vectorized: {vectorized_result['wall_time']:.4f}s")
|
||
print(f"✓ Cache-Optimized: {cache_result['wall_time']:.4f}s")
|
||
print(f"✓ Parallel: {parallel_result['wall_time']:.4f}s")
|
||
|
||
# Calculate speedups (students learn to do this themselves)
|
||
if basic_result['wall_time'] > 0:
|
||
vec_speedup = basic_result['wall_time'] / vectorized_result['wall_time']
|
||
cache_speedup = basic_result['wall_time'] / cache_result['wall_time']
|
||
parallel_speedup = basic_result['wall_time'] / parallel_result['wall_time']
|
||
|
||
print(f"\n📊 Performance Summary:")
|
||
print(f"🏆 Speedups vs Module 03 Baseline:")
|
||
print(f" Vectorized: {vec_speedup:.1f}x faster")
|
||
print(f" Cache-Optimized: {cache_speedup:.1f}x faster")
|
||
print(f" Parallel: {parallel_speedup:.1f}x faster")
|
||
|
||
print("✅ Performance comparison works")
|
||
print("📈 Progress: Kernels Module ✓")
|
||
|
||
print("\n🎯 Module 11: Kernels Summary:")
|
||
print("- Built hardware-aware optimizations from scratch")
|
||
print("- Implemented vectorized, cache-optimized, and parallel algorithms")
|
||
print("- Learned to profile and compare performance systematically")
|
||
print("- Connected basic algorithms to production optimization strategies")
|
||
print("- Ready for comprehensive benchmarking and real-world optimization!")
|
||
|
||
print("\n🎉 Module 11: Kernels - Complete!")
|
||
print("Ready for Module 12: Benchmarking!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🔍 Profiling and Analysis
|
||
|
||
You've learned to use the shared profiler utility to measure individual functions. Here are examples of how to collect data and make your own comparisons:
|
||
|
||
### Basic Performance Analysis
|
||
```python
|
||
from utils.profiler import SimpleProfiler, profile_function
|
||
|
||
# Create profiler
|
||
profiler = SimpleProfiler(track_memory=True, track_cpu=True)
|
||
|
||
# Profile individual functions
|
||
A = np.random.randn(100, 100)
|
||
B = np.random.randn(100, 100)
|
||
|
||
basic_result = profiler.profile(matmul_custom, A, B, name="Basic")
|
||
optimized_result = profiler.profile(matmul_vectorized, A, B, name="Optimized")
|
||
|
||
# Students analyze the results themselves
|
||
print(f"Basic: {basic_result['wall_time']:.4f}s")
|
||
print(f"Optimized: {optimized_result['wall_time']:.4f}s")
|
||
speedup = basic_result['wall_time'] / optimized_result['wall_time']
|
||
print(f"Speedup: {speedup:.1f}x")
|
||
```
|
||
|
||
### Memory and CPU Analysis
|
||
```python
|
||
# Profile with detailed output
|
||
result = profiler.profile(matmul_cache_optimized, A, B, name="Cache-Optimized")
|
||
profiler.print_result(result, show_details=True)
|
||
|
||
# Access specific metrics
|
||
wall_time = result['wall_time']
|
||
memory_used = result['memory_delta_mb']
|
||
cpu_efficiency = result['cpu_efficiency']
|
||
```
|
||
|
||
### Key Optimization Insights
|
||
- **Vectorization**: Massive speedups from SIMD operations
|
||
- **Cache optimization**: Better memory access patterns for large matrices
|
||
- **Parallelization**: Utilizing multiple CPU cores effectively
|
||
- **Hardware awareness**: Understanding system architecture drives optimization
|
||
|
||
### Educational Approach
|
||
Students learn to:
|
||
1. **Measure**: Profile individual functions with comprehensive metrics
|
||
2. **Collect**: Gather data from multiple implementations
|
||
3. **Compare**: Calculate speedups and efficiency differences themselves
|
||
4. **Analyze**: Understand what the metrics mean for optimization
|
||
|
||
This teaches proper benchmarking methodology and critical thinking about performance!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🧪 Module Testing
|
||
|
||
Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.
|
||
|
||
**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false}
|
||
# =============================================================================
|
||
# STANDARDIZED MODULE TESTING - DO NOT MODIFY
|
||
# This cell is locked to ensure consistent testing across all TinyTorch modules
|
||
# =============================================================================
|
||
|
||
if __name__ == "__main__":
|
||
from tito.tools.testing import run_module_tests_auto
|
||
|
||
# Automatically discover and run all tests in this module
|
||
success = run_module_tests_auto("Kernels")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 Module Summary: Hardware-Aware Optimization Mastery!
|
||
|
||
Congratulations! You've successfully implemented hardware-aware optimization techniques for ML systems:
|
||
|
||
### ✅ What You've Built
|
||
- **Vectorized Operations**: Leveraging SIMD instructions for massive speedups
|
||
- **Cache-Optimized Algorithms**: Memory hierarchy-aware blocked implementations
|
||
- **Parallel Processing**: Multi-core utilization with row-wise distribution
|
||
- **Performance Profiling**: Systematic measurement and analysis techniques
|
||
|
||
### ✅ Key Learning Outcomes
|
||
- **Hardware Understanding**: How CPU architecture drives optimization strategies
|
||
- **Implementation Skills**: Built optimizations from scratch with detailed guidance
|
||
- **Performance Analysis**: Learned to measure and compare implementations systematically
|
||
- **Real-world Connection**: Connected basic algorithms to production optimization
|
||
|
||
### ✅ Optimization Mastery
|
||
- **SIMD Vectorization**: Using hardware parallel arithmetic units
|
||
- **Memory Hierarchy**: Organizing computation for cache efficiency
|
||
- **Parallel Computing**: Distributing work across multiple cores
|
||
- **Profiling Methodology**: Measuring performance correctly and fairly
|
||
|
||
### ✅ Professional Skills Developed
|
||
- **Systems thinking**: Understanding hardware-software interaction
|
||
- **Optimization mindset**: Identifying performance bottlenecks and solutions
|
||
- **Benchmarking skills**: Fair comparison and analysis techniques
|
||
- **Production awareness**: How real ML systems achieve high performance
|
||
|
||
### ✅ Ready for Next Steps
|
||
Your optimization skills are now ready for:
|
||
- **Module 12**: Comprehensive benchmarking and performance analysis
|
||
- **Production Systems**: Understanding how PyTorch, TensorFlow optimize operations
|
||
- **Custom Kernels**: Writing specialized operations for specific hardware
|
||
- **Distributed Computing**: Scaling optimizations across multiple machines
|
||
|
||
### 🔗 Connection to Real ML Systems
|
||
Your implementations demonstrate the foundations of:
|
||
- **BLAS Libraries**: Intel MKL, OpenBLAS, cuBLAS optimization strategies
|
||
- **Framework Internals**: How PyTorch and TensorFlow achieve performance
|
||
- **Hardware Acceleration**: GPU kernels, TPU operations, specialized chips
|
||
- **Production Deployment**: Optimizing ML inference for real-world constraints
|
||
|
||
### 🎯 The Power of Hardware-Aware Programming
|
||
You've learned the essential mindset for high-performance computing:
|
||
- **Know your hardware**: Understanding system architecture guides optimization
|
||
- **Profile everything**: Measurement drives optimization decisions
|
||
- **Optimize systematically**: Vectorization → Memory → Parallelization
|
||
- **Think like production**: Real systems demand hardware-aware implementations
|
||
|
||
### 🧠 Optimization Insights
|
||
- **Why optimization matters**: Performance gaps can be 1000x+ between naive and optimized code
|
||
- **Hardware evolution**: Modern ML requires understanding of specialized accelerators
|
||
- **System design**: Optimization considerations influence software architecture
|
||
- **Continuous learning**: Hardware advances require constant learning and adaptation
|
||
|
||
**Next Module**: Benchmarking - Comprehensive performance analysis and systematic optimization!
|
||
|
||
You've built the optimizations. Now let's learn to analyze and benchmark them like production ML engineers!
|
||
""" |