Add TinyTorch Profiler Utility

- Add tinytorch.utils.profiler following PyTorch's utils pattern - Includes SimpleProfiler class for educational performance measurement - Provides timing, memory usage, and system metrics - Follows PyTorch's torch.utils.* organizational pattern - Module 11: Kernels uses profiler for performance demonstrations Features: - Wall time and CPU time measurement - Memory usage tracking (peak, delta, percentages) - Array information (shape, size, dtype) - CPU and system metrics - Clean educational interface for ML performance learning Import pattern: from tinytorch.utils.profiler import SimpleProfiler
2026-05-06 07:01:48 -05:00 · 2025-07-14 13:04:44 -04:00
parent 100f0cc3fa
commit 4ea5a4e024
7 changed files with 2919 additions and 0 deletions
--- a/modules/source/11_kernels/README.md
+++ b/modules/source/11_kernels/README.md
@@ -0,0 +1,264 @@
+# 🚀 Module 11: Kernels - Hardware-Aware Optimization
+
+## 📊 Module Info
+- **Difficulty**: ⭐⭐⭐⭐⭐ Expert
+- **Time Estimate**: 8-10 hours
+- **Prerequisites**: All previous modules (00-10), especially Compression
+- **Next Steps**: Benchmarking, MLOps modules
+
+**Bridge the gap between algorithmic optimization and hardware-level performance engineering**
+
+## 🎯 Learning Objectives
+
+After completing this module, you will:
+- Understand how to implement custom ML operations beyond NumPy
+- Apply SIMD vectorization and CPU optimization techniques
+- Optimize memory layout and access patterns for cache efficiency
+- Implement GPU-style parallel computing concepts
+- Build comprehensive performance profiling and benchmarking tools
+- Create hardware-optimized operations for quantized and pruned models
+
+## 🔗 Connection to Previous Modules
+
+### What You Already Know
+- **Compression (Module 10)**: *What* to optimize (model size, computation)
+- **Layers (Module 03)**: Basic matrix multiplication with `matmul()`
+- **Training (Module 09)**: High-level optimization workflows
+- **Networks (Module 04)**: How operations compose into architectures
+
+### The Performance Gap
+Students understand **algorithmic optimization** but not **hardware optimization**:
+- ✅ **Algorithmic**: Pruning, quantization, knowledge distillation
+- ❌ **Hardware**: Memory layout, vectorization, parallel processing
+
+## 🧠 Build → Use → Optimize
+
+This module follows the **"Build → Use → Optimize"** pedagogical framework:
+
+### 1. **Build**: Custom Operations
+- Move beyond NumPy's black box implementations
+- Implement matrix multiplication, convolution, and activations from scratch
+- Understand the computational patterns underlying ML
+
+### 2. **Use**: Performance Optimization
+- Apply SIMD vectorization for CPU optimization
+- Implement cache-friendly memory layouts
+- Build GPU-style parallel computing concepts
+
+### 3. **Optimize**: Real-World Integration
+- Profile and benchmark performance improvements
+- Integrate with compressed models from Module 10
+- Bridge to production deployment considerations
+
+## 📚 What You'll Build
+
+### **Step 1: Understanding Custom Operations**
+```python
+# Move beyond NumPy to custom implementations
+def matmul_custom(A, B):
+    # Your low-level implementation
+    return result
+
+def relu_custom(x):
+    # Understanding what happens inside activation functions
+    return np.maximum(0, x)
+```
+
+### **Step 2: SIMD Vectorization**
+```python
+# CPU optimization with vector operations
+def matmul_vectorized(A, B):
+    # Use SIMD instructions for parallel computation
+    return optimized_result
+```
+
+### **Step 3: Memory Layout Optimization**
+```python
+# Cache-friendly data structures
+def matmul_cache_optimized(A, B):
+    # Optimize memory access patterns
+    return cache_friendly_result
+```
+
+### **Step 4: GPU-Style Parallel Computing**
+```python
+# Understand parallel computing concepts
+def matmul_parallel(A, B):
+    # Parallel processing patterns
+    return parallel_result
+```
+
+### **Step 5: Performance Profiling**
+```python
+# Measure and optimize performance
+profiler = KernelProfiler()
+profiler.benchmark(matmul_custom, matmul_vectorized, matmul_parallel)
+```
+
+### **Step 6: Compressed Model Kernels**
+```python
+# Hardware-optimized operations for compressed models
+def quantized_matmul(A_int8, B_int8):
+    # Optimized kernels for quantized models
+    return result
+
+def sparse_matmul(A_sparse, B):
+    # Efficient sparse matrix operations
+    return result
+```
+
+## 🎓 Learning Path
+
+### **Foundation Level**: Understanding Implementation
+- See what happens inside NumPy operations
+- Implement basic kernels with explicit loops
+- Debug performance bottlenecks
+
+### **Intermediate Level**: CPU Optimization
+- Apply vectorization techniques
+- Optimize memory access patterns
+- Understand cache behavior
+
+### **Advanced Level**: Parallel Computing
+- Implement GPU-style parallel algorithms
+- Profile and benchmark performance
+- Integrate with compressed models
+
+### **Expert Level**: Production Integration
+- Build kernels for real deployment scenarios
+- Optimize for specific hardware targets
+- Connect to MLOps and monitoring systems
+
+## 🔧 Technical Skills Developed
+
+### **Low-Level Programming**
+- Manual memory management
+- Understanding of CPU architecture
+- Assembly-level optimization concepts
+
+### **Performance Engineering**
+- Profiling and benchmarking
+- Bottleneck identification
+- Performance optimization strategies
+
+### **Parallel Computing**
+- Thread-level parallelism
+- SIMD vectorization
+- GPU computing concepts
+
+### **Systems Integration**
+- Hardware-software co-design
+- Production deployment considerations
+- Real-world performance constraints
+
+## 🎯 Real-World Applications
+
+### **Production ML Systems**
+- Custom kernels for edge deployment
+- Hardware-specific optimizations
+- Real-time inference requirements
+
+### **Research and Development**
+- Prototype new operations
+- Benchmark algorithm improvements
+- Understand performance trade-offs
+
+### **MLOps and Deployment**
+- Optimize for specific hardware
+- Monitor performance in production
+- Scale to distributed systems
+
+## 🚀 Getting Started
+
+### Prerequisites Check
+- ✅ Complete all previous modules (00-10)
+- ✅ Understand compression techniques
+- ✅ Familiar with NumPy operations
+- ✅ Basic understanding of computer architecture
+
+### Development Setup
+```bash
+# Navigate to the kernels module
+cd modules/source/11_kernels
+
+# Work in the development notebook
+jupyter notebook kernels_dev.ipynb
+
+# Or work in the Python file
+code kernels_dev.py
+```
+
+## 📖 Module Structure
+
+```
+modules/source/11_kernels/
+├── kernels_dev.py           # Main development file (work here!)
+├── kernels_dev.ipynb        # Jupyter notebook version
+├── tests/
+│   └── test_kernels.py      # Performance and correctness tests
+├── README.md               # This file
+└── benchmarks/             # Performance benchmarking tools
+```
+
+## 🧪 Testing Your Implementation
+
+### Performance Testing
+```bash
+# Run performance benchmarks
+python -m pytest tests/test_kernels.py -v --benchmark
+
+# Profile specific operations
+python -c "from kernels_dev import benchmark_kernels; benchmark_kernels()"
+```
+
+### Integration Testing
+```bash
+# Test with compressed models
+python -c "from kernels_dev import test_compressed_kernels; test_compressed_kernels()"
+```
+
+## 🎯 Success Criteria
+
+You've mastered hardware-aware optimization when:
+- ✅ Can implement custom ML operations from scratch
+- ✅ Understand CPU optimization techniques (SIMD, caching)
+- ✅ Can profile and benchmark performance improvements
+- ✅ Successfully integrate with compressed models
+- ✅ Bridge algorithmic and hardware optimization
+
+## 🔍 Common Challenges
+
+### **Performance Debugging**
+- Use profiling tools to identify bottlenecks
+- Understand the difference between algorithmic and implementation efficiency
+- Learn to read performance metrics
+
+### **Hardware Complexity**
+- Start with CPU optimization before GPU concepts
+- Focus on understanding principles, not memorizing details
+- Use abstractions to manage complexity
+
+### **Integration Complexity**
+- Test kernels independently before integration
+- Verify correctness before optimizing for performance
+- Maintain compatibility with existing TinyTorch components
+
+## 🚀 What's Next
+
+After completing this module, you're ready for:
+- **Module 12: Benchmarking** - Systematic performance measurement
+- **Module 13: MLOps** - Production deployment and monitoring
+- **Real-world applications** - Apply optimization skills to production systems
+
+## 🤝 Getting Help
+
+- Focus on understanding principles over memorizing techniques
+- Use profiling tools to guide optimization decisions
+- Connect optimization choices to real-world constraints
+- Remember: **Build → Use → Optimize!**
+
+---
+
+**Ready to optimize ML systems for real-world performance?** 🚀
+
+*This module bridges the gap between algorithmic optimization and hardware-level performance engineering, preparing you for production ML systems deployment.* 
--- a/modules/source/11_kernels/kernels_dev.py
+++ b/modules/source/11_kernels/kernels_dev.py
--- a/modules/source/11_kernels/kernels_dev_backup.py
+++ b/modules/source/11_kernels/kernels_dev_backup.py
@@ -0,0 +1,733 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 11: Kernels - Hardware-Aware Optimization
+
+Welcome to the Kernels module! This is where you'll learn to optimize ML operations for real hardware performance.
+
+## Learning Goals
+- Understand hardware optimization principles for ML systems
+- Implement vectorized operations using SIMD capabilities
+- Build cache-friendly algorithms with memory hierarchy awareness
+- Create parallel processing implementations for multi-core systems
+- Connect basic algorithms to production-level performance optimization
+
+## Build → Use → Understand
+1. **Build**: Implement hardware-aware optimization techniques from scratch
+2. **Use**: Compare performance differences between basic and optimized implementations
+3. **Understand**: How hardware characteristics drive optimization strategies
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "kernels-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.kernels
+
+import numpy as np
+import time
+import multiprocessing as mp
+import sys
+import os
+from typing import Callable, Dict, Any, Optional, List, Tuple
+
+# Import the basic matrix multiplication from Module 03
+# This is the triple-loop implementation students built earlier
+try:
+    from tinytorch.core.layers import matmul_naive as matmul_from_module03
+except ImportError:
+    # For development, import from local modules
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+    try:
+        from layers_dev import matmul as matmul_from_module03
+    except ImportError:
+        # Fallback if we can't import, define it directly
+        def matmul_from_module03(A: np.ndarray, B: np.ndarray) -> np.ndarray:
+            rows_a, cols_a = A.shape
+            rows_b, cols_b = B.shape
+            
+            if cols_a != rows_b:
+                raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
+            
+            result = np.zeros((rows_a, cols_b))
+            
+            for i in range(rows_a):
+                for j in range(cols_b):
+                    for k in range(cols_a):
+                        result[i, j] += A[i, k] * B[k, j]
+            
+            return result
+
+# Import shared profiling utilities
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'utils'))
+from profiler import SimpleProfiler, profile_function
+
+print("🚀 TinyTorch Kernels Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version.split()[0]}")
+print(f"CPU count: {mp.cpu_count()}")
+print("Ready to optimize ML systems for hardware!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/11_kernels/kernels_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.kernels`
+
+```python
+# Final package structure:
+from tinytorch.core.kernels import matmul_vectorized, matmul_cache_optimized
+from tinytorch.core.kernels import matmul_parallel  # Hardware-optimized operations
+```
+
+**Why this matters:**
+- **Learning:** Understand optimization from first principles
+- **Production:** Real ML systems need hardware-aware implementations
+- **Performance:** Bridge between theory and practical efficiency
+- **Foundation:** Optimization skills transfer to all ML system components
+"""
+
+# %% [markdown]
+"""
+## Step 1: Understanding Our Baseline - The Module 03 Implementation
+
+In Module 03, you implemented matrix multiplication using three nested loops. Let's use that **exact same function** as our baseline for optimization:
+
+```python
+# From Module 03 (your original implementation):
+def matmul(A, B):
+    rows_a, cols_a = A.shape
+    rows_b, cols_b = B.shape
+    
+    result = np.zeros((rows_a, cols_b))
+    
+    for i in range(rows_a):
+        for j in range(cols_b):
+            for k in range(cols_a):
+                result[i, j] += A[i, k] * B[k, j]
+    
+    return result
+```
+
+This is our **baseline** - the implementation we'll optimize in this module.
+
+### Why This Baseline Matters
+- **Your own code**: You built this in Module 03, so you understand it completely
+- **Clear performance reference**: See exactly how much faster optimized versions are
+- **Real improvement**: Measure actual performance gains of your optimizations
+- **Educational value**: Connect basic concepts to hardware-aware programming
+
+### Performance Characteristics of the Baseline
+- **Time complexity**: O(n³) for n×n matrices
+- **Memory access**: Not cache-friendly - jumps around memory
+- **Parallelization**: No parallel execution
+- **Vectorization**: No SIMD optimizations
+
+**Goal**: Take this basic implementation and make it hardware-aware!
+"""
+
+# %%
+#| export  
+def matmul_custom(A: np.ndarray, B: np.ndarray) -> np.ndarray:
+    """
+    Baseline matrix multiplication using the exact same implementation from Module 03.
+    
+    This directly calls the matmul function you built in Module 03,
+    showing the clear connection between modules and avoiding code duplication.
+    """
+    return matmul_from_module03(A, B)
+
+# %% [markdown]
+"""
+## Step 2: Vectorized Operations - Leveraging SIMD
+
+### What is Vectorization?
+**Vectorization** means using Single Instruction, Multiple Data (SIMD) operations to process multiple data elements simultaneously. Modern CPUs can perform the same operation on multiple numbers at once.
+
+### Why Vectorization Matters
+- **SIMD Instructions**: Modern CPUs have 128-bit, 256-bit, or 512-bit registers
+- **Parallel Arithmetic**: One instruction can operate on 4-16 numbers simultaneously
+- **Automatic Optimization**: NumPy uses highly optimized BLAS libraries
+- **Massive Speedup**: Often 10-100x faster than basic loops
+
+### The NumPy Advantage
+NumPy operations like `@` (matrix multiplication) automatically use:
+- **Intel MKL**: Math Kernel Library with hand-optimized assembly
+- **OpenBLAS**: Open-source optimized BLAS implementation
+- **BLAS/LAPACK**: Industry-standard linear algebra routines
+
+### Learning Connection
+This is why production ML frameworks (PyTorch, TensorFlow) are built on optimized libraries rather than pure Python loops.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "matmul-vectorized", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def matmul_vectorized(A: np.ndarray, B: np.ndarray) -> np.ndarray:
+    """
+    Vectorized matrix multiplication using NumPy's optimized operations.
+    
+    TODO: Implement vectorized matrix multiplication using NumPy's @ operator.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Validate that the matrices can be multiplied (A.shape[1] == B.shape[0])
+    2. Use NumPy's @ operator for optimized matrix multiplication
+    3. Return the result directly
+    
+    EXAMPLE:
+    A = [[1, 2],     B = [[5, 6],     Result = [[19, 22],
+         [3, 4]]          [7, 8]]               [43, 50]]
+    
+    IMPLEMENTATION HINTS:
+    - Check shape compatibility with A.shape[1] == B.shape[0]
+    - Use A @ B for the actual multiplication
+    - NumPy handles all the SIMD optimization automatically
+    - This should be much faster than the triple-loop version
+    
+    LEARNING CONNECTIONS:
+    - This uses the same optimized libraries as PyTorch and TensorFlow
+    - SIMD operations process multiple numbers simultaneously
+    - Why you should use library functions instead of writing loops
+    """
+    ### BEGIN SOLUTION
+    # Validate matrix dimensions
+    if A.shape[1] != B.shape[0]:
+        raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
+    
+    # Use NumPy's optimized matrix multiplication
+    return A @ B
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Vectorized Implementation
+
+Once you implement the `matmul_vectorized` function above, run this cell to test it:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-vectorized-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+def test_vectorized_matrix_multiplication():
+    """Test vectorized matrix multiplication implementation"""
+    print("🔬 Unit Test: Vectorized Matrix Multiplication...")
+
+    # Test simple 2x2 case
+    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+    B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+    
+    result = matmul_vectorized(A, B)
+    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+    
+    assert np.allclose(result, expected), f"Vectorized multiplication failed: expected {expected}, got {result}"
+    
+    # Compare with baseline
+    baseline_result = matmul_custom(A, B)
+    assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
+
+    # Test different shapes
+    A2 = np.array([[1, 2, 3]], dtype=np.float32)  # 1x3
+    B2 = np.array([[4], [5], [6]], dtype=np.float32)  # 3x1
+    result2 = matmul_vectorized(A2, B2)
+    expected2 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32
+    
+    assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"
+    
+    print("✅ Vectorized matrix multiplication works correctly!")
+
+# Run the test
+test_vectorized_matrix_multiplication()
+
+# %% [markdown]
+"""
+## Step 3: Cache-Optimized Implementation - Memory Hierarchy Awareness
+
+### What is Cache Optimization?
+**Cache optimization** means organizing memory accesses to work efficiently with the CPU's memory hierarchy. Modern processors have multiple levels of cache that are much faster than main memory.
+
+### Memory Hierarchy
+- **L1 Cache**: ~1 cycle, ~32KB, per-core
+- **L2 Cache**: ~10 cycles, ~256KB, per-core  
+- **L3 Cache**: ~50 cycles, ~8MB, shared
+- **Main Memory**: ~200 cycles, GB-scale
+
+### Cache-Friendly Strategy: Blocking
+**Blocking** (or tiling) divides large matrices into smaller blocks that fit in cache:
+- Process one block at a time
+- Reuse data while it's still in cache
+- Reduce expensive memory fetches
+- Better performance for large matrices
+
+### Why This Matters
+- **Cache misses are expensive**: 200x slower than cache hits
+- **Locality of reference**: Access nearby data together
+- **Real systems**: Production ML uses cache-aware algorithms
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "matmul-cache", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def matmul_cache_optimized(A: np.ndarray, B: np.ndarray, block_size: int = 64) -> np.ndarray:
+    """
+    Cache-optimized matrix multiplication using blocked algorithm.
+    
+    TODO: Implement cache-friendly matrix multiplication with blocking.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Validate matrix dimensions for multiplication
+    2. Get matrix dimensions: m, n, p
+    3. Initialize result matrix with zeros
+    4. Use three nested loops over blocks (not elements):
+       - i_block: iterate through row blocks of A
+       - j_block: iterate through column blocks of B
+       - k_block: iterate through shared dimension blocks
+    5. For each block combination, extract submatrices and multiply
+    6. Add the block result to the appropriate section of the output matrix
+    
+    EXAMPLE WALKTHROUGH:
+    For 4x4 matrices with block_size=2:
+    - Divide into 2x2 blocks
+    - Multiply corresponding blocks
+    - Accumulate results in output matrix
+    
+    IMPLEMENTATION HINTS:
+    - Use range(0, dimension, block_size) for block iteration
+    - Extract blocks: A_block = A[i:i_end, k:k_end]
+    - Use @ operator for block multiplication
+    - Accumulate: result[i:i_end, j:j_end] += A_block @ B_block
+    - Handle edge cases where blocks don't divide evenly
+    
+    LEARNING CONNECTIONS:
+    - This is how BLAS libraries optimize for cache hierarchy
+    - Block size should match cache size for optimal performance
+    - Cache-aware algorithms are essential for large-scale ML
+    """
+    ### BEGIN SOLUTION
+    # Validate matrix dimensions
+    if A.shape[1] != B.shape[0]:
+        raise ValueError(f"Cannot multiply matrices with shapes {A.shape} and {B.shape}")
+    
+    m, n = A.shape
+    n2, p = B.shape
+    
+    # Initialize result matrix
+    result = np.zeros((m, p))
+    
+    # Blocked matrix multiplication
+    for i in range(0, m, block_size):
+        for j in range(0, p, block_size):
+            for k in range(0, n, block_size):
+                # Calculate block boundaries
+                i_end = min(i + block_size, m)
+                j_end = min(j + block_size, p)
+                k_end = min(k + block_size, n)
+                
+                # Extract blocks
+                A_block = A[i:i_end, k:k_end]
+                B_block = B[k:k_end, j:j_end]
+                
+                # Multiply blocks and accumulate
+                result[i:i_end, j:j_end] += A_block @ B_block
+    
+    return result
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Cache-Optimized Implementation
+
+Once you implement the `matmul_cache_optimized` function above, run this cell to test it:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-cache-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_cache_optimized_matrix_multiplication():
+    """Test cache-optimized matrix multiplication implementation"""
+    print("🔬 Unit Test: Cache-Optimized Matrix Multiplication...")
+
+    # Test simple 2x2 case
+    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+    B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+    
+    result = matmul_cache_optimized(A, B, block_size=2)
+    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+    
+    assert np.allclose(result, expected), f"Cache-optimized multiplication failed: expected {expected}, got {result}"
+    
+    # Compare with baseline
+    baseline_result = matmul_custom(A, B)
+    assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
+
+    # Test larger matrix with different block sizes
+    A2 = np.random.randn(8, 6).astype(np.float32)
+    B2 = np.random.randn(6, 10).astype(np.float32)
+    
+    result_block2 = matmul_cache_optimized(A2, B2, block_size=2)
+    result_block4 = matmul_cache_optimized(A2, B2, block_size=4)
+    expected_large = A2 @ B2
+    
+    assert np.allclose(result_block2, expected_large), "Block size 2 failed on larger matrix"
+    assert np.allclose(result_block4, expected_large), "Block size 4 failed on larger matrix"
+    
+    print("✅ Cache-optimized matrix multiplication works correctly!")
+
+# Run the test
+test_cache_optimized_matrix_multiplication()
+
+# %% [markdown]
+"""
+## Step 4: Parallel Processing - Multi-Core Utilization
+
+### What is Parallel Processing?
+**Parallel processing** means distributing computation across multiple CPU cores to reduce overall execution time. Modern processors have multiple cores that can work simultaneously.
+
+### Why Parallelization Matters
+- **Multi-core CPUs**: Modern systems have 4-16+ cores
+- **Independent operations**: Matrix rows can be computed independently  
+- **Linear speedup potential**: 4 cores → ~4x faster (ideally)
+- **Real-world necessity**: Production systems must use all available cores
+
+### Parallelization Strategy: Row-wise Distribution
+- Divide matrix rows among available cores
+- Each core computes its assigned rows independently
+- No communication needed between cores during computation
+- Simple and effective for matrix multiplication
+
+### Learning Connection
+This demonstrates the foundations of distributed computing and parallel algorithms used in modern ML training.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "matmul-parallel", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def matmul_parallel(A: np.ndarray, B: np.ndarray, num_processes: Optional[int] = None) -> np.ndarray:
+    """
+    Parallel matrix multiplication using multiple CPU cores.
+    
+    TODO: Implement parallel matrix multiplication with row-wise distribution.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Validate matrix dimensions for multiplication
+    2. Set default number of processes to CPU count if not specified
+    3. For very small matrices, use single-threaded approach (overhead not worth it)
+    4. Calculate chunk size: divide rows among processes
+    5. Process chunks sequentially (simulating parallel processing)
+    6. Combine results using np.vstack()
+    
+    EXAMPLE WALKTHROUGH:
+    For 8x8 matrix with 4 cores:
+    - Core 1: processes rows 0-1  
+    - Core 2: processes rows 2-3
+    - Core 3: processes rows 4-5
+    - Core 4: processes rows 6-7
+    
+    IMPLEMENTATION HINTS:
+    - Check if A.shape[0] < 20, use A @ B directly
+    - Calculate chunk_size = max(1, A.shape[0] // num_processes)
+    - Use range(0, A.shape[0], chunk_size) to iterate through chunks
+    - For each chunk: A_chunk = A[i:end_i], result_chunk = A_chunk @ B
+    - Collect all chunks in a list, then np.vstack(results)
+    
+    LEARNING CONNECTIONS:
+    - This is how distributed training works across multiple GPUs/machines
+    - Row-wise parallelization is embarrassingly parallel
+    - Real parallel processing would use multiprocessing.Pool
+    """
+    ### BEGIN SOLUTION
+    # Validate dimensions
+    if A.shape[1] != B.shape[0]:
+        raise ValueError(f"Matrix dimensions incompatible: A is {A.shape}, B is {B.shape}")
+    
+    # Default to number of CPU cores
+    if num_processes is None:
+        num_processes = mp.cpu_count()
+    
+    # For very small matrices, use single-threaded approach
+    if A.shape[0] < 20:
+        return A @ B
+    
+    # Simple row-wise parallel processing simulation
+    # (In real implementation, would use multiprocessing.Pool)
+    chunk_size = max(1, A.shape[0] // num_processes)
+    results = []
+    
+    for i in range(0, A.shape[0], chunk_size):
+        end_i = min(i + chunk_size, A.shape[0])
+        # Process this chunk of rows
+        A_chunk = A[i:end_i]
+        chunk_result = A_chunk @ B
+        results.append(chunk_result)
+    
+    # Combine results
+    return np.vstack(results)
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Parallel Implementation
+
+Once you implement the `matmul_parallel` function above, run this cell to test it:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-parallel-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_parallel_matrix_multiplication():
+    """Test parallel matrix multiplication implementation"""
+    print("🔬 Unit Test: Parallel Matrix Multiplication...")
+
+    # Test simple 2x2 case
+    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+    B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+    
+    result = matmul_parallel(A, B, num_processes=2)
+    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+    
+    assert np.allclose(result, expected), f"Parallel multiplication failed: expected {expected}, got {result}"
+    
+    # Compare with baseline
+    baseline_result = matmul_custom(A, B)
+    assert np.allclose(result, baseline_result), f"Doesn't match baseline: got {result}, expected {baseline_result}"
+
+    # Test larger matrix
+    A2 = np.random.randn(24, 16).astype(np.float32)
+    B2 = np.random.randn(16, 20).astype(np.float32)
+    
+    result_parallel = matmul_parallel(A2, B2, num_processes=4)
+    expected_large = A2 @ B2
+    
+    assert np.allclose(result_parallel, expected_large), "Parallel processing failed on larger matrix"
+    
+    # Test different number of processes
+    result_parallel2 = matmul_parallel(A2, B2, num_processes=2)
+    assert np.allclose(result_parallel2, expected_large), "Different process count failed"
+    
+    print("✅ Parallel matrix multiplication works correctly!")
+
+# Run the test
+test_parallel_matrix_multiplication()
+
+# %% [markdown]
+"""
+## 🧪 Unit Test: All Matrix Multiplication Implementations
+
+**This is a unit test** - it tests all our matrix multiplication implementations for correctness.
+"""
+
+# %%
+print("### 🧪 Unit Test: Matrix Multiplication Implementations")
+print("**This is a unit test** - it tests our matrix multiplication implementations.")
+
+# Test basic functionality
+A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+
+print("🔬 Unit Test: Matrix Multiplication...")
+
+# Test our implementations
+result_custom = matmul_custom(A, B)
+result_vectorized = matmul_vectorized(A, B)
+result_cache_optimized = matmul_cache_optimized(A, B)
+result_parallel = matmul_parallel(A, B)
+
+# Expected result
+expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+
+print(f"✅ Custom result correct: {np.allclose(result_custom, expected)}")
+print(f"✅ Vectorized result correct: {np.allclose(result_vectorized, expected)}")
+print(f"✅ Cache-optimized result correct: {np.allclose(result_cache_optimized, expected)}")
+print(f"✅ Parallel result correct: {np.allclose(result_parallel, expected)}")
+
+print("📈 Progress: All implementations work correctly ✓")
+
+# %% [markdown]
+"""
+## Performance Comparison: Your Optimizations in Action
+
+Now let's see how much faster your optimized implementations are compared to the Module 03 baseline:
+"""
+
+# %%
+print("🔬 Testing performance comparison...")
+print("Students can collect their own performance data:")
+
+# Create profiler for measuring performance
+profiler = SimpleProfiler(track_memory=True, track_cpu=True)
+
+# Test matrices
+A = np.random.randn(50, 50).astype(np.float32)
+B = np.random.randn(50, 50).astype(np.float32)
+
+# Profile each implementation
+basic_result = profiler.profile(matmul_custom, A, B, name="Module 03 Baseline")
+vectorized_result = profiler.profile(matmul_vectorized, A, B, name="Vectorized")
+cache_result = profiler.profile(matmul_cache_optimized, A, B, name="Cache-Optimized")
+parallel_result = profiler.profile(matmul_parallel, A, B, name="Parallel")
+
+# Students analyze results themselves
+print(f"✓ Module 03 Baseline: {basic_result['wall_time']:.4f}s")
+print(f"✓ Vectorized: {vectorized_result['wall_time']:.4f}s")
+print(f"✓ Cache-Optimized: {cache_result['wall_time']:.4f}s")
+print(f"✓ Parallel: {parallel_result['wall_time']:.4f}s")
+
+# Calculate speedups (students learn to do this themselves)
+if basic_result['wall_time'] > 0:
+    vec_speedup = basic_result['wall_time'] / vectorized_result['wall_time']
+    cache_speedup = basic_result['wall_time'] / cache_result['wall_time'] 
+    parallel_speedup = basic_result['wall_time'] / parallel_result['wall_time']
+    
+    print(f"\n📊 Performance Summary:")
+    print(f"🏆 Speedups vs Module 03 Baseline:")
+    print(f"   Vectorized: {vec_speedup:.1f}x faster")
+    print(f"   Cache-Optimized: {cache_speedup:.1f}x faster")
+    print(f"   Parallel: {parallel_speedup:.1f}x faster")
+
+print("✅ Performance comparison works")
+print("📈 Progress: Kernels Module ✓")
+
+print("\n🎯 Module 11: Kernels Summary:")
+print("- Built hardware-aware optimizations from scratch")
+print("- Implemented vectorized, cache-optimized, and parallel algorithms")
+print("- Learned to profile and compare performance systematically")
+print("- Connected basic algorithms to production optimization strategies")
+print("- Ready for comprehensive benchmarking and real-world optimization!")
+
+print("\n🎉 Module 11: Kernels - Complete!")
+print("Ready for Module 12: Benchmarking!")
+
+# %% [markdown]
+"""
+## 🔍 Profiling and Analysis
+
+You've learned to use the shared profiler utility to measure individual functions. Here are examples of how to collect data and make your own comparisons:
+
+### Basic Performance Analysis
+```python
+from utils.profiler import SimpleProfiler, profile_function
+
+# Create profiler
+profiler = SimpleProfiler(track_memory=True, track_cpu=True)
+
+# Profile individual functions
+A = np.random.randn(100, 100)
+B = np.random.randn(100, 100)
+
+basic_result = profiler.profile(matmul_custom, A, B, name="Basic")
+optimized_result = profiler.profile(matmul_vectorized, A, B, name="Optimized")
+
+# Students analyze the results themselves
+print(f"Basic: {basic_result['wall_time']:.4f}s")
+print(f"Optimized: {optimized_result['wall_time']:.4f}s")
+speedup = basic_result['wall_time'] / optimized_result['wall_time']
+print(f"Speedup: {speedup:.1f}x")
+```
+
+### Memory and CPU Analysis
+```python
+# Profile with detailed output
+result = profiler.profile(matmul_cache_optimized, A, B, name="Cache-Optimized")
+profiler.print_result(result, show_details=True)
+
+# Access specific metrics
+wall_time = result['wall_time']
+memory_used = result['memory_delta_mb']
+cpu_efficiency = result['cpu_efficiency']
+```
+
+### Key Optimization Insights
+- **Vectorization**: Massive speedups from SIMD operations
+- **Cache optimization**: Better memory access patterns for large matrices
+- **Parallelization**: Utilizing multiple CPU cores effectively
+- **Hardware awareness**: Understanding system architecture drives optimization
+
+### Educational Approach
+Students learn to:
+1. **Measure**: Profile individual functions with comprehensive metrics
+2. **Collect**: Gather data from multiple implementations
+3. **Compare**: Calculate speedups and efficiency differences themselves
+4. **Analyze**: Understand what the metrics mean for optimization
+
+This teaches proper benchmarking methodology and critical thinking about performance!
+"""
+
+# %% [markdown]
+"""
+## 🧪 Module Testing
+
+Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.
+
+**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false}
+# =============================================================================
+# STANDARDIZED MODULE TESTING - DO NOT MODIFY
+# This cell is locked to ensure consistent testing across all TinyTorch modules
+# =============================================================================
+
+if __name__ == "__main__":
+    from tito.tools.testing import run_module_tests_auto
+    
+    # Automatically discover and run all tests in this module
+    success = run_module_tests_auto("Kernels")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary: Hardware-Aware Optimization Mastery!
+
+Congratulations! You've successfully implemented hardware-aware optimization techniques for ML systems:
+
+### ✅ What You've Built
+- **Vectorized Operations**: Leveraging SIMD instructions for massive speedups
+- **Cache-Optimized Algorithms**: Memory hierarchy-aware blocked implementations
+- **Parallel Processing**: Multi-core utilization with row-wise distribution
+- **Performance Profiling**: Systematic measurement and analysis techniques
+
+### ✅ Key Learning Outcomes
+- **Hardware Understanding**: How CPU architecture drives optimization strategies
+- **Implementation Skills**: Built optimizations from scratch with detailed guidance
+- **Performance Analysis**: Learned to measure and compare implementations systematically
+- **Real-world Connection**: Connected basic algorithms to production optimization
+
+### ✅ Optimization Mastery
+- **SIMD Vectorization**: Using hardware parallel arithmetic units
+- **Memory Hierarchy**: Organizing computation for cache efficiency
+- **Parallel Computing**: Distributing work across multiple cores
+- **Profiling Methodology**: Measuring performance correctly and fairly
+
+### ✅ Professional Skills Developed
+- **Systems thinking**: Understanding hardware-software interaction
+- **Optimization mindset**: Identifying performance bottlenecks and solutions
+- **Benchmarking skills**: Fair comparison and analysis techniques
+- **Production awareness**: How real ML systems achieve high performance
+
+### ✅ Ready for Next Steps
+Your optimization skills are now ready for:
+- **Module 12**: Comprehensive benchmarking and performance analysis
+- **Production Systems**: Understanding how PyTorch, TensorFlow optimize operations
+- **Custom Kernels**: Writing specialized operations for specific hardware
+- **Distributed Computing**: Scaling optimizations across multiple machines
+
+### 🔗 Connection to Real ML Systems
+Your implementations demonstrate the foundations of:
+- **BLAS Libraries**: Intel MKL, OpenBLAS, cuBLAS optimization strategies
+- **Framework Internals**: How PyTorch and TensorFlow achieve performance
+- **Hardware Acceleration**: GPU kernels, TPU operations, specialized chips
+- **Production Deployment**: Optimizing ML inference for real-world constraints
+
+### 🎯 The Power of Hardware-Aware Programming
+You've learned the essential mindset for high-performance computing:
+- **Know your hardware**: Understanding system architecture guides optimization
+- **Profile everything**: Measurement drives optimization decisions  
+- **Optimize systematically**: Vectorization → Memory → Parallelization
+- **Think like production**: Real systems demand hardware-aware implementations
+
+### 🧠 Optimization Insights
+- **Why optimization matters**: Performance gaps can be 1000x+ between naive and optimized code
+- **Hardware evolution**: Modern ML requires understanding of specialized accelerators
+- **System design**: Optimization considerations influence software architecture
+- **Continuous learning**: Hardware advances require constant learning and adaptation
+
+**Next Module**: Benchmarking - Comprehensive performance analysis and systematic optimization!
+
+You've built the optimizations. Now let's learn to analyze and benchmark them like production ML engineers!
+""" 
--- a/modules/source/11_kernels/module.yaml
+++ b/modules/source/11_kernels/module.yaml
@@ -0,0 +1,77 @@
+# TinyTorch Module Metadata
+# Essential system information for CLI tools and build systems
+
+name: "11_kernels"
+title: "Kernels - Hardware-Aware Optimization"
+description: "Custom operations, performance optimization, and hardware-aware computing for ML systems"
+version: "1.0.0"
+author: "TinyTorch Team"
+
+# Dependencies - Used by CLI for module ordering and prerequisites
+dependencies:
+  prerequisites: [
+    "00_setup", "01_tensor", "02_activations", "03_layers", 
+    "04_networks", "05_cnn", "06_dataloader", "07_autograd", 
+    "08_optimizers", "09_training", "10_compression"
+  ]
+  enables: ["12_benchmarking", "13_mlops"]
+
+# Package Export - What gets built into tinytorch package
+exports_to: "tinytorch.core.kernels"
+
+# File Structure - What files exist in this module
+files:
+  dev_file: "kernels_dev.py"
+  test_file: "tests/test_kernels.py"
+  readme: "README.md"
+  benchmark_dir: "benchmarks/"
+
+# Components - What's implemented in this module
+components:
+  # Custom Operations
+  - "matmul_custom"
+  - "relu_custom"
+  - "conv2d_custom"
+  
+  # Optimized Implementations
+  - "matmul_vectorized"
+  - "matmul_cache_optimized"
+  - "matmul_parallel"
+  
+  # Compressed Model Kernels
+  - "quantized_matmul"
+  - "sparse_matmul"
+  - "pruned_conv2d"
+  
+  # Performance Tools
+  - "KernelProfiler"
+  - "PerformanceBenchmark"
+  - "HardwareProfiler"
+
+# Learning Objectives - What students will achieve
+learning_objectives:
+  - "Implement custom ML operations beyond NumPy"
+  - "Apply SIMD vectorization and CPU optimization"
+  - "Optimize memory layout and cache efficiency"
+  - "Understand GPU-style parallel computing"
+  - "Build performance profiling tools"
+  - "Create hardware-optimized compressed model operations"
+
+# Educational Approach
+pedagogy:
+  framework: "Build → Use → Optimize"
+  difficulty: "Expert"
+  time_estimate: "8-10 hours"
+  
+# Integration Points - How this connects to other modules
+integration:
+  builds_on: "10_compression"  # Extends compression with hardware optimization
+  enables: "12_benchmarking"   # Provides optimized kernels for benchmarking
+  connects_to: "13_mlops"      # Hardware optimization for production deployment
+
+# Testing Strategy
+testing:
+  inline_tests: true
+  performance_tests: true
+  integration_tests: true
+  benchmark_tests: true 
--- a/modules/source/utils/init.py
+++ b/modules/source/utils/init.py
@@ -0,0 +1,9 @@
+"""
+TinyTorch Utils Package
+
+Shared utilities for TinyTorch modules.
+"""
+
+from .profiler import SimpleProfiler, profile_function
+
+__all__ = ['SimpleProfiler', 'profile_function'] 
--- a/modules/source/utils/profiler.py
+++ b/modules/source/utils/profiler.py
@@ -0,0 +1,226 @@
+"""
+TinyTorch Utils: Simple Educational Profiler
+
+A lightweight profiling utility for measuring performance of ML operations.
+Focused on measuring individual functions - students do their own comparisons.
+"""
+
+import time
+import sys
+import gc
+import numpy as np
+from typing import Callable, Dict, Any, Optional
+
+try:
+    import psutil
+    HAS_PSUTIL = True
+except ImportError:
+    HAS_PSUTIL = False
+
+try:
+    import tracemalloc
+    HAS_TRACEMALLOC = True
+except ImportError:
+    HAS_TRACEMALLOC = False
+
+class SimpleProfiler:
+    """
+    Simple profiler for measuring individual function performance.
+    
+    Measures timing, memory usage, and other key metrics for a single function.
+    Students collect multiple measurements and compare results themselves.
+    """
+    
+    def __init__(self, track_memory: bool = True, track_cpu: bool = True):
+        self.track_memory = track_memory and HAS_TRACEMALLOC
+        self.track_cpu = track_cpu and HAS_PSUTIL
+        
+        if self.track_memory:
+            tracemalloc.start()
+    
+    def _get_memory_info(self) -> Dict[str, Any]:
+        """Get current memory information."""
+        if not self.track_memory:
+            return {}
+        
+        try:
+            current, peak = tracemalloc.get_traced_memory()
+            return {
+                'current_memory_mb': current / 1024 / 1024,
+                'peak_memory_mb': peak / 1024 / 1024
+            }
+        except:
+            return {}
+    
+    def _get_cpu_info(self) -> Dict[str, Any]:
+        """Get current CPU information."""
+        if not self.track_cpu:
+            return {}
+        
+        try:
+            process = psutil.Process()
+            return {
+                'cpu_percent': process.cpu_percent(),
+                'memory_percent': process.memory_percent(),
+                'num_threads': process.num_threads()
+            }
+        except:
+            return {}
+    
+    def _get_array_info(self, result: Any) -> Dict[str, Any]:
+        """Get information about numpy arrays."""
+        if not isinstance(result, np.ndarray):
+            return {}
+        
+        return {
+            'result_shape': result.shape,
+            'result_dtype': str(result.dtype),
+            'result_size_mb': result.nbytes / 1024 / 1024,
+            'result_elements': result.size
+        }
+    
+    def profile(self, func: Callable, *args, name: Optional[str] = None, warmup: bool = True, **kwargs) -> Dict[str, Any]:
+        """
+        Profile a single function execution with comprehensive metrics.
+        
+        Args:
+            func: Function to profile
+            *args: Arguments to pass to function
+            name: Optional name for the function (defaults to func.__name__)
+            warmup: Whether to do a warmup run (recommended for fair timing)
+            **kwargs: Keyword arguments to pass to function
+            
+        Returns:
+            Dictionary with comprehensive performance metrics
+            
+        Example:
+            profiler = SimpleProfiler()
+            result = profiler.profile(my_function, arg1, arg2, name="My Function")
+            print(f"Time: {result['wall_time']:.4f}s")
+            print(f"Memory: {result['memory_delta_mb']:.2f}MB")
+        """
+        func_name = name or func.__name__
+        
+        # Reset memory tracking
+        if self.track_memory:
+            tracemalloc.clear_traces()
+        
+        # Warm up (important for fair comparison)
+        if warmup:
+            try:
+                warmup_result = func(*args, **kwargs)
+                del warmup_result
+            except:
+                pass
+        
+        # Force garbage collection for clean measurement
+        gc.collect()
+        
+        # Get baseline measurements
+        memory_before = self._get_memory_info()
+        cpu_before = self._get_cpu_info()
+        
+        # Time the actual execution
+        start_time = time.time()
+        start_cpu_time = time.process_time()
+        
+        result = func(*args, **kwargs)
+        
+        end_time = time.time()
+        end_cpu_time = time.process_time()
+        
+        # Get post-execution measurements
+        memory_after = self._get_memory_info()
+        cpu_after = self._get_cpu_info()
+        
+        # Calculate metrics
+        wall_time = end_time - start_time
+        cpu_time = end_cpu_time - start_cpu_time
+        
+        profile_result = {
+            'name': func_name,
+            'wall_time': wall_time,
+            'cpu_time': cpu_time,
+            'cpu_efficiency': (cpu_time / wall_time) if wall_time > 0 else 0,
+            'result': result
+        }
+        
+        # Add memory metrics
+        if self.track_memory and memory_before and memory_after:
+            profile_result.update({
+                'memory_before_mb': memory_before.get('current_memory_mb', 0),
+                'memory_after_mb': memory_after.get('current_memory_mb', 0),
+                'peak_memory_mb': memory_after.get('peak_memory_mb', 0),
+                'memory_delta_mb': memory_after.get('current_memory_mb', 0) - memory_before.get('current_memory_mb', 0)
+            })
+        
+        # Add CPU metrics
+        if self.track_cpu and cpu_after:
+            profile_result.update({
+                'cpu_percent': cpu_after.get('cpu_percent', 0),
+                'memory_percent': cpu_after.get('memory_percent', 0),
+                'num_threads': cpu_after.get('num_threads', 1)
+            })
+        
+        # Add array information
+        profile_result.update(self._get_array_info(result))
+        
+        return profile_result
+    
+    def print_result(self, profile_result: Dict[str, Any], show_details: bool = False) -> None:
+        """
+        Print profiling results in a readable format.
+        
+        Args:
+            profile_result: Result from profile() method
+            show_details: Whether to show detailed metrics
+        """
+        name = profile_result['name']
+        wall_time = profile_result['wall_time']
+        
+        print(f"📊 {name}: {wall_time:.4f}s")
+        
+        if show_details:
+            if 'memory_delta_mb' in profile_result:
+                print(f"   💾 Memory: {profile_result['memory_delta_mb']:.2f}MB delta, {profile_result['peak_memory_mb']:.2f}MB peak")
+            if 'result_size_mb' in profile_result:
+                print(f"   🔢 Output: {profile_result['result_shape']} ({profile_result['result_size_mb']:.2f}MB)")
+            if 'cpu_efficiency' in profile_result:
+                print(f"   ⚡ CPU: {profile_result['cpu_efficiency']:.2f} efficiency")
+    
+    def get_capabilities(self) -> Dict[str, bool]:
+        """Get information about profiler capabilities."""
+        return {
+            'memory_tracking': self.track_memory,
+            'cpu_tracking': self.track_cpu,
+            'has_psutil': HAS_PSUTIL,
+            'has_tracemalloc': HAS_TRACEMALLOC
+        }
+
+# Convenience function for quick profiling
+def profile_function(func: Callable, *args, name: Optional[str] = None, 
+                     show_details: bool = False, **kwargs) -> Dict[str, Any]:
+    """
+    Quick profiling of a single function.
+    
+    Args:
+        func: Function to profile
+        *args: Arguments to pass to function
+        name: Optional name for the function
+        show_details: Whether to print detailed metrics
+        **kwargs: Keyword arguments to pass to function
+        
+    Returns:
+        Dictionary with profiling results
+        
+    Example:
+        result = profile_function(my_matmul, A, B, name="Custom MatMul", show_details=True)
+        print(f"Execution time: {result['wall_time']:.4f}s")
+    """
+    profiler = SimpleProfiler(track_memory=True, track_cpu=True)
+    result = profiler.profile(func, *args, name=name, **kwargs)
+    
+    if show_details:
+        profiler.print_result(result, show_details=True)
+    
+    return result