diff --git a/modules/source/03_layers/layers_dev.py b/modules/source/03_layers/layers_dev.py index 2a83eb8b..2a5a90b6 100644 --- a/modules/source/03_layers/layers_dev.py +++ b/modules/source/03_layers/layers_dev.py @@ -144,33 +144,161 @@ C = A @ B Each element C[i,j] is the **dot product** of row i from A and column j from B. -### Why Matrix Multiplication in Neural Networks? -- **Dense layers**: Transform inputs through learned weights -- **Batch processing**: Handle multiple samples at once -- **Feature learning**: Each neuron learns different patterns -- **Efficiency**: GPUs are optimized for matrix operations +### The Mathematical Foundation: Linear Algebra in Neural Networks -### Visual Example +#### **Why Matrix Multiplication in Neural Networks?** +Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**: + +```python +# The core neural network operation: +linear_output = weights @ input + bias # Linear transformation (matrix multiplication) +activation_output = activation_function(linear_output) # Nonlinear transformation +``` + +#### **The Geometric Interpretation** +Matrix multiplication represents **geometric transformations** in high-dimensional space: + +- **Rotation**: Changing the orientation of data +- **Scaling**: Stretching or compressing along certain dimensions +- **Projection**: Mapping to lower or higher dimensional spaces +- **Translation**: Shifting data (via bias terms) + +#### **Why This Matters for Learning** +Each layer learns to transform the input space to make the final task easier: + +```python +# Example: Image classification +raw_pixels → [Layer 1] → edges → [Layer 2] → shapes → [Layer 3] → objects → [Layer 4] → classes +``` + +### The Computational Perspective + +#### **Batch Processing Power** +Matrix multiplication enables efficient batch processing: + +```python +# Single sample (inefficient): +for sample in batch: + output = weights @ sample + bias # Process one at a time + +# Batch processing (efficient): +batch_output = weights @ batch + bias # Process all samples simultaneously +``` + +#### **Parallelization Benefits** +- **CPU**: Multiple cores can compute different parts simultaneously +- **GPU**: Thousands of cores excel at matrix operations +- **TPU**: Specialized hardware designed for matrix multiplication +- **Memory**: Contiguous memory access patterns improve cache efficiency + +#### **Computational Complexity** +For matrices A(m×n) and B(n×p): +- **Time complexity**: O(mnp) - cubic in the worst case +- **Space complexity**: O(mp) - for the output matrix +- **Optimization**: Modern libraries use optimized algorithms (Strassen, etc.) + +### Real-World Applications: Where Matrix Multiplication Shines + +#### **Computer Vision** +```python +# Convolutional layers can be expressed as matrix multiplication: +# Image patches → Matrix A +# Convolutional filters → Matrix B +# Feature maps → Matrix C = A @ B +``` + +#### **Natural Language Processing** +```python +# Transformer attention mechanism: +# Query matrix Q, Key matrix K, Value matrix V +# Attention weights = softmax(Q @ K.T / sqrt(d_k)) +# Output = Attention_weights @ V +``` + +#### **Recommendation Systems** +```python +# Matrix factorization: +# User-item matrix R ≈ User_factors @ Item_factors.T +# Collaborative filtering through matrix operations +``` + +### The Algorithm: Understanding Every Step + +For matrices A(m×n) and B(n×p) → C(m×p): +```python +for i in range(m): # For each row of A + for j in range(p): # For each column of B + for k in range(n): # Compute dot product + C[i,j] += A[i,k] * B[k,j] +``` + +#### **Visual Breakdown** ``` A = [[1, 2], B = [[5, 6], C = [[19, 22], [3, 4]] [7, 8]] [43, 50]] -C[0,0] = 1*5 + 2*7 = 19 -C[0,1] = 1*6 + 2*8 = 22 -C[1,0] = 3*5 + 4*7 = 43 -C[1,1] = 3*6 + 4*8 = 50 +C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19 +C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22 +C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43 +C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50 ``` -### The Algorithm -For matrices A(m×n) and B(n×p) → C(m×p): -``` -for i in range(m): - for j in range(p): - for k in range(n): - C[i,j] += A[i,k] * B[k,j] +#### **Memory Access Pattern** +- **Row-major order**: Access elements row by row for cache efficiency +- **Cache locality**: Nearby elements are likely to be accessed together +- **Blocking**: Divide large matrices into blocks for better cache usage + +### Performance Considerations: Making It Fast + +#### **Optimization Strategies** +1. **Vectorization**: Use SIMD instructions for parallel element operations +2. **Blocking**: Divide matrices into cache-friendly blocks +3. **Loop unrolling**: Reduce loop overhead +4. **Memory alignment**: Ensure data is aligned for optimal access + +#### **Modern Libraries** +- **BLAS (Basic Linear Algebra Subprograms)**: Optimized matrix operations +- **Intel MKL**: Highly optimized for Intel processors +- **OpenBLAS**: Open-source optimized BLAS +- **cuBLAS**: GPU-accelerated BLAS from NVIDIA + +#### **Why We Implement Naive Version** +Understanding the basic algorithm helps you: +- **Debug performance issues**: Know what's happening under the hood +- **Optimize for specific cases**: Custom implementations for special matrices +- **Understand complexity**: Appreciate the optimizations in modern libraries +- **Educational value**: See the mathematical foundation clearly + +### Connection to Neural Network Architecture + +#### **Layer Composition** +```python +# Each layer is a matrix multiplication: +layer1_output = W1 @ input + b1 +layer2_output = W2 @ layer1_output + b2 +layer3_output = W3 @ layer2_output + b3 + +# This is equivalent to: +final_output = W3 @ (W2 @ (W1 @ input + b1) + b2) + b3 ``` -Let's implement this to truly understand it! +#### **Gradient Flow** +During backpropagation, gradients flow through matrix operations: +```python +# Forward: y = W @ x + b +# Backward: +# dW = dy @ x.T +# dx = W.T @ dy +# db = dy.sum(axis=0) +``` + +#### **Weight Initialization** +Matrix multiplication behavior depends on weight initialization: +- **Xavier/Glorot**: Maintains variance across layers +- **He initialization**: Optimized for ReLU activations +- **Orthogonal**: Preserves gradient norms + +Let's implement matrix multiplication to truly understand it! """ # %% nbgrader={"grade": false, "grade_id": "matmul-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}