mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-01 17:59:38 -05:00
Enhance layers module with comprehensive linear algebra foundations
- Added detailed mathematical foundation of matrix multiplication in neural networks - Enhanced geometric interpretation of linear transformations - Included computational perspective with batch processing and parallelization - Added real-world applications (computer vision, NLP, recommendation systems) - Comprehensive performance considerations and optimization strategies - Connection to neural network architecture and gradient flow - Educational focus on understanding the algorithm before optimization
This commit is contained in:
@@ -144,33 +144,161 @@ C = A @ B
|
||||
|
||||
Each element C[i,j] is the **dot product** of row i from A and column j from B.
|
||||
|
||||
### Why Matrix Multiplication in Neural Networks?
|
||||
- **Dense layers**: Transform inputs through learned weights
|
||||
- **Batch processing**: Handle multiple samples at once
|
||||
- **Feature learning**: Each neuron learns different patterns
|
||||
- **Efficiency**: GPUs are optimized for matrix operations
|
||||
### The Mathematical Foundation: Linear Algebra in Neural Networks
|
||||
|
||||
### Visual Example
|
||||
#### **Why Matrix Multiplication in Neural Networks?**
|
||||
Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**:
|
||||
|
||||
```python
|
||||
# The core neural network operation:
|
||||
linear_output = weights @ input + bias # Linear transformation (matrix multiplication)
|
||||
activation_output = activation_function(linear_output) # Nonlinear transformation
|
||||
```
|
||||
|
||||
#### **The Geometric Interpretation**
|
||||
Matrix multiplication represents **geometric transformations** in high-dimensional space:
|
||||
|
||||
- **Rotation**: Changing the orientation of data
|
||||
- **Scaling**: Stretching or compressing along certain dimensions
|
||||
- **Projection**: Mapping to lower or higher dimensional spaces
|
||||
- **Translation**: Shifting data (via bias terms)
|
||||
|
||||
#### **Why This Matters for Learning**
|
||||
Each layer learns to transform the input space to make the final task easier:
|
||||
|
||||
```python
|
||||
# Example: Image classification
|
||||
raw_pixels → [Layer 1] → edges → [Layer 2] → shapes → [Layer 3] → objects → [Layer 4] → classes
|
||||
```
|
||||
|
||||
### The Computational Perspective
|
||||
|
||||
#### **Batch Processing Power**
|
||||
Matrix multiplication enables efficient batch processing:
|
||||
|
||||
```python
|
||||
# Single sample (inefficient):
|
||||
for sample in batch:
|
||||
output = weights @ sample + bias # Process one at a time
|
||||
|
||||
# Batch processing (efficient):
|
||||
batch_output = weights @ batch + bias # Process all samples simultaneously
|
||||
```
|
||||
|
||||
#### **Parallelization Benefits**
|
||||
- **CPU**: Multiple cores can compute different parts simultaneously
|
||||
- **GPU**: Thousands of cores excel at matrix operations
|
||||
- **TPU**: Specialized hardware designed for matrix multiplication
|
||||
- **Memory**: Contiguous memory access patterns improve cache efficiency
|
||||
|
||||
#### **Computational Complexity**
|
||||
For matrices A(m×n) and B(n×p):
|
||||
- **Time complexity**: O(mnp) - cubic in the worst case
|
||||
- **Space complexity**: O(mp) - for the output matrix
|
||||
- **Optimization**: Modern libraries use optimized algorithms (Strassen, etc.)
|
||||
|
||||
### Real-World Applications: Where Matrix Multiplication Shines
|
||||
|
||||
#### **Computer Vision**
|
||||
```python
|
||||
# Convolutional layers can be expressed as matrix multiplication:
|
||||
# Image patches → Matrix A
|
||||
# Convolutional filters → Matrix B
|
||||
# Feature maps → Matrix C = A @ B
|
||||
```
|
||||
|
||||
#### **Natural Language Processing**
|
||||
```python
|
||||
# Transformer attention mechanism:
|
||||
# Query matrix Q, Key matrix K, Value matrix V
|
||||
# Attention weights = softmax(Q @ K.T / sqrt(d_k))
|
||||
# Output = Attention_weights @ V
|
||||
```
|
||||
|
||||
#### **Recommendation Systems**
|
||||
```python
|
||||
# Matrix factorization:
|
||||
# User-item matrix R ≈ User_factors @ Item_factors.T
|
||||
# Collaborative filtering through matrix operations
|
||||
```
|
||||
|
||||
### The Algorithm: Understanding Every Step
|
||||
|
||||
For matrices A(m×n) and B(n×p) → C(m×p):
|
||||
```python
|
||||
for i in range(m): # For each row of A
|
||||
for j in range(p): # For each column of B
|
||||
for k in range(n): # Compute dot product
|
||||
C[i,j] += A[i,k] * B[k,j]
|
||||
```
|
||||
|
||||
#### **Visual Breakdown**
|
||||
```
|
||||
A = [[1, 2], B = [[5, 6], C = [[19, 22],
|
||||
[3, 4]] [7, 8]] [43, 50]]
|
||||
|
||||
C[0,0] = 1*5 + 2*7 = 19
|
||||
C[0,1] = 1*6 + 2*8 = 22
|
||||
C[1,0] = 3*5 + 4*7 = 43
|
||||
C[1,1] = 3*6 + 4*8 = 50
|
||||
C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19
|
||||
C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22
|
||||
C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43
|
||||
C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50
|
||||
```
|
||||
|
||||
### The Algorithm
|
||||
For matrices A(m×n) and B(n×p) → C(m×p):
|
||||
```
|
||||
for i in range(m):
|
||||
for j in range(p):
|
||||
for k in range(n):
|
||||
C[i,j] += A[i,k] * B[k,j]
|
||||
#### **Memory Access Pattern**
|
||||
- **Row-major order**: Access elements row by row for cache efficiency
|
||||
- **Cache locality**: Nearby elements are likely to be accessed together
|
||||
- **Blocking**: Divide large matrices into blocks for better cache usage
|
||||
|
||||
### Performance Considerations: Making It Fast
|
||||
|
||||
#### **Optimization Strategies**
|
||||
1. **Vectorization**: Use SIMD instructions for parallel element operations
|
||||
2. **Blocking**: Divide matrices into cache-friendly blocks
|
||||
3. **Loop unrolling**: Reduce loop overhead
|
||||
4. **Memory alignment**: Ensure data is aligned for optimal access
|
||||
|
||||
#### **Modern Libraries**
|
||||
- **BLAS (Basic Linear Algebra Subprograms)**: Optimized matrix operations
|
||||
- **Intel MKL**: Highly optimized for Intel processors
|
||||
- **OpenBLAS**: Open-source optimized BLAS
|
||||
- **cuBLAS**: GPU-accelerated BLAS from NVIDIA
|
||||
|
||||
#### **Why We Implement Naive Version**
|
||||
Understanding the basic algorithm helps you:
|
||||
- **Debug performance issues**: Know what's happening under the hood
|
||||
- **Optimize for specific cases**: Custom implementations for special matrices
|
||||
- **Understand complexity**: Appreciate the optimizations in modern libraries
|
||||
- **Educational value**: See the mathematical foundation clearly
|
||||
|
||||
### Connection to Neural Network Architecture
|
||||
|
||||
#### **Layer Composition**
|
||||
```python
|
||||
# Each layer is a matrix multiplication:
|
||||
layer1_output = W1 @ input + b1
|
||||
layer2_output = W2 @ layer1_output + b2
|
||||
layer3_output = W3 @ layer2_output + b3
|
||||
|
||||
# This is equivalent to:
|
||||
final_output = W3 @ (W2 @ (W1 @ input + b1) + b2) + b3
|
||||
```
|
||||
|
||||
Let's implement this to truly understand it!
|
||||
#### **Gradient Flow**
|
||||
During backpropagation, gradients flow through matrix operations:
|
||||
```python
|
||||
# Forward: y = W @ x + b
|
||||
# Backward:
|
||||
# dW = dy @ x.T
|
||||
# dx = W.T @ dy
|
||||
# db = dy.sum(axis=0)
|
||||
```
|
||||
|
||||
#### **Weight Initialization**
|
||||
Matrix multiplication behavior depends on weight initialization:
|
||||
- **Xavier/Glorot**: Maintains variance across layers
|
||||
- **He initialization**: Optimized for ReLU activations
|
||||
- **Orthogonal**: Preserves gradient norms
|
||||
|
||||
Let's implement matrix multiplication to truly understand it!
|
||||
"""
|
||||
|
||||
# %% nbgrader={"grade": false, "grade_id": "matmul-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||||
|
||||
Reference in New Issue
Block a user