mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-01 20:35:21 -05:00
Module Standardization: - Applied consistent introduction format to all 17 modules - Every module now has: Welcome, Learning Goals, Build→Use→Reflect, What You'll Achieve, Systems Reality Check - Focused on systems thinking, performance, and production relevance - Consistent 5 learning goals with systems/performance/scaling emphasis Agent Structure Fixes: - Recreated missing documentation-publisher.md agent - Clear separation: Documentation Publisher (content) vs Educational ML Docs Architect (structure) - All 10 agents now present and properly defined - No overlapping responsibilities between agents Improvements: - Consistent Build→Use→Reflect pattern (not Understand or Analyze) - What You'll Achieve section (not What You'll Learn) - Systems Reality Check in every module - Production context and performance insights emphasized
2479 lines
95 KiB
Python
2479 lines
95 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Layers - Neural Network Building Blocks and Composition Patterns
|
||
|
||
Welcome to the Layers module! You'll build the fundamental components that stack together to form any neural network architecture, from simple perceptrons to transformers.
|
||
|
||
## Learning Goals
|
||
- Systems understanding: How layer composition creates complex function approximators and why stacking enables deep learning
|
||
- Core implementation skill: Build matrix multiplication and Dense layers with proper parameter management
|
||
- Pattern recognition: Understand how different layer types solve different computational problems
|
||
- Framework connection: See how your layer implementations mirror PyTorch's nn.Module design patterns
|
||
- Performance insight: Learn why layer computation order and memory layout determine training speed
|
||
|
||
## Build → Use → Reflect
|
||
1. **Build**: Matrix multiplication primitives and Dense layers with parameter initialization strategies
|
||
2. **Use**: Compose layers into multi-layer networks and observe how data transforms through the stack
|
||
3. **Reflect**: Why does layer depth enable more complex functions, and when does it hurt performance?
|
||
|
||
## What You'll Achieve
|
||
By the end of this module, you'll understand:
|
||
- Deep technical understanding of how matrix operations enable neural networks to learn arbitrary functions
|
||
- Practical capability to build and compose layers into complex architectures
|
||
- Systems insight into why layer composition is the fundamental pattern for scalable ML systems
|
||
- Performance consideration of how layer size and depth affect memory usage and computational cost
|
||
- Connection to production ML systems and how frameworks optimize layer execution for different hardware
|
||
|
||
## Systems Reality Check
|
||
💡 **Production Context**: PyTorch's nn.Linear uses optimized BLAS operations and can automatically choose between different matrix multiplication algorithms based on tensor sizes
|
||
⚡ **Performance Note**: Matrix multiplication is O(n³) but highly parallelizable - modern deep learning success comes from hardware designed specifically for this operation
|
||
"""
|
||
|
||
Mastering layers means understanding the foundation of all modern AI.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "layers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
#| default_exp core.layers
|
||
|
||
#| export
|
||
import numpy as np
|
||
import os
|
||
import sys
|
||
|
||
# Import our dependencies - try from package first, then local modules
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
|
||
except ImportError:
|
||
# For development, import from local modules
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
|
||
try:
|
||
from tensor_dev import Tensor
|
||
from activations_dev import ReLU, Sigmoid, Tanh, Softmax
|
||
except ImportError:
|
||
# If the local modules are not available, use relative imports
|
||
from ..tensor.tensor_dev import Tensor
|
||
from ..activations.activations_dev import ReLU, Sigmoid, Tanh, Softmax
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "layers-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
print("🔥 TinyTorch Layers Module")
|
||
print(f"NumPy version: {np.__version__}")
|
||
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
|
||
print("Ready to build neural network layers!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/source/03_layers/layers_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.core.layers`
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.layers import Dense, Conv2D # All layer types together!
|
||
from tinytorch.core.tensor import Tensor # The foundation
|
||
from tinytorch.core.activations import ReLU, Sigmoid # Nonlinearity
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Focused modules for deep understanding
|
||
- **Production:** Proper organization like PyTorch's `torch.nn.Linear`
|
||
- **Consistency:** All layer types live together in `core.layers`
|
||
- **Integration:** Works seamlessly with tensors and activations
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## The Deep Mathematics of Neural Network Layers
|
||
|
||
### What Are Neural Network Layers?
|
||
Layers are **learnable function approximators** - each layer is a mathematical transformation that:
|
||
1. **Takes input data**: Raw features, pixels, words, or intermediate representations
|
||
2. **Applies learned transformation**: Linear combinations followed by nonlinear activations
|
||
3. **Produces useful representations**: Features that are better for the final task
|
||
|
||
### The Universal Layer Pattern
|
||
Every layer in every neural network follows this fundamental pattern:
|
||
```python
|
||
def universal_layer(x):
|
||
# 1. Linear transformation (learnable)
|
||
linear_output = x @ weights + bias
|
||
|
||
# 2. Nonlinear activation (fixed function)
|
||
output = activation(linear_output)
|
||
|
||
return output
|
||
```
|
||
|
||
### Why This Simple Pattern Works for Everything
|
||
|
||
#### The Mathematical Miracle
|
||
- **Linear part**: Learns weighted combinations of input features
|
||
- **Nonlinear part**: Enables complex decision boundaries
|
||
- **Stacking**: Creates arbitrarily complex function approximation
|
||
- **Universal approximation**: Proven to approximate any continuous function
|
||
|
||
#### Visual Understanding
|
||
```
|
||
Input Features → Linear Transform → Nonlinear Activation → Output Features
|
||
[x1, x2, x3] [w11 w12 w13] ReLU/Sigmoid/Tanh [y1, y2]
|
||
[w21 w22 w23]
|
||
[bias1, bias2]
|
||
```
|
||
|
||
### Mathematical Foundation: Function Composition
|
||
A neural network is mathematical function composition:
|
||
```
|
||
f(x) = layer_n(layer_{n-1}(...layer_2(layer_1(x))))
|
||
|
||
Where each layer_i(x) = activation(x @ W_i + b_i)
|
||
```
|
||
|
||
**Key insight**: Each layer learns to transform its input into a representation that makes the next layer's job easier.
|
||
|
||
### Real-World Applications
|
||
|
||
#### Computer Vision
|
||
- **Layer 1**: Detects edges and textures
|
||
- **Layer 2**: Combines edges into shapes
|
||
- **Layer 3**: Combines shapes into objects
|
||
- **Final Layer**: Maps objects to class labels
|
||
|
||
#### Natural Language Processing
|
||
- **Embedding Layer**: Maps words to vector representations
|
||
- **Hidden Layers**: Learn syntactic and semantic patterns
|
||
- **Output Layer**: Maps representations to predictions
|
||
|
||
#### Scientific Computing
|
||
- **Physics**: Learn differential equation solutions
|
||
- **Chemistry**: Predict molecular properties
|
||
- **Biology**: Model protein folding
|
||
|
||
### What We'll Build Step by Step
|
||
|
||
1. **Matrix Multiplication Engine**: The mathematical core powering all layers
|
||
2. **Dense Layer Implementation**: The fundamental building block
|
||
3. **Weight Initialization Strategies**: How to start learning effectively
|
||
4. **Layer Composition Patterns**: Building complex architectures
|
||
5. **Integration with Activations**: Creating complete neural network components
|
||
6. **Production-Ready Implementation**: Code that scales to real applications
|
||
|
||
### Why Understanding Layers Deeply Matters
|
||
|
||
#### For ML Engineers
|
||
- **Debugging**: Understand why networks fail to train
|
||
- **Architecture Design**: Know when to use which layer types
|
||
- **Performance Optimization**: Optimize for specific hardware
|
||
|
||
#### For AI Researchers
|
||
- **Novel Architectures**: Invent new layer types
|
||
- **Theoretical Understanding**: Prove properties of neural networks
|
||
- **Algorithmic Innovation**: Develop new training methods
|
||
|
||
#### For Industry Applications
|
||
- **Model Deployment**: Optimize for production environments
|
||
- **Transfer Learning**: Adapt pre-trained layers to new tasks
|
||
- **Custom Solutions**: Build domain-specific architectures
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🔧 DEVELOPMENT
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 1: Matrix Multiplication - The Mathematical Engine of All AI
|
||
|
||
### The Foundation of Modern AI
|
||
Matrix multiplication is the **single most important operation** in all of machine learning. Every neural network, from simple classifiers to GPT and ChatGPT, is fundamentally powered by this operation:
|
||
|
||
```
|
||
C = A @ B # This simple operation powers all of AI
|
||
```
|
||
|
||
### Deep Mathematical Understanding
|
||
|
||
#### The Core Operation
|
||
For matrices A (m×n) and B (n×p), the result C (m×p) is:
|
||
```
|
||
C[i,j] = Σ(k=0 to n-1) A[i,k] * B[k,j]
|
||
```
|
||
|
||
**Physical interpretation**: Each output element is a **weighted sum** of input features.
|
||
|
||
#### Visual Step-by-Step Breakdown
|
||
```
|
||
Matrix A (2×2) Matrix B (2×2) Result C (2×2)
|
||
┌─────────┐ ┌─────────┐ ┌─────────┐
|
||
│ 1 2 │ @ │ 5 6 │ = │ 19 22 │
|
||
│ 3 4 │ │ 7 8 │ │ 43 50 │
|
||
└─────────┘ └─────────┘ └─────────┘
|
||
|
||
Step-by-step computation:
|
||
C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 5 + 14 = 19
|
||
C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 6 + 16 = 22
|
||
C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 15 + 28 = 43
|
||
C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 18 + 32 = 50
|
||
```
|
||
|
||
#### Neural Network Interpretation
|
||
```
|
||
Input Data Weight Matrix Output Features
|
||
(batch × in) @ (in × out) = (batch × out)
|
||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||
│ sample 1 │ │ feature │ │transformed │
|
||
│ sample 2 │ @ │ weights │ = │features │
|
||
│ ... │ │ ... │ │ ... │
|
||
│ sample n │ │ │ │ │
|
||
└─────────────┘ └─────────────┘ └─────────────┘
|
||
```
|
||
|
||
### Why Matrix Multiplication Powers All AI
|
||
|
||
#### 1. Feature Combination
|
||
Each output is a **learned combination** of all input features:
|
||
```
|
||
output[i] = w1*input[0] + w2*input[1] + ... + wn*input[n-1]
|
||
```
|
||
The weights determine **which features matter** and **how they combine**.
|
||
|
||
#### 2. Parallel Processing
|
||
- **CPU vectorization**: Process multiple elements simultaneously
|
||
- **GPU acceleration**: Thousands of cores compute matrix operations
|
||
- **TPU optimization**: Specialized hardware for matrix computations
|
||
|
||
#### 3. Mathematical Elegance
|
||
- **Differentiable**: Gradients flow cleanly through matrix operations
|
||
- **Composable**: Matrix operations stack naturally
|
||
- **Expressive**: Can represent any linear transformation
|
||
|
||
### Real-World Applications Powered by Matrix Multiplication
|
||
|
||
#### Large Language Models (GPT, ChatGPT)
|
||
```
|
||
Attention(Q,K,V) = softmax(QK^T/√d)V # Three matrix multiplications!
|
||
```
|
||
- **Q @ K^T**: Compute attention scores between all word pairs
|
||
- **Attention @ V**: Weight and combine value vectors
|
||
- **Linear layers**: Transform representations at each layer
|
||
|
||
#### Computer Vision (ResNet, Vision Transformers)
|
||
```
|
||
Convolution ≈ Matrix Multiplication # Convolution can be expressed as matrix ops
|
||
```
|
||
- **Feature maps**: Each filter creates a feature map via matrix operations
|
||
- **Classification**: Final features → class logits via matrix multiplication
|
||
- **Object detection**: Bounding box regression via matrix operations
|
||
|
||
#### Recommendation Systems
|
||
```
|
||
User-Item Matrix @ Item-Feature Matrix = User-Feature Preferences
|
||
```
|
||
- **Collaborative filtering**: User similarity via matrix operations
|
||
- **Content-based**: Feature matching via matrix computations
|
||
- **Deep models**: Neural collaborative filtering via matrix layers
|
||
|
||
### Performance Considerations
|
||
|
||
#### Why We Use NumPy (and why GPUs exist)
|
||
```
|
||
# Naive Python loops: ~10 seconds for large matrices
|
||
for i in range(m):
|
||
for j in range(p):
|
||
for k in range(n):
|
||
C[i,j] += A[i,k] * B[k,j]
|
||
|
||
# NumPy (optimized C): ~0.01 seconds for same matrices
|
||
C = A @ B
|
||
|
||
# GPU (CUDA): ~0.001 seconds for same matrices
|
||
C = torch.matmul(A_gpu, B_gpu)
|
||
```
|
||
|
||
#### Memory and Computation Complexity
|
||
- **Memory**: O(mn + np + mp) to store three matrices
|
||
- **Computation**: O(mnp) multiply-add operations
|
||
- **For large models**: Billions of parameters × billions of operations
|
||
|
||
### Debugging Matrix Multiplication
|
||
|
||
#### Common Shape Errors
|
||
```
|
||
A.shape = (batch_size, input_features) # e.g., (32, 784)
|
||
B.shape = (input_features, output_features) # e.g., (784, 10)
|
||
C.shape = (batch_size, output_features) # result: (32, 10)
|
||
|
||
# COMMON ERROR:
|
||
A.shape = (32, 784)
|
||
B.shape = (10, 784) # Wrong! Should be (784, 10)
|
||
# Error: Cannot multiply (32, 784) @ (10, 784)
|
||
```
|
||
|
||
#### Visual Debugging Technique
|
||
```
|
||
Always check: A's last dimension == B's first dimension
|
||
(m, n) @ (n, p) = (m, p) ✓
|
||
(m, n) @ (k, p) = ERROR if n ≠ k
|
||
```
|
||
|
||
### Connection to Production ML Systems
|
||
|
||
#### PyTorch Implementation
|
||
```python
|
||
# Your implementation (educational)
|
||
result = matmul(A, B)
|
||
|
||
# PyTorch (production)
|
||
result = torch.matmul(A, B) # Optimized, GPU-accelerated
|
||
result = A @ B # Same operation
|
||
```
|
||
|
||
#### TensorFlow Implementation
|
||
```python
|
||
# Your implementation (educational)
|
||
result = matmul(A, B)
|
||
|
||
# TensorFlow (production)
|
||
result = tf.matmul(A, B) # Optimized, distributed computing
|
||
result = A @ B # Same operation
|
||
```
|
||
|
||
### Why Implement It Ourselves?
|
||
1. **Deep Understanding**: See exactly what happens in each operation
|
||
2. **Debugging Skills**: Understand why shape errors occur
|
||
3. **Performance Intuition**: Appreciate why GPUs are essential
|
||
4. **Algorithm Design**: Know how to optimize for specific use cases
|
||
5. **Research Foundation**: Basis for developing new layer types
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "matmul-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
def matmul(A: np.ndarray, B: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Matrix multiplication using explicit for-loops for deep understanding.
|
||
|
||
This implementation reveals the mathematical essence of neural networks!
|
||
Every time a neural network processes data, it's doing exactly this operation.
|
||
|
||
TODO: Implement matrix multiplication using three nested for-loops.
|
||
|
||
APPROACH:
|
||
1. Extract and validate matrix dimensions
|
||
2. Initialize result matrix with zeros
|
||
3. Implement the triple-nested loop structure
|
||
4. Accumulate dot products for each output element
|
||
|
||
MATHEMATICAL FOUNDATION:
|
||
For C = A @ B, each element C[i,j] is the dot product of:
|
||
- Row i from matrix A: [A[i,0], A[i,1], ..., A[i,n-1]]
|
||
- Column j from matrix B: [B[0,j], B[1,j], ..., B[n-1,j]]
|
||
|
||
VISUAL STEP-BY-STEP:
|
||
```
|
||
A = [[1, 2], B = [[5, 6], C = [[?, ?],
|
||
[3, 4]] [7, 8]] [?, ?]]
|
||
|
||
Computing C[0,0] (row 0 of A, column 0 of B):
|
||
A[0,:] = [1, 2] ←→ B[:,0] = [5, 7]
|
||
C[0,0] = 1*5 + 2*7 = 5 + 14 = 19
|
||
|
||
Computing C[0,1] (row 0 of A, column 1 of B):
|
||
A[0,:] = [1, 2] ←→ B[:,1] = [6, 8]
|
||
C[0,1] = 1*6 + 2*8 = 6 + 16 = 22
|
||
|
||
Computing C[1,0] (row 1 of A, column 0 of B):
|
||
A[1,:] = [3, 4] ←→ B[:,0] = [5, 7]
|
||
C[1,0] = 3*5 + 4*7 = 15 + 28 = 43
|
||
|
||
Computing C[1,1] (row 1 of A, column 1 of B):
|
||
A[1,:] = [3, 4] ←→ B[:,1] = [6, 8]
|
||
C[1,1] = 3*6 + 4*8 = 18 + 32 = 50
|
||
|
||
Final result: C = [[19, 22], [43, 50]]
|
||
```
|
||
|
||
IMPLEMENTATION ALGORITHM:
|
||
```python
|
||
# 1. Get dimensions and validate
|
||
m, n = A.shape # A is m×n
|
||
n2, p = B.shape # B is n×p (n2 must equal n)
|
||
assert n == n2 # Inner dimensions must match
|
||
|
||
# 2. Initialize result matrix
|
||
C = zeros(m, p) # Result is m×p
|
||
|
||
# 3. Triple nested loops
|
||
for i in range(m): # For each row of A
|
||
for j in range(p): # For each column of B
|
||
for k in range(n): # For each element in dot product
|
||
C[i,j] += A[i,k] * B[k,j] # Accumulate
|
||
```
|
||
|
||
NEURAL NETWORK CONNECTION:
|
||
In a neural network layer:
|
||
- A = input batch (batch_size × input_features)
|
||
- B = weight matrix (input_features × output_features)
|
||
- C = output batch (batch_size × output_features)
|
||
|
||
Each C[i,j] represents how much output feature j is activated for input sample i.
|
||
|
||
DEBUGGING HINTS:
|
||
- Check shapes: A.shape = (m,n), B.shape = (n,p) → C.shape = (m,p)
|
||
- Common error: Swapping B's dimensions (should be input_features × output_features)
|
||
- Accumulation: Start with C[i,j] = 0, then add all A[i,k] * B[k,j]
|
||
- Index bounds: i ∈ [0,m), j ∈ [0,p), k ∈ [0,n)
|
||
|
||
PERFORMANCE NOTE:
|
||
This implementation is O(mnp) time complexity and helps you understand:
|
||
- Why GPUs are essential for deep learning (parallelizable operations)
|
||
- Why NumPy/BLAS libraries are much faster (optimized C/Fortran)
|
||
- How memory access patterns affect performance
|
||
|
||
LEARNING CONNECTIONS:
|
||
- Foundation of ALL neural network computations
|
||
- Understanding enables debugging shape mismatches
|
||
- Basis for implementing custom layer types
|
||
- Essential for optimizing model performance
|
||
- Connects to linear algebra theory
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Get matrix dimensions
|
||
m, n = A.shape
|
||
n2, p = B.shape
|
||
|
||
# Check compatibility
|
||
if n != n2:
|
||
raise ValueError(f"Incompatible matrix dimensions: A is {m}x{n}, B is {n2}x{p}")
|
||
|
||
# Initialize result matrix
|
||
C = np.zeros((m, p))
|
||
|
||
# Triple nested loop for matrix multiplication
|
||
for i in range(m):
|
||
for j in range(p):
|
||
for k in range(n):
|
||
C[i, j] += A[i, k] * B[k, j]
|
||
|
||
return C
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Matrix Multiplication
|
||
|
||
Once you implement the `matmul` function above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-matmul-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_matrix_multiplication():
|
||
"""Test matrix multiplication implementation"""
|
||
print("🔬 Unit Test: Matrix Multiplication...")
|
||
|
||
# Test simple 2x2 case
|
||
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result = matmul(A, B)
|
||
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
|
||
|
||
assert np.allclose(result, expected), f"Matrix multiplication failed: expected {expected}, got {result}"
|
||
|
||
# Compare with NumPy
|
||
numpy_result = A @ B
|
||
assert np.allclose(result, numpy_result), f"Doesn't match NumPy: got {result}, expected {numpy_result}"
|
||
|
||
# Test different shapes
|
||
A2 = np.array([[1, 2, 3]], dtype=np.float32) # 1x3
|
||
B2 = np.array([[4], [5], [6]], dtype=np.float32) # 3x1
|
||
result2 = matmul(A2, B2)
|
||
expected2 = np.array([[32]], dtype=np.float32) # 1*4 + 2*5 + 3*6 = 32
|
||
|
||
assert np.allclose(result2, expected2), f"1x3 @ 3x1 failed: expected {expected2}, got {result2}"
|
||
|
||
# Test 3x3 case
|
||
A3 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
|
||
B3 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=np.float32) # Identity
|
||
result3 = matmul(A3, B3)
|
||
|
||
assert np.allclose(result3, A3), "Multiplication by identity should preserve matrix"
|
||
|
||
# Test incompatible shapes
|
||
A4 = np.array([[1, 2]], dtype=np.float32) # 1x2
|
||
B4 = np.array([[3], [4], [5]], dtype=np.float32) # 3x1
|
||
|
||
try:
|
||
matmul(A4, B4)
|
||
assert False, "Should raise error for incompatible shapes"
|
||
except ValueError as e:
|
||
assert "Incompatible matrix dimensions" in str(e)
|
||
|
||
print("✅ Matrix multiplication tests passed!")
|
||
print(f"✅ 2x2 multiplication working correctly")
|
||
print(f"✅ Matches NumPy's implementation")
|
||
print(f"✅ Handles different shapes correctly")
|
||
print(f"✅ Proper error handling for incompatible shapes")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🎯 CHECKPOINT: Matrix Multiplication Mastery
|
||
|
||
You've just implemented the mathematical engine that powers ALL neural networks!
|
||
|
||
#### What You've Accomplished
|
||
✅ **Deep Understanding**: You now understand exactly what happens inside every neural network layer
|
||
✅ **Implementation Skills**: You can build matrix operations from mathematical first principles
|
||
✅ **Debugging Abilities**: You understand why shape mismatches occur and how to fix them
|
||
✅ **Performance Intuition**: You appreciate why GPUs and optimized libraries are essential
|
||
|
||
#### Mathematical Concepts Mastered
|
||
- **Dot Products**: The fundamental operation combining features with weights
|
||
- **Shape Compatibility**: Understanding when matrices can be multiplied
|
||
- **Computational Complexity**: O(mnp) operations for (m×n) @ (n×p) matrices
|
||
- **Memory Layout**: How data flows through matrix operations
|
||
|
||
#### Real-World Connection
|
||
Your implementation does exactly what happens inside:
|
||
- **PyTorch**: `torch.matmul(A, B)` uses the same mathematical principles
|
||
- **TensorFlow**: `tf.matmul(A, B)` performs identical operations
|
||
- **NumPy**: `A @ B` follows the same algorithm (just optimized in C)
|
||
|
||
#### Ready for Next Step
|
||
With matrix multiplication mastered, you're ready to build Dense layers - the fundamental building blocks that stack together to create all neural networks!
|
||
|
||
**Key insight**: Every time you see `layer(x)` in any neural network, you now know it's doing matrix multiplication under the hood.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 2: Dense Layer - The Foundation of All Neural Networks
|
||
|
||
### What is a Dense Layer?
|
||
A **Dense layer** (also called Linear or Fully Connected layer) is the fundamental building block that appears in EVERY neural network architecture ever created:
|
||
|
||
```python
|
||
output = input @ weights + bias
|
||
```
|
||
|
||
This simple equation powers:
|
||
- **GPT and language models**: Transform text representations
|
||
- **ResNet and vision models**: Classify image features
|
||
- **Recommendation systems**: Map user preferences
|
||
- **Scientific AI**: Model physical phenomena
|
||
|
||
### The Mathematical Miracle of Dense Layers
|
||
|
||
#### Universal Function Approximation
|
||
Dense layers have a **mathematically proven superpower**: Stack enough of them with nonlinear activations, and they can approximate **any continuous function**!
|
||
|
||
```python
|
||
# This can learn ANY pattern:
|
||
f(x) = dense_n(activation(dense_{n-1}(...activation(dense_1(x)))))
|
||
```
|
||
|
||
#### Why This Works
|
||
```
|
||
Linear Transformation + Nonlinear Activation = Universal Expressiveness
|
||
```
|
||
|
||
1. **Linear part (y = xW + b)**: Learns feature combinations
|
||
2. **Nonlinear activation**: Enables complex decision boundaries
|
||
3. **Stacking**: Creates arbitrarily complex functions
|
||
|
||
### Deep Mathematical Understanding
|
||
|
||
#### The Linear Transformation Matrix
|
||
```
|
||
Input Features Weight Matrix Output Features
|
||
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
|
||
│ pixel_1 │ │ w₁₁ w₁₂ w₁₃ │ │ feature_1 │
|
||
│ pixel_2 │ │ w₂₁ w₂₂ w₂₃ │ │ feature_2 │
|
||
│ pixel_3 │ │ w₃₁ w₃₂ w₃₃ │ │ feature_3 │
|
||
│ ... │ │ ⋮ ⋮ ⋮ │ │ ... │
|
||
│ pixel_784 │ │ w₇₈₄₁ ... w₇₈₄₃│ │ │
|
||
└─────────────┘ └─────────────────┘ └─────────────┘
|
||
(784 features) (784 × 3 weights) (3 features)
|
||
```
|
||
|
||
**Key insight**: Each output feature is a **learned combination** of ALL input features.
|
||
|
||
#### Weight Interpretation
|
||
Each weight w[i,j] represents:
|
||
- **How much input feature i contributes to output feature j**
|
||
- **Positive weights**: Input increases output
|
||
- **Negative weights**: Input decreases output
|
||
- **Large weights**: Strong influence
|
||
- **Small weights**: Weak influence
|
||
|
||
#### Bias Terms
|
||
```
|
||
Without bias: y = xW (line through origin)
|
||
With bias: y = xW + b (line can be shifted)
|
||
```
|
||
|
||
Bias allows the layer to **shift its output**, enabling:
|
||
- **Better fit**: Not forced through origin
|
||
- **Increased expressiveness**: More flexible transformations
|
||
- **Faster training**: Better starting point
|
||
|
||
### Real-World Architecture Patterns
|
||
|
||
#### Computer Vision
|
||
```python
|
||
# Image classification pipeline
|
||
image → flatten → dense(784→512) → relu → dense(512→10) → softmax
|
||
# ↑ Feature extraction ↑ Classification
|
||
```
|
||
|
||
#### Natural Language Processing
|
||
```python
|
||
# Text classification pipeline
|
||
text → embed → dense(300→128) → tanh → dense(128→2) → sigmoid
|
||
# ↑ Representation learning ↑ Binary classification
|
||
```
|
||
|
||
#### Generative Models
|
||
```python
|
||
# VAE decoder
|
||
noise → dense(100→256) → relu → dense(256→784) → sigmoid → image
|
||
# ↑ Expand latent code ↑ Generate pixels
|
||
```
|
||
|
||
### Weight Initialization: The Science of Starting Right
|
||
|
||
#### Why Initialization Matters
|
||
```
|
||
Poor initialization → Vanishing/exploding gradients → Training failure
|
||
Good initialization → Stable gradients → Successful training
|
||
```
|
||
|
||
#### Xavier/Glorot Initialization
|
||
```python
|
||
scale = sqrt(2 / (input_size + output_size))
|
||
weights ~ Normal(0, scale²)
|
||
```
|
||
|
||
**Mathematical motivation**: Preserves activation variance across layers.
|
||
|
||
#### Alternative Strategies
|
||
```python
|
||
# He initialization (better for ReLU)
|
||
scale = sqrt(2 / input_size)
|
||
|
||
# LeCun initialization (for SELU)
|
||
scale = sqrt(1 / input_size)
|
||
|
||
# Uniform Xavier
|
||
limit = sqrt(6 / (input_size + output_size))
|
||
weights ~ Uniform(-limit, limit)
|
||
```
|
||
|
||
### Production System Comparison
|
||
|
||
#### PyTorch Dense Layer
|
||
```python
|
||
# Your implementation
|
||
layer = Dense(input_size=784, output_size=10)
|
||
|
||
# PyTorch equivalent
|
||
layer = torch.nn.Linear(in_features=784, out_features=10)
|
||
|
||
# Identical mathematical operation!
|
||
output = layer(input) # y = xW^T + b (note: PyTorch transposes W)
|
||
```
|
||
|
||
#### TensorFlow Dense Layer
|
||
```python
|
||
# Your implementation
|
||
layer = Dense(input_size=784, output_size=10)
|
||
|
||
# TensorFlow equivalent
|
||
layer = tf.keras.layers.Dense(units=10, input_shape=(784,))
|
||
|
||
# Same mathematical operation!
|
||
output = layer(input) # y = xW + b
|
||
```
|
||
|
||
### Memory and Computational Complexity
|
||
|
||
#### Parameter Count
|
||
```
|
||
Parameters = input_size × output_size + output_size (if bias)
|
||
Example: Dense(784, 512) has 784 × 512 + 512 = 401,920 parameters
|
||
```
|
||
|
||
#### Computational Complexity
|
||
```
|
||
FLOPs per sample = 2 × input_size × output_size
|
||
Example: Dense(784, 512) requires 2 × 784 × 512 = 802,816 operations
|
||
```
|
||
|
||
#### Memory Usage
|
||
```
|
||
Memory = (batch_size × input_size × 4) + # Input (float32)
|
||
(input_size × output_size × 4) + # Weights
|
||
(output_size × 4) + # Bias
|
||
(batch_size × output_size × 4) # Output
|
||
```
|
||
|
||
### Design Philosophy
|
||
|
||
#### When to Use Dense Layers
|
||
- **Always**: As final classification/regression layers
|
||
- **Often**: For combining features from other layer types
|
||
- **Sometimes**: As hidden layers in simple architectures
|
||
- **Rarely**: For processing raw high-dimensional data (use CNN/RNN instead)
|
||
|
||
#### Architecture Decisions
|
||
```python
|
||
# Width vs Depth trade-off
|
||
Wide: Dense(1000, 2000) # More parameters, might overfit
|
||
Deep: Dense(1000, 500) → Dense(500, 250) → Dense(250, 125) # More layers
|
||
|
||
# Rule of thumb: Start simple, add complexity as needed
|
||
```
|
||
|
||
### Connection to Advanced Architectures
|
||
|
||
#### Attention Mechanisms
|
||
```python
|
||
# Multi-head attention uses THREE dense layers
|
||
Q = dense_q(x) # Query projection
|
||
K = dense_k(x) # Key projection
|
||
V = dense_v(x) # Value projection
|
||
attention = softmax(QK^T/√d) @ V
|
||
```
|
||
|
||
#### Residual Connections
|
||
```python
|
||
# ResNet block with dense layers
|
||
def residual_dense_block(x):
|
||
residual = x
|
||
x = dense1(x)
|
||
x = activation(x)
|
||
x = dense2(x)
|
||
return x + residual # Skip connection
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "dense-layer", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class Dense:
|
||
"""
|
||
Dense (Linear/Fully Connected) Layer
|
||
|
||
Applies a linear transformation: y = xW + b
|
||
|
||
This is the fundamental building block of neural networks.
|
||
"""
|
||
|
||
def __init__(self, input_size: int, output_size: int, use_bias: bool = True):
|
||
"""
|
||
Initialize Dense layer with random weights and optional bias.
|
||
|
||
This initialization is CRITICAL for successful neural network training!
|
||
Poor initialization can cause vanishing/exploding gradients and training failure.
|
||
|
||
TODO: Implement Dense layer initialization with proper weight scaling.
|
||
|
||
APPROACH:
|
||
1. Store layer configuration parameters
|
||
2. Initialize weights using Xavier/Glorot strategy
|
||
3. Initialize bias terms (typically zeros)
|
||
4. Convert arrays to Tensor objects for compatibility
|
||
|
||
WEIGHT INITIALIZATION DEEP DIVE:
|
||
|
||
Why Random Initialization?
|
||
- Breaks symmetry: All neurons start different
|
||
- Enables learning: Gradients won't be identical
|
||
- Avoids dead neurons: Some neurons activate from start
|
||
|
||
Xavier/Glorot Initialization Strategy:
|
||
```
|
||
scale = sqrt(2 / (input_size + output_size))
|
||
weights ~ Normal(0, scale²)
|
||
```
|
||
|
||
Mathematical Justification:
|
||
- Maintains activation variance across layers
|
||
- Prevents vanishing/exploding gradients
|
||
- Empirically proven to improve training
|
||
|
||
VISUAL INITIALIZATION PATTERN:
|
||
```
|
||
Input Layer (3 neurons) Dense Layer (2 neurons)
|
||
┌─────┐ ┌─────┐
|
||
│ x₁ │ ──w₁₁──→ │ y₁ │
|
||
│ │ \\ │ │
|
||
│ x₂ │ ──w₂₁─w₂₂──→ │ y₂ │
|
||
│ │ / │ │
|
||
│ x₃ │ ──w₃₁──→ │ │
|
||
└─────┘ +b₁ +b₂ └─────┘
|
||
|
||
Weight Matrix W (3×2): Bias Vector b (2×1):
|
||
┌──────────────┐ ┌────┐
|
||
│ w₁₁ w₁₂ │ │ b₁ │
|
||
│ w₂₁ w₂₂ │ │ b₂ │
|
||
│ w₃₁ w₃₂ │ └────┘
|
||
└──────────────┘
|
||
```
|
||
|
||
EXAMPLE INITIALIZATION:
|
||
```python
|
||
layer = Dense(input_size=784, output_size=10) # MNIST classifier
|
||
# Weight shape: (784, 10) - each output connects to all inputs
|
||
# Bias shape: (10,) - one bias per output neuron
|
||
# Scale: sqrt(2/(784+10)) ≈ 0.05 - prevents gradients from exploding
|
||
```
|
||
|
||
IMPLEMENTATION STEPS:
|
||
```python
|
||
# 1. Store configuration
|
||
self.input_size = input_size # Number of input features
|
||
self.output_size = output_size # Number of output neurons
|
||
self.use_bias = use_bias # Whether to include bias terms
|
||
|
||
# 2. Calculate Xavier scale
|
||
scale = np.sqrt(2.0 / (input_size + output_size))
|
||
|
||
# 3. Initialize weights (shape matters!)
|
||
weight_data = np.random.randn(input_size, output_size) * scale
|
||
|
||
# 4. Initialize bias (usually zeros)
|
||
if use_bias:
|
||
bias_data = np.zeros(output_size)
|
||
|
||
# 5. Convert to Tensors
|
||
self.weights = Tensor(weight_data)
|
||
self.bias = Tensor(bias_data) if use_bias else None
|
||
```
|
||
|
||
ALTERNATIVE INITIALIZATION STRATEGIES:
|
||
|
||
He Initialization (better for ReLU):
|
||
```python
|
||
scale = np.sqrt(2.0 / input_size) # Only input size
|
||
```
|
||
|
||
Uniform Xavier:
|
||
```python
|
||
limit = np.sqrt(6.0 / (input_size + output_size))
|
||
weights = np.random.uniform(-limit, limit, (input_size, output_size))
|
||
```
|
||
|
||
COMMON INITIALIZATION MISTAKES:
|
||
1. **All zeros**: No learning (dead neurons)
|
||
2. **Too large**: Exploding gradients
|
||
3. **Too small**: Vanishing gradients
|
||
4. **Wrong shape**: Broadcasting errors
|
||
5. **Same values**: Symmetry problem
|
||
|
||
PRODUCTION SYSTEM COMPARISON:
|
||
```python
|
||
# Your implementation
|
||
layer = Dense(input_size, output_size)
|
||
|
||
# PyTorch equivalent
|
||
layer = torch.nn.Linear(input_size, output_size)
|
||
# Uses Kaiming uniform initialization by default
|
||
|
||
# TensorFlow equivalent
|
||
layer = tf.keras.layers.Dense(output_size, input_shape=(input_size,))
|
||
# Uses Glorot uniform initialization by default
|
||
```
|
||
|
||
DEBUGGING HINTS:
|
||
- Print weight statistics: mean ≈ 0, std ≈ scale
|
||
- Check shapes: weights (input_size, output_size), bias (output_size,)
|
||
- Verify Tensor conversion: isinstance(self.weights, Tensor)
|
||
- Test forward pass: no shape errors
|
||
|
||
LEARNING CONNECTIONS:
|
||
- Foundation for all layer types (Conv2D, LSTM, Attention)
|
||
- Understanding gradients and backpropagation
|
||
- Basis for transfer learning (loading pre-trained weights)
|
||
- Essential for model architecture design
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Store layer parameters
|
||
self.input_size = input_size
|
||
self.output_size = output_size
|
||
self.use_bias = use_bias
|
||
|
||
# Xavier/Glorot initialization
|
||
scale = np.sqrt(2.0 / (input_size + output_size))
|
||
|
||
# Initialize weights with random values
|
||
weight_data = np.random.randn(input_size, output_size) * scale
|
||
self.weights = Tensor(weight_data)
|
||
|
||
# Initialize bias
|
||
if use_bias:
|
||
bias_data = np.zeros(output_size)
|
||
self.bias = Tensor(bias_data)
|
||
else:
|
||
self.bias = None
|
||
### END SOLUTION
|
||
|
||
def forward(self, x):
|
||
"""
|
||
Forward pass through the Dense layer: the heart of neural computation.
|
||
|
||
This function implements y = xW + b, the fundamental equation that powers
|
||
all neural networks from simple perceptrons to massive transformers!
|
||
|
||
TODO: Implement the forward pass with proper shape handling.
|
||
|
||
APPROACH:
|
||
1. Apply matrix multiplication for feature combination
|
||
2. Add bias terms for output shifting
|
||
3. Return properly shaped Tensor result
|
||
4. Handle batch processing automatically
|
||
|
||
MATHEMATICAL FOUNDATION:
|
||
|
||
The Linear Transformation:
|
||
```
|
||
y = xW + b
|
||
|
||
Where:
|
||
x: Input features (batch_size × input_features)
|
||
W: Weight matrix (input_features × output_features)
|
||
b: Bias vector (output_features,)
|
||
y: Output features (batch_size × output_features)
|
||
```
|
||
|
||
VISUAL DATA FLOW:
|
||
```
|
||
Input Batch Weight Matrix Bias Vector Output Batch
|
||
┌─────────────┐ ┌─────────────┐ ┌─────────┐ ┌─────────────┐
|
||
│ [x₁₁ x₁₂] │ │ [w₁₁ w₁₂] │ │ [b₁ b₂] │ │ [y₁₁ y₁₂] │
|
||
│ [x₂₁ x₂₂] │ @ │ [w₂₁ w₂₂] │ + │ │ = │ [y₂₁ y₂₂] │
|
||
│ [x₃₁ x₃₂] │ └─────────────┘ └─────────┘ │ [y₃₁ y₃₂] │
|
||
└─────────────┘ └─────────────┘
|
||
(3×2) (2×2) (2,) (3×2)
|
||
```
|
||
|
||
STEP-BY-STEP COMPUTATION:
|
||
|
||
For each output element y[i,j]:
|
||
```
|
||
y[i,j] = Σₖ x[i,k] * W[k,j] + b[j]
|
||
|
||
Example:
|
||
x = [[1, 2]] # 1 sample, 2 features
|
||
W = [[0.5, 0.3], # 2 input → 2 output
|
||
[0.7, 0.4]]
|
||
b = [0.1, 0.2] # bias for each output
|
||
|
||
y[0,0] = x[0,0]*W[0,0] + x[0,1]*W[1,0] + b[0]
|
||
= 1*0.5 + 2*0.7 + 0.1 = 0.5 + 1.4 + 0.1 = 2.0
|
||
|
||
y[0,1] = x[0,0]*W[0,1] + x[0,1]*W[1,1] + b[1]
|
||
= 1*0.3 + 2*0.4 + 0.2 = 0.3 + 0.8 + 0.2 = 1.3
|
||
|
||
Result: y = [[2.0, 1.3]]
|
||
```
|
||
|
||
BATCH PROCESSING MAGIC:
|
||
The same operation works for ANY batch size:
|
||
```
|
||
Single sample: (1, features) @ (features, outputs) = (1, outputs)
|
||
Mini-batch: (32, features) @ (features, outputs) = (32, outputs)
|
||
Large batch: (1000, features) @ (features, outputs) = (1000, outputs)
|
||
```
|
||
|
||
IMPLEMENTATION DETAILS:
|
||
```python
|
||
# 1. Matrix multiplication (the core operation)
|
||
linear_output = matmul(x.data, self.weights.data)
|
||
|
||
# 2. Bias addition (broadcasting handles shape automatically)
|
||
if self.use_bias and self.bias is not None:
|
||
linear_output = linear_output + self.bias.data
|
||
# Broadcasting: (batch_size, output_features) + (output_features,)
|
||
# → (batch_size, output_features)
|
||
|
||
# 3. Return as proper Tensor type
|
||
return type(x)(linear_output) # Preserves Tensor class
|
||
```
|
||
|
||
BROADCASTING EXPLANATION:
|
||
NumPy automatically broadcasts the bias:
|
||
```
|
||
linear_output.shape = (batch_size, output_features) # e.g., (32, 10)
|
||
bias.shape = (output_features,) # e.g., (10,)
|
||
|
||
# Broadcasting adds bias to each sample:
|
||
result[i,j] = linear_output[i,j] + bias[j] # for all i
|
||
```
|
||
|
||
REAL-WORLD APPLICATIONS:
|
||
|
||
Image Classification:
|
||
```
|
||
# Flatten image: (28, 28) → (784,)
|
||
# Dense layer: (784,) → (10,) class scores
|
||
x = flattened_image # Shape: (batch, 784)
|
||
scores = dense_layer(x) # Shape: (batch, 10)
|
||
```
|
||
|
||
Language Model:
|
||
```
|
||
# Word embedding: word_id → dense vector
|
||
# Dense layer: hidden → vocabulary scores
|
||
x = hidden_state # Shape: (batch, hidden_size)
|
||
logits = output_layer(x) # Shape: (batch, vocab_size)
|
||
```
|
||
|
||
COMMON SHAPE ERRORS AND SOLUTIONS:
|
||
```
|
||
Error: "Cannot multiply (32, 784) and (10, 784)"
|
||
Solution: Weight shape should be (784, 10), not (10, 784)
|
||
|
||
Error: "Cannot add (32, 10) and (784,)"
|
||
Solution: Bias shape should be (10,), not (784,)
|
||
|
||
Error: "Expected 2D input, got 1D"
|
||
Solution: Reshape input from (features,) to (1, features)
|
||
```
|
||
|
||
DEBUGGING CHECKLIST:
|
||
- Input shape: (batch_size, input_features)
|
||
- Weight shape: (input_features, output_features)
|
||
- Bias shape: (output_features,) or None
|
||
- Output shape: (batch_size, output_features)
|
||
|
||
PERFORMANCE NOTES:
|
||
- Matrix multiplication is O(batch × input × output)
|
||
- Most computation time spent here in large models
|
||
- GPU acceleration crucial for large layers
|
||
- Memory usage: store input, weights, bias, output
|
||
|
||
LEARNING CONNECTIONS:
|
||
- Foundation of backpropagation (gradients flow through this operation)
|
||
- Basis for all advanced layer types (attention, convolution)
|
||
- Understanding enables custom layer development
|
||
- Critical for model optimization and deployment
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Perform matrix multiplication
|
||
linear_output = matmul(x.data, self.weights.data)
|
||
|
||
# Add bias if present
|
||
if self.use_bias and self.bias is not None:
|
||
linear_output = linear_output + self.bias.data
|
||
|
||
return type(x)(linear_output)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x):
|
||
"""Make the layer callable: layer(x) instead of layer.forward(x)"""
|
||
return self.forward(x)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Dense Layer
|
||
|
||
Once you implement the Dense layer above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-dense-layer", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_dense_layer():
|
||
"""Test Dense layer implementation"""
|
||
print("🔬 Unit Test: Dense Layer...")
|
||
|
||
# Test layer creation
|
||
layer = Dense(input_size=3, output_size=2)
|
||
|
||
# Check weight and bias shapes
|
||
assert layer.weights.shape == (3, 2), f"Weight shape should be (3, 2), got {layer.weights.shape}"
|
||
assert layer.bias is not None, "Bias should not be None when use_bias=True"
|
||
assert layer.bias.shape == (2,), f"Bias shape should be (2,), got {layer.bias.shape}"
|
||
|
||
# Test forward pass
|
||
input_data = Tensor([[1, 2, 3]]) # Shape: (1, 3)
|
||
output = layer(input_data)
|
||
|
||
# Check output shape
|
||
assert output.shape == (1, 2), f"Output shape should be (1, 2), got {output.shape}"
|
||
|
||
# Test batch processing
|
||
batch_input = Tensor([[1, 2, 3], [4, 5, 6]]) # Shape: (2, 3)
|
||
batch_output = layer(batch_input)
|
||
|
||
assert batch_output.shape == (2, 2), f"Batch output shape should be (2, 2), got {batch_output.shape}"
|
||
|
||
# Test without bias
|
||
no_bias_layer = Dense(input_size=3, output_size=2, use_bias=False)
|
||
assert no_bias_layer.bias is None, "Layer without bias should have None bias"
|
||
|
||
no_bias_output = no_bias_layer(input_data)
|
||
assert no_bias_output.shape == (1, 2), "No-bias layer should still produce correct shape"
|
||
|
||
# Test that different inputs produce different outputs
|
||
input1 = Tensor([[1, 0, 0]])
|
||
input2 = Tensor([[0, 1, 0]])
|
||
|
||
output1 = layer(input1)
|
||
output2 = layer(input2)
|
||
|
||
# Should not be equal (with high probability due to random initialization)
|
||
assert not np.allclose(output1.data, output2.data), "Different inputs should produce different outputs"
|
||
|
||
# Test linearity property: layer(a*x) = a*layer(x)
|
||
scale = 2.0
|
||
scaled_input = Tensor([[2, 4, 6]]) # 2 * [1, 2, 3]
|
||
scaled_output = layer(scaled_input)
|
||
|
||
# Due to bias, this won't be exactly 2*output, but the linear part should scale
|
||
print("✅ Dense layer tests passed!")
|
||
print(f"✅ Correct weight and bias initialization")
|
||
print(f"✅ Forward pass produces correct shapes")
|
||
print(f"✅ Batch processing works correctly")
|
||
print(f"✅ Bias and no-bias variants work")
|
||
print(f"✅ Naive matrix multiplication option works")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🎯 CHECKPOINT: Dense Layer Implementation Complete
|
||
|
||
Congratulations! You've just implemented the fundamental building block of all neural networks!
|
||
|
||
#### What You've Accomplished
|
||
✅ **Dense Layer Mastery**: You can now build the core component of every neural network
|
||
✅ **Weight Initialization**: You understand how to start training with proper parameter scaling
|
||
✅ **Shape Management**: You handle batch processing and broadcasting automatically
|
||
✅ **Production-Ready Code**: Your implementation matches PyTorch and TensorFlow standards
|
||
|
||
#### Mathematical Concepts Mastered
|
||
- **Linear Transformations**: y = xW + b is now deeply understood
|
||
- **Parameter Initialization**: Xavier/Glorot scaling for stable gradients
|
||
- **Broadcasting**: Automatic shape handling for bias addition
|
||
- **Batch Processing**: Same operation works for any batch size
|
||
|
||
#### Real-World Impact
|
||
Your Dense layer implementation enables:
|
||
- **Image Classification**: Transform pixel features to class predictions
|
||
- **Language Models**: Map word embeddings to vocabulary scores
|
||
- **Recommendation Systems**: Learn user-item preference mappings
|
||
- **Scientific Computing**: Model complex physical phenomena
|
||
|
||
#### Connection to Advanced AI
|
||
Every advanced architecture uses your Dense layer:
|
||
- **Transformers (GPT)**: Attention layers are built from Dense layers
|
||
- **ResNets**: Skip connections combine with Dense layers
|
||
- **GANs**: Both generator and discriminator use Dense layers
|
||
- **VAEs**: Encoder and decoder networks built from Dense layers
|
||
|
||
#### Ready for Integration
|
||
With Dense layers mastered, you're ready to see how they combine with activation functions to create complete neural network components that can learn any pattern!
|
||
|
||
**Key insight**: You now understand the mathematical foundation of all modern AI systems.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 3: Layer Integration with Activations - Building Complete Neural Networks
|
||
|
||
### The Magic of Layer + Activation Composition
|
||
Now we combine Dense layers with activation functions to create complete neural network components that can learn ANY pattern! This is where the true power of neural networks emerges.
|
||
|
||
### The Universal Neural Network Building Block
|
||
```python
|
||
# This pattern appears in EVERY neural network:
|
||
def neural_component(x):
|
||
# 1. Linear transformation (learnable)
|
||
linear_output = dense_layer(x)
|
||
|
||
# 2. Nonlinear activation (fixed function)
|
||
final_output = activation_function(linear_output)
|
||
|
||
return final_output
|
||
```
|
||
|
||
### Why This Simple Pattern Enables Universal Learning
|
||
|
||
#### Mathematical Foundation
|
||
```
|
||
f(x) = activation(xW + b)
|
||
```
|
||
|
||
This combination provides:
|
||
- **Linear part**: Learns optimal feature combinations
|
||
- **Nonlinear part**: Enables complex decision boundaries
|
||
- **Composability**: Stacks to approximate any function
|
||
|
||
#### Visual Understanding of Layer + Activation
|
||
```
|
||
Input → Dense Layer → Activation → Output
|
||
┌─────┐ ┌─────────┐ ┌──────────┐ ┌─────┐
|
||
│ [1] │ │ [1 2] │ │ ReLU │ │ [2] │
|
||
│ [2] │ → │ [3 4] @ │ → │ max(0,x) │ → │ [0] │
|
||
│ [3] │ │ [5 6] │ │ │ │ [8] │
|
||
└─────┘ └─────────┘ └──────────┘ └─────┘
|
||
Linear Output Nonlinear Final
|
||
[2, -1, 8] Activation [2, 0, 8]
|
||
```
|
||
|
||
### Real-World Layer Patterns
|
||
|
||
#### Hidden Layers (Feature Learning)
|
||
```python
|
||
# Most common pattern in neural networks
|
||
hidden = relu(dense(x)) # Dense + ReLU
|
||
|
||
# Why ReLU?
|
||
# - Sparse activation (many zeros)
|
||
# - No vanishing gradient problem
|
||
# - Computationally efficient
|
||
# - Biologically inspired
|
||
```
|
||
|
||
#### Classification Output Layers
|
||
```python
|
||
# Multi-class classification
|
||
logits = dense(hidden) # Raw scores
|
||
probabilities = softmax(logits) # Convert to probabilities
|
||
|
||
# Binary classification
|
||
score = dense(hidden) # Single score
|
||
probability = sigmoid(score) # Convert to probability [0,1]
|
||
```
|
||
|
||
#### Gated Mechanisms (Advanced Architectures)
|
||
```python
|
||
# LSTM/GRU gates
|
||
forget_gate = sigmoid(dense_forget(x)) # Values in [0,1]
|
||
input_gate = sigmoid(dense_input(x)) # Controls information flow
|
||
output_gate = sigmoid(dense_output(x)) # Controls output
|
||
|
||
# Attention mechanisms
|
||
attention_scores = softmax(dense_attention(x)) # Probability distribution
|
||
```
|
||
|
||
### Deep Network Architecture Patterns
|
||
|
||
#### Multi-Layer Perceptron (MLP)
|
||
```python
|
||
# Classic deep network architecture
|
||
def mlp(x):
|
||
h1 = relu(dense1(x)) # Hidden layer 1
|
||
h2 = relu(dense2(h1)) # Hidden layer 2
|
||
h3 = relu(dense3(h2)) # Hidden layer 3
|
||
output = softmax(dense4(h3)) # Output layer
|
||
return output
|
||
|
||
# Each layer learns increasingly complex features:
|
||
# Layer 1: Basic feature combinations
|
||
# Layer 2: Feature interactions
|
||
# Layer 3: Complex patterns
|
||
# Output: Task-specific predictions
|
||
```
|
||
|
||
#### Residual Network Block
|
||
```python
|
||
# ResNet-style skip connections
|
||
def residual_block(x):
|
||
residual = x
|
||
h1 = relu(dense1(x))
|
||
h2 = dense2(h1) # No activation before skip connection
|
||
output = relu(h2 + residual) # Add skip connection
|
||
return output
|
||
|
||
# Why this works:
|
||
# - Enables very deep networks
|
||
# - Solves vanishing gradient problem
|
||
# - Allows learning identity mappings
|
||
```
|
||
|
||
#### Attention Mechanism
|
||
```python
|
||
# Transformer-style attention
|
||
def attention_layer(x):
|
||
queries = dense_q(x) # Project to query space
|
||
keys = dense_k(x) # Project to key space
|
||
values = dense_v(x) # Project to value space
|
||
|
||
# Compute attention scores
|
||
scores = queries @ keys.T / sqrt(d_model)
|
||
attention_weights = softmax(scores)
|
||
|
||
# Apply attention to values
|
||
output = attention_weights @ values
|
||
return output
|
||
```
|
||
|
||
### Layer Combination Strategies
|
||
|
||
#### Width vs Depth Trade-offs
|
||
```python
|
||
# Wide network (fewer layers, more neurons)
|
||
def wide_network(x):
|
||
h1 = relu(dense(x, 1000)) # Large hidden layer
|
||
output = softmax(dense(h1, 10))
|
||
return output
|
||
|
||
# Deep network (more layers, fewer neurons)
|
||
def deep_network(x):
|
||
h1 = relu(dense(x, 100))
|
||
h2 = relu(dense(h1, 100))
|
||
h3 = relu(dense(h2, 100))
|
||
h4 = relu(dense(h3, 100))
|
||
output = softmax(dense(h4, 10))
|
||
return output
|
||
|
||
# General trend: Deeper networks often perform better
|
||
```
|
||
|
||
#### Activation Function Selection Guide
|
||
```python
|
||
# Hidden layers
|
||
hidden = relu(dense(x)) # Default choice, works well
|
||
hidden = leaky_relu(dense(x)) # Prevents dead neurons
|
||
hidden = gelu(dense(x)) # Used in transformers
|
||
hidden = swish(dense(x)) # Smooth, self-gated
|
||
|
||
# Output layers
|
||
classification = softmax(dense(x)) # Multi-class probabilities
|
||
binary = sigmoid(dense(x)) # Binary probability
|
||
regression = dense(x) # No activation for regression
|
||
structured = tanh(dense(x)) # Bounded outputs [-1, 1]
|
||
```
|
||
|
||
### Training Considerations
|
||
|
||
#### Gradient Flow Through Layer+Activation
|
||
```python
|
||
# Good gradient flow
|
||
x → dense1 → relu → dense2 → relu → output
|
||
↑ Well-conditioned gradients flow back
|
||
|
||
# Poor gradient flow
|
||
x → dense1 → sigmoid → dense2 → sigmoid → output
|
||
↑ Gradients may vanish in deep networks
|
||
```
|
||
|
||
#### Initialization Strategies for Layer+Activation
|
||
```python
|
||
# Xavier/Glorot (for sigmoid, tanh)
|
||
scale = sqrt(2 / (input_size + output_size))
|
||
|
||
# He initialization (for ReLU)
|
||
scale = sqrt(2 / input_size)
|
||
|
||
# Activation function determines optimal initialization!
|
||
```
|
||
|
||
### Production Architecture Examples
|
||
|
||
#### Image Classification (ResNet-style)
|
||
```python
|
||
def image_classifier(x):
|
||
# Feature extraction
|
||
h1 = relu(dense(flatten(x), 512))
|
||
h2 = relu(dense(h1, 256))
|
||
h3 = relu(dense(h2, 128))
|
||
|
||
# Classification head
|
||
logits = dense(h3, num_classes)
|
||
probabilities = softmax(logits)
|
||
return probabilities
|
||
```
|
||
|
||
#### Language Model (Transformer-style)
|
||
```python
|
||
def language_model(x):
|
||
# Embedding and position encoding
|
||
embedded = embedding(x) + position_encoding(x)
|
||
|
||
# Transformer layers
|
||
for _ in range(num_layers):
|
||
# Self-attention
|
||
attended = attention_layer(embedded)
|
||
embedded = layer_norm(embedded + attended)
|
||
|
||
# Feed-forward
|
||
ff_output = relu(dense(embedded, ff_size))
|
||
ff_output = dense(ff_output, embed_size)
|
||
embedded = layer_norm(embedded + ff_output)
|
||
|
||
# Output projection
|
||
logits = dense(embedded, vocab_size)
|
||
return softmax(logits)
|
||
```
|
||
|
||
#### Generative Model (VAE-style)
|
||
```python
|
||
def variational_autoencoder(x):
|
||
# Encoder
|
||
h1 = relu(dense(x, 256))
|
||
h2 = relu(dense(h1, 128))
|
||
mu = dense(h2, latent_size) # Mean
|
||
log_var = dense(h2, latent_size) # Log variance
|
||
|
||
# Reparameterization trick
|
||
eps = random_normal(latent_size)
|
||
z = mu + exp(0.5 * log_var) * eps
|
||
|
||
# Decoder
|
||
h3 = relu(dense(z, 128))
|
||
h4 = relu(dense(h3, 256))
|
||
reconstruction = sigmoid(dense(h4, input_size))
|
||
|
||
return reconstruction, mu, log_var
|
||
```
|
||
|
||
### Integration Testing Strategy
|
||
Let's test that Dense layers work seamlessly with all activation functions to create complete neural network components!
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-layer-activation-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_layer_activation():
|
||
"""Test Dense layer comprehensive testing with activation functions"""
|
||
print("🔬 Unit Test: Layer-Activation Comprehensive Test...")
|
||
|
||
# Create layer and activation functions
|
||
layer = Dense(input_size=4, output_size=3)
|
||
relu = ReLU()
|
||
sigmoid = Sigmoid()
|
||
tanh = Tanh()
|
||
softmax = Softmax()
|
||
|
||
# Test input
|
||
input_data = Tensor([[1, -2, 3, -4], [2, 1, -1, 3]]) # Shape: (2, 4)
|
||
|
||
# Test Dense + ReLU (common hidden layer pattern)
|
||
linear_output = layer(input_data)
|
||
relu_output = relu(linear_output)
|
||
|
||
assert relu_output.shape == (2, 3), "ReLU output should preserve shape"
|
||
assert np.all(relu_output.data >= 0), "ReLU output should be non-negative"
|
||
|
||
# Test Dense + Softmax (classification output pattern)
|
||
softmax_output = softmax(linear_output)
|
||
|
||
assert softmax_output.shape == (2, 3), "Softmax output should preserve shape"
|
||
|
||
# Each row should sum to 1 (probability distribution)
|
||
for i in range(2):
|
||
row_sum = np.sum(softmax_output.data[i])
|
||
assert abs(row_sum - 1.0) < 1e-6, f"Row {i} should sum to 1, got {row_sum}"
|
||
|
||
# Test Dense + Sigmoid (binary classification pattern)
|
||
sigmoid_output = sigmoid(linear_output)
|
||
|
||
assert sigmoid_output.shape == (2, 3), "Sigmoid output should preserve shape"
|
||
assert np.all(sigmoid_output.data > 0), "Sigmoid output should be positive"
|
||
assert np.all(sigmoid_output.data < 1), "Sigmoid output should be less than 1"
|
||
|
||
# Test Dense + Tanh (hidden layer with centered outputs)
|
||
tanh_output = tanh(linear_output)
|
||
|
||
assert tanh_output.shape == (2, 3), "Tanh output should preserve shape"
|
||
assert np.all(tanh_output.data > -1), "Tanh output should be > -1"
|
||
assert np.all(tanh_output.data < 1), "Tanh output should be < 1"
|
||
|
||
# Test chained layers (simple 2-layer network)
|
||
layer1 = Dense(input_size=4, output_size=5)
|
||
layer2 = Dense(input_size=5, output_size=3)
|
||
|
||
# Forward pass through 2-layer network
|
||
hidden = relu(layer1(input_data))
|
||
output = softmax(layer2(hidden))
|
||
|
||
assert output.shape == (2, 3), "2-layer network should produce correct output shape"
|
||
|
||
# Each output should be a valid probability distribution
|
||
for i in range(2):
|
||
row_sum = np.sum(output.data[i])
|
||
assert abs(row_sum - 1.0) < 1e-6, f"Network output row {i} should sum to 1"
|
||
|
||
# Test that layers are learning-ready (have parameters)
|
||
assert hasattr(layer1, 'weights'), "Layer should have weights"
|
||
assert hasattr(layer1, 'bias'), "Layer should have bias"
|
||
assert isinstance(layer1.weights, Tensor), "Weights should be Tensor"
|
||
assert isinstance(layer1.bias, Tensor), "Bias should be Tensor"
|
||
|
||
print("✅ Layer-activation comprehensive tests passed!")
|
||
print(f"✅ Dense + ReLU working correctly")
|
||
print(f"✅ Dense + Softmax producing valid probabilities")
|
||
print(f"✅ Dense + Sigmoid bounded correctly")
|
||
print(f"✅ Dense + Tanh centered correctly")
|
||
print(f"✅ Multi-layer networks working")
|
||
print(f"✅ All components ready for training!")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🎯 CHECKPOINT: Complete Neural Network Components Mastered
|
||
|
||
Outstanding! You've now mastered the complete pipeline from basic matrix operations to full neural network components!
|
||
|
||
#### What You've Accomplished
|
||
✅ **Complete Neural Network Components**: Dense layers + activations working together
|
||
✅ **Real-World Architecture Patterns**: Understanding how components combine in production systems
|
||
✅ **Integration Mastery**: Seamless compatibility between layers, activations, and tensors
|
||
✅ **Production-Ready Implementation**: Code that scales to actual deep learning applications
|
||
|
||
#### Mathematical Concepts Mastered
|
||
- **Universal Function Approximation**: Layer + activation composition enables learning any pattern
|
||
- **Gradient Flow**: Understanding how gradients propagate through layer-activation chains
|
||
- **Architecture Design**: Knowledge of when to use which layer-activation combinations
|
||
- **Batch Processing**: Automatic handling of variable batch sizes
|
||
|
||
#### Real-World Applications You Can Now Build
|
||
Your implementations now enable:
|
||
- **Image Classification**: Multi-layer networks for computer vision
|
||
- **Language Models**: Transformer-style architectures for NLP
|
||
- **Generative Models**: VAEs, GANs, and other generative architectures
|
||
- **Recommendation Systems**: Deep collaborative filtering networks
|
||
|
||
#### Advanced Architecture Patterns Understood
|
||
- **Residual Networks**: Skip connections for very deep networks
|
||
- **Attention Mechanisms**: Query-key-value patterns for transformers
|
||
- **Gated Architectures**: LSTM/GRU-style information flow control
|
||
- **Multi-layer Perceptrons**: Classic feedforward architectures
|
||
|
||
**Key insight**: You can now understand and implement ANY neural network architecture!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🔬 Integration Test: Layers with Tensors
|
||
|
||
This is our first cumulative integration test.
|
||
It ensures that the 'Layer' abstraction works correctly with the 'Tensor' class from the previous module.
|
||
"""
|
||
|
||
# %%
|
||
def test_module_layer_tensor_integration():
|
||
"""
|
||
Tests that a Tensor can be passed through a Layer subclass
|
||
and that the output is of the correct type and shape.
|
||
"""
|
||
print("🔬 Running Integration Test: Layer with Tensor...")
|
||
|
||
# 1. Define a simple Layer that doubles the input
|
||
class DoubleLayer(Dense): # Inherit from Dense to get __call__
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
return x * 2
|
||
|
||
# 2. Create an instance of the layer
|
||
double_layer = DoubleLayer(input_size=1, output_size=1) # Dummy sizes
|
||
|
||
# 3. Create a Tensor from the previous module
|
||
input_tensor = Tensor([1, 2, 3])
|
||
|
||
# 4. Perform the forward pass
|
||
output_tensor = double_layer(input_tensor)
|
||
|
||
# 5. Assert correctness
|
||
assert isinstance(output_tensor, Tensor), "Output should be a Tensor"
|
||
assert np.array_equal(output_tensor.data, np.array([2, 4, 6])), "Output data is incorrect"
|
||
print("✅ Integration Test Passed: Layer correctly processed Tensor.")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🏗️ ML Systems: Architecture Analysis & Memory Scaling
|
||
|
||
Now that you have working neural network layers, let's develop **architecture analysis skills**. This section teaches you to understand how layer composition affects memory usage, parameter counts, and computational complexity.
|
||
|
||
### **Learning Outcome**: *"I understand how layers combine to create memory pressure and can analyze model architectures"*
|
||
|
||
---
|
||
|
||
## Layer Architecture Profiler (Medium Guided Implementation)
|
||
|
||
As an ML systems engineer, you need to understand how different layer configurations affect system resources. Let's build tools to analyze layer architectures and scaling patterns.
|
||
"""
|
||
|
||
# %%
|
||
import time
|
||
import psutil
|
||
import os
|
||
|
||
class LayerArchitectureProfiler:
|
||
"""
|
||
Architecture analysis toolkit for neural network layers.
|
||
|
||
Helps ML engineers understand memory scaling, parameter counts,
|
||
and computational complexity of different layer configurations.
|
||
"""
|
||
|
||
def __init__(self):
|
||
self.process = psutil.Process(os.getpid())
|
||
self.analysis_cache = {}
|
||
|
||
def analyze_layer_parameters(self, input_size, hidden_size, output_size):
|
||
"""
|
||
Analyze parameter count and memory usage for a layer configuration.
|
||
|
||
TODO: Implement parameter count analysis.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Calculate weight matrix parameters: input_size * hidden_size
|
||
2. Calculate bias parameters: hidden_size
|
||
3. Calculate total parameters: weights + bias
|
||
4. Calculate memory usage: parameters * 4 bytes (float32)
|
||
5. Return analysis dictionary with all metrics
|
||
|
||
EXAMPLE:
|
||
profiler = LayerArchitectureProfiler()
|
||
analysis = profiler.analyze_layer_parameters(784, 128, 10)
|
||
print(f"Parameters: {analysis['total_parameters']:,}")
|
||
print(f"Memory: {analysis['memory_mb']:.2f} MB")
|
||
|
||
HINTS:
|
||
- Weight matrix shape: (input_size, hidden_size)
|
||
- Bias vector shape: (hidden_size,)
|
||
- Float32 = 4 bytes per parameter
|
||
- Convert bytes to MB: bytes / (1024 * 1024)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Calculate parameters
|
||
weight_params = input_size * hidden_size
|
||
bias_params = hidden_size
|
||
total_params = weight_params + bias_params
|
||
|
||
# Calculate memory (assuming float32 = 4 bytes)
|
||
memory_bytes = total_params * 4
|
||
memory_mb = memory_bytes / (1024 * 1024)
|
||
|
||
return {
|
||
'input_size': input_size,
|
||
'hidden_size': hidden_size,
|
||
'output_size': output_size,
|
||
'weight_parameters': weight_params,
|
||
'bias_parameters': bias_params,
|
||
'total_parameters': total_params,
|
||
'memory_bytes': memory_bytes,
|
||
'memory_mb': memory_mb
|
||
}
|
||
### END SOLUTION
|
||
|
||
def analyze_network_scaling(self, input_size, hidden_sizes, output_size):
|
||
"""
|
||
Analyze how network depth affects parameter count and memory.
|
||
|
||
TODO: Implement network scaling analysis.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Initialize total parameters counter
|
||
2. For each layer in the network:
|
||
a. Calculate layer parameters using analyze_layer_parameters
|
||
b. Add to total count
|
||
c. Update input_size for next layer
|
||
3. Calculate total memory usage
|
||
4. Return comprehensive analysis
|
||
|
||
EXAMPLE:
|
||
profiler = LayerArchitectureProfiler()
|
||
analysis = profiler.analyze_network_scaling(784, [512, 256, 128], 10)
|
||
print(f"Total parameters: {analysis['total_parameters']:,}")
|
||
print(f"Layers: {len(analysis['layer_details'])}")
|
||
|
||
HINTS:
|
||
- Loop through hidden_sizes for each layer
|
||
- Track input_size changes: input → hidden[0] → hidden[1] → ... → output
|
||
- Sum all layer parameters
|
||
- Store per-layer details for analysis
|
||
"""
|
||
### BEGIN SOLUTION
|
||
total_parameters = 0
|
||
layer_details = []
|
||
current_input = input_size
|
||
|
||
# Analyze each hidden layer
|
||
for i, hidden_size in enumerate(hidden_sizes):
|
||
layer_analysis = self.analyze_layer_parameters(current_input, hidden_size, 0)
|
||
layer_analysis['layer_name'] = f'Hidden_{i+1}'
|
||
layer_details.append(layer_analysis)
|
||
total_parameters += layer_analysis['total_parameters']
|
||
current_input = hidden_size
|
||
|
||
# Analyze output layer
|
||
output_analysis = self.analyze_layer_parameters(current_input, output_size, 0)
|
||
output_analysis['layer_name'] = 'Output'
|
||
layer_details.append(output_analysis)
|
||
total_parameters += output_analysis['total_parameters']
|
||
|
||
# Calculate total memory
|
||
total_memory_mb = total_parameters * 4 / (1024 * 1024)
|
||
|
||
return {
|
||
'network_architecture': f"{input_size} → {' → '.join(map(str, hidden_sizes))} → {output_size}",
|
||
'total_parameters': total_parameters,
|
||
'total_memory_mb': total_memory_mb,
|
||
'num_layers': len(hidden_sizes) + 1,
|
||
'layer_details': layer_details
|
||
}
|
||
### END SOLUTION
|
||
|
||
def compare_architectures(self, input_size, architecture_configs, output_size=10):
|
||
"""
|
||
Compare different network architectures for parameter efficiency.
|
||
|
||
This function is PROVIDED to demonstrate architecture analysis.
|
||
Students use it to understand architecture trade-offs.
|
||
"""
|
||
print(f"🏗️ ARCHITECTURE COMPARISON")
|
||
print(f"=" * 50)
|
||
print(f"Input size: {input_size}, Output size: {output_size}")
|
||
|
||
results = {}
|
||
|
||
for arch_name, hidden_sizes in architecture_configs.items():
|
||
analysis = self.analyze_network_scaling(input_size, hidden_sizes, output_size)
|
||
results[arch_name] = analysis
|
||
|
||
print(f"\n📊 {arch_name}:")
|
||
print(f" Architecture: {analysis['network_architecture']}")
|
||
print(f" Parameters: {analysis['total_parameters']:,}")
|
||
print(f" Memory: {analysis['total_memory_mb']:.2f} MB")
|
||
print(f" Layers: {analysis['num_layers']}")
|
||
|
||
# Find most/least parameter efficient
|
||
sorted_by_params = sorted(results.items(), key=lambda x: x[1]['total_parameters'])
|
||
most_efficient = sorted_by_params[0]
|
||
least_efficient = sorted_by_params[-1]
|
||
|
||
print(f"\n🎯 EFFICIENCY ANALYSIS:")
|
||
print(f" Most efficient: {most_efficient[0]} ({most_efficient[1]['total_parameters']:,} params)")
|
||
print(f" Least efficient: {least_efficient[0]} ({least_efficient[1]['total_parameters']:,} params)")
|
||
|
||
efficiency_ratio = least_efficient[1]['total_parameters'] / most_efficient[1]['total_parameters']
|
||
print(f" Parameter difference: {efficiency_ratio:.1f}x")
|
||
|
||
return results
|
||
|
||
def analyze_depth_vs_width_tradeoffs(self, input_size=784, output_size=10):
|
||
"""
|
||
Analyze the classic deep vs wide network trade-off.
|
||
|
||
This function is PROVIDED to show systems thinking.
|
||
Students run it to understand architecture decisions.
|
||
"""
|
||
print(f"🔍 DEPTH vs WIDTH ANALYSIS")
|
||
print(f"=" * 40)
|
||
|
||
# Test different depth vs width configurations
|
||
configurations = {
|
||
'Shallow Wide': [1024], # 1 huge layer
|
||
'Medium Wide': [512, 512], # 2 medium layers
|
||
'Medium Deep': [256, 256, 256], # 3 smaller layers
|
||
'Deep Narrow': [128, 128, 128, 128], # 4 narrow layers
|
||
'Very Deep': [64, 64, 64, 64, 64, 64] # 6 very narrow layers
|
||
}
|
||
|
||
results = {}
|
||
for config_name, hidden_sizes in configurations.items():
|
||
analysis = self.analyze_network_scaling(input_size, hidden_sizes, output_size)
|
||
results[config_name] = analysis
|
||
|
||
# Calculate depth and width metrics
|
||
depth = len(hidden_sizes)
|
||
avg_width = sum(hidden_sizes) / len(hidden_sizes)
|
||
max_width = max(hidden_sizes)
|
||
|
||
print(f"\n{config_name}:")
|
||
print(f" Depth: {depth} layers")
|
||
print(f" Avg width: {avg_width:.0f} neurons")
|
||
print(f" Max width: {max_width} neurons")
|
||
print(f" Parameters: {analysis['total_parameters']:,}")
|
||
print(f" Memory: {analysis['total_memory_mb']:.2f} MB")
|
||
|
||
print(f"\n💡 ARCHITECTURE INSIGHTS:")
|
||
print(f" - Deeper networks: Better representation learning, harder to train")
|
||
print(f" - Wider networks: More capacity per layer, more parameters")
|
||
print(f" - Modern trend: Very deep (100+ layers) with skip connections")
|
||
print(f" - Memory scales with total parameters regardless of arrangement")
|
||
|
||
return results
|
||
|
||
def analyze_famous_architectures():
|
||
"""
|
||
Analyze parameter counts of famous neural network architectures.
|
||
|
||
This function is PROVIDED to connect student work to real systems.
|
||
Shows how layer analysis applies to production models.
|
||
"""
|
||
profiler = LayerArchitectureProfiler()
|
||
|
||
print(f"🌟 FAMOUS ARCHITECTURE ANALYSIS")
|
||
print(f"=" * 50)
|
||
|
||
# Simplified versions of famous architectures
|
||
famous_models = {
|
||
'LeNet-5 (1998)': {
|
||
'description': 'First successful CNN',
|
||
'approx_params': 60_000,
|
||
'era': 'Early deep learning'
|
||
},
|
||
'AlexNet (2012)': {
|
||
'description': 'ImageNet breakthrough',
|
||
'approx_params': 60_000_000,
|
||
'era': 'Deep learning revolution'
|
||
},
|
||
'VGG-16 (2014)': {
|
||
'description': 'Very deep networks',
|
||
'approx_params': 138_000_000,
|
||
'era': 'Going deeper'
|
||
},
|
||
'ResNet-50 (2015)': {
|
||
'description': 'Skip connections enable very deep nets',
|
||
'approx_params': 25_600_000,
|
||
'era': 'Architecture innovation'
|
||
},
|
||
'GPT-3 (2020)': {
|
||
'description': 'Large language model',
|
||
'approx_params': 175_000_000_000,
|
||
'era': 'Scale revolution'
|
||
},
|
||
'GPT-4 (2023)': {
|
||
'description': 'Estimated multimodal model',
|
||
'approx_params': 1_800_000_000_000,
|
||
'era': 'Massive scale'
|
||
}
|
||
}
|
||
|
||
print(f"Model Evolution Over Time:")
|
||
for model_name, info in famous_models.items():
|
||
params = info['approx_params']
|
||
memory_gb = params * 4 / (1024**3) # Rough memory estimate
|
||
|
||
print(f"\n{model_name}:")
|
||
print(f" Parameters: {params:,}")
|
||
print(f" Est. Memory: {memory_gb:.1f} GB")
|
||
print(f" Description: {info['description']}")
|
||
print(f" Era: {info['era']}")
|
||
|
||
# Show scaling progression
|
||
print(f"\n📈 SCALING PROGRESSION:")
|
||
params_1998 = famous_models['LeNet-5 (1998)']['approx_params']
|
||
params_2023 = famous_models['GPT-4 (2023)']['approx_params']
|
||
scaling_factor = params_2023 / params_1998
|
||
|
||
print(f" 1998 → 2023: {scaling_factor:,.0f}x parameter increase")
|
||
print(f" That's about {scaling_factor/1000000:.1f} million times larger!")
|
||
print(f" Memory requirements grew from KB to TB")
|
||
|
||
print(f"\n🎯 SYSTEMS IMPLICATIONS:")
|
||
print(f" - Parameter count directly affects memory requirements")
|
||
print(f" - Larger models need distributed training across multiple GPUs")
|
||
print(f" - Model serving requires careful memory management")
|
||
print(f" - Architecture efficiency becomes crucial at scale")
|
||
|
||
return famous_models
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🎯 Learning Activity 1: Layer Architecture Analysis (Medium Guided Implementation)
|
||
|
||
**Goal**: Learn to analyze neural network architectures and understand how layer configurations affect system resources.
|
||
|
||
Complete the missing implementations in the `LayerArchitectureProfiler` class above, then use your profiler to understand architecture trade-offs.
|
||
"""
|
||
|
||
# Architecture profiler (initialized at module level)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🎯 Learning Activity 2: Architecture Comparison & Analysis (Review & Understand)
|
||
|
||
**Goal**: Compare different network architectures and understand the depth vs width trade-offs that affect production ML systems.
|
||
"""
|
||
|
||
# Architecture analysis functions (called in main block)
|
||
|
||
if __name__ == "__main__":
|
||
# Run all layer tests
|
||
test_unit_matrix_multiplication()
|
||
test_unit_dense_layer()
|
||
test_unit_layer_activation()
|
||
test_module_layer_tensor_integration()
|
||
|
||
# Initialize the layer architecture profiler
|
||
profiler = LayerArchitectureProfiler()
|
||
|
||
print("🏗️ LAYER ARCHITECTURE ANALYSIS")
|
||
print("=" * 50)
|
||
|
||
# Test 1: Single layer analysis
|
||
print("📊 Single Layer Analysis:")
|
||
layer_configs = [
|
||
(784, 128), # MNIST → small hidden
|
||
(784, 512), # MNIST → medium hidden
|
||
(784, 2048), # MNIST → large hidden
|
||
(3072, 1024), # CIFAR-10 → hidden
|
||
]
|
||
|
||
for input_size, hidden_size in layer_configs:
|
||
analysis = profiler.analyze_layer_parameters(input_size, hidden_size, 10)
|
||
print(f" {input_size} → {hidden_size}: {analysis['total_parameters']:,} params, {analysis['memory_mb']:.2f} MB")
|
||
|
||
# Test 2: Network scaling analysis
|
||
print(f"\n🔍 Network Scaling Analysis:")
|
||
network_configs = [
|
||
([128], "Small network"),
|
||
([256, 128], "Medium network"),
|
||
([512, 256, 128], "Large network"),
|
||
([1024, 512, 256, 128], "Very large network")
|
||
]
|
||
|
||
for hidden_sizes, description in network_configs:
|
||
analysis = profiler.analyze_network_scaling(784, hidden_sizes, 10)
|
||
print(f" {description}: {analysis['total_parameters']:,} params, {analysis['total_memory_mb']:.2f} MB")
|
||
|
||
print(f"\n💡 SCALING INSIGHTS:")
|
||
print(f" - Adding layers multiplies parameter count")
|
||
print(f" - First layer often dominates parameter count (large input)")
|
||
print(f" - Memory scales linearly with parameter count")
|
||
print(f" - Architecture choice = resource planning decision")
|
||
|
||
# Compare different architecture strategies
|
||
input_size = 784 # MNIST flattened image
|
||
output_size = 10 # 10 digit classes
|
||
|
||
architecture_configs = {
|
||
'Baseline': [128],
|
||
'Wide Shallow': [512],
|
||
'Narrow Deep': [64, 64, 64],
|
||
'Pyramid': [256, 128, 64],
|
||
'Inverted Pyramid': [64, 128, 256],
|
||
'Bottleneck': [512, 32, 512]
|
||
}
|
||
|
||
# Students use their implemented analysis tools
|
||
comparison_results = profiler.compare_architectures(input_size, architecture_configs, output_size)
|
||
|
||
# Analyze depth vs width trade-offs
|
||
depth_width_results = profiler.analyze_depth_vs_width_tradeoffs(input_size, output_size)
|
||
|
||
# Connect to famous architectures
|
||
famous_analysis = analyze_famous_architectures()
|
||
|
||
print(f"\n🎯 KEY LEARNINGS FOR ML SYSTEMS ENGINEERS:")
|
||
print(f"=" * 55)
|
||
|
||
print(f"\n1. 📊 PARAMETER SCALING:")
|
||
print(f" First layer dominates: input_size × hidden_size")
|
||
print(f" Layer composition multiplies parameter count")
|
||
print(f" Memory = parameters × 4 bytes (float32)")
|
||
|
||
print(f"\n2. 🏗️ ARCHITECTURE STRATEGIES:")
|
||
print(f" Wide networks: More capacity, more parameters")
|
||
print(f" Deep networks: Better representations, harder training")
|
||
print(f" Bottlenecks: Compress then expand information")
|
||
|
||
print(f"\n3. 🚀 PRODUCTION IMPLICATIONS:")
|
||
print(f" Parameter count = memory requirements")
|
||
print(f" Model serving: Load entire model into memory")
|
||
print(f" Training: Need 2-3x model size for gradients/optimizer")
|
||
|
||
print(f"\n4. 💰 COST IMPLICATIONS:")
|
||
print(f" More parameters = larger cloud instances needed")
|
||
print(f" GPU memory limits determine maximum model size")
|
||
print(f" Distributed training costs scale with model size")
|
||
|
||
print(f"\n💡 SYSTEMS ENGINEERING INSIGHT:")
|
||
print(f"Every layer you add is a resource planning decision:")
|
||
print(f"- More layers = more memory = higher cloud costs")
|
||
print(f"- Architecture efficiency matters at production scale")
|
||
print(f"- Understanding parameter scaling helps optimize deployments")
|
||
|
||
print("All tests passed!")
|
||
print("Layers module complete!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Interactive Questions
|
||
|
||
Now that you've built the fundamental building blocks of neural networks, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how layer abstractions scale to production ML environments.
|
||
|
||
Take time to reflect thoughtfully on each question - your insights will help you understand how the layer concepts you've implemented connect to real-world ML systems engineering.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 1: Parameter Management and Memory Optimization
|
||
|
||
**Context**: Your Dense layer implementation stores weights and biases as Tensor objects with specific initialization strategies. In production ML systems with billions of parameters, efficient parameter management becomes critical for memory usage, training speed, and model deployment.
|
||
|
||
**Reflection Question**: Design a parameter management system for large-scale neural networks that optimizes memory usage and supports efficient distributed training. How would you handle parameter initialization strategies for networks with hundreds of layers, implement parameter sharing across layers, and manage memory-efficient storage for billion-parameter models? Consider scenarios where memory constraints force trade-offs between model capacity and computational efficiency.
|
||
|
||
Think about: parameter quantization, memory pooling, distributed parameter storage, and initialization strategies that maintain numerical stability across very deep networks.
|
||
|
||
*Target length: 150-300 words*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-1-parameter-management", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR REFLECTION ON PARAMETER MANAGEMENT AND MEMORY OPTIMIZATION:
|
||
|
||
TODO: Replace this text with your thoughtful response about parameter management system design.
|
||
|
||
Consider addressing:
|
||
- How would you optimize parameter storage and memory usage for billion-parameter models?
|
||
- What strategies would you use for parameter initialization in very deep networks?
|
||
- How would you implement parameter sharing and distributed storage efficiently?
|
||
- What role would quantization play in your parameter management system?
|
||
- How would you balance memory constraints with model capacity requirements?
|
||
|
||
Write a technical analysis connecting your layer implementations to real parameter management challenges.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Demonstrates understanding of large-scale parameter management challenges (3 points)
|
||
- Addresses memory optimization and distributed storage strategies (3 points)
|
||
- Shows practical knowledge of initialization and quantization techniques (2 points)
|
||
- Demonstrates systems thinking about memory vs capacity trade-offs (2 points)
|
||
- Clear technical reasoning and practical considerations (bonus points for innovative approaches)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring technical analysis of parameter management
|
||
# Students should demonstrate understanding of memory optimization and distributed parameter storage
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 2: Abstraction Design and Framework Integration
|
||
|
||
**Context**: Your layer implementation provides a clean abstraction that separates computation from parameter storage. Production ML frameworks must balance abstraction simplicity with performance optimization, automatic differentiation support, and hardware acceleration capabilities.
|
||
|
||
**Reflection Question**: Architect a layer abstraction system that enables both research flexibility and production optimization. How would you design layer interfaces that support automatic differentiation, enable fusion with other operations for performance, and maintain compatibility across different execution backends (CPU, GPU, TPU)? Consider the challenge of providing high-level abstractions while allowing low-level optimization for specific hardware platforms.
|
||
|
||
Think about: API design principles, automatic differentiation integration, operation fusion opportunities, and backend abstraction strategies.
|
||
|
||
*Target length: 150-300 words*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-2-abstraction-design", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR REFLECTION ON ABSTRACTION DESIGN AND FRAMEWORK INTEGRATION:
|
||
|
||
TODO: Replace this text with your thoughtful response about layer abstraction system design.
|
||
|
||
Consider addressing:
|
||
- How would you design layer abstractions that balance simplicity with optimization potential?
|
||
- What strategies would you use to integrate layers with automatic differentiation systems?
|
||
- How would you enable operation fusion while maintaining clean abstractions?
|
||
- What role would backend abstraction play in supporting multiple hardware platforms?
|
||
- How would you maintain API compatibility while enabling hardware-specific optimizations?
|
||
|
||
Write an architectural analysis connecting your layer abstractions to real framework design challenges.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Shows understanding of abstraction design principles for ML systems (3 points)
|
||
- Designs practical approaches to automatic differentiation integration (3 points)
|
||
- Addresses performance optimization and hardware abstraction (2 points)
|
||
- Demonstrates systems thinking about framework architecture (2 points)
|
||
- Clear architectural reasoning with framework insights (bonus points for comprehensive understanding)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring understanding of framework abstraction design
|
||
# Students should demonstrate knowledge of balancing simplicity with optimization potential
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 3: Initialization Strategies and Training Stability
|
||
|
||
**Context**: Your Dense layer uses Xavier/Glorot initialization to maintain proper signal propagation through networks. In production training of very deep networks (hundreds of layers), initialization strategies become critical for training stability, convergence speed, and final model performance.
|
||
|
||
**Reflection Question**: Design an advanced initialization system for training ultra-deep neural networks that ensures stable gradient flow and optimal convergence. How would you adapt initialization strategies for different layer types, handle initialization in networks with skip connections and attention mechanisms, and implement dynamic initialization that adapts during training? Consider scenarios where poor initialization causes training failures in expensive large-scale experiments.
|
||
|
||
Think about: layer-specific initialization, residual connection handling, attention mechanism initialization, and adaptive initialization techniques.
|
||
|
||
*Target length: 150-300 words*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-3-initialization-strategies", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR REFLECTION ON INITIALIZATION STRATEGIES AND TRAINING STABILITY:
|
||
|
||
TODO: Replace this text with your thoughtful response about advanced initialization system design.
|
||
|
||
Consider addressing:
|
||
- How would you design initialization strategies for different types of layers and architectures?
|
||
- What approaches would you use to ensure stable gradient flow in ultra-deep networks?
|
||
- How would you handle initialization for complex architectures with skip connections and attention?
|
||
- What role would adaptive initialization play in improving training stability?
|
||
- How would you prevent initialization-related training failures in large-scale experiments?
|
||
|
||
Write a design analysis connecting your initialization implementations to real training stability challenges.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Understands initialization impact on training stability and gradient flow (3 points)
|
||
- Designs practical approaches to layer-specific and adaptive initialization (3 points)
|
||
- Addresses complex architecture initialization challenges (2 points)
|
||
- Shows systems thinking about training optimization and stability (2 points)
|
||
- Clear design reasoning with initialization optimization insights (bonus points for deep understanding)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring understanding of initialization strategies and training stability
|
||
# Students should demonstrate knowledge of gradient flow and initialization optimization challenges
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Neural Network Layers - Foundation of All AI
|
||
|
||
🎉 **CONGRATULATIONS!** You've just mastered the mathematical and computational foundation of ALL modern artificial intelligence!
|
||
|
||
### What You've Accomplished: A Complete AI Foundation
|
||
|
||
#### ✅ Mathematical Mastery
|
||
- **Matrix Multiplication Engine**: The core operation powering every neural network
|
||
- **Dense Layer Implementation**: The universal building block of all AI systems
|
||
- **Universal Function Approximation**: Understanding how layer+activation enables learning ANY pattern
|
||
- **Weight Initialization Science**: Xavier/Glorot strategies for stable training
|
||
|
||
#### ✅ Implementation Excellence
|
||
- **Production-Grade Code**: Your implementations match PyTorch and TensorFlow standards
|
||
- **Shape Management Mastery**: Automatic batch processing and broadcasting
|
||
- **Error Handling**: Robust validation and meaningful error messages
|
||
- **Integration Ready**: Seamless compatibility with Tensor and Activation modules
|
||
|
||
#### ✅ Real-World Architecture Understanding
|
||
- **Multi-Layer Perceptrons**: Classic feedforward architectures
|
||
- **Residual Networks**: Skip connections for ultra-deep networks
|
||
- **Attention Mechanisms**: The foundation of transformers and GPT models
|
||
- **Generative Architectures**: VAEs, GANs, and modern generative AI
|
||
|
||
### Deep Mathematical Concepts Mastered
|
||
|
||
#### Linear Algebra Foundations
|
||
```
|
||
Matrix Multiplication: C = A @ B
|
||
Dense Layer: y = xW + b
|
||
Universal Approximation: f(x) = activation_n(...activation_1(x @ W_1 + b_1)...)
|
||
```
|
||
|
||
#### Parameter Learning Theory
|
||
- **Initialization Strategies**: Why random weights break symmetry
|
||
- **Gradient Flow**: How learning signals propagate through networks
|
||
- **Batch Processing**: Vectorized operations for computational efficiency
|
||
- **Broadcasting**: Automatic shape handling for different tensor dimensions
|
||
|
||
#### Architecture Design Principles
|
||
- **Width vs Depth**: Trade-offs in network architecture
|
||
- **Activation Selection**: Choosing the right nonlinearity for each layer
|
||
- **Skip Connections**: Enabling ultra-deep networks with residual learning
|
||
- **Attention Patterns**: Query-key-value mechanisms for sequence modeling
|
||
|
||
### Real-World Impact: What You Can Now Build
|
||
|
||
#### 🖼️ Computer Vision
|
||
```python
|
||
# Image classification with your Dense layers
|
||
image → flatten → dense(784→512) → relu → dense(512→256) → relu → dense(256→10) → softmax
|
||
```
|
||
- **Object Recognition**: Classify images into thousands of categories
|
||
- **Medical Imaging**: Detect diseases from X-rays and MRI scans
|
||
- **Autonomous Vehicles**: Recognize traffic signs and pedestrians
|
||
|
||
#### 🗣️ Natural Language Processing
|
||
```python
|
||
# Language model with your Dense layers
|
||
text → embed → dense(300→128) → tanh → dense(128→vocab) → softmax
|
||
```
|
||
- **Language Models**: Build GPT-style text generation systems
|
||
- **Machine Translation**: Translate between any pair of languages
|
||
- **Sentiment Analysis**: Understand emotional content in text
|
||
|
||
#### 🎯 Recommendation Systems
|
||
```python
|
||
# Collaborative filtering with your Dense layers
|
||
user_features → dense(1000→256) → relu → dense(256→items) → sigmoid
|
||
```
|
||
- **Netflix Recommendations**: Predict what movies users will enjoy
|
||
- **E-commerce**: Suggest products based on browsing history
|
||
- **Social Media**: Recommend friends and content
|
||
|
||
#### 🧪 Scientific AI
|
||
```python
|
||
# Physics simulation with your Dense layers
|
||
parameters → dense(10→64) → relu → dense(64→64) → relu → dense(64→1) → output
|
||
```
|
||
- **Drug Discovery**: Predict molecular properties for new medicines
|
||
- **Climate Modeling**: Simulate complex atmospheric phenomena
|
||
- **Materials Science**: Design new materials with desired properties
|
||
|
||
### Connection to Advanced AI Systems
|
||
|
||
#### 🤖 Large Language Models (GPT, ChatGPT)
|
||
```python
|
||
# Every transformer layer uses YOUR Dense implementation
|
||
attention_output → dense(hidden→hidden) → relu → dense(hidden→hidden)
|
||
```
|
||
Your Dense layers power the feed-forward networks in every transformer!
|
||
|
||
#### 🎨 Generative AI (DALL-E, Stable Diffusion)
|
||
```python
|
||
# Generative models built on YOUR foundation
|
||
noise → dense(100→256) → relu → dense(256→784) → sigmoid → image
|
||
```
|
||
Your layers enable the neural networks that create art and images!
|
||
|
||
#### 🎮 Reinforcement Learning (AlphaGo, game AI)
|
||
```python
|
||
# Policy networks use YOUR Dense layers
|
||
game_state → dense(board→256) → relu → dense(256→actions) → softmax
|
||
```
|
||
Your implementation enables AI that masters complex games!
|
||
|
||
### Professional Skills Developed
|
||
|
||
#### 🏗️ Software Engineering
|
||
- **Clean Code**: Well-documented, readable implementations
|
||
- **Testing**: Comprehensive validation of functionality
|
||
- **API Design**: Consistent, intuitive interfaces
|
||
- **Error Handling**: Graceful failure modes with helpful messages
|
||
|
||
#### 🧮 Mathematical Computing
|
||
- **Numerical Stability**: Proper initialization and scaling
|
||
- **Performance Optimization**: Understanding computational complexity
|
||
- **Memory Management**: Efficient tensor operations
|
||
- **Debugging**: Systematic approaches to shape and gradient issues
|
||
|
||
#### 🔬 Machine Learning Engineering
|
||
- **Architecture Design**: Knowing when to use which layer types
|
||
- **Hyperparameter Selection**: Understanding initialization and activation choices
|
||
- **Gradient Flow**: Designing networks for stable training
|
||
- **Production Deployment**: Building scalable, maintainable systems
|
||
|
||
### Industry-Standard Implementation Quality
|
||
|
||
#### Production System Equivalence
|
||
```python
|
||
# Your implementation
|
||
layer = Dense(input_size=784, output_size=10)
|
||
output = layer(input)
|
||
|
||
# PyTorch equivalent
|
||
layer = torch.nn.Linear(784, 10)
|
||
output = layer(input)
|
||
|
||
# TensorFlow equivalent
|
||
layer = tf.keras.layers.Dense(10)
|
||
output = layer(input)
|
||
|
||
# IDENTICAL MATHEMATICAL OPERATIONS!
|
||
```
|
||
|
||
#### Performance Considerations
|
||
- **Computational Complexity**: O(batch_size × input_size × output_size)
|
||
- **Memory Usage**: Optimal tensor storage and reuse
|
||
- **GPU Acceleration**: Foundation for hardware optimization
|
||
- **Distributed Computing**: Basis for multi-device training
|
||
|
||
### Advanced Topics You're Now Ready For
|
||
|
||
#### 🧠 Specialized Architectures
|
||
- **Convolutional Networks**: For image and spatial data processing
|
||
- **Recurrent Networks**: For sequential data and time series
|
||
- **Graph Neural Networks**: For structured data and relationships
|
||
- **Transformer Architectures**: For attention-based modeling
|
||
|
||
#### 🎯 Advanced Training Techniques
|
||
- **Batch Normalization**: Stabilizing training in deep networks
|
||
- **Dropout Regularization**: Preventing overfitting
|
||
- **Learning Rate Scheduling**: Optimizing convergence
|
||
- **Transfer Learning**: Adapting pre-trained models
|
||
|
||
#### 🚀 Cutting-Edge Research
|
||
- **Neural Architecture Search**: Automatically designing networks
|
||
- **Meta-Learning**: Learning to learn new tasks quickly
|
||
- **Federated Learning**: Training across distributed devices
|
||
- **Quantum Neural Networks**: Quantum computing + neural networks
|
||
|
||
### Your Neural Network Toolkit
|
||
|
||
You now have the complete foundation to understand and implement:
|
||
|
||
```python
|
||
# ANY neural network architecture can be built with your components!
|
||
|
||
def your_neural_network(x):
|
||
# Foundation layers (YOUR implementation)
|
||
h1 = relu(dense1(x))
|
||
h2 = relu(dense2(h1))
|
||
|
||
# Advanced patterns (built on YOUR foundation)
|
||
attention = attention_layer(h2)
|
||
residual = h2 + attention
|
||
|
||
# Output (YOUR implementation)
|
||
output = softmax(dense_output(residual))
|
||
return output
|
||
```
|
||
|
||
### Next Steps: Continue Your AI Journey
|
||
|
||
#### 🔧 Module 5: Convolutional Layers
|
||
Build specialized layers for image processing and computer vision
|
||
|
||
#### 📊 Module 6: Optimization
|
||
Implement gradient descent and advanced optimization algorithms
|
||
|
||
#### 🔄 Module 7: Training Loops
|
||
Create complete training and validation pipelines
|
||
|
||
#### 🌐 Module 8: Advanced Architectures
|
||
Build transformers, ResNets, and state-of-the-art models
|
||
|
||
### The Bigger Picture: Your Impact on AI
|
||
|
||
**You now understand the mathematical foundation of:**
|
||
- Every neural network ever created
|
||
- All modern AI systems (GPT, DALL-E, AlphaGo, etc.)
|
||
- The core operations that power trillion-dollar AI companies
|
||
- The building blocks enabling the current AI revolution
|
||
|
||
**Your layer implementations:**
|
||
- Are mathematically equivalent to production systems
|
||
- Form the foundation of all advanced architectures
|
||
- Enable you to contribute to cutting-edge AI research
|
||
- Provide the knowledge to build the next generation of AI systems
|
||
|
||
### 🌟 **You Are Now a Neural Network Architect!**
|
||
|
||
With your deep understanding of layers, you can:
|
||
- **Understand** any neural network architecture
|
||
- **Implement** custom layer types for new applications
|
||
- **Debug** training issues in complex models
|
||
- **Optimize** networks for production deployment
|
||
- **Research** novel architectures for unsolved problems
|
||
|
||
**Welcome to the community of AI builders! Your journey to mastering neural networks is well underway.**
|
||
|
||
---
|
||
|
||
*"Every expert was once a beginner. Every pro was once an amateur. Every icon was once an unknown." - Robin Sharma*
|
||
|
||
**You've built the foundation. Now go build the future of AI!** 🚀
|
||
"""
|
||
|
||
if __name__ == "__main__":
|
||
# Run all layer tests
|
||
test_unit_matrix_multiplication()
|
||
test_unit_dense_layer()
|
||
test_unit_layer_activation()
|
||
test_module_layer_tensor_integration()
|
||
|
||
# Initialize the layer architecture profiler
|
||
profiler = LayerArchitectureProfiler()
|
||
|
||
print("🏗️ LAYER ARCHITECTURE ANALYSIS")
|
||
print("=" * 50)
|
||
|
||
# Test 1: Single layer analysis
|
||
print("📊 Single Layer Analysis:")
|
||
layer_configs = [
|
||
(784, 128), # MNIST → small hidden
|
||
(784, 512), # MNIST → medium hidden
|
||
(784, 2048), # MNIST → large hidden
|
||
(3072, 1024), # CIFAR-10 → hidden
|
||
]
|
||
|
||
for input_size, hidden_size in layer_configs:
|
||
analysis = profiler.analyze_layer_parameters(input_size, hidden_size, 10)
|
||
print(f" {input_size} → {hidden_size}: {analysis['total_parameters']:,} params, {analysis['memory_mb']:.2f} MB")
|
||
|
||
# Test 2: Network scaling analysis
|
||
print(f"\n🔍 Network Scaling Analysis:")
|
||
network_configs = [
|
||
([128], "Small network"),
|
||
([256, 128], "Medium network"),
|
||
([512, 256, 128], "Large network"),
|
||
([1024, 512, 256, 128], "Very large network")
|
||
]
|
||
|
||
for hidden_sizes, description in network_configs:
|
||
analysis = profiler.analyze_network_scaling(784, hidden_sizes, 10)
|
||
print(f" {description}: {analysis['total_parameters']:,} params, {analysis['total_memory_mb']:.2f} MB")
|
||
|
||
print(f"\n💡 SCALING INSIGHTS:")
|
||
print(f" - Adding layers multiplies parameter count")
|
||
print(f" - First layer often dominates parameter count (large input)")
|
||
print(f" - Memory scales linearly with parameter count")
|
||
print(f" - Architecture choice = resource planning decision")
|
||
|
||
# Compare different architecture strategies
|
||
input_size = 784 # MNIST flattened image
|
||
output_size = 10 # 10 digit classes
|
||
|
||
architecture_configs = {
|
||
'Baseline': [128],
|
||
'Wide Shallow': [512],
|
||
'Narrow Deep': [64, 64, 64],
|
||
'Pyramid': [256, 128, 64],
|
||
'Inverted Pyramid': [64, 128, 256],
|
||
'Bottleneck': [512, 32, 512]
|
||
}
|
||
|
||
# Students use their implemented analysis tools
|
||
comparison_results = profiler.compare_architectures(input_size, architecture_configs, output_size)
|
||
|
||
# Analyze depth vs width trade-offs
|
||
depth_width_results = profiler.analyze_depth_vs_width_tradeoffs(input_size, output_size)
|
||
|
||
# Connect to famous architectures
|
||
famous_analysis = analyze_famous_architectures()
|
||
|
||
print(f"\n🎯 KEY LEARNINGS FOR ML SYSTEMS ENGINEERS:")
|
||
print(f"=" * 55)
|
||
|
||
print(f"\n1. 📊 PARAMETER SCALING:")
|
||
print(f" First layer dominates: input_size × hidden_size")
|
||
print(f" Layer composition multiplies parameter count")
|
||
print(f" Memory = parameters × 4 bytes (float32)")
|
||
|
||
print(f"\n2. 🏗️ ARCHITECTURE STRATEGIES:")
|
||
print(f" Wide networks: More capacity, more parameters")
|
||
print(f" Deep networks: Better representations, harder training")
|
||
print(f" Bottlenecks: Compress then expand information")
|
||
|
||
print(f"\n3. 🚀 PRODUCTION IMPLICATIONS:")
|
||
print(f" Parameter count = memory requirements")
|
||
print(f" Model serving: Load entire model into memory")
|
||
print(f" Training: Need 2-3x model size for gradients/optimizer")
|
||
|
||
print(f"\n4. 💰 COST IMPLICATIONS:")
|
||
print(f" More parameters = larger cloud instances needed")
|
||
print(f" GPU memory limits determine maximum model size")
|
||
print(f" Distributed training costs scale with model size")
|
||
|
||
print(f"\n💡 SYSTEMS ENGINEERING INSIGHT:")
|
||
print(f"Every layer you add is a resource planning decision:")
|
||
print(f"- More layers = more memory = higher cloud costs")
|
||
print(f"- Architecture efficiency matters at production scale")
|
||
print(f"- Understanding parameter scaling helps optimize deployments")
|
||
|
||
print("All tests passed!")
|
||
print("Layers module complete!") |