From bd0aa792bc4036cc26289935e5b8dd9e6a26fb63 Mon Sep 17 00:00:00 2001 From: Vijay Janapa Reddi Date: Sat, 12 Jul 2025 21:13:52 -0400 Subject: [PATCH] Enhance networks module with comprehensive composition theory - Added detailed mathematical foundation of function composition - Enhanced architectural design principles (depth vs width trade-offs) - Included real-world architecture examples (MLP, CNN, RNN, Transformer) - Comprehensive network design process and optimization considerations - Performance characteristics and scaling laws - Connection to deep learning revolution and hierarchical feature learning - Better integration with previous modules (tensor, activations, layers) --- modules/source/04_networks/networks_dev.py | 196 ++++++++++++++++++--- 1 file changed, 174 insertions(+), 22 deletions(-) diff --git a/modules/source/04_networks/networks_dev.py b/modules/source/04_networks/networks_dev.py index 564867d8..ba6e908b 100644 --- a/modules/source/04_networks/networks_dev.py +++ b/modules/source/04_networks/networks_dev.py @@ -150,33 +150,185 @@ A **network** is a composition of layers that transforms input data into output Input → Layer1 → Layer2 → Layer3 → Output ``` -### Why Networks Matter -- **Function composition**: Complex behavior from simple building blocks -- **Learnable parameters**: Each layer has weights that can be learned -- **Architecture design**: Different layouts solve different problems -- **Real-world applications**: Classification, regression, generation, etc. +### The Mathematical Foundation: Function Composition Theory -### The Fundamental Insight -**Neural networks are just function composition!** -- Each layer is a function: `f_i(x)` -- The network is: `f(x) = f_n(...f_2(f_1(x)))` -- Complex behavior emerges from simple building blocks +#### **Function Composition in Mathematics** +In mathematics, function composition combines simple functions to create complex ones: -### Real-World Examples -- **MLP (Multi-Layer Perceptron)**: Classic feedforward network -- **CNN (Convolutional Neural Network)**: For image processing -- **RNN (Recurrent Neural Network)**: For sequential data -- **Transformer**: For attention-based processing +```python +# Mathematical composition: (f ∘ g)(x) = f(g(x)) +def compose(f, g): + return lambda x: f(g(x)) -### Visual Intuition -``` -Input: [1, 2, 3] (3 features) -Layer1: [1.4, 2.8] (linear transformation) -Layer2: [1.4, 2.8] (nonlinearity) -Layer3: [0.7] (final prediction) +# Neural network composition: h(x) = f_n(f_{n-1}(...f_2(f_1(x)))) +def network(layers): + return lambda x: reduce(lambda acc, layer: layer(acc), layers, x) ``` -Let's start by building the most fundamental network: **Sequential**. +#### **Why Composition is Powerful** +1. **Modularity**: Each layer has a specific, well-defined purpose +2. **Composability**: Simple functions combine to create arbitrarily complex behaviors +3. **Hierarchical learning**: Early layers learn simple features, later layers learn complex patterns +4. **Universal approximation**: Deep compositions can approximate any continuous function + +#### **The Emergence of Intelligence** +Complex behavior emerges from simple layer composition: + +```python +# Example: Image classification +raw_pixels → [Edge detectors] → [Shape detectors] → [Object detectors] → [Class predictor] + ↓ ↓ ↓ ↓ ↓ + [28x28] [64 features] [128 features] [256 features] [10 classes] +``` + +### Architectural Design Principles + +#### **1. Depth vs. Width Trade-offs** +- **Deep networks**: More layers → more complex representations + - **Advantages**: Better feature hierarchies, parameter efficiency + - **Disadvantages**: Harder to train, gradient problems +- **Wide networks**: More neurons per layer → more capacity per layer + - **Advantages**: Easier to train, parallel computation + - **Disadvantages**: More parameters, potential overfitting + +#### **2. Information Flow Patterns** +```python +# Sequential flow (what we're building): +x → layer1 → layer2 → layer3 → output + +# Residual flow (advanced): +x → layer1 → layer2 + x → layer3 → output + +# Attention flow (transformers): +x → attention(x, x, x) → feedforward → output +``` + +#### **3. Activation Function Placement** +```python +# Standard pattern: +linear_transformation → nonlinear_activation → next_layer + +# Why this works: +# Linear + Linear = Linear (no increase in expressiveness) +# Linear + Nonlinear + Linear = Nonlinear (exponential increase in expressiveness) +``` + +### Real-World Architecture Examples + +#### **Multi-Layer Perceptron (MLP)** +```python +# Classic feedforward network +input → dense(512) → relu → dense(256) → relu → dense(10) → softmax +``` +- **Use cases**: Tabular data, feature learning, classification +- **Strengths**: Universal approximation, well-understood +- **Weaknesses**: Doesn't exploit spatial/temporal structure + +#### **Convolutional Neural Network (CNN)** +```python +# Exploits spatial structure +input → conv2d → relu → pool → conv2d → relu → pool → dense → softmax +``` +- **Use cases**: Image processing, computer vision +- **Strengths**: Translation invariance, parameter sharing +- **Weaknesses**: Fixed receptive field, not great for sequences + +#### **Recurrent Neural Network (RNN)** +```python +# Processes sequences +input_t → rnn_cell(hidden_{t-1}) → hidden_t → output_t +``` +- **Use cases**: Natural language processing, time series +- **Strengths**: Variable length sequences, memory +- **Weaknesses**: Sequential computation, gradient problems + +#### **Transformer** +```python +# Attention-based processing +input → attention → feedforward → attention → feedforward → output +``` +- **Use cases**: Language models, machine translation +- **Strengths**: Parallelizable, long-range dependencies +- **Weaknesses**: Quadratic complexity, large memory requirements + +### The Network Design Process + +#### **1. Problem Analysis** +- **Data type**: Images, text, tabular, time series? +- **Task type**: Classification, regression, generation? +- **Constraints**: Latency, memory, accuracy requirements? + +#### **2. Architecture Selection** +- **Start simple**: Begin with basic MLP +- **Add structure**: Incorporate domain-specific inductive biases +- **Scale up**: Increase depth/width as needed + +#### **3. Component Design** +- **Input layer**: Match data dimensions +- **Hidden layers**: Gradual dimension reduction typical +- **Output layer**: Match task requirements (classes, regression targets) +- **Activation functions**: ReLU for hidden, task-specific for output + +#### **4. Optimization Considerations** +- **Gradient flow**: Ensure gradients can flow through the network +- **Computational efficiency**: Balance expressiveness with speed +- **Memory usage**: Consider intermediate activation storage + +### Performance Characteristics + +#### **Forward Pass Complexity** +For a network with L layers, each with n neurons: +- **Time complexity**: O(L × n²) for dense layers +- **Space complexity**: O(L × n) for activations +- **Parallelization**: Each layer can be parallelized + +#### **Memory Management** +```python +# Memory usage during forward pass: +input_memory = batch_size × input_size +hidden_memory = batch_size × hidden_size × num_layers +output_memory = batch_size × output_size +total_memory = input_memory + hidden_memory + output_memory +``` + +#### **Computational Optimization** +- **Batch processing**: Process multiple samples simultaneously +- **Vectorization**: Use optimized matrix operations +- **Hardware acceleration**: Leverage GPUs/TPUs for parallel computation + +### Connection to Previous Modules + +#### **From Module 1 (Tensor)** +- **Data flow**: Tensors flow through the network +- **Shape management**: Ensure compatible dimensions between layers + +#### **From Module 2 (Activations)** +- **Nonlinearity**: Activation functions between layers enable complex learning +- **Function choice**: Different activations for different purposes + +#### **From Module 3 (Layers)** +- **Building blocks**: Layers are the fundamental components +- **Composition**: Networks compose layers into complete architectures + +### Why Networks Matter: The Scaling Laws + +#### **Empirical Observations** +- **More parameters**: Generally better performance (up to a point) +- **More data**: Enables training of larger networks +- **More compute**: Allows exploration of larger architectures + +#### **The Deep Learning Revolution** +```python +# Pre-2012: Shallow networks +input → hidden(100) → output + +# Post-2012: Deep networks +input → hidden(512) → hidden(512) → hidden(512) → ... → output +``` + +The key insight: **Depth enables hierarchical feature learning** + +Let's start building our Sequential network architecture! """ # %% nbgrader={"grade": false, "grade_id": "sequential-class", "locked": false, "schema_version": 3, "solution": true, "task": false}