--- title: "Layers" description: "Build the fundamental neural network building blocks: Linear layers with weight initialization and Dropout for regularization" difficulty: "⭐⭐" time_estimate: "4-5 hours" prerequisites: ["01_tensor", "02_activations"] next_steps: ["04_losses"] learning_objectives: - "Understand layer abstractions as composable transformations" - "Implement Linear layers with Xavier initialization" - "Build Dropout regularization for preventing overfitting" - "Master parameter management for gradient-based training" - "Compose layers into multi-layer architectures" --- # 03. Layers **FOUNDATION TIER** | Difficulty: ⭐⭐ (2/4) | Time: 4-5 hours ## Overview Build the fundamental building blocks that compose into neural networks. This module teaches you that layers are simply functions that transform tensors, with learnable parameters that define the transformation. You'll implement Linear layers (the workhorse of deep learning) and Dropout regularization, understanding how these simple abstractions enable arbitrarily complex architectures through composition. ## Learning Objectives By the end of this module, you will be able to: - **Understand Layer Abstraction**: Recognize layers as composable functions with parameters, mirroring PyTorch's `torch.nn.Module` design pattern - **Implement Linear Transformations**: Build `y = xW + b` with proper Xavier initialization to prevent gradient vanishing/explosion - **Master Parameter Management**: Track trainable parameters using `parameters()` method for optimizer integration - **Build Dropout Regularization**: Implement training/inference mode switching with proper scaling to prevent overfitting - **Analyze Memory Scaling**: Calculate parameter counts and understand how network architecture affects memory footprint ## Build → Use → Reflect This module follows TinyTorch's **Build → Use → Reflect** framework: 1. **Build**: Implement Linear and Dropout layer classes with proper initialization, forward passes, and parameter tracking 2. **Use**: Compose layers manually to create multi-layer networks for MNIST digit classification 3. **Reflect**: Analyze memory scaling, computational complexity, and the trade-offs between model capacity and efficiency ## Implementation Guide ### Linear Layer: The Neural Network Workhorse The Linear layer implements the fundamental transformation `y = xW + b`: ```python from tinytorch.core.layers import Linear # Create a linear transformation: 784 input features → 256 output features layer = Linear(784, 256) # Forward pass: transform input batch x = Tensor(np.random.randn(32, 784)) # 32 images, 784 pixels each y = layer(x) # Output: (32, 256) # Access trainable parameters print(f"Weight shape: {layer.weight.shape}") # (784, 256) print(f"Bias shape: {layer.bias.shape}") # (256,) print(f"Total params: {784 * 256 + 256}") # 200,960 parameters ``` **Key Design Decisions:** - **Xavier Initialization**: Weights scaled by `sqrt(1/in_features)` to maintain gradient flow through deep networks - **Parameter Tracking**: `parameters()` method returns list of tensors with `requires_grad=True` for optimizer compatibility - **Bias Handling**: Optional bias parameter (`bias=False` for architectures like batch normalization) ### Dropout: Preventing Overfitting Dropout randomly zeros elements during training to force network robustness: ```python from tinytorch.core.layers import Dropout # Create dropout with 50% probability dropout = Dropout(p=0.5) x = Tensor([1.0, 2.0, 3.0, 4.0]) # Training mode: randomly zero elements and scale by 1/(1-p) y_train = dropout(x, training=True) # Example output: [2.0, 0.0, 6.0, 0.0] - survivors scaled by 2.0 # Inference mode: pass through unchanged y_eval = dropout(x, training=False) # Output: [1.0, 2.0, 3.0, 4.0] - no dropout applied ``` **Why Inverted Dropout?** During training, surviving elements are scaled by `1/(1-p)` so that expected values match during inference. This eliminates the need to scale during evaluation, making deployment simpler. ### Layer Composition: Building Neural Networks Layers compose through sequential application - no container needed: ```python from tinytorch.core.layers import Linear, Dropout from tinytorch.core.activations import ReLU # Build 3-layer MNIST classifier manually layer1 = Linear(784, 256) activation1 = ReLU() dropout1 = Dropout(0.5) layer2 = Linear(256, 128) activation2 = ReLU() dropout2 = Dropout(0.3) layer3 = Linear(128, 10) # Forward pass: explicit composition shows data flow def forward(x): x = layer1(x) x = activation1(x) x = dropout1(x, training=True) x = layer2(x) x = activation2(x) x = dropout2(x, training=True) x = layer3(x) return x # Process batch x = Tensor(np.random.randn(32, 784)) # 32 MNIST images output = forward(x) # Shape: (32, 10) - class logits # Collect all parameters for training all_params = layer1.parameters() + layer2.parameters() + layer3.parameters() print(f"Total trainable parameters: {len(all_params)}") # 6 tensors (3 weights, 3 biases) ``` ## Getting Started ### Prerequisites Ensure you've completed the prerequisite modules: ```bash # Activate TinyTorch environment source bin/activate-tinytorch.sh # Verify Module 01 (Tensor) is complete tito test --module tensor # Verify Module 02 (Activations) is complete tito test --module activations ``` ### Development Workflow 1. **Open the development file**: `modules/03_layers/layers_dev.py` 2. **Implement Linear layer**: Build `__init__` with Xavier initialization, `forward` with matrix multiplication, and `parameters()` method 3. **Add Dropout layer**: Implement training/inference mode switching with proper mask generation and scaling 4. **Test layer composition**: Verify manual composition of multi-layer networks with mixed layer types 5. **Analyze systems behavior**: Run memory analysis to understand parameter scaling with network size 6. **Export and verify**: `tito module complete 03 && tito test --module layers` ## Testing ### Comprehensive Test Suite Run the full test suite to verify layer functionality: ```bash # TinyTorch CLI (recommended) tito test --module layers # Direct pytest execution python -m pytest tests/ -k layers -v ``` ### Test Coverage Areas - ✅ **Linear Layer Functionality**: Verify `y = xW + b` computation with correct matrix dimensions and broadcasting - ✅ **Xavier Initialization**: Ensure weights scaled by `sqrt(1/in_features)` for gradient stability - ✅ **Parameter Management**: Confirm `parameters()` returns all trainable tensors with `requires_grad=True` - ✅ **Dropout Training Mode**: Validate probabilistic masking with correct `1/(1-p)` scaling - ✅ **Dropout Inference Mode**: Verify passthrough behavior without modification during evaluation - ✅ **Layer Composition**: Test multi-layer forward passes with mixed layer types - ✅ **Edge Cases**: Handle empty batches, single samples, no-bias configurations, and probability boundaries ### Inline Testing & Validation The module includes comprehensive inline tests with educational feedback: ```python # Example inline test output 🔬 Unit Test: Linear Layer... ✅ Linear layer computes y = xW + b correctly ✅ Weight initialization within expected Xavier range ✅ Bias initialized to zeros ✅ Output shape matches expected dimensions (32, 256) ✅ Parameter list contains weight and bias tensors 📈 Progress: Linear Layer ✓ 🔬 Unit Test: Dropout Layer... ✅ Inference mode passes through unchanged ✅ Training mode zeros ~50% of elements ✅ Survivors scaled by 1/(1-p) = 2.0 ✅ Zero dropout (p=0.0) preserves all values ✅ Full dropout (p=1.0) zeros everything 📈 Progress: Dropout Layer ✓ 🔬 Integration Test: Multi-layer Network... ✅ 3-layer network processes batch: (32, 784) → (32, 10) ✅ Parameter count: 235,146 parameters across 6 tensors ✅ All parameters have requires_grad=True 📈 Progress: Layer Composition ✓ ``` ### Manual Testing Examples ```python from tinytorch.core.tensor import Tensor from tinytorch.core.layers import Linear, Dropout from tinytorch.core.activations import ReLU # Test Linear layer forward pass layer = Linear(784, 256) x = Tensor(np.random.randn(1, 784)) # Single MNIST image y = layer(x) print(f"Input: {x.shape} → Output: {y.shape}") # (1, 784) → (1, 256) # Test parameter counting params = layer.parameters() total = sum(p.data.size for p in params) print(f"Parameters: {total}") # 200,960 # Test Dropout behavior dropout = Dropout(0.5) x = Tensor(np.ones((1, 100))) y_train = dropout(x, training=True) y_eval = dropout(x, training=False) print(f"Training: ~{np.count_nonzero(y_train.data)} survived") # ~50 print(f"Inference: {np.count_nonzero(y_eval.data)} survived") # 100 # Test composition net = lambda x: layer3(dropout2(activation2(layer2(dropout1(activation1(layer1(x))))))) ``` ## Systems Thinking Questions ### Real-World Applications - **Computer Vision**: How do Linear layers in ResNet-50's final classification head transform 2048 feature maps to 1000 class logits? What determines this bottleneck layer's size? - **Language Models**: GPT-3 uses Linear layers with 12,288 input features. How much memory do these layers consume, and why does this limit model deployment? - **Recommendation Systems**: Netflix uses multi-layer networks with Dropout. How does `p=0.5` affect training time vs model accuracy on sparse user-item interactions? - **Edge Deployment**: A mobile CNN has 5 Linear layers totaling 2MB. How do you decide which layers to quantize or prune when targeting 500KB model size? ### Mathematical Foundations - **Xavier Initialization**: Why does `scale = sqrt(1/fan_in)` preserve gradient variance through layers? What happens in a 20-layer network without proper initialization? - **Matrix Multiplication Complexity**: A Linear(1024, 1024) layer with batch size 128 performs how many FLOPs? How does this compare to a Dropout layer on the same tensor? - **Dropout Mathematics**: During training with `p=0.5`, what's the expected value of each element? Why must we scale by `1/(1-p)` to match inference behavior? - **Parameter Growth**: If you double the hidden layer size from 256 to 512, how many times more parameters do you have in Linear(784, hidden) + Linear(hidden, 10)? ### Architecture Design Patterns - **Layer Width vs Depth**: A 784→512→10 network vs 784→256→256→10 - which has more parameters? Which typically generalizes better and why? - **Dropout Placement**: Should you place Dropout before or after activation functions? What's the difference between `Linear → ReLU → Dropout` vs `Linear → Dropout → ReLU`? - **Bias Necessity**: When can you safely use `bias=False`? How does batch normalization (Module 09) interact with bias terms? - **Composition Philosophy**: We deliberately avoided a Sequential container. What trade-offs do explicit composition and container abstractions make for debugging vs convenience? ### Performance Characteristics - **Memory Hierarchy**: A Linear(4096, 4096) layer has 16M parameters (64MB). Does this fit in L3 cache? How does cache performance affect training speed? - **Batch Size Scaling**: Measuring throughput from batch_size=1 to 512, why does samples/sec increase but eventually plateau? What's the bottleneck? - **Dropout Overhead**: Profiling shows Dropout adds 2% overhead to training time. Where is this cost - mask generation, element-wise multiply, or memory bandwidth? - **Parameter Memory vs Activation Memory**: In a 100-layer network, which dominates memory usage during training? How does gradient checkpointing address this? ## Ready to Build? You're about to implement the abstractions that power every neural network in production. Linear layers might seem deceptively simple - just matrix multiplication and bias addition - but this simplicity is the foundation of extraordinary complexity. From ResNet's 25 million parameters to GPT-3's 175 billion, every learned transformation ultimately reduces to chains of `y = xW + b`. Understanding layer composition is crucial for systems thinking. When you see "ResNet-50," you'll know exactly how parameter counts scale with depth. When debugging vanishing gradients, you'll understand why Xavier initialization matters. When deploying to mobile devices, you'll calculate memory footprints in your head. Take your time with this module. Test each component thoroughly. Analyze the memory patterns. Build the intuition for how these simple building blocks compose into intelligence. This is where deep learning becomes real. Choose your preferred way to engage with this module: ````{grid} 1 2 3 3 ```{grid-item-card} 🚀 Launch Binder :link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/03_layers/layers_dev.ipynb :class-header: bg-light Run this module interactively in your browser. No installation required! ``` ```{grid-item-card} ⚡ Open in Colab :link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/03_layers/layers_dev.ipynb :class-header: bg-light Use Google Colab for GPU access and cloud compute power. ``` ```{grid-item-card} 📖 View Source :link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/03_layers/layers_dev.py :class-header: bg-light Browse the Python source code and understand the implementation. ``` ```` ```{admonition} 💾 Save Your Progress :class: tip **Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. ``` ---
← Previous Module Next Module →