TinyTorch/modules/source/03_layers/layers_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "794e99a4",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module 03: Layers - Building Blocks of Neural Networks\n",
    "\n",
    "Welcome to Module 03! You're about to build the fundamental building blocks that make neural networks possible.\n",
    "\n",
    "## 🔗 Prerequisites & Progress\n",
    "**You've Built**: Tensor class (Module 01) with all operations and activations (Module 02)\n",
    "**You'll Build**: Linear layers and Dropout regularization\n",
    "**You'll Enable**: Multi-layer neural networks, trainable parameters, and forward passes\n",
    "\n",
    "**Connection Map**:\n",
    "```\n",
    "Tensor → Activations → Layers → Networks\n",
    "(data)   (intelligence) (building blocks) (architectures)\n",
    "```\n",
    "\n",
    "## Learning Objectives\n",
    "By the end of this module, you will:\n",
    "1. Implement Linear layers with proper weight initialization\n",
    "2. Add Dropout for regularization during training\n",
    "3. Understand parameter management and counting\n",
    "4. Test individual layer components\n",
    "\n",
    "Let's get started!\n",
    "\n",
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in modules/03_layers/layers_dev.py\n",
    "**Building Side:** Code exports to tinytorch.core.layers\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.layers import Linear, Dropout  # This module\n",
    "from tinytorch.core.tensor import Tensor  # Module 01 - foundation\n",
    "from tinytorch.core.activations import ReLU, Sigmoid  # Module 02 - intelligence\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Complete layer system in one focused module for deep understanding\n",
    "- **Production:** Proper organization like PyTorch's torch.nn with all layer building blocks together\n",
    "- **Consistency:** All layer operations and parameter management in core.layers\n",
    "- **Integration:** Works seamlessly with tensors and activations for complete neural networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "901fe04d",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "imports",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.layers\n",
    "#| export\n",
    "\n",
    "import numpy as np\n",
    "import sys\n",
    "import os\n",
    "\n",
    "# Import dependencies from tinytorch package\n",
    "from tinytorch.core.tensor import Tensor\n",
    "from tinytorch.core.activations import ReLU, Sigmoid"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "967152a3",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 1. Introduction: What are Neural Network Layers?\n",
    "\n",
    "Neural network layers are the fundamental building blocks that transform data as it flows through a network. Each layer performs a specific computation:\n",
    "\n",
    "- **Linear layers** apply learned transformations: `y = xW + b`\n",
    "- **Dropout layers** randomly zero elements for regularization\n",
    "\n",
    "Think of layers as processing stations in a factory:\n",
    "```\n",
    "Input Data → Layer 1 → Layer 2 → Layer 3 → Output\n",
    "    ↓          ↓         ↓         ↓         ↓\n",
    "  Features   Hidden   Hidden   Hidden   Predictions\n",
    "```\n",
    "\n",
    "Each layer learns its own piece of the puzzle. Linear layers learn which features matter, while dropout prevents overfitting by forcing robustness."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec1e941b",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 2. Foundations: Mathematical Background\n",
    "\n",
    "### Linear Layer Mathematics\n",
    "A linear layer implements: **y = xW + b**\n",
    "\n",
    "```\n",
    "Input x (batch_size, in_features)  @  Weight W (in_features, out_features)  +  Bias b (out_features)\n",
    "                                   =  Output y (batch_size, out_features)\n",
    "```\n",
    "\n",
    "### Weight Initialization\n",
    "Random initialization is crucial for breaking symmetry:\n",
    "- **Xavier/Glorot**: Scale by sqrt(1/fan_in) for stable gradients\n",
    "- **He**: Scale by sqrt(2/fan_in) for ReLU activation\n",
    "- **Too small**: Gradients vanish, learning is slow\n",
    "- **Too large**: Gradients explode, training unstable\n",
    "\n",
    "### Parameter Counting\n",
    "```\n",
    "Linear(784, 256): 784 × 256 + 256 = 200,960 parameters\n",
    "\n",
    "Manual composition:\n",
    "    layer1 = Linear(784, 256)  # 200,960 params\n",
    "    activation = ReLU()        # 0 params\n",
    "    layer2 = Linear(256, 10)   # 2,570 params\n",
    "                               # Total: 203,530 params\n",
    "```\n",
    "\n",
    "Memory usage: 4 bytes/param × 203,530 = ~814KB for weights alone"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "908da7b4",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 3. Implementation: Building Layer Foundation\n",
    "\n",
    "Let's build our layer system step by step. We'll implement two essential layer types:\n",
    "\n",
    "1. **Linear Layer** - The workhorse of neural networks\n",
    "2. **Dropout Layer** - Prevents overfitting\n",
    "\n",
    "### Key Design Principles:\n",
    "- All methods defined INSIDE classes (no monkey-patching)\n",
    "- Parameter tensors have requires_grad=True (ready for Module 05)\n",
    "- Forward methods return new tensors, preserving immutability\n",
    "- parameters() method enables optimizer integration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dad822a3",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🏗️ Linear Layer - The Foundation of Neural Networks\n",
    "\n",
    "Linear layers (also called Dense or Fully Connected layers) are the fundamental building blocks of neural networks. They implement the mathematical operation:\n",
    "\n",
    "**y = xW + b**\n",
    "\n",
    "Where:\n",
    "- **x**: Input features (what we know)\n",
    "- **W**: Weight matrix (what we learn)\n",
    "- **b**: Bias vector (adjusts the output)\n",
    "- **y**: Output features (what we predict)\n",
    "\n",
    "### Why Linear Layers Matter\n",
    "\n",
    "Linear layers learn **feature combinations**. Each output neuron asks: \"What combination of input features is most useful for my task?\" The network discovers these combinations through training.\n",
    "\n",
    "### Data Flow Visualization\n",
    "```\n",
    "Input Features     Weight Matrix        Bias Vector      Output Features\n",
    "[batch, in_feat] @ [in_feat, out_feat] + [out_feat]  =  [batch, out_feat]\n",
    "\n",
    "Example: MNIST Digit Recognition\n",
    "[32, 784]       @  [784, 10]          + [10]        =  [32, 10]\n",
    "  ↑                   ↑                    ↑             ↑\n",
    "32 images         784 pixels          10 classes    10 probabilities\n",
    "                  to 10 classes       adjustments   per image\n",
    "```\n",
    "\n",
    "### Memory Layout\n",
    "```\n",
    "Linear(784, 256) Parameters:\n",
    "┌─────────────────────────────┐\n",
    "│ Weight Matrix W             │  784 × 256 = 200,704 params\n",
    "│ [784, 256] float32          │  × 4 bytes = 802.8 KB\n",
    "├─────────────────────────────┤\n",
    "│ Bias Vector b               │  256 params\n",
    "│ [256] float32               │  × 4 bytes = 1.0 KB\n",
    "└─────────────────────────────┘\n",
    "                Total: 803.8 KB for one layer\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac6dc79d",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "linear-layer",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Linear:\n",
    "    \"\"\"\n",
    "    Linear (fully connected) layer: y = xW + b\n",
    "\n",
    "    This is the fundamental building block of neural networks.\n",
    "    Applies a linear transformation to incoming data.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, in_features, out_features, bias=True):\n",
    "        \"\"\"\n",
    "        Initialize linear layer with proper weight initialization.\n",
    "\n",
    "        TODO: Initialize weights and bias with Xavier initialization\n",
    "\n",
    "        APPROACH:\n",
    "        1. Create weight matrix (in_features, out_features) with Xavier scaling\n",
    "        2. Create bias vector (out_features,) initialized to zeros if bias=True\n",
    "        3. Set requires_grad=True for parameters (ready for Module 05)\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> layer = Linear(784, 10)  # MNIST classifier final layer\n",
    "        >>> print(layer.weight.shape)\n",
    "        (784, 10)\n",
    "        >>> print(layer.bias.shape)\n",
    "        (10,)\n",
    "\n",
    "        HINTS:\n",
    "        - Xavier init: scale = sqrt(1/in_features)\n",
    "        - Use np.random.randn() for normal distribution\n",
    "        - bias=None when bias=False\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.in_features = in_features\n",
    "        self.out_features = out_features\n",
    "\n",
    "        # Xavier/Glorot initialization for stable gradients\n",
    "        scale = np.sqrt(1.0 / in_features)\n",
    "        weight_data = np.random.randn(in_features, out_features) * scale\n",
    "        self.weight = Tensor(weight_data, requires_grad=True)\n",
    "\n",
    "        # Initialize bias to zeros or None\n",
    "        if bias:\n",
    "            bias_data = np.zeros(out_features)\n",
    "            self.bias = Tensor(bias_data, requires_grad=True)\n",
    "        else:\n",
    "            self.bias = None\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def forward(self, x):\n",
    "        \"\"\"\n",
    "        Forward pass through linear layer.\n",
    "\n",
    "        TODO: Implement y = xW + b\n",
    "\n",
    "        APPROACH:\n",
    "        1. Matrix multiply input with weights: xW\n",
    "        2. Add bias if it exists\n",
    "        3. Return result as new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> layer = Linear(3, 2)\n",
    "        >>> x = Tensor([[1, 2, 3], [4, 5, 6]])  # 2 samples, 3 features\n",
    "        >>> y = layer.forward(x)\n",
    "        >>> print(y.shape)\n",
    "        (2, 2)  # 2 samples, 2 outputs\n",
    "\n",
    "        HINTS:\n",
    "        - Use tensor.matmul() for matrix multiplication\n",
    "        - Handle bias=None case\n",
    "        - Broadcasting automatically handles bias addition\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Linear transformation: y = xW\n",
    "        output = x.matmul(self.weight)\n",
    "\n",
    "        # Add bias if present\n",
    "        if self.bias is not None:\n",
    "            output = output + self.bias\n",
    "\n",
    "        return output\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x):\n",
    "        \"\"\"Allows the layer to be called like a function.\"\"\"\n",
    "        return self.forward(x)\n",
    "\n",
    "    def parameters(self):\n",
    "        \"\"\"\n",
    "        Return list of trainable parameters.\n",
    "\n",
    "        TODO: Return all tensors that need gradients\n",
    "\n",
    "        APPROACH:\n",
    "        1. Start with weight (always present)\n",
    "        2. Add bias if it exists\n",
    "        3. Return as list for optimizer\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        params = [self.weight]\n",
    "        if self.bias is not None:\n",
    "            params.append(self.bias)\n",
    "        return params\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __repr__(self):\n",
    "        \"\"\"String representation for debugging.\"\"\"\n",
    "        bias_str = f\", bias={self.bias is not None}\"\n",
    "        return f\"Linear(in_features={self.in_features}, out_features={self.out_features}{bias_str})\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff32f81b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: Linear Layer\n",
    "This test validates our Linear layer implementation works correctly.\n",
    "**What we're testing**: Weight initialization, forward pass, parameter management\n",
    "**Why it matters**: Foundation for all neural network architectures\n",
    "**Expected**: Proper shapes, Xavier scaling, parameter counting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5b2ca52",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-linear",
     "locked": true,
     "points": 15
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_linear_layer():\n",
    "    \"\"\"🔬 Test Linear layer implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Linear Layer...\")\n",
    "\n",
    "    # Test layer creation\n",
    "    layer = Linear(784, 256)\n",
    "    assert layer.in_features == 784\n",
    "    assert layer.out_features == 256\n",
    "    assert layer.weight.shape == (784, 256)\n",
    "    assert layer.bias.shape == (256,)\n",
    "    assert layer.weight.requires_grad == True\n",
    "    assert layer.bias.requires_grad == True\n",
    "\n",
    "    # Test Xavier initialization (weights should be reasonably scaled)\n",
    "    weight_std = np.std(layer.weight.data)\n",
    "    expected_std = np.sqrt(1.0 / 784)\n",
    "    assert 0.5 * expected_std < weight_std < 2.0 * expected_std, f\"Weight std {weight_std} not close to Xavier {expected_std}\"\n",
    "\n",
    "    # Test bias initialization (should be zeros)\n",
    "    assert np.allclose(layer.bias.data, 0), \"Bias should be initialized to zeros\"\n",
    "\n",
    "    # Test forward pass\n",
    "    x = Tensor(np.random.randn(32, 784))  # Batch of 32 samples\n",
    "    y = layer.forward(x)\n",
    "    assert y.shape == (32, 256), f\"Expected shape (32, 256), got {y.shape}\"\n",
    "\n",
    "    # Test no bias option\n",
    "    layer_no_bias = Linear(10, 5, bias=False)\n",
    "    assert layer_no_bias.bias is None\n",
    "    params = layer_no_bias.parameters()\n",
    "    assert len(params) == 1  # Only weight, no bias\n",
    "\n",
    "    # Test parameters method\n",
    "    params = layer.parameters()\n",
    "    assert len(params) == 2  # Weight and bias\n",
    "    assert params[0] is layer.weight\n",
    "    assert params[1] is layer.bias\n",
    "\n",
    "    print(\"✅ Linear layer works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_linear_layer()\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba15fcbb",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🎲 Dropout Layer - Preventing Overfitting\n",
    "\n",
    "Dropout is a regularization technique that randomly \"turns off\" neurons during training. This forces the network to not rely too heavily on any single neuron, making it more robust and generalizable.\n",
    "\n",
    "### Why Dropout Matters\n",
    "\n",
    "**The Problem**: Neural networks can memorize training data instead of learning generalizable patterns. This leads to poor performance on new, unseen data.\n",
    "\n",
    "**The Solution**: Dropout randomly zeros out neurons, forcing the network to learn multiple independent ways to solve the problem.\n",
    "\n",
    "### Dropout in Action\n",
    "```\n",
    "Training Mode (p=0.5 dropout):\n",
    "Input:  [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n",
    "         ↓ Random mask with 50% survival rate\n",
    "Mask:   [1,   0,   1,   0,   1,   1,   0,   1  ]\n",
    "         ↓ Apply mask and scale by 1/(1-p) = 2.0\n",
    "Output: [2.0, 0.0, 6.0, 0.0, 10.0, 12.0, 0.0, 16.0]\n",
    "\n",
    "Inference Mode (no dropout):\n",
    "Input:  [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n",
    "         ↓ Pass through unchanged\n",
    "Output: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n",
    "```\n",
    "\n",
    "### Training vs Inference Behavior\n",
    "```\n",
    "                Training Mode              Inference Mode\n",
    "               ┌─────────────────┐        ┌─────────────────┐\n",
    "Input Features │ [×] [ ] [×] [×] │        │ [×] [×] [×] [×] │\n",
    "               │ Active Dropped  │   →    │   All Active    │\n",
    "               │ Active Active   │        │                 │\n",
    "               └─────────────────┘        └─────────────────┘\n",
    "                      ↓                           ↓\n",
    "                \"Learn robustly\"            \"Use all knowledge\"\n",
    "```\n",
    "\n",
    "### Memory and Performance\n",
    "```\n",
    "Dropout Memory Usage:\n",
    "┌─────────────────────────────┐\n",
    "│ Input Tensor: X MB          │\n",
    "├─────────────────────────────┤\n",
    "│ Random Mask: X/4 MB         │  (boolean mask, 1 byte/element)\n",
    "├─────────────────────────────┤\n",
    "│ Output Tensor: X MB         │\n",
    "└─────────────────────────────┘\n",
    "        Total: ~2.25X MB peak memory\n",
    "\n",
    "Computational Overhead: Minimal (element-wise operations)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "644af0ae",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "dropout-layer",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Dropout:\n",
    "    \"\"\"\n",
    "    Dropout layer for regularization.\n",
    "\n",
    "    During training: randomly zeros elements with probability p\n",
    "    During inference: scales outputs by (1-p) to maintain expected value\n",
    "\n",
    "    This prevents overfitting by forcing the network to not rely on specific neurons.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, p=0.5):\n",
    "        \"\"\"\n",
    "        Initialize dropout layer.\n",
    "\n",
    "        TODO: Store dropout probability\n",
    "\n",
    "        Args:\n",
    "            p: Probability of zeroing each element (0.0 = no dropout, 1.0 = zero everything)\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> dropout = Dropout(0.5)  # Zero 50% of elements during training\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if not 0.0 <= p <= 1.0:\n",
    "            raise ValueError(f\"Dropout probability must be between 0 and 1, got {p}\")\n",
    "        self.p = p\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def forward(self, x, training=True):\n",
    "        \"\"\"\n",
    "        Forward pass through dropout layer.\n",
    "\n",
    "        TODO: Apply dropout during training, pass through during inference\n",
    "\n",
    "        APPROACH:\n",
    "        1. If not training, return input unchanged\n",
    "        2. If training, create random mask with probability (1-p)\n",
    "        3. Multiply input by mask and scale by 1/(1-p)\n",
    "        4. Return result as new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> dropout = Dropout(0.5)\n",
    "        >>> x = Tensor([1, 2, 3, 4])\n",
    "        >>> y_train = dropout.forward(x, training=True)   # Some elements zeroed\n",
    "        >>> y_eval = dropout.forward(x, training=False)   # All elements preserved\n",
    "\n",
    "        HINTS:\n",
    "        - Use np.random.random() < keep_prob for mask\n",
    "        - Scale by 1/(1-p) to maintain expected value\n",
    "        - training=False should return input unchanged\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if not training or self.p == 0.0:\n",
    "            # During inference or no dropout, pass through unchanged\n",
    "            return x\n",
    "\n",
    "        if self.p == 1.0:\n",
    "            # Drop everything (preserve requires_grad for gradient flow)\n",
    "            return Tensor(np.zeros_like(x.data), requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)\n",
    "\n",
    "        # During training, apply dropout\n",
    "        keep_prob = 1.0 - self.p\n",
    "\n",
    "        # Create random mask: True where we keep elements\n",
    "        mask = np.random.random(x.data.shape) < keep_prob\n",
    "\n",
    "        # Apply mask and scale using Tensor operations to preserve gradients!\n",
    "        mask_tensor = Tensor(mask.astype(np.float32), requires_grad=False)  # Mask doesn't need gradients\n",
    "        scale = Tensor(np.array(1.0 / keep_prob), requires_grad=False)\n",
    "        \n",
    "        # Use Tensor operations: x * mask * scale\n",
    "        output = x * mask_tensor * scale\n",
    "        return output\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x, training=True):\n",
    "        \"\"\"Allows the layer to be called like a function.\"\"\"\n",
    "        return self.forward(x, training)\n",
    "\n",
    "    def parameters(self):\n",
    "        \"\"\"Dropout has no parameters.\"\"\"\n",
    "        return []\n",
    "\n",
    "    def __repr__(self):\n",
    "        return f\"Dropout(p={self.p})\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62a0de23",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: Dropout Layer\n",
    "This test validates our Dropout layer implementation works correctly.\n",
    "**What we're testing**: Training vs inference behavior, probability scaling, randomness\n",
    "**Why it matters**: Essential for preventing overfitting in neural networks\n",
    "**Expected**: Correct masking during training, passthrough during inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3877feeb",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-dropout",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_dropout_layer():\n",
    "    \"\"\"🔬 Test Dropout layer implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Dropout Layer...\")\n",
    "\n",
    "    # Test dropout creation\n",
    "    dropout = Dropout(0.5)\n",
    "    assert dropout.p == 0.5\n",
    "\n",
    "    # Test inference mode (should pass through unchanged)\n",
    "    x = Tensor([1, 2, 3, 4])\n",
    "    y_inference = dropout.forward(x, training=False)\n",
    "    assert np.array_equal(x.data, y_inference.data), \"Inference should pass through unchanged\"\n",
    "\n",
    "    # Test training mode with zero dropout (should pass through unchanged)\n",
    "    dropout_zero = Dropout(0.0)\n",
    "    y_zero = dropout_zero.forward(x, training=True)\n",
    "    assert np.array_equal(x.data, y_zero.data), \"Zero dropout should pass through unchanged\"\n",
    "\n",
    "    # Test training mode with full dropout (should zero everything)\n",
    "    dropout_full = Dropout(1.0)\n",
    "    y_full = dropout_full.forward(x, training=True)\n",
    "    assert np.allclose(y_full.data, 0), \"Full dropout should zero everything\"\n",
    "\n",
    "    # Test training mode with partial dropout\n",
    "    # Note: This is probabilistic, so we test statistical properties\n",
    "    np.random.seed(42)  # For reproducible test\n",
    "    x_large = Tensor(np.ones((1000,)))  # Large tensor for statistical significance\n",
    "    y_train = dropout.forward(x_large, training=True)\n",
    "\n",
    "    # Count non-zero elements (approximately 50% should survive)\n",
    "    non_zero_count = np.count_nonzero(y_train.data)\n",
    "    expected_survival = 1000 * 0.5\n",
    "    # Allow 10% tolerance for randomness\n",
    "    assert 0.4 * 1000 < non_zero_count < 0.6 * 1000, f\"Expected ~500 survivors, got {non_zero_count}\"\n",
    "\n",
    "    # Test scaling (surviving elements should be scaled by 1/(1-p) = 2.0)\n",
    "    surviving_values = y_train.data[y_train.data != 0]\n",
    "    expected_value = 2.0  # 1.0 / (1 - 0.5)\n",
    "    assert np.allclose(surviving_values, expected_value), f\"Surviving values should be {expected_value}\"\n",
    "\n",
    "    # Test no parameters\n",
    "    params = dropout.parameters()\n",
    "    assert len(params) == 0, \"Dropout should have no parameters\"\n",
    "\n",
    "    # Test invalid probability\n",
    "    try:\n",
    "        Dropout(-0.1)\n",
    "        assert False, \"Should raise ValueError for negative probability\"\n",
    "    except ValueError:\n",
    "        pass\n",
    "\n",
    "    try:\n",
    "        Dropout(1.1)\n",
    "        assert False, \"Should raise ValueError for probability > 1\"\n",
    "    except ValueError:\n",
    "        pass\n",
    "\n",
    "    print(\"✅ Dropout layer works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_dropout_layer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbb58951",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 2
   },
   "source": [
    "## 4. Integration: Bringing It Together\n",
    "\n",
    "Now that we've built both layer types, let's see how they work together to create a complete neural network architecture. We'll manually compose a realistic 3-layer MLP for MNIST digit classification.\n",
    "\n",
    "### Network Architecture Visualization\n",
    "```\n",
    "MNIST Classification Network (3-Layer MLP):\n",
    "\n",
    "    Input Layer          Hidden Layer 1        Hidden Layer 2        Output Layer\n",
    "┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐\n",
    "│     784         │    │      256        │    │      128        │    │       10        │\n",
    "│   Pixels        │───▶│   Features      │───▶│   Features      │───▶│    Classes      │\n",
    "│  (28×28 image)  │    │   + ReLU        │    │   + ReLU        │    │  (0-9 digits)   │\n",
    "│                 │    │   + Dropout     │    │   + Dropout     │    │                 │\n",
    "└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘\n",
    "        ↓                       ↓                       ↓                       ↓\n",
    "   \"Raw pixels\"          \"Edge detectors\"        \"Shape detectors\"        \"Digit classifier\"\n",
    "\n",
    "Data Flow:\n",
    "[32, 784] → Linear(784,256) → ReLU → Dropout(0.5) → Linear(256,128) → ReLU → Dropout(0.3) → Linear(128,10) → [32, 10]\n",
    "```\n",
    "\n",
    "### Parameter Count Analysis\n",
    "```\n",
    "Parameter Breakdown (Manual Layer Composition):\n",
    "┌─────────────────────────────────────────────────────────────┐\n",
    "│ layer1 = Linear(784 → 256)                               │\n",
    "│   Weights: 784 × 256 = 200,704 params                      │\n",
    "│   Bias:    256 params                                       │\n",
    "│   Subtotal: 200,960 params                                  │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ activation1 = ReLU(), dropout1 = Dropout(0.5)              │\n",
    "│   Parameters: 0 (no learnable weights)                      │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ layer2 = Linear(256 → 128)                               │\n",
    "│   Weights: 256 × 128 = 32,768 params                       │\n",
    "│   Bias:    128 params                                       │\n",
    "│   Subtotal: 32,896 params                                   │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ activation2 = ReLU(), dropout2 = Dropout(0.3)              │\n",
    "│   Parameters: 0 (no learnable weights)                      │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ layer3 = Linear(128 → 10)                                │\n",
    "│   Weights: 128 × 10 = 1,280 params                         │\n",
    "│   Bias:    10 params                                        │\n",
    "│   Subtotal: 1,290 params                                    │\n",
    "└─────────────────────────────────────────────────────────────┘\n",
    "                    TOTAL: 235,146 parameters\n",
    "                    Memory: ~940 KB (float32)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fee73cb8",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 5. Systems Analysis: Memory and Performance\n",
    "\n",
    "Now let's analyze the systems characteristics of our layer implementations. Understanding memory usage and computational complexity helps us build efficient neural networks.\n",
    "\n",
    "### Memory Analysis Overview\n",
    "```\n",
    "Layer Memory Components:\n",
    "┌─────────────────────────────────────────────────────────────┐\n",
    "│                    PARAMETER MEMORY                         │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ • Weights: Persistent, shared across batches               │\n",
    "│ • Biases: Small but necessary for output shifting          │\n",
    "│ • Total: Grows with network width and depth                │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│                   ACTIVATION MEMORY                         │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ • Input tensors: batch_size × features × 4 bytes           │\n",
    "│ • Output tensors: batch_size × features × 4 bytes          │\n",
    "│ • Intermediate results during forward pass                  │\n",
    "│ • Total: Grows with batch size and layer width             │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│                   TEMPORARY MEMORY                          │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ • Dropout masks: batch_size × features × 1 byte            │\n",
    "│ • Computation buffers for matrix operations                 │\n",
    "│ • Total: Peak during forward/backward passes               │\n",
    "└─────────────────────────────────────────────────────────────┘\n",
    "```\n",
    "\n",
    "### Computational Complexity Overview\n",
    "```\n",
    "Layer Operation Complexity:\n",
    "┌─────────────────────────────────────────────────────────────┐\n",
    "│ Linear Layer Forward Pass:                                  │\n",
    "│   Matrix Multiply: O(batch × in_features × out_features)    │\n",
    "│   Bias Addition: O(batch × out_features)                    │\n",
    "│   Dominant: Matrix multiplication                           │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ Multi-layer Forward Pass:                                   │\n",
    "│   Sum of all layer complexities                             │\n",
    "│   Memory: Peak of all intermediate activations              │\n",
    "├─────────────────────────────────────────────────────────────┤\n",
    "│ Dropout Forward Pass:                                        │\n",
    "│   Mask Generation: O(elements)                              │\n",
    "│   Element-wise Multiply: O(elements)                        │\n",
    "│   Overhead: Minimal compared to linear layers               │\n",
    "└─────────────────────────────────────────────────────────────┘\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fc6a34e",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "analyze-layer-memory",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_layer_memory():\n",
    "    \"\"\"📊 Analyze memory usage patterns in layer operations.\"\"\"\n",
    "    print(\"📊 Analyzing Layer Memory Usage...\")\n",
    "\n",
    "    # Test different layer sizes\n",
    "    layer_configs = [\n",
    "        (784, 256),   # MNIST → hidden\n",
    "        (256, 256),   # Hidden → hidden\n",
    "        (256, 10),    # Hidden → output\n",
    "        (2048, 2048), # Large hidden\n",
    "    ]\n",
    "\n",
    "    print(\"\\nLinear Layer Memory Analysis:\")\n",
    "    print(\"Configuration → Weight Memory → Bias Memory → Total Memory\")\n",
    "\n",
    "    for in_feat, out_feat in layer_configs:\n",
    "        # Calculate memory usage\n",
    "        weight_memory = in_feat * out_feat * 4  # 4 bytes per float32\n",
    "        bias_memory = out_feat * 4\n",
    "        total_memory = weight_memory + bias_memory\n",
    "\n",
    "        print(f\"({in_feat:4d}, {out_feat:4d}) → {weight_memory/1024:7.1f} KB → {bias_memory/1024:6.1f} KB → {total_memory/1024:7.1f} KB\")\n",
    "\n",
    "    # Analyze multi-layer memory scaling\n",
    "    print(\"\\n💡 Multi-layer Model Memory Scaling:\")\n",
    "    hidden_sizes = [128, 256, 512, 1024, 2048]\n",
    "\n",
    "    for hidden_size in hidden_sizes:\n",
    "        # 3-layer MLP: 784 → hidden → hidden/2 → 10\n",
    "        layer1_params = 784 * hidden_size + hidden_size\n",
    "        layer2_params = hidden_size * (hidden_size // 2) + (hidden_size // 2)\n",
    "        layer3_params = (hidden_size // 2) * 10 + 10\n",
    "\n",
    "        total_params = layer1_params + layer2_params + layer3_params\n",
    "        memory_mb = total_params * 4 / (1024 * 1024)\n",
    "\n",
    "        print(f\"Hidden={hidden_size:4d}: {total_params:7,} params = {memory_mb:5.1f} MB\")\n",
    "\n",
    "# Analysis will be run in main block"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16816429",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "analyze-layer-performance",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_layer_performance():\n",
    "    \"\"\"📊 Analyze computational complexity of layer operations.\"\"\"\n",
    "    print(\"📊 Analyzing Layer Computational Complexity...\")\n",
    "\n",
    "    # Test forward pass FLOPs\n",
    "    batch_sizes = [1, 32, 128, 512]\n",
    "    layer = Linear(784, 256)\n",
    "\n",
    "    print(\"\\nLinear Layer FLOPs Analysis:\")\n",
    "    print(\"Batch Size → Matrix Multiply FLOPs → Bias Add FLOPs → Total FLOPs\")\n",
    "\n",
    "    for batch_size in batch_sizes:\n",
    "        # Matrix multiplication: (batch, in) @ (in, out) = batch * in * out FLOPs\n",
    "        matmul_flops = batch_size * 784 * 256\n",
    "        # Bias addition: batch * out FLOPs\n",
    "        bias_flops = batch_size * 256\n",
    "        total_flops = matmul_flops + bias_flops\n",
    "\n",
    "        print(f\"{batch_size:10d} → {matmul_flops:15,} → {bias_flops:13,} → {total_flops:11,}\")\n",
    "\n",
    "    print(\"\\n💡 Key Insights:\")\n",
    "    print(\"🚀 Linear layer complexity: O(batch_size × in_features × out_features)\")\n",
    "    print(\"🚀 Memory grows linearly with batch size, quadratically with layer width\")\n",
    "    print(\"🚀 Dropout adds minimal computational overhead (element-wise operations)\")\n",
    "\n",
    "# Analysis will be run in main block"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b80cd94",
   "metadata": {
    "lines_to_next_cell": 1
   },
   "source": [
    "\"\"\"\n",
    "# 🧪 Module Integration Test\n",
    "\n",
    "Final validation that everything works together correctly.\n",
    "\"\"\"\n",
    "\n",
    "def import_previous_module(module_name: str, component_name: str):\n",
    "    import sys\n",
    "    import os\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', module_name))\n",
    "    module = __import__(f\"{module_name.split('_')[1]}_dev\")\n",
    "    return getattr(module, component_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a80be9e",
   "metadata": {
    "lines_to_next_cell": 2,
    "nbgrader": {
     "grade": true,
     "grade_id": "module-integration",
     "locked": true,
     "points": 20
    }
   },
   "outputs": [],
   "source": [
    "def test_module():\n",
    "    \"\"\"\n",
    "    Comprehensive test of entire module functionality.\n",
    "\n",
    "    This final test runs before module summary to ensure:\n",
    "    - All unit tests pass\n",
    "    - Functions work together correctly\n",
    "    - Module is ready for integration with TinyTorch\n",
    "    \"\"\"\n",
    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
    "    print(\"=\" * 50)\n",
    "\n",
    "    # Run all unit tests\n",
    "    print(\"Running unit tests...\")\n",
    "    test_unit_linear_layer()\n",
    "    test_unit_dropout_layer()\n",
    "\n",
    "    print(\"\\nRunning integration scenarios...\")\n",
    "\n",
    "    # Test realistic neural network construction with manual composition\n",
    "    print(\"🔬 Integration Test: Multi-layer Network...\")\n",
    "\n",
    "    # Import real activation from module 02 using standardized helper\n",
    "    ReLU = import_previous_module('02_activations', 'ReLU')\n",
    "\n",
    "    # Build individual layers for manual composition\n",
    "    layer1 = Linear(784, 128)\n",
    "    activation1 = ReLU()\n",
    "    dropout1 = Dropout(0.5)\n",
    "    layer2 = Linear(128, 64)\n",
    "    activation2 = ReLU()\n",
    "    dropout2 = Dropout(0.3)\n",
    "    layer3 = Linear(64, 10)\n",
    "\n",
    "    # Test end-to-end forward pass with manual composition\n",
    "    batch_size = 16\n",
    "    x = Tensor(np.random.randn(batch_size, 784))\n",
    "\n",
    "    # Manual forward pass\n",
    "    x = layer1.forward(x)\n",
    "    x = activation1.forward(x)\n",
    "    x = dropout1.forward(x)\n",
    "    x = layer2.forward(x)\n",
    "    x = activation2.forward(x)\n",
    "    x = dropout2.forward(x)\n",
    "    output = layer3.forward(x)\n",
    "\n",
    "    assert output.shape == (batch_size, 10), f\"Expected output shape ({batch_size}, 10), got {output.shape}\"\n",
    "\n",
    "    # Test parameter counting from individual layers\n",
    "    all_params = layer1.parameters() + layer2.parameters() + layer3.parameters()\n",
    "    expected_params = 6  # 3 weights + 3 biases from 3 Linear layers\n",
    "    assert len(all_params) == expected_params, f\"Expected {expected_params} parameters, got {len(all_params)}\"\n",
    "\n",
    "    # Test all parameters have requires_grad=True\n",
    "    for param in all_params:\n",
    "        assert param.requires_grad == True, \"All parameters should have requires_grad=True\"\n",
    "\n",
    "    # Test individual layer functionality\n",
    "    test_x = Tensor(np.random.randn(4, 784))\n",
    "    # Test dropout in training vs inference\n",
    "    dropout_test = Dropout(0.5)\n",
    "    train_output = dropout_test.forward(test_x, training=True)\n",
    "    infer_output = dropout_test.forward(test_x, training=False)\n",
    "    assert np.array_equal(test_x.data, infer_output.data), \"Inference mode should pass through unchanged\"\n",
    "\n",
    "    print(\"✅ Multi-layer network integration works!\")\n",
    "\n",
    "    print(\"\\n\" + \"=\" * 50)\n",
    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
    "    print(\"Run: tito module complete 03_layers\")\n",
    "\n",
    "# Run comprehensive module test\n",
    "if __name__ == \"__main__\":\n",
    "    test_module()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93360ac7",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 MODULE SUMMARY: Layers\n",
    "\n",
    "Congratulations! You've built the fundamental building blocks that make neural networks possible!\n",
    "\n",
    "### Key Accomplishments\n",
    "- Built Linear layers with proper Xavier initialization and parameter management\n",
    "- Created Dropout layers for regularization with training/inference mode handling\n",
    "- Demonstrated manual layer composition for building neural networks\n",
    "- Analyzed memory scaling and computational complexity of layer operations\n",
    "- All tests pass ✅ (validated by `test_module()`)\n",
    "\n",
    "### Ready for Next Steps\n",
    "Your layer implementation enables building complete neural networks! The Linear layer provides learnable transformations, manual composition chains them together, and Dropout prevents overfitting.\n",
    "\n",
    "Export with: `tito module complete 03_layers`\n",
    "\n",
    "**Next**: Module 04 will add loss functions (CrossEntropyLoss, MSELoss) that measure how wrong your model is - the foundation for learning!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}