TinyTorch/modules/04_layers/layers_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6cd42919",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Layers - Neural Network Building Blocks and Composition Patterns\n",
    "\n",
    "Welcome to the Layers module! You'll build the fundamental components that stack together to form any neural network architecture, from simple perceptrons to transformers.\n",
    "\n",
    "## Learning Goals\n",
    "- Systems understanding: How layer composition creates complex function approximators and why stacking enables deep learning\n",
    "- Core implementation skill: Build matrix multiplication and Dense layers with proper parameter management\n",
    "- Pattern recognition: Understand how different layer types solve different computational problems\n",
    "- Framework connection: See how your layer implementations mirror PyTorch's nn.Module design patterns\n",
    "- Performance insight: Learn why layer computation order and memory layout determine training speed\n",
    "\n",
    "## Build → Use → Reflect\n",
    "1. **Build**: Matrix multiplication primitives and Dense layers with parameter initialization strategies\n",
    "2. **Use**: Compose layers into multi-layer networks and observe how data transforms through the stack\n",
    "3. **Reflect**: Why does layer depth enable more complex functions, and when does it hurt performance?\n",
    "\n",
    "## What You'll Achieve\n",
    "By the end of this module, you'll understand:\n",
    "- Deep technical understanding of how matrix operations enable neural networks to learn arbitrary functions\n",
    "- Practical capability to build and compose layers into complex architectures\n",
    "- Systems insight into why layer composition is the fundamental pattern for scalable ML systems\n",
    "- Performance consideration of how layer size and depth affect memory usage and computational cost\n",
    "- Connection to production ML systems and how frameworks optimize layer execution for different hardware\n",
    "\n",
    "## Systems Reality Check\n",
    "💡 **Production Context**: PyTorch's nn.Linear uses optimized BLAS operations and can automatically select GPU vs CPU execution based on data size\n",
    "⚡ **Performance Note**: Large matrix multiplications can be memory-bound rather than compute-bound - understanding this shapes how production systems optimize layer execution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "921f1b43",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "layers-imports",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.layers\n",
    "\n",
    "#| export\n",
    "import numpy as np\n",
    "import sys\n",
    "import os\n",
    "from typing import Union, Tuple, Optional, Any\n",
    "\n",
    "# Import our building blocks - try package first, then local modules\n",
    "try:\n",
    "    from tinytorch.core.tensor import Tensor, Parameter\n",
    "except ImportError:\n",
    "    # For development, import from local modules\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
    "    from tensor_dev import Tensor, Parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d342e264",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "layers-setup",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "print(\"🔥 TinyTorch Layers Module\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
    "print(\"Ready to build neural network layers!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37720590",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Module Base Class - Neural Network Foundation\n",
    "\n",
    "Before building specific layers like Dense and Conv2d, we need a base class that handles parameter management and provides a clean interface. This is the foundation that makes neural networks composable and easy to use.\n",
    "\n",
    "### Why We Need a Module Base Class\n",
    "\n",
    "🏗️ **Organization**: Automatic parameter collection across all layers  \n",
    "🔄 **Composition**: Modules can contain other modules (networks of networks)  \n",
    "🎯 **Clean API**: Enable `model(input)` instead of `model.forward(input)`  \n",
    "📦 **PyTorch Compatibility**: Same patterns as `torch.nn.Module`  \n",
    "\n",
    "Let's build the foundation that will make all our neural network code clean and powerful:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c167643",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "module-base-class",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Module:\n",
    "    \"\"\"\n",
    "    Base class for all neural network modules.\n",
    "    \n",
    "    Provides automatic parameter collection, forward pass management,\n",
    "    and clean composition patterns. All layers (Dense, Conv2d, etc.)\n",
    "    inherit from this class.\n",
    "    \n",
    "    Key Features:\n",
    "    - Automatic parameter registration when you assign Tensors with requires_grad=True\n",
    "    - Recursive parameter collection from sub-modules\n",
    "    - Clean __call__ interface: model(x) instead of model.forward(x)\n",
    "    - Extensible for custom layers\n",
    "    \n",
    "    Example Usage:\n",
    "        class MLP(Module):\n",
    "            def __init__(self):\n",
    "                super().__init__()\n",
    "                self.layer1 = Dense(784, 128)  # Auto-registered!\n",
    "                self.layer2 = Dense(128, 10)   # Auto-registered!\n",
    "                \n",
    "            def forward(self, x):\n",
    "                x = self.layer1(x)\n",
    "                return self.layer2(x)\n",
    "                \n",
    "        model = MLP()\n",
    "        params = model.parameters()  # Gets all parameters automatically!\n",
    "        output = model(input)        # Clean interface!\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        \"\"\"Initialize module with empty parameter and sub-module storage.\"\"\"\n",
    "        self._parameters = []\n",
    "        self._modules = []\n",
    "    \n",
    "    def __setattr__(self, name, value):\n",
    "        \"\"\"\n",
    "        Intercept attribute assignment to auto-register parameters and modules.\n",
    "        \n",
    "        When you do self.weight = Parameter(...), this automatically adds\n",
    "        the parameter to our collection for easy optimization.\n",
    "        \"\"\"\n",
    "        # Check if it's a tensor that needs gradients (a parameter)\n",
    "        if hasattr(value, 'requires_grad') and value.requires_grad:\n",
    "            self._parameters.append(value)\n",
    "        # Check if it's another Module (sub-module)\n",
    "        elif isinstance(value, Module):\n",
    "            self._modules.append(value)\n",
    "        \n",
    "        # Always call parent to actually set the attribute\n",
    "        super().__setattr__(name, value)\n",
    "    \n",
    "    def parameters(self):\n",
    "        \"\"\"\n",
    "        Recursively collect all parameters from this module and sub-modules.\n",
    "        \n",
    "        Returns:\n",
    "            List of all parameters (Tensors with requires_grad=True)\n",
    "            \n",
    "        This enables: optimizer = Adam(model.parameters())\n",
    "        \"\"\"\n",
    "        # Start with our own parameters\n",
    "        params = list(self._parameters)\n",
    "        \n",
    "        # Add parameters from sub-modules recursively\n",
    "        for module in self._modules:\n",
    "            params.extend(module.parameters())\n",
    "            \n",
    "        return params\n",
    "    \n",
    "    def __call__(self, *args, **kwargs):\n",
    "        \"\"\"\n",
    "        Makes modules callable: model(x) instead of model.forward(x).\n",
    "        \n",
    "        This is the magic that enables clean syntax like:\n",
    "            output = model(input)\n",
    "        instead of:\n",
    "            output = model.forward(input)\n",
    "        \"\"\"\n",
    "        return self.forward(*args, **kwargs)\n",
    "    \n",
    "    def forward(self, *args, **kwargs):\n",
    "        \"\"\"\n",
    "        Forward pass - must be implemented by subclasses.\n",
    "        \n",
    "        This is where the actual computation happens. Every layer\n",
    "        defines its own forward() method.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\"Subclasses must implement forward()\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91f83f11",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in modules/source/04_layers/layers_dev.py  \n",
    "**Building Side:** Code exports to tinytorch.core.layers\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.layers import Dense, matmul  # All layer types together!\n",
    "from tinytorch.core.tensor import Tensor  # The foundation\n",
    "from tinytorch.core.activations import ReLU, Sigmoid  # Nonlinearity\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Focused modules for deep understanding\n",
    "- **Production:** Proper organization like PyTorch's torch.nn.Linear\n",
    "- **Consistency:** All layer types live together in core.layers\n",
    "- **Integration:** Works seamlessly with tensors and activations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d1cbf04",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "# Matrix Multiplication - The Heart of Neural Networks\n",
    "\n",
    "Every neural network operation ultimately reduces to matrix multiplication. Let's build the foundation that powers everything from simple perceptrons to transformers.\n",
    "\n",
    "## Why Matrix Multiplication Matters\n",
    "\n",
    "🧠 **Neural Network Core**: Every layer applies: output = input @ weights + bias  \n",
    "⚡ **Parallel Processing**: Matrix ops utilize vectorized CPU instructions and GPU parallelism  \n",
    "🏗️ **Scalable Architecture**: Stacking matrix operations creates arbitrarily complex function approximators  \n",
    "📈 **Performance Critical**: 90%+ of neural network compute time is spent in matrix multiplication  \n",
    "\n",
    "## Learning Objectives\n",
    "By implementing matrix multiplication, you'll understand:\n",
    "- How neural networks transform data through linear algebra\n",
    "- Why matrix operations are the building blocks of all modern ML frameworks\n",
    "- How proper implementation affects performance by orders of magnitude\n",
    "- The connection between mathematical operations and computational efficiency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adb83e78",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "matmul-implementation",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def matmul(a: Tensor, b: Tensor) -> Tensor:\n",
    "    \"\"\"\n",
    "    Matrix multiplication for tensors.\n",
    "    \n",
    "    Args:\n",
    "        a: Left tensor (shape: ..., m, k)\n",
    "        b: Right tensor (shape: ..., k, n)\n",
    "    \n",
    "    Returns:\n",
    "        Result tensor (shape: ..., m, n)\n",
    "    \n",
    "    TODO: Implement matrix multiplication using numpy's @ operator.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy arrays from both tensors using .data\n",
    "    2. Perform matrix multiplication: result_data = a_data @ b_data\n",
    "    3. Wrap result in a new Tensor and return\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is the core operation in Dense layers: output = input @ weights\n",
    "    - PyTorch uses optimized BLAS libraries for this operation\n",
    "    - GPU implementations parallelize this across thousands of cores\n",
    "    - Understanding this operation is key to neural network performance\n",
    "    \n",
    "    EXAMPLE:\n",
    "    ```python\n",
    "    a = Tensor([[1, 2], [3, 4]])  # shape (2, 2)\n",
    "    b = Tensor([[5, 6], [7, 8]])  # shape (2, 2)\n",
    "    result = matmul(a, b)\n",
    "    # result.data = [[19, 22], [43, 50]]\n",
    "    ```\n",
    "    \n",
    "    IMPLEMENTATION HINTS:\n",
    "    - Use the @ operator for clean matrix multiplication\n",
    "    - Ensure you return a Tensor, not a numpy array\n",
    "    - The operation should work for any compatible matrix shapes\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Check if we're dealing with Variables (autograd) or plain Tensors\n",
    "    a_is_variable = hasattr(a, 'requires_grad') and hasattr(a, 'grad_fn')\n",
    "    b_is_variable = hasattr(b, 'requires_grad') and hasattr(b, 'grad_fn')\n",
    "    \n",
    "    # Extract numpy data appropriately\n",
    "    if a_is_variable:\n",
    "        a_data = a.data.data  # Variable.data is a Tensor, so .data.data gets numpy array\n",
    "    else:\n",
    "        a_data = a.data  # Tensor.data is numpy array directly\n",
    "    \n",
    "    if b_is_variable:\n",
    "        b_data = b.data.data\n",
    "    else:\n",
    "        b_data = b.data\n",
    "    \n",
    "    # Perform matrix multiplication\n",
    "    result_data = a_data @ b_data\n",
    "    \n",
    "    # If any input is a Variable, return Variable with gradient tracking\n",
    "    if a_is_variable or b_is_variable:\n",
    "        # Import Variable locally to avoid circular imports\n",
    "        if 'Variable' not in globals():\n",
    "            try:\n",
    "                from tinytorch.core.autograd import Variable\n",
    "            except ImportError:\n",
    "                from autograd_dev import Variable\n",
    "        \n",
    "        # Create gradient function for matrix multiplication\n",
    "        def grad_fn(grad_output):\n",
    "            # Matrix multiplication backward pass:\n",
    "            # If C = A @ B, then:\n",
    "            # dA = grad_output @ B^T\n",
    "            # dB = A^T @ grad_output\n",
    "            \n",
    "            if a_is_variable and a.requires_grad:\n",
    "                # Gradient w.r.t. A: grad_output @ B^T\n",
    "                grad_a_data = grad_output.data.data @ b_data.T\n",
    "                a.backward(Variable(grad_a_data))\n",
    "            \n",
    "            if b_is_variable and b.requires_grad:\n",
    "                # Gradient w.r.t. B: A^T @ grad_output  \n",
    "                grad_b_data = a_data.T @ grad_output.data.data\n",
    "                b.backward(Variable(grad_b_data))\n",
    "        \n",
    "        # Determine if result should require gradients\n",
    "        requires_grad = (a_is_variable and a.requires_grad) or (b_is_variable and b.requires_grad)\n",
    "        \n",
    "        return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)\n",
    "    else:\n",
    "        # Both inputs are Tensors, return Tensor (backward compatible)\n",
    "        return Tensor(result_data)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7691910",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Testing Matrix Multiplication\n",
    "\n",
    "Let's verify our matrix multiplication works correctly with some test cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d10bd1ed",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-matmul",
     "locked": true,
     "points": 2,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_matmul():\n",
    "    \"\"\"Test matrix multiplication implementation.\"\"\"\n",
    "    print(\"🧪 Testing Matrix Multiplication...\")\n",
    "    \n",
    "    # Test case 1: Simple 2x2 matrices\n",
    "    a = Tensor([[1, 2], [3, 4]])\n",
    "    b = Tensor([[5, 6], [7, 8]])\n",
    "    result = matmul(a, b)\n",
    "    expected = np.array([[19, 22], [43, 50]])\n",
    "    \n",
    "    assert np.allclose(result.data, expected), f\"Expected {expected}, got {result.data}\"\n",
    "    print(\"✅ 2x2 matrix multiplication\")\n",
    "    \n",
    "    # Test case 2: Non-square matrices\n",
    "    a = Tensor([[1, 2, 3], [4, 5, 6]])  # 2x3\n",
    "    b = Tensor([[7, 8], [9, 10], [11, 12]])  # 3x2\n",
    "    result = matmul(a, b)\n",
    "    expected = np.array([[58, 64], [139, 154]])\n",
    "    \n",
    "    assert np.allclose(result.data, expected), f\"Expected {expected}, got {result.data}\"\n",
    "    print(\"✅ Non-square matrix multiplication\")\n",
    "    \n",
    "    # Test case 3: Vector-matrix multiplication\n",
    "    a = Tensor([[1, 2, 3]])  # 1x3 (row vector)\n",
    "    b = Tensor([[4], [5], [6]])  # 3x1 (column vector)\n",
    "    result = matmul(a, b)\n",
    "    expected = np.array([[32]])  # 1*4 + 2*5 + 3*6 = 32\n",
    "    \n",
    "    assert np.allclose(result.data, expected), f\"Expected {expected}, got {result.data}\"\n",
    "    print(\"✅ Vector-matrix multiplication\")\n",
    "    \n",
    "    print(\"🎉 All matrix multiplication tests passed!\")\n",
    "\n",
    "test_matmul()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f512ed2",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "# Dense Layer - The Fundamental Neural Network Component\n",
    "\n",
    "Dense layers (also called Linear or Fully Connected layers) are the building blocks of neural networks. They apply the transformation: **output = input @ weights + bias**\n",
    "\n",
    "## Why Dense Layers Matter\n",
    "\n",
    "🧠 **Universal Function Approximators**: Dense layers can approximate any continuous function when stacked  \n",
    "🔧 **Parameter Learning**: Weights and biases are learned through backpropagation  \n",
    "🏗️ **Modular Design**: Dense layers compose into complex architectures (MLPs, transformers, etc.)  \n",
    "⚡ **Computational Efficiency**: Matrix operations leverage optimized linear algebra libraries  \n",
    "\n",
    "## Learning Objectives\n",
    "By implementing Dense layers, you'll understand:\n",
    "- How neural networks learn through adjustable parameters\n",
    "- The mathematical foundation underlying all neural network layers\n",
    "- Why proper parameter initialization is crucial for training success\n",
    "- How layer composition enables complex function approximation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5b4e929",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "dense-implementation",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Linear(Module):\n",
    "    \"\"\"\n",
    "    Linear (Fully Connected) Layer implementation.\n",
    "    \n",
    "    Applies the transformation: output = input @ weights + bias\n",
    "    \n",
    "    Inherits from Module for automatic parameter management and clean API.\n",
    "    This is PyTorch's nn.Linear equivalent with the same name for familiarity.\n",
    "    \n",
    "    Features:\n",
    "    - Automatic parameter registration (weights and bias)\n",
    "    - Clean call interface: layer(input) instead of layer.forward(input)\n",
    "    - Works with optimizers via model.parameters()\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, input_size: int, output_size: int, use_bias: bool = True):\n",
    "        \"\"\"\n",
    "        Initialize Linear layer with random weights and optional bias.\n",
    "        \n",
    "        Args:\n",
    "            input_size: Number of input features\n",
    "            output_size: Number of output features  \n",
    "            use_bias: Whether to include bias term\n",
    "        \n",
    "        TODO: Implement Linear layer initialization.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Store input_size and output_size as instance variables\n",
    "        2. Initialize weights as Tensor with shape (input_size, output_size)\n",
    "        3. Use small random values: np.random.randn(...) * 0.1\n",
    "        4. Initialize bias as Tensor with shape (output_size,) if use_bias is True\n",
    "        5. Set bias to None if use_bias is False\n",
    "        \n",
    "        LEARNING CONNECTIONS:\n",
    "        - Small random initialization prevents symmetry breaking\n",
    "        - Weight shape (input_size, output_size) enables matrix multiplication\n",
    "        - Bias allows shifting the output (like y-intercept in linear regression)\n",
    "        - PyTorch uses more sophisticated initialization (Xavier, Kaiming)\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Use np.random.randn() for Gaussian random numbers\n",
    "        - Scale by 0.1 to keep initial values small\n",
    "        - Remember to wrap numpy arrays in Tensor()\n",
    "        - Store use_bias flag for forward pass logic\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        super().__init__()  # Initialize Module base class\n",
    "        \n",
    "        self.input_size = input_size\n",
    "        self.output_size = output_size\n",
    "        self.use_bias = use_bias\n",
    "        \n",
    "        # Initialize weights with small random values using Parameter\n",
    "        # Shape: (input_size, output_size) for matrix multiplication\n",
    "        weight_data = np.random.randn(input_size, output_size) * 0.1\n",
    "        self.weights = Parameter(weight_data)  # Auto-registers for optimization!\n",
    "        \n",
    "        # Initialize bias if requested\n",
    "        if use_bias:\n",
    "            bias_data = np.random.randn(output_size) * 0.1\n",
    "            self.bias = Parameter(bias_data)  # Auto-registers for optimization!\n",
    "        else:\n",
    "            self.bias = None\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def forward(self, x: Union[Tensor, 'Variable']) -> Union[Tensor, 'Variable']:\n",
    "        \"\"\"\n",
    "        Forward pass through the Linear layer.\n",
    "        \n",
    "        Args:\n",
    "            x: Input tensor or Variable (shape: ..., input_size)\n",
    "        \n",
    "        Returns:\n",
    "            Output tensor or Variable (shape: ..., output_size)\n",
    "            Preserves Variable type for gradient tracking in training\n",
    "        \n",
    "        TODO: Implement autograd-aware forward pass: output = input @ weights + bias\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Perform matrix multiplication: output = matmul(x, self.weights)\n",
    "        2. If bias exists, add it appropriately based on input type\n",
    "        3. Preserve Variable type for gradient tracking if input is Variable\n",
    "        4. Return result maintaining autograd capabilities\n",
    "        \n",
    "        AUTOGRAD CONSIDERATIONS:\n",
    "        - If x is Variable: weights and bias should also be Variables for training\n",
    "        - Preserve gradient tracking through the entire computation\n",
    "        - Enable backpropagation through this layer's parameters\n",
    "        - Handle mixed Tensor/Variable scenarios gracefully\n",
    "        \n",
    "        LEARNING CONNECTIONS:\n",
    "        - This is the core neural network transformation\n",
    "        - Matrix multiplication scales input features to output features  \n",
    "        - Bias provides offset (like y-intercept in linear equations)\n",
    "        - Broadcasting handles different batch sizes automatically\n",
    "        - Autograd support enables automatic parameter optimization\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Use the matmul function you implemented above (now autograd-aware)\n",
    "        - Handle bias addition based on input/output types\n",
    "        - Variables support + operator for gradient-tracked addition\n",
    "        - Check if self.bias is not None before adding\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Matrix multiplication: input @ weights (now autograd-aware)\n",
    "        output = matmul(x, self.weights)\n",
    "        \n",
    "        # Add bias if it exists\n",
    "        # The addition will preserve Variable type if output is Variable\n",
    "        if self.bias is not None:\n",
    "            # Check if we need Variable-aware addition\n",
    "            if hasattr(output, 'requires_grad'):\n",
    "                # output is a Variable, use Variable addition\n",
    "                if hasattr(self.bias, 'requires_grad'):\n",
    "                    # bias is also Variable, direct addition works\n",
    "                    output = output + self.bias\n",
    "                else:\n",
    "                    # bias is Tensor, convert to Variable for addition\n",
    "                    # Import Variable if not already available\n",
    "                    if 'Variable' not in globals():\n",
    "                        try:\n",
    "                            from tinytorch.core.autograd import Variable\n",
    "                        except ImportError:\n",
    "                            from autograd_dev import Variable\n",
    "                    \n",
    "                    bias_var = Variable(self.bias.data, requires_grad=False)\n",
    "                    output = output + bias_var\n",
    "            else:\n",
    "                # output is Tensor, use regular addition\n",
    "                output = output + self.bias\n",
    "        \n",
    "        return output\n",
    "        ### END SOLUTION\n",
    "\n",
    "# Backward compatibility alias\n",
    "#| export  \n",
    "Dense = Linear"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df5cd843",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Testing Linear Layer\n",
    "\n",
    "Let's verify our Linear layer works correctly with comprehensive tests.\n",
    "The tests use Dense for backward compatibility, but Dense is now an alias for Linear."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "385374fa",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-dense",
     "locked": true,
     "points": 3,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_dense_layer():\n",
    "    \"\"\"Test Dense layer implementation.\"\"\"\n",
    "    print(\"🧪 Testing Dense Layer...\")\n",
    "    \n",
    "    # Test case 1: Basic functionality\n",
    "    layer = Dense(input_size=3, output_size=2)\n",
    "    input_tensor = Tensor([[1.0, 2.0, 3.0]])  # Shape: (1, 3)\n",
    "    output = layer.forward(input_tensor)\n",
    "    \n",
    "    # Check output shape\n",
    "    assert output.shape == (1, 2), f\"Expected shape (1, 2), got {output.shape}\"\n",
    "    print(\"✅ Output shape correct\")\n",
    "    \n",
    "    # Test case 2: No bias\n",
    "    layer_no_bias = Dense(input_size=2, output_size=3, use_bias=False)\n",
    "    assert layer_no_bias.bias is None, \"Bias should be None when use_bias=False\"\n",
    "    print(\"✅ No bias option works\")\n",
    "    \n",
    "    # Test case 3: Multiple samples (batch processing)\n",
    "    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # Shape: (3, 2)\n",
    "    layer_batch = Dense(input_size=2, output_size=2)\n",
    "    batch_output = layer_batch.forward(batch_input)\n",
    "    \n",
    "    assert batch_output.shape == (3, 2), f\"Expected shape (3, 2), got {batch_output.shape}\"\n",
    "    print(\"✅ Batch processing works\")\n",
    "    \n",
    "    # Test case 4: Callable interface\n",
    "    callable_output = layer_batch(batch_input)\n",
    "    assert np.allclose(callable_output.data, batch_output.data), \"Callable interface should match forward()\"\n",
    "    print(\"✅ Callable interface works\")\n",
    "    \n",
    "    # Test case 5: Parameter initialization\n",
    "    layer_init = Dense(input_size=10, output_size=5)\n",
    "    assert layer_init.weights.shape == (10, 5), f\"Expected weights shape (10, 5), got {layer_init.weights.shape}\"\n",
    "    assert layer_init.bias.shape == (5,), f\"Expected bias shape (5,), got {layer_init.bias.shape}\"\n",
    "    \n",
    "    # Check that weights are reasonably small (good initialization)\n",
    "    assert np.abs(layer_init.weights.data).mean() < 1.0, \"Weights should be small for good initialization\"\n",
    "    print(\"✅ Parameter initialization correct\")\n",
    "    \n",
    "    print(\"🎉 All Dense layer tests passed!\")\n",
    "\n",
    "test_dense_layer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f9bb46b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Testing Autograd Integration\n",
    "\n",
    "Now let's test that our Dense layer works correctly with Variables for gradient tracking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df791018",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-dense-autograd",
     "locked": true,
     "points": 3,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_dense_layer_autograd():\n",
    "    \"\"\"Test Dense layer with autograd Variable support.\"\"\"\n",
    "    print(\"🧪 Testing Dense Layer Autograd Integration...\")\n",
    "    \n",
    "    try:\n",
    "        # Import Variable locally to handle import issues\n",
    "        try:\n",
    "            from tinytorch.core.autograd import Variable\n",
    "        except ImportError:\n",
    "            sys.path.append(os.path.join(os.path.dirname(__file__), '..', '09_autograd'))\n",
    "            from autograd_dev import Variable\n",
    "        \n",
    "        # Test case 1: Variable input with Tensor weights (inference mode)\n",
    "        layer = Dense(input_size=3, output_size=2)\n",
    "        variable_input = Variable([[1.0, 2.0, 3.0]], requires_grad=True)\n",
    "        output = layer.forward(variable_input)\n",
    "        \n",
    "        # Check that output is Variable and preserves gradient tracking\n",
    "        assert hasattr(output, 'requires_grad'), \"Output should be Variable with gradient tracking\"\n",
    "        assert output.shape == (1, 2), f\"Expected shape (1, 2), got {output.shape}\"\n",
    "        print(\"✅ Variable input preserves gradient tracking\")\n",
    "        \n",
    "        # Test case 2: Variable weights for training\n",
    "        # Convert weights and bias to Variables for training\n",
    "        layer_trainable = Dense(input_size=2, output_size=2)\n",
    "        layer_trainable.weights = Variable(layer_trainable.weights.data, requires_grad=True)\n",
    "        layer_trainable.bias = Variable(layer_trainable.bias.data, requires_grad=True)\n",
    "        \n",
    "        variable_input_2 = Variable([[1.0, 2.0]], requires_grad=True)\n",
    "        output_2 = layer_trainable.forward(variable_input_2)\n",
    "        \n",
    "        assert hasattr(output_2, 'requires_grad'), \"Output should support gradients\"\n",
    "        assert output_2.requires_grad, \"Output should require gradients when weights require gradients\"\n",
    "        print(\"✅ Variable weights enable training mode\")\n",
    "        \n",
    "        # Test case 3: Gradient flow through Dense layer\n",
    "        # Simple backward pass to check gradient computation\n",
    "        try:\n",
    "            # Create a simple loss (sum of outputs)\n",
    "            loss = Variable(np.sum(output_2.data.data))\n",
    "            loss.backward()\n",
    "            \n",
    "            # Check that gradients were computed\n",
    "            assert layer_trainable.weights.grad is not None, \"Weights should have gradients\"\n",
    "            assert layer_trainable.bias.grad is not None, \"Bias should have gradients\"\n",
    "            assert variable_input_2.grad is not None, \"Input should have gradients\"\n",
    "            print(\"✅ Gradient computation works\")\n",
    "        except Exception as e:\n",
    "            print(f\"⚠️  Gradient computation test skipped: {e}\")\n",
    "            print(\"   (This is expected if full autograd integration isn't complete yet)\")\n",
    "        \n",
    "        # Test case 4: Mixed Tensor/Variable scenarios\n",
    "        tensor_input = Tensor([[1.0, 2.0, 3.0]])\n",
    "        variable_layer = Dense(input_size=3, output_size=2)\n",
    "        mixed_output = variable_layer.forward(tensor_input)\n",
    "        \n",
    "        assert isinstance(mixed_output, Tensor), \"Tensor input should produce Tensor output\"\n",
    "        print(\"✅ Mixed Tensor/Variable handling works\")\n",
    "        \n",
    "        # Test case 5: Batch processing with Variables\n",
    "        batch_variable_input = Variable([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], requires_grad=True)\n",
    "        batch_layer = Dense(input_size=2, output_size=2)\n",
    "        batch_variable_output = batch_layer.forward(batch_variable_input)\n",
    "        \n",
    "        assert batch_variable_output.shape == (3, 2), f\"Expected batch shape (3, 2), got {batch_variable_output.shape}\"\n",
    "        assert hasattr(batch_variable_output, 'requires_grad'), \"Batch output should support gradients\"\n",
    "        print(\"✅ Batch processing with Variables works\")\n",
    "        \n",
    "        print(\"🎉 All Dense layer autograd tests passed!\")\n",
    "        \n",
    "    except ImportError as e:\n",
    "        print(f\"⚠️  Autograd tests skipped: {e}\")\n",
    "        print(\"   (Variable class not available - this is expected during development)\")\n",
    "    except Exception as e:\n",
    "        print(f\"❌ Autograd test failed: {e}\")\n",
    "        print(\"   (This indicates an implementation issue that needs fixing)\")\n",
    "\n",
    "test_dense_layer_autograd()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f047fbc8",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "# Systems Analysis: Memory and Performance Characteristics\n",
    "\n",
    "Let's analyze the memory usage and computational complexity of our layer implementations.\n",
    "\n",
    "## Memory Analysis\n",
    "- **Dense Layer Storage**: input_size × output_size weights + output_size bias terms\n",
    "- **Forward Pass Memory**: Input tensor + weight tensor + output tensor (temporary storage)\n",
    "- **Scaling Behavior**: Memory grows quadratically with layer size\n",
    "\n",
    "## Computational Complexity\n",
    "- **Matrix Multiplication**: O(batch_size × input_size × output_size)\n",
    "- **Bias Addition**: O(batch_size × output_size)\n",
    "- **Total**: Dominated by matrix multiplication for large layers\n",
    "\n",
    "## Production Insights\n",
    "In production ML systems:\n",
    "- **Memory Management**: PyTorch uses memory pools to avoid frequent allocation/deallocation\n",
    "- **Compute Optimization**: BLAS libraries (MKL, OpenBLAS) optimize matrix operations for specific hardware\n",
    "- **GPU Acceleration**: CUDA kernels parallelize matrix operations across thousands of cores\n",
    "- **Mixed Precision**: Using float16 instead of float32 can halve memory usage with minimal accuracy loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7825066",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "memory-analysis",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def analyze_layer_memory():\n",
    "    \"\"\"Analyze memory usage of different layer sizes.\"\"\"\n",
    "    print(\"📊 Layer Memory Analysis\")\n",
    "    print(\"=\" * 40)\n",
    "    \n",
    "    layer_sizes = [(10, 10), (100, 100), (1000, 1000), (784, 128), (128, 10)]\n",
    "    \n",
    "    for input_size, output_size in layer_sizes:\n",
    "        # Calculate parameter count\n",
    "        weight_params = input_size * output_size\n",
    "        bias_params = output_size\n",
    "        total_params = weight_params + bias_params\n",
    "        \n",
    "        # Calculate memory usage (assuming float32 = 4 bytes)\n",
    "        memory_mb = total_params * 4 / (1024 * 1024)\n",
    "        \n",
    "        print(f\"  {input_size:4d} → {output_size:4d}: {total_params:,} params, {memory_mb:.3f} MB\")\n",
    "    \n",
    "    print(\"\\n🔍 Key Insights:\")\n",
    "    print(\"  • Memory grows quadratically with layer width\")\n",
    "    print(\"  • Large layers (1000×1000) use significant memory\")\n",
    "    print(\"  • Modern networks balance width vs depth for efficiency\")\n",
    "\n",
    "analyze_layer_memory()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c04fb2c",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "# ML Systems Thinking: Interactive Questions\n",
    "\n",
    "Let's explore the deeper implications of our layer implementations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ce66a00",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "systems-thinking",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def explore_layer_scaling():\n",
    "    \"\"\"Explore how layer operations scale with size.\"\"\"\n",
    "    print(\"🤔 Scaling Analysis: Matrix Multiplication Performance\")\n",
    "    print(\"=\" * 55)\n",
    "    \n",
    "    sizes = [64, 128, 256, 512]\n",
    "    \n",
    "    for size in sizes:\n",
    "        # Estimate FLOPs for square matrix multiplication\n",
    "        flops = 2 * size * size * size  # 2 operations per multiply-add\n",
    "        \n",
    "        # Estimate memory bandwidth (reading A, B, writing C)\n",
    "        memory_ops = 3 * size * size  # Elements read/written\n",
    "        memory_mb = memory_ops * 4 / (1024 * 1024)  # float32 = 4 bytes\n",
    "        \n",
    "        print(f\"  Size {size:3d}×{size:3d}: {flops/1e6:.1f} MFLOPS, {memory_mb:.2f} MB transfers\")\n",
    "    \n",
    "    print(\"\\n💡 Performance Insights:\")\n",
    "    print(\"  • FLOPs grow cubically (O(n³)) with matrix size\")\n",
    "    print(\"  • Memory bandwidth grows quadratically (O(n²))\")\n",
    "    print(\"  • Large matrices become memory-bound, not compute-bound\")\n",
    "    print(\"  • This is why GPUs excel: high memory bandwidth + parallel compute\")\n",
    "\n",
    "explore_layer_scaling()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2de5338c",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🤔 ML Systems Thinking: Interactive Questions\n",
    "\n",
    "Now that you've implemented the core components, let's think about their implications for ML systems:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9dc3f4e",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "question-1",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Question 1: Memory vs Computation Trade-offs\n",
    "\"\"\"\n",
    "🤔 **Question 1: Memory vs Computation Analysis**\n",
    "\n",
    "You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power.\n",
    "\n",
    "You have two architecture options:\n",
    "A) Wide network: 784 → 2048 → 2048 → 10 (3 layers, wide)\n",
    "B) Deep network: 784 → 256 → 256 → 256 → 256 → 10 (5 layers, narrow)\n",
    "\n",
    "Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why.\n",
    "\n",
    "Consider:\n",
    "- Parameter storage requirements\n",
    "- Intermediate activation storage during forward pass\n",
    "- Training vs inference memory requirements\n",
    "- How your choice affects model capacity and accuracy\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13d2171d",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "question-2",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Question 2: Performance Optimization\n",
    "\"\"\"\n",
    "🤔 **Question 2: Production Performance Optimization**\n",
    "\n",
    "Your Dense layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware.\n",
    "\n",
    "Investigate and explain:\n",
    "1. Why might our implementation be slower? (Hint: think about underlying linear algebra libraries)\n",
    "2. What optimization techniques do production frameworks use?\n",
    "3. How would you modify our implementation to approach production performance?\n",
    "4. When might our simple implementation actually be preferable?\n",
    "\n",
    "Research areas to consider:\n",
    "- BLAS (Basic Linear Algebra Subprograms) libraries\n",
    "- Memory layout and cache efficiency\n",
    "- Vectorization and SIMD instructions\n",
    "- GPU kernel optimization\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc136b4d",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "question-3",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Question 3: Scaling and Architecture Design\n",
    "\"\"\"\n",
    "🤔 **Question 3: Systems Architecture Scaling**\n",
    "\n",
    "Modern transformer models like GPT-3 have billions of parameters, primarily in Dense layers.\n",
    "\n",
    "Analyze the scaling challenges:\n",
    "1. How does memory requirement scale with model size? Calculate the memory needed for a 175B parameter model.\n",
    "2. What are the computational bottlenecks during training vs inference?\n",
    "3. How do systems like distributed training address these scaling challenges?\n",
    "4. Why do large models use techniques like gradient checkpointing and model parallelism?\n",
    "\n",
    "Systems considerations:\n",
    "- Memory hierarchy (L1/L2/L3 cache, RAM, storage)\n",
    "- Network bandwidth for distributed training\n",
    "- GPU memory constraints and model sharding\n",
    "- Inference optimization for production serving\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "264e2bd3",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "# Comprehensive Testing and Integration\n",
    "\n",
    "Let's run a comprehensive test suite to verify all our implementations work correctly together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e45d1bfe",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "comprehensive-tests",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def run_comprehensive_tests():\n",
    "    \"\"\"Run comprehensive tests of all layer functionality.\"\"\"\n",
    "    print(\"🔬 Comprehensive Layer Testing Suite\")\n",
    "    print(\"=\" * 45)\n",
    "    \n",
    "    # Test 1: Matrix multiplication edge cases\n",
    "    print(\"\\n1. Matrix Multiplication Edge Cases:\")\n",
    "    \n",
    "    # Single element\n",
    "    a = Tensor([[5]])\n",
    "    b = Tensor([[3]])\n",
    "    result = matmul(a, b)\n",
    "    assert result.data[0, 0] == 15, \"Single element multiplication failed\"\n",
    "    print(\"   ✅ Single element multiplication\")\n",
    "    \n",
    "    # Identity matrix\n",
    "    identity = Tensor([[1, 0], [0, 1]])\n",
    "    test_matrix = Tensor([[2, 3], [4, 5]])\n",
    "    result = matmul(test_matrix, identity)\n",
    "    assert np.allclose(result.data, test_matrix.data), \"Identity multiplication failed\"\n",
    "    print(\"   ✅ Identity matrix multiplication\")\n",
    "    \n",
    "    # Test 2: Dense layer composition\n",
    "    print(\"\\n2. Dense Layer Composition:\")\n",
    "    \n",
    "    # Create a simple 2-layer network\n",
    "    layer1 = Dense(4, 3)\n",
    "    layer2 = Dense(3, 2)\n",
    "    \n",
    "    # Test data flow\n",
    "    input_data = Tensor([[1, 2, 3, 4]])\n",
    "    hidden = layer1(input_data)\n",
    "    output = layer2(hidden)\n",
    "    \n",
    "    assert output.shape == (1, 2), f\"Expected final output shape (1, 2), got {output.shape}\"\n",
    "    print(\"   ✅ Multi-layer composition\")\n",
    "    \n",
    "    # Test 3: Batch processing\n",
    "    print(\"\\n3. Batch Processing:\")\n",
    "    \n",
    "    batch_size = 10\n",
    "    batch_input = Tensor(np.random.randn(batch_size, 4))\n",
    "    batch_hidden = layer1(batch_input)\n",
    "    batch_output = layer2(batch_hidden)\n",
    "    \n",
    "    assert batch_output.shape == (batch_size, 2), f\"Expected batch output shape ({batch_size}, 2), got {batch_output.shape}\"\n",
    "    print(\"   ✅ Batch processing\")\n",
    "    \n",
    "    # Test 4: Parameter access and modification\n",
    "    print(\"\\n4. Parameter Management:\")\n",
    "    \n",
    "    layer = Dense(5, 3)\n",
    "    original_weights = layer.weights.data.copy()\n",
    "    \n",
    "    # Simulate parameter update\n",
    "    layer.weights = Tensor(original_weights + 0.1)\n",
    "    \n",
    "    assert not np.allclose(layer.weights.data, original_weights), \"Parameter update failed\"\n",
    "    print(\"   ✅ Parameter modification\")\n",
    "    \n",
    "    print(\"\\n🎉 All comprehensive tests passed!\")\n",
    "    print(\"   Your layer implementations are ready for neural network construction!\")\n",
    "\n",
    "run_comprehensive_tests()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b9bc103",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Autograd Integration Demo\n",
    "\n",
    "Let's demonstrate how the Dense layer now works seamlessly with autograd Variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9d3d3c8",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "autograd-demo",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def demonstrate_autograd_integration():\n",
    "    \"\"\"Demonstrate Dense layer working with autograd Variables.\"\"\"\n",
    "    print(\"🔥 Dense Layer Autograd Integration Demo\")\n",
    "    print(\"=\" * 50)\n",
    "    \n",
    "    try:\n",
    "        # Import Variable\n",
    "        try:\n",
    "            from tinytorch.core.autograd import Variable\n",
    "        except ImportError:\n",
    "            sys.path.append(os.path.join(os.path.dirname(__file__), '..', '09_autograd'))\n",
    "            from autograd_dev import Variable\n",
    "        \n",
    "        print(\"\\n1. Creating trainable Dense layer:\")\n",
    "        layer = Dense(input_size=3, output_size=2)\n",
    "        \n",
    "        # Convert to trainable parameters (Variables)\n",
    "        layer.weights = Variable(layer.weights.data, requires_grad=True)\n",
    "        layer.bias = Variable(layer.bias.data, requires_grad=True)\n",
    "        \n",
    "        print(f\"   Weights shape: {layer.weights.shape}\")\n",
    "        print(f\"   Weights require grad: {layer.weights.requires_grad}\")\n",
    "        print(f\"   Bias shape: {layer.bias.shape}\")\n",
    "        print(f\"   Bias require grad: {layer.bias.requires_grad}\")\n",
    "        \n",
    "        print(\"\\n2. Forward pass with Variable input:\")\n",
    "        x = Variable([[1.0, 2.0, 3.0]], requires_grad=True)\n",
    "        print(f\"   Input: {x.data.data.tolist()}\")\n",
    "        \n",
    "        y = layer(x)\n",
    "        print(f\"   Output shape: {y.shape}\")\n",
    "        print(f\"   Output requires grad: {y.requires_grad}\")\n",
    "        print(f\"   Output values: {y.data.data.tolist()}\")\n",
    "        \n",
    "        print(\"\\n3. Backward pass demonstration:\")\n",
    "        try:\n",
    "            # Simple loss: sum of all outputs\n",
    "            loss = Variable(np.sum(y.data.data))\n",
    "            print(f\"   Loss: {loss.data.data}\")\n",
    "            \n",
    "            # Clear gradients\n",
    "            layer.weights.zero_grad()\n",
    "            layer.bias.zero_grad() \n",
    "            x.zero_grad()\n",
    "            \n",
    "            # Backward pass\n",
    "            loss.backward()\n",
    "            \n",
    "            print(f\"   Weight gradients computed: {layer.weights.grad is not None}\")\n",
    "            print(f\"   Bias gradients computed: {layer.bias.grad is not None}\")\n",
    "            print(f\"   Input gradients computed: {x.grad is not None}\")\n",
    "            \n",
    "            if layer.weights.grad is not None:\n",
    "                print(f\"   Weight gradient shape: {layer.weights.grad.shape}\")\n",
    "            if layer.bias.grad is not None:\n",
    "                print(f\"   Bias gradient shape: {layer.bias.grad.shape}\")\n",
    "            \n",
    "        except Exception as e:\n",
    "            print(f\"   ⚠️ Backward pass demo limited: {e}\")\n",
    "        \n",
    "        print(\"\\n4. Backward compatibility with Tensors:\")\n",
    "        tensor_input = Tensor([[1.0, 2.0, 3.0]])\n",
    "        tensor_layer = Dense(input_size=3, output_size=2)\n",
    "        tensor_output = tensor_layer(tensor_input)\n",
    "        \n",
    "        print(f\"   Input type: {type(tensor_input).__name__}\")\n",
    "        print(f\"   Output type: {type(tensor_output).__name__}\")\n",
    "        print(\"   ✅ Tensor-only operations still work perfectly\")\n",
    "        \n",
    "        print(\"\\n🎉 Dense layer now supports both Tensors and Variables!\")\n",
    "        print(\"   • Tensors: Fast inference without gradient tracking\")\n",
    "        print(\"   • Variables: Full training with automatic differentiation\")\n",
    "        print(\"   • Seamless interoperability for different use cases\")\n",
    "        \n",
    "    except ImportError as e:\n",
    "        print(f\"⚠️ Autograd demo skipped: {e}\")\n",
    "        print(\"  (Variable class not available)\")\n",
    "    except Exception as e:\n",
    "        print(f\"❌ Demo failed: {e}\")\n",
    "\n",
    "demonstrate_autograd_integration()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "deffd3a3",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module Summary\n",
    "\n",
    "## 🎯 What You've Accomplished\n",
    "\n",
    "You've successfully implemented the fundamental building blocks of neural networks:\n",
    "\n",
    "### ✅ **Core Implementations**\n",
    "- **Matrix Multiplication**: The computational primitive underlying all neural network operations (now with autograd support)\n",
    "- **Dense Layer**: Complete implementation with proper parameter initialization, forward propagation, and Variable support\n",
    "- **Autograd Integration**: Seamless support for both Tensors (inference) and Variables (training with gradients)\n",
    "- **Composition Patterns**: How layers stack together to form complex function approximators\n",
    "\n",
    "### ✅ **Systems Understanding**\n",
    "- **Memory Analysis**: How layer size affects memory usage and why this matters for deployment\n",
    "- **Performance Characteristics**: Understanding computational complexity and scaling behavior\n",
    "- **Production Context**: Connection to real-world ML systems and optimization techniques\n",
    "\n",
    "### ✅ **ML Engineering Skills**\n",
    "- **Parameter Management**: How neural networks store and update learnable parameters\n",
    "- **Batch Processing**: Efficient handling of multiple data samples simultaneously\n",
    "- **Architecture Design**: Trade-offs between network width, depth, and resource requirements\n",
    "\n",
    "## 🔗 **Connection to Production ML Systems**\n",
    "\n",
    "Your implementations mirror the core concepts used in:\n",
    "- **PyTorch's nn.Linear**: Same mathematical operations with production optimizations\n",
    "- **TensorFlow's Dense layers**: Identical parameter structure and forward pass logic\n",
    "- **Transformer architectures**: Dense layers form the foundation of modern language models\n",
    "- **Computer vision models**: ConvNets use similar principles with spatial structure\n",
    "\n",
    "## 🚀 **What's Next**\n",
    "\n",
    "With solid layer implementations, you're ready to:\n",
    "- **Compose** these layers into complete neural networks\n",
    "- **Add** nonlinear activations to enable complex function approximation\n",
    "- **Implement** training algorithms to learn from data\n",
    "- **Scale** to larger, more sophisticated architectures\n",
    "\n",
    "## 💡 **Key Systems Insights**\n",
    "\n",
    "1. **Matrix multiplication is the computational bottleneck** in neural networks\n",
    "2. **Memory layout and access patterns** often matter more than raw compute power\n",
    "3. **Layer composition** is the fundamental abstraction for building complex ML systems\n",
    "4. **Parameter initialization and management** directly affects training success\n",
    "\n",
    "You now understand the mathematical and computational foundations that enable neural networks to learn complex patterns from data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4d045ea",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "final-demo",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    print(\"🔥 TinyTorch Layers Module - Final Demo\")\n",
    "    print(\"=\" * 50)\n",
    "    \n",
    "    # Create a simple neural network architecture\n",
    "    print(\"\\n🏗️ Building a 3-layer neural network:\")\n",
    "    layer1 = Dense(784, 128)  # Input layer (like MNIST images)\n",
    "    layer2 = Dense(128, 64)   # Hidden layer\n",
    "    layer3 = Dense(64, 10)    # Output layer (10 classes)\n",
    "    \n",
    "    print(f\"  Layer 1: {layer1.input_size} → {layer1.output_size} ({layer1.weights.data.size:,} parameters)\")\n",
    "    print(f\"  Layer 2: {layer2.input_size} → {layer2.output_size} ({layer2.weights.data.size:,} parameters)\")\n",
    "    print(f\"  Layer 3: {layer3.input_size} → {layer3.output_size} ({layer3.weights.data.size:,} parameters)\")\n",
    "    \n",
    "    # Simulate forward pass\n",
    "    print(\"\\n🚀 Forward pass through network:\")\n",
    "    batch_size = 32\n",
    "    input_data = Tensor(np.random.randn(batch_size, 784))\n",
    "    \n",
    "    print(f\"  Input shape: {input_data.shape}\")\n",
    "    hidden1 = layer1(input_data)\n",
    "    print(f\"  After layer 1: {hidden1.shape}\")\n",
    "    hidden2 = layer2(hidden1)\n",
    "    print(f\"  After layer 2: {hidden2.shape}\")\n",
    "    output = layer3(hidden2)\n",
    "    print(f\"  Final output: {output.shape}\")\n",
    "    \n",
    "    # Calculate total parameters\n",
    "    total_params = (layer1.weights.data.size + layer1.bias.data.size + \n",
    "                   layer2.weights.data.size + layer2.bias.data.size +\n",
    "                   layer3.weights.data.size + layer3.bias.data.size)\n",
    "    \n",
    "    print(f\"\\n📊 Network Statistics:\")\n",
    "    print(f\"  Total parameters: {total_params:,}\")\n",
    "    print(f\"  Memory usage: ~{total_params * 4 / 1024 / 1024:.2f} MB (float32)\")\n",
    "    print(f\"  Forward pass: {batch_size} samples processed simultaneously\")\n",
    "    \n",
    "    print(\"\\n✅ Neural network construction complete!\")\n",
    "    print(\"Ready for activation functions and training algorithms!\")\n",
    "    \n",
    "    # Run autograd integration demo\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    demonstrate_autograd_integration()"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}