# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#       jupytext_version: 1.17.1
# ---

# %% [markdown]
"""
# Module 2: Layers - Neural Network Building Blocks

Welcome to the Layers module! This is where neural networks begin. You'll implement the fundamental building blocks that transform tensors.

## Learning Goals
- Understand layers as functions that transform tensors: `y = f(x)`
- Implement Dense layers with linear transformations: `y = Wx + b`
- Use activation functions from the activations module for nonlinearity
- See how neural networks are just function composition
- Build intuition before diving into training

## Build → Use → Understand
1. **Build**: Dense layers using activation functions as building blocks
2. **Use**: Transform tensors and see immediate results
3. **Understand**: How neural networks transform information

## Module Dependencies
This module builds on the **activations** module:
- **activations** → **layers** → **networks**
- Clean separation of concerns: math functions → layer building blocks → full networks
"""

# %% [markdown]
"""
## 📦 Where This Code Lives in the Final Package

**Learning Side:** You work in `modules/layers/layers_dev.py`  
**Building Side:** Code exports to `tinytorch.core.layers`

```python
# Final package structure:
from tinytorch.core.layers import Dense, Conv2D  # All layers together!
from tinytorch.core.activations import ReLU, Sigmoid, Tanh
from tinytorch.core.tensor import Tensor
```

**Why this matters:**
- **Learning:** Focused modules for deep understanding
- **Production:** Proper organization like PyTorch's `torch.nn`
- **Consistency:** All layers (Dense, Conv2D) live together in `core.layers`
"""

# %%
#| default_exp core.layers

# Setup and imports
import numpy as np
import sys
from typing import Union, Optional, Callable
import math

# %%
#| export
import numpy as np
import math
import sys
from typing import Union, Optional, Callable

# Import from the main package (rock solid foundation)
from tinytorch.core.tensor import Tensor
from tinytorch.core.activations import ReLU, Sigmoid, Tanh

# print("🔥 TinyTorch Layers Module")
# print(f"NumPy version: {np.__version__}")
# print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
# print("Ready to build neural network layers!")

# %% [markdown]
"""
## Step 1: What is a Layer?

### Definition
A **layer** is a function that transforms tensors. Think of it as a mathematical operation that takes input data and produces output data:

```
Input Tensor → Layer → Output Tensor
```

### Why Layers Matter in Neural Networks
Layers are the fundamental building blocks of all neural networks because:
- **Modularity**: Each layer has a specific job (linear transformation, nonlinearity, etc.)
- **Composability**: Layers can be combined to create complex functions
- **Learnability**: Each layer has parameters that can be learned from data
- **Interpretability**: Different layers learn different features

### The Fundamental Insight
**Neural networks are just function composition!**
```
x → Layer1 → Layer2 → Layer3 → y
```

Each layer transforms the data, and the final output is the composition of all these transformations.

### Real-World Examples
- **Dense Layer**: Learns linear relationships between features
- **Convolutional Layer**: Learns spatial patterns in images
- **Recurrent Layer**: Learns temporal patterns in sequences
- **Activation Layer**: Adds nonlinearity to make networks powerful

### Visual Intuition
```
Input: [1, 2, 3] (3 features)
Dense Layer: y = Wx + b
Weights W: [[0.1, 0.2, 0.3],
            [0.4, 0.5, 0.6]] (2×3 matrix)
Bias b: [0.1, 0.2] (2 values)
Output: [0.1*1 + 0.2*2 + 0.3*3 + 0.1,
         0.4*1 + 0.5*2 + 0.6*3 + 0.2] = [1.4, 3.2]
```

Let's start with the most important layer: **Dense** (also called Linear or Fully Connected).
"""

# %% [markdown]
"""
## Step 2: Understanding Matrix Multiplication

Before we build layers, let's understand the core operation: **matrix multiplication**. This is what powers all neural network computations.

### Why Matrix Multiplication Matters
- **Efficiency**: Process multiple inputs at once
- **Parallelization**: GPU acceleration works great with matrix operations
- **Batch processing**: Handle multiple samples simultaneously
- **Mathematical foundation**: Linear algebra is the language of neural networks

### The Math Behind It
For matrices A (m×n) and B (n×p), the result C (m×p) is:
```
C[i,j] = sum(A[i,k] * B[k,j] for k in range(n))
```

### Visual Example
```
A = [[1, 2],     B = [[5, 6],
     [3, 4]]          [7, 8]]

C = A @ B = [[1*5 + 2*7,  1*6 + 2*8],
              [3*5 + 4*7,  3*6 + 4*8]]
  = [[19, 22],
     [43, 50]]
```

Let's implement this step by step!
"""

# %%
#| export
def matmul_naive(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """
    Naive matrix multiplication using explicit for-loops.
    
    This helps you understand what matrix multiplication really does!
    
    Args:
        A: Matrix of shape (m, n)
        B: Matrix of shape (n, p)
        
    Returns:
        Matrix of shape (m, p) where C[i,j] = sum(A[i,k] * B[k,j] for k in range(n))
        
    TODO: Implement matrix multiplication using three nested for-loops.
    
    APPROACH:
    1. Get the dimensions: m, n from A and n2, p from B
    2. Check that n == n2 (matrices must be compatible)
    3. Create output matrix C of shape (m, p) filled with zeros
    4. Use three nested loops:
       - i loop: rows of A (0 to m-1)
       - j loop: columns of B (0 to p-1) 
       - k loop: shared dimension (0 to n-1)
    5. For each (i,j), compute: C[i,j] += A[i,k] * B[k,j]
    
    EXAMPLE:
    A = [[1, 2],     B = [[5, 6],
         [3, 4]]          [7, 8]]
    
    C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19
    C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22
    C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43
    C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50
    
    HINTS:
    - Start with C = np.zeros((m, p))
    - Use three nested for loops: for i in range(m): for j in range(p): for k in range(n):
    - Accumulate the sum: C[i,j] += A[i,k] * B[k,j]
    """
    raise NotImplementedError("Student implementation required")

# %%
#| hide
#| export
def matmul_naive(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """
    Naive matrix multiplication using explicit for-loops.
    
    This helps you understand what matrix multiplication really does!
    """
    m, n = A.shape
    n2, p = B.shape
    assert n == n2, f"Matrix shapes don't match: A({m},{n}) @ B({n2},{p})"
    
    C = np.zeros((m, p))
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]
    return C

# %% [markdown]
"""
### 🧪 Test Your Matrix Multiplication
"""

# %%
# Test matrix multiplication
print("Testing matrix multiplication...")

try:
    # Test case 1: Simple 2x2 matrices
    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
    B = np.array([[5, 6], [7, 8]], dtype=np.float32)
    
    result = matmul_naive(A, B)
    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
    
    print(f"✅ Matrix A:\n{A}")
    print(f"✅ Matrix B:\n{B}")
    print(f"✅ Your result:\n{result}")
    print(f"✅ Expected:\n{expected}")
    
    assert np.allclose(result, expected), "❌ Result doesn't match expected!"
    print("🎉 Matrix multiplication works!")
    
    # Test case 2: Compare with NumPy
    numpy_result = A @ B
    assert np.allclose(result, numpy_result), "❌ Doesn't match NumPy result!"
    print("✅ Matches NumPy implementation!")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Make sure to implement matmul_naive above!")

# %% [markdown]
"""
## Step 3: Building the Dense Layer

Now let's build the **Dense layer**, the most fundamental building block of neural networks. A Dense layer performs a linear transformation: `y = Wx + b`

### What is a Dense Layer?
- **Linear transformation**: `y = Wx + b`
- **W**: Weight matrix (learnable parameters)
- **x**: Input tensor
- **b**: Bias vector (learnable parameters)
- **y**: Output tensor

### Why Dense Layers Matter
- **Universal approximation**: Can approximate any function with enough neurons
- **Feature learning**: Each neuron learns a different feature
- **Nonlinearity**: When combined with activation functions, becomes very powerful
- **Foundation**: All other layers build on this concept

### The Math
For input x of shape (batch_size, input_size):
- **W**: Weight matrix of shape (input_size, output_size)
- **b**: Bias vector of shape (output_size)
- **y**: Output of shape (batch_size, output_size)

### Visual Example
```
Input: x = [1, 2, 3] (3 features)
Weights: W = [[0.1, 0.2],    Bias: b = [0.1, 0.2]
              [0.3, 0.4],
              [0.5, 0.6]]

Step 1: Wx = [0.1*1 + 0.3*2 + 0.5*3,  0.2*1 + 0.4*2 + 0.6*3]
            = [2.2, 3.2]

Step 2: y = Wx + b = [2.2 + 0.1, 3.2 + 0.2] = [2.3, 3.4]
```

Let's implement this!
"""

# %%
#| export
class Dense:
    """
    Dense (Linear) Layer: y = Wx + b
    
    The fundamental building block of neural networks.
    Performs linear transformation: matrix multiplication + bias addition.
    
    Args:
        input_size: Number of input features
        output_size: Number of output features
        use_bias: Whether to include bias term (default: True)
        use_naive_matmul: Whether to use naive matrix multiplication (for learning)
        
    TODO: Implement the Dense layer with weight initialization and forward pass.
    
    APPROACH:
    1. Store layer parameters (input_size, output_size, use_bias, use_naive_matmul)
    2. Initialize weights with small random values (Xavier/Glorot initialization)
    3. Initialize bias to zeros (if use_bias=True)
    4. Implement forward pass using matrix multiplication and bias addition
    
    EXAMPLE:
    layer = Dense(input_size=3, output_size=2)
    x = Tensor([[1, 2, 3]])  # batch_size=1, input_size=3
    y = layer(x)  # shape: (1, 2)
    
    HINTS:
    - Use np.random.randn() for random initialization
    - Scale weights by sqrt(2/(input_size + output_size)) for Xavier init
    - Store weights and bias as numpy arrays
    - Use matmul_naive or @ operator based on use_naive_matmul flag
    """
    
    def __init__(self, input_size: int, output_size: int, use_bias: bool = True, 
                 use_naive_matmul: bool = False):
        """
        Initialize Dense layer with random weights.
        
        Args:
            input_size: Number of input features
            output_size: Number of output features
            use_bias: Whether to include bias term
            use_naive_matmul: Use naive matrix multiplication (for learning)
            
        TODO: 
        1. Store layer parameters (input_size, output_size, use_bias, use_naive_matmul)
        2. Initialize weights with small random values
        3. Initialize bias to zeros (if use_bias=True)
        
        STEP-BY-STEP:
        1. Store the parameters as instance variables
        2. Calculate scale factor for Xavier initialization: sqrt(2/(input_size + output_size))
        3. Initialize weights: np.random.randn(input_size, output_size) * scale
        4. If use_bias=True, initialize bias: np.zeros(output_size)
        5. If use_bias=False, set bias to None
        
        EXAMPLE:
        Dense(3, 2) creates:
        - weights: shape (3, 2) with small random values
        - bias: shape (2,) with zeros
        """
        raise NotImplementedError("Student implementation required")
    
    def forward(self, x: Tensor) -> Tensor:
        """
        Forward pass: y = Wx + b
        
        Args:
            x: Input tensor of shape (batch_size, input_size)
            
        Returns:
            Output tensor of shape (batch_size, output_size)
            
        TODO: Implement matrix multiplication and bias addition
        - Use self.use_naive_matmul to choose between NumPy and naive implementation
        - If use_naive_matmul=True, use matmul_naive(x.data, self.weights)
        - If use_naive_matmul=False, use x.data @ self.weights
        - Add bias if self.use_bias=True
        
        STEP-BY-STEP:
        1. Perform matrix multiplication: Wx
           - If use_naive_matmul: result = matmul_naive(x.data, self.weights)
           - Else: result = x.data @ self.weights
        2. Add bias if use_bias: result += self.bias
        3. Return Tensor(result)
        
        EXAMPLE:
        Input x: Tensor([[1, 2, 3]])  # shape (1, 3)
        Weights: shape (3, 2)
        Output: Tensor([[val1, val2]])  # shape (1, 2)
        
        HINTS:
        - x.data gives you the numpy array
        - self.weights is your weight matrix
        - Use broadcasting for bias addition: result + self.bias
        - Return Tensor(result) to wrap the result
        """
        raise NotImplementedError("Student implementation required")
    
    def __call__(self, x: Tensor) -> Tensor:
        """Make layer callable: layer(x) same as layer.forward(x)"""
        return self.forward(x)

# %%
#| hide
#| export
class Dense:
    """
    Dense (Linear) Layer: y = Wx + b
    
    The fundamental building block of neural networks.
    Performs linear transformation: matrix multiplication + bias addition.
    """
    
    def __init__(self, input_size: int, output_size: int, use_bias: bool = True, 
                 use_naive_matmul: bool = False):
        """
        Initialize Dense layer with random weights.
        
        Args:
            input_size: Number of input features
            output_size: Number of output features
            use_bias: Whether to include bias term
            use_naive_matmul: Use naive matrix multiplication (for learning)
        """
        # Store parameters
        self.input_size = input_size
        self.output_size = output_size
        self.use_bias = use_bias
        self.use_naive_matmul = use_naive_matmul
        
        # Xavier/Glorot initialization
        scale = np.sqrt(2.0 / (input_size + output_size))
        self.weights = np.random.randn(input_size, output_size).astype(np.float32) * scale
        
        # Initialize bias
        if use_bias:
            self.bias = np.zeros(output_size, dtype=np.float32)
        else:
            self.bias = None
    
    def forward(self, x: Tensor) -> Tensor:
        """
        Forward pass: y = Wx + b
        
        Args:
            x: Input tensor of shape (batch_size, input_size)
            
        Returns:
            Output tensor of shape (batch_size, output_size)
        """
        # Matrix multiplication
        if self.use_naive_matmul:
            result = matmul_naive(x.data, self.weights)
        else:
            result = x.data @ self.weights
        
        # Add bias
        if self.use_bias:
            result += self.bias
        
        return Tensor(result)
    
    def __call__(self, x: Tensor) -> Tensor:
        """Make layer callable: layer(x) same as layer.forward(x)"""
        return self.forward(x)

# %% [markdown]
"""
### 🧪 Test Your Dense Layer
"""

# %%
# Test Dense layer
print("Testing Dense layer...")

try:
    # Test basic Dense layer
    layer = Dense(input_size=3, output_size=2, use_bias=True)
    x = Tensor([[1, 2, 3]])  # batch_size=1, input_size=3
    
    print(f"✅ Input shape: {x.shape}")
    print(f"✅ Layer weights shape: {layer.weights.shape}")
    print(f"✅ Layer bias shape: {layer.bias.shape}")
    
    y = layer(x)
    print(f"✅ Output shape: {y.shape}")
    print(f"✅ Output: {y}")
    
    # Test without bias
    layer_no_bias = Dense(input_size=2, output_size=1, use_bias=False)
    x2 = Tensor([[1, 2]])
    y2 = layer_no_bias(x2)
    print(f"✅ No bias output: {y2}")
    
    # Test naive matrix multiplication
    layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
    x3 = Tensor([[1, 2]])
    y3 = layer_naive(x3)
    print(f"✅ Naive matmul output: {y3}")
    
    print("\n🎉 All Dense layer tests passed!")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Make sure to implement the Dense layer above!")

# %% [markdown]
"""
## Step 4: Composing Layers with Activations

Now let's see how layers work together! A neural network is just layers composed with activation functions.

### Why Layer Composition Matters
- **Nonlinearity**: Activation functions make networks powerful
- **Feature learning**: Each layer learns different levels of features
- **Universal approximation**: Can approximate any function
- **Modularity**: Easy to experiment with different architectures

### The Pattern
```
Input → Dense → Activation → Dense → Activation → Output
```

### Real-World Example
```
Input: [1, 2, 3] (3 features)
Dense(3→2): [1.4, 2.8] (linear transformation)
ReLU: [1.4, 2.8] (nonlinearity)
Dense(2→1): [3.2] (final prediction)
```

Let's build a simple network!
"""

# %%
# Test layer composition
print("Testing layer composition...")

try:
    # Create a simple network: Dense → ReLU → Dense
    dense1 = Dense(input_size=3, output_size=2)
    relu = ReLU()
    dense2 = Dense(input_size=2, output_size=1)
    
    # Test input
    x = Tensor([[1, 2, 3]])
    print(f"✅ Input: {x}")
    
    # Forward pass through the network
    h1 = dense1(x)
    print(f"✅ After Dense1: {h1}")
    
    h2 = relu(h1)
    print(f"✅ After ReLU: {h2}")
    
    y = dense2(h2)
    print(f"✅ Final output: {y}")
    
    print("\n🎉 Layer composition works!")
    print("This is how neural networks work: layers + activations!")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Make sure all your layers and activations are working!")

# %% [markdown]
"""
## Step 5: Performance Comparison

Let's compare our naive matrix multiplication with NumPy's optimized version to understand why optimization matters in ML.

### Why Performance Matters
- **Training time**: Neural networks train for hours/days
- **Inference speed**: Real-time applications need fast predictions
- **GPU utilization**: Optimized operations use hardware efficiently
- **Scalability**: Large models need efficient implementations
"""

# %%
# Performance comparison
print("Comparing naive vs NumPy matrix multiplication...")

try:
    import time
    
    # Create test matrices
    A = np.random.randn(100, 100).astype(np.float32)
    B = np.random.randn(100, 100).astype(np.float32)
    
    # Time naive implementation
    start_time = time.time()
    result_naive = matmul_naive(A, B)
    naive_time = time.time() - start_time
    
    # Time NumPy implementation
    start_time = time.time()
    result_numpy = A @ B
    numpy_time = time.time() - start_time
    
    print(f"✅ Naive time: {naive_time:.4f} seconds")
    print(f"✅ NumPy time: {numpy_time:.4f} seconds")
    print(f"✅ Speedup: {naive_time/numpy_time:.1f}x faster")
    
    # Verify correctness
    assert np.allclose(result_naive, result_numpy), "Results don't match!"
    print("✅ Results are identical!")
    
    print("\n💡 This is why we use optimized libraries in production!")
    
except Exception as e:
    print(f"❌ Error: {e}")

# %% [markdown]
"""
## 🎯 Module Summary

Congratulations! You've built the foundation of neural network layers:

### What You've Accomplished
✅ **Matrix Multiplication**: Understanding the core operation  
✅ **Dense Layer**: Linear transformation with weights and bias  
✅ **Layer Composition**: Combining layers with activations  
✅ **Performance Awareness**: Understanding optimization importance  
✅ **Testing**: Immediate feedback on your implementations  

### Key Concepts You've Learned
- **Layers** are functions that transform tensors
- **Matrix multiplication** powers all neural network computations
- **Dense layers** perform linear transformations: `y = Wx + b`
- **Layer composition** creates complex functions from simple building blocks
- **Performance** matters for real-world ML applications

### What's Next
In the next modules, you'll build on this foundation:
- **Networks**: Compose layers into complete models
- **Training**: Learn parameters with gradients and optimization
- **Convolutional layers**: Process spatial data like images
- **Recurrent layers**: Process sequential data like text

### Real-World Connection
Your Dense layer is now ready to:
- Learn patterns in data through weight updates
- Transform features for classification and regression
- Serve as building blocks for complex architectures
- Integrate with the rest of the TinyTorch ecosystem

**Ready for the next challenge?** Let's move on to building complete neural networks!
"""

# %%
# Final verification
print("\n" + "="*50)
print("🎉 LAYERS MODULE COMPLETE!")
print("="*50)
print("✅ Matrix multiplication understanding")
print("✅ Dense layer implementation")
print("✅ Layer composition with activations")
print("✅ Performance awareness")
print("✅ Comprehensive testing")
print("\n🚀 Ready to build networks in the next module!")