TinyTorch/modules/source/03_layers/layers_dev.py

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#       jupytext_version: 1.17.1
# ---

# %% [markdown]
"""
# Module 3: Layers - Building Blocks of Neural Networks

Welcome to the Layers module! This is where we build the fundamental components that stack together to form neural networks.

## Learning Goals
- Understand how matrix multiplication powers neural networks
- Implement naive matrix multiplication from scratch for deep understanding
- Build the Dense (Linear) layer - the foundation of all neural networks
- Learn weight initialization strategies and their importance
- See how layers compose with activations to create powerful networks

## Build → Use → Understand
1. **Build**: Matrix multiplication and Dense layers from scratch
2. **Use**: Create and test layers with real data
3. **Understand**: How linear transformations enable feature learning
"""

# %% nbgrader={"grade": false, "grade_id": "layers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
#| default_exp core.layers

#| export
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
from typing import Union, List, Tuple, Optional

# Import our dependencies - try from package first, then local modules
try:
    from tinytorch.core.tensor import Tensor
    from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
except ImportError:
    # For development, import from local modules
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
    from tensor_dev import Tensor
    from activations_dev import ReLU, Sigmoid, Tanh, Softmax

# %% nbgrader={"grade": false, "grade_id": "layers-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
#| hide
#| export
def _should_show_plots():
    """Check if we should show plots (disable during testing)"""
    # Check multiple conditions that indicate we're in test mode
    is_pytest = (
        'pytest' in sys.modules or
        'test' in sys.argv or
        os.environ.get('PYTEST_CURRENT_TEST') is not None or
        any('test' in arg for arg in sys.argv) or
        any('pytest' in arg for arg in sys.argv)
    )

    # Show plots in development mode (when not in test mode)
    return not is_pytest

# %% nbgrader={"grade": false, "grade_id": "layers-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
print("🔥 TinyTorch Layers Module")
print(f"NumPy version: {np.__version__}")
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print("Ready to build neural network layers!")

# %% [markdown]
"""
## 📦 Where This Code Lives in the Final Package

**Learning Side:** You work in `modules/source/03_layers/layers_dev.py`
**Building Side:** Code exports to `tinytorch.core.layers`

```python
# Final package structure:
from tinytorch.core.layers import Dense, Conv2D  # All layer types together!
from tinytorch.core.tensor import Tensor  # The foundation
from tinytorch.core.activations import ReLU, Sigmoid  # Nonlinearity
```

**Why this matters:**
- **Learning:** Focused modules for deep understanding
- **Production:** Proper organization like PyTorch's `torch.nn.Linear`
- **Consistency:** All layer types live together in `core.layers`
- **Integration:** Works seamlessly with tensors and activations
"""

# %% [markdown]
"""
## 🧠 The Mathematical Foundation of Neural Layers

### Linear Algebra at the Heart of ML
Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**:

```
Layer: y = Wx + b (linear transformation)
Activation: z = σ(y) (nonlinear transformation)
```

### Matrix Multiplication: The Engine of Deep Learning
Every forward pass in a neural network involves matrix multiplication:
- **Dense layers**: Matrix multiplication between inputs and weights
- **Convolutional layers**: Convolution as matrix multiplication
- **Attention**: Query-key-value matrix operations
- **Transformers**: Self-attention through matrix operations

### Why Matrix Multiplication Matters
- **Parallel computation**: GPUs excel at matrix operations
- **Batch processing**: Handle multiple samples simultaneously
- **Feature learning**: Each row/column learns different patterns
- **Composability**: Layers stack naturally through matrix chains

### Connection to Real ML Systems
Every framework optimizes matrix multiplication:
- **PyTorch**: `torch.nn.Linear` uses optimized BLAS
- **TensorFlow**: `tf.keras.layers.Dense` uses cuDNN
- **JAX**: `jax.numpy.dot` uses XLA compilation
- **TinyTorch**: `tinytorch.core.layers.Dense` (what we're building!)

### Performance Considerations
- **Memory layout**: Contiguous arrays for cache efficiency
- **Vectorization**: SIMD operations for speed
- **Parallelization**: Multi-threading and GPU acceleration
- **Numerical stability**: Proper initialization and normalization
"""

# %% [markdown]
"""
## Step 1: Understanding Matrix Multiplication

### What is Matrix Multiplication?
Matrix multiplication is the **fundamental operation** that powers neural networks. When we multiply matrices A and B:

```
C = A @ B
```

Each element C[i,j] is the **dot product** of row i from A and column j from B.

### The Mathematical Foundation: Linear Algebra in Neural Networks

#### **Why Matrix Multiplication in Neural Networks?**
Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**:

```python
# The core neural network operation:
linear_output = weights @ input + bias    # Linear transformation (matrix multiplication)
activation_output = activation_function(linear_output)  # Nonlinear transformation
```

#### **The Geometric Interpretation**
Matrix multiplication represents **geometric transformations** in high-dimensional space:

- **Rotation**: Changing the orientation of data
- **Scaling**: Stretching or compressing along certain dimensions
- **Projection**: Mapping to lower or higher dimensional spaces
- **Translation**: Shifting data (via bias terms)

#### **Why This Matters for Learning**
Each layer learns to transform the input space to make the final task easier:

```python
# Example: Image classification
raw_pixels → [Layer 1] → edges → [Layer 2] → shapes → [Layer 3] → objects → [Layer 4] → classes
```

### The Computational Perspective

#### **Batch Processing Power**
Matrix multiplication enables efficient batch processing:

```python
# Single sample (inefficient):
for sample in batch:
    output = weights @ sample + bias  # Process one at a time

# Batch processing (efficient):
batch_output = weights @ batch + bias  # Process all samples simultaneously
```

#### **Parallelization Benefits**
- **CPU**: Multiple cores can compute different parts simultaneously
- **GPU**: Thousands of cores excel at matrix operations
- **TPU**: Specialized hardware designed for matrix multiplication
- **Memory**: Contiguous memory access patterns improve cache efficiency

#### **Computational Complexity**
For matrices A(m×n) and B(n×p):
- **Time complexity**: O(mnp) - cubic in the worst case
- **Space complexity**: O(mp) - for the output matrix
- **Optimization**: Modern libraries use optimized algorithms (Strassen, etc.)

### Real-World Applications: Where Matrix Multiplication Shines

#### **Computer Vision**
```python
# Convolutional layers can be expressed as matrix multiplication:
# Image patches → Matrix A
# Convolutional filters → Matrix B
# Feature maps → Matrix C = A @ B
```

#### **Natural Language Processing**
```python
# Transformer attention mechanism:
# Query matrix Q, Key matrix K, Value matrix V
# Attention weights = softmax(Q @ K.T / sqrt(d_k))
# Output = Attention_weights @ V
```

#### **Recommendation Systems**
```python
# Matrix factorization:
# User-item matrix R ≈ User_factors @ Item_factors.T
# Collaborative filtering through matrix operations
```

### The Algorithm: Understanding Every Step

For matrices A(m×n) and B(n×p) → C(m×p):
```python
for i in range(m):        # For each row of A
    for j in range(p):    # For each column of B
        for k in range(n):  # Compute dot product
            C[i,j] += A[i,k] * B[k,j]
```

#### **Visual Breakdown**
```
A = [[1, 2],     B = [[5, 6],     C = [[19, 22],
     [3, 4]]          [7, 8]]          [43, 50]]

C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19
C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22
C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43
C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50
```

#### **Memory Access Pattern**
- **Row-major order**: Access elements row by row for cache efficiency
- **Cache locality**: Nearby elements are likely to be accessed together
- **Blocking**: Divide large matrices into blocks for better cache usage

### Performance Considerations: Making It Fast

#### **Optimization Strategies**
1. **Vectorization**: Use SIMD instructions for parallel element operations
2. **Blocking**: Divide matrices into cache-friendly blocks
3. **Loop unrolling**: Reduce loop overhead
4. **Memory alignment**: Ensure data is aligned for optimal access

#### **Modern Libraries**
- **BLAS (Basic Linear Algebra Subprograms)**: Optimized matrix operations
- **Intel MKL**: Highly optimized for Intel processors
- **OpenBLAS**: Open-source optimized BLAS
- **cuBLAS**: GPU-accelerated BLAS from NVIDIA

#### **Why We Implement Naive Version**
Understanding the basic algorithm helps you:
- **Debug performance issues**: Know what's happening under the hood
- **Optimize for specific cases**: Custom implementations for special matrices
- **Understand complexity**: Appreciate the optimizations in modern libraries
- **Educational value**: See the mathematical foundation clearly

### Connection to Neural Network Architecture

#### **Layer Composition**
```python
# Each layer is a matrix multiplication:
layer1_output = W1 @ input + b1
layer2_output = W2 @ layer1_output + b2
layer3_output = W3 @ layer2_output + b3

# This is equivalent to:
final_output = W3 @ (W2 @ (W1 @ input + b1) + b2) + b3
```

#### **Gradient Flow**
During backpropagation, gradients flow through matrix operations:
```python
# Forward: y = W @ x + b
# Backward:
# dW = dy @ x.T
# dx = W.T @ dy
# db = dy.sum(axis=0)
```

#### **Weight Initialization**
Matrix multiplication behavior depends on weight initialization:
- **Xavier/Glorot**: Maintains variance across layers
- **He initialization**: Optimized for ReLU activations
- **Orthogonal**: Preserves gradient norms

Let's implement matrix multiplication to truly understand it!
"""

# %% nbgrader={"grade": false, "grade_id": "matmul-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
def matmul_naive(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """
    Naive matrix multiplication using explicit for-loops.

    This helps you understand what matrix multiplication really does!

    Args:
        A: Matrix of shape (m, n)
        B: Matrix of shape (n, p)

    Returns:
        Matrix of shape (m, p) where C[i,j] = sum(A[i,k] * B[k,j] for k in range(n))

    TODO: Implement matrix multiplication using three nested for-loops.

    APPROACH:
    1. Get the dimensions: m, n from A and n2, p from B
    2. Check that n == n2 (matrices must be compatible)
    3. Create output matrix C of shape (m, p) filled with zeros
    4. Use three nested loops:
       - i loop: rows of A (0 to m-1)
       - j loop: columns of B (0 to p-1)
       - k loop: shared dimension (0 to n-1)
    5. For each (i,j), compute: C[i,j] += A[i,k] * B[k,j]

    EXAMPLE:
    A = [[1, 2],     B = [[5, 6],
         [3, 4]]          [7, 8]]

    C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19
    C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22
    C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43
    C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50

    HINTS:
    - Start with C = np.zeros((m, p))
    - Use three nested for loops: for i in range(m): for j in range(p): for k in range(n):
    - Accumulate the sum: C[i,j] += A[i,k] * B[k,j]
    """
    ### BEGIN SOLUTION
    # Get matrix dimensions
    m, n = A.shape
    n2, p = B.shape

    # Check compatibility
    if n != n2:
        raise ValueError(f"Incompatible matrix dimensions: A is {m}x{n}, B is {n2}x{p}")

    # Initialize result matrix
    C = np.zeros((m, p))

    # Triple nested loop for matrix multiplication
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]

    return C
    ### END SOLUTION

# %% [markdown]
"""
### 🧪 Unit Test: Matrix Multiplication

Let's test your matrix multiplication implementation right away! This is the foundation of neural networks.

**This is a unit test** - it tests one specific function (matmul_naive) in isolation.
"""

# %% nbgrader={"grade": true, "grade_id": "test-matmul-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
# Test matrix multiplication immediately after implementation
print("🔬 Unit Test: Matrix Multiplication...")

# Test simple 2x2 case
try:
    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
    B = np.array([[5, 6], [7, 8]], dtype=np.float32)

    result = matmul_naive(A, B)
    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)

    assert np.allclose(result, expected), f"Matrix multiplication failed: expected {expected}, got {result}"
    print(f"✅ Simple 2x2 test: {A.tolist()} @ {B.tolist()} = {result.tolist()}")

    # Compare with NumPy
    numpy_result = A @ B
    assert np.allclose(result, numpy_result), f"Doesn't match NumPy: got {result}, expected {numpy_result}"
    print("✅ Matches NumPy's result")

except Exception as e:
    print(f"❌ Matrix multiplication test failed: {e}")
    raise

# Test different shapes
try:
    A2 = np.array([[1, 2, 3]], dtype=np.float32)  # 1x3
    B2 = np.array([[4], [5], [6]], dtype=np.float32)  # 3x1
    result2 = matmul_naive(A2, B2)
    expected2 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32

    assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"
    print(f"✅ Different shapes test: {A2.tolist()} @ {B2.tolist()} = {result2.tolist()}")

except Exception as e:
    print(f"❌ Different shapes test failed: {e}")
    raise

# Show the algorithm in action
print("🎯 Matrix multiplication algorithm:")
print("   C[i,j] = Σ(A[i,k] * B[k,j]) for all k")
print("   Triple nested loops compute each element")
print("📈 Progress: Matrix multiplication ✓")

# %% [markdown]
"""
## Step 2: Building the Dense Layer

Now let's build the **Dense layer**, the most fundamental building block of neural networks. A Dense layer performs a linear transformation: `y = Wx + b`

### What is a Dense Layer?
- **Linear transformation**: `y = Wx + b`
- **W**: Weight matrix (learnable parameters)
- **x**: Input tensor
- **b**: Bias vector (learnable parameters)
- **y**: Output tensor

### Why Dense Layers Matter
- **Universal approximation**: Can approximate any function with enough neurons
- **Feature learning**: Each neuron learns a different feature
- **Nonlinearity**: When combined with activation functions, becomes very powerful
- **Foundation**: All other layers build on this concept

### The Math
For input x of shape (batch_size, input_size):
- **W**: Weight matrix of shape (input_size, output_size)
- **b**: Bias vector of shape (output_size)
- **y**: Output of shape (batch_size, output_size)

### Visual Example
```
Input: x = [1, 2, 3] (3 features)
Weights: W = [[0.1, 0.2],    Bias: b = [0.1, 0.2]
              [0.3, 0.4],
              [0.5, 0.6]]

Step 1: Wx = [0.1*1 + 0.3*2 + 0.5*3,  0.2*1 + 0.4*2 + 0.6*3]
            = [2.2, 3.2]

Step 2: y = Wx + b = [2.2 + 0.1, 3.2 + 0.2] = [2.3, 3.4]
```

Let's implement this!
"""

# %% nbgrader={"grade": false, "grade_id": "dense-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class Dense:
    """
    Dense (Linear) Layer: y = Wx + b

    The fundamental building block of neural networks.
    Performs linear transformation: matrix multiplication + bias addition.
    """

    def __init__(self, input_size: int, output_size: int, use_bias: bool = True,
                 use_naive_matmul: bool = False):
        """
        Initialize Dense layer with random weights.

        Args:
            input_size: Number of input features
            output_size: Number of output features
            use_bias: Whether to include bias term (default: True)
            use_naive_matmul: Whether to use naive matrix multiplication (for learning)

        TODO: Implement Dense layer initialization with proper weight initialization.

        APPROACH:
        1. Store layer parameters (input_size, output_size, use_bias, use_naive_matmul)
        2. Initialize weights with Xavier/Glorot initialization
        3. Initialize bias to zeros (if use_bias=True)
        4. Convert to float32 for consistency

        EXAMPLE:
        Dense(3, 2) creates:
        - weights: shape (3, 2) with small random values
        - bias: shape (2,) with zeros

        HINTS:
        - Use np.random.randn() for random initialization
        - Scale weights by sqrt(2/(input_size + output_size)) for Xavier init
        - Use np.zeros() for bias initialization
        - Convert to float32 with .astype(np.float32)
        """
        ### BEGIN SOLUTION
        # Store parameters
        self.input_size = input_size
        self.output_size = output_size
        self.use_bias = use_bias
        self.use_naive_matmul = use_naive_matmul

        # Xavier/Glorot initialization
        scale = np.sqrt(2.0 / (input_size + output_size))
        self.weights = np.random.randn(input_size, output_size).astype(np.float32) * scale

        # Initialize bias
        if use_bias:
            self.bias = np.zeros(output_size, dtype=np.float32)
        else:
            self.bias = None
        ### END SOLUTION

    def forward(self, x: Tensor) -> Tensor:
        """
        Forward pass: y = Wx + b

        Args:
            x: Input tensor of shape (batch_size, input_size)

        Returns:
            Output tensor of shape (batch_size, output_size)

        TODO: Implement matrix multiplication and bias addition.

        APPROACH:
        1. Choose matrix multiplication method based on use_naive_matmul flag
        2. Perform matrix multiplication: Wx
        3. Add bias if use_bias=True
        4. Return result wrapped in Tensor

        EXAMPLE:
        Input x: Tensor([[1, 2, 3]])  # shape (1, 3)
        Weights: shape (3, 2)
        Output: Tensor([[val1, val2]])  # shape (1, 2)

        HINTS:
        - Use self.use_naive_matmul to choose between matmul_naive and @
        - x.data gives you the numpy array
        - Use broadcasting for bias addition: result + self.bias
        - Return Tensor(result) to wrap the result
        """
        ### BEGIN SOLUTION
        # Matrix multiplication
        if self.use_naive_matmul:
            result = matmul_naive(x.data, self.weights)
        else:
            result = x.data @ self.weights

        # Add bias
        if self.use_bias:
            result += self.bias

        return Tensor(result)
        ### END SOLUTION

    def __call__(self, x: Tensor) -> Tensor:
        """Make layer callable: layer(x) same as layer.forward(x)"""
        return self.forward(x)

# %% [markdown]
"""
### 🧪 Unit Test: Dense Layer

Let's test your Dense layer implementation! This is the fundamental building block of neural networks.

**This is a unit test** - it tests one specific class (Dense layer) in isolation.
"""

# %% nbgrader={"grade": true, "grade_id": "test-dense-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
# Test Dense layer immediately after implementation
print("🔬 Unit Test: Dense Layer...")

# Test basic Dense layer
try:
    layer = Dense(input_size=3, output_size=2, use_bias=True)
    x = Tensor([[1, 2, 3]])  # batch_size=1, input_size=3

    print(f"Input shape: {x.shape}")
    print(f"Layer weights shape: {layer.weights.shape}")
    if layer.bias is not None:
        print(f"Layer bias shape: {layer.bias.shape}")

    y = layer(x)
    print(f"Output shape: {y.shape}")
    print(f"Output: {y}")

    # Test shape compatibility
    assert y.shape == (1, 2), f"Output shape should be (1, 2), got {y.shape}"
    print("✅ Dense layer produces correct output shape")

    # Test weights initialization
    assert layer.weights.shape == (3, 2), f"Weights shape should be (3, 2), got {layer.weights.shape}"
    if layer.bias is not None:
        assert layer.bias.shape == (2,), f"Bias shape should be (2,), got {layer.bias.shape}"
    print("✅ Dense layer has correct weight and bias shapes")

    # Test that weights are not all zeros (proper initialization)
    assert not np.allclose(layer.weights, 0), "Weights should not be all zeros"
    if layer.bias is not None:
        assert np.allclose(layer.bias, 0), "Bias should be initialized to zeros"
    print("✅ Dense layer has proper weight initialization")

except Exception as e:
    print(f"❌ Dense layer test failed: {e}")
    raise

# Test without bias
try:
    layer_no_bias = Dense(input_size=2, output_size=1, use_bias=False)
    x2 = Tensor([[1, 2]])
    y2 = layer_no_bias(x2)

    assert y2.shape == (1, 1), f"No bias output shape should be (1, 1), got {y2.shape}"
    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
    print("✅ Dense layer works without bias")

except Exception as e:
    print(f"❌ Dense layer no-bias test failed: {e}")
    raise

# Test naive matrix multiplication
try:
    layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
    x3 = Tensor([[1, 2]])
    y3 = layer_naive(x3)

    assert y3.shape == (1, 2), f"Naive matmul output shape should be (1, 2), got {y3.shape}"
    print("✅ Dense layer works with naive matrix multiplication")

except Exception as e:
    print(f"❌ Dense layer naive matmul test failed: {e}")
    raise

# Show the linear transformation in action
print("🎯 Dense layer behavior:")
print("   y = Wx + b (linear transformation)")
print("   W: learnable weight matrix")
print("   b: learnable bias vector")
print("📈 Progress: Matrix multiplication ✓, Dense layer ✓")

# %% [markdown]
"""
### 🧪 Test Your Implementations

Once you implement the functions above, run these cells to test them:
"""

# %% nbgrader={"grade": true, "grade_id": "test-matmul-naive", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
# Test matrix multiplication
print("Testing matrix multiplication...")

# Test case 1: Simple 2x2 matrices
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
B = np.array([[5, 6], [7, 8]], dtype=np.float32)

result = matmul_naive(A, B)
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)

print(f"Matrix A:\n{A}")
print(f"Matrix B:\n{B}")
print(f"Your result:\n{result}")
print(f"Expected:\n{expected}")

assert np.allclose(result, expected), f"Result doesn't match expected: got {result}, expected {expected}"

# Test case 2: Compare with NumPy
numpy_result = A @ B
assert np.allclose(result, numpy_result), f"Doesn't match NumPy result: got {result}, expected {numpy_result}"

# Test case 3: Different shapes
A2 = np.array([[1, 2, 3]], dtype=np.float32)  # 1x3
B2 = np.array([[4], [5], [6]], dtype=np.float32)  # 3x1
result2 = matmul_naive(A2, B2)
expected2 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32
assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"

print("✅ Matrix multiplication tests passed!")

# %% nbgrader={"grade": true, "grade_id": "test-dense-layer", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
# Test Dense layer
print("Testing Dense layer...")

# Test basic Dense layer
layer = Dense(input_size=3, output_size=2, use_bias=True)
x = Tensor([[1, 2, 3]])  # batch_size=1, input_size=3

print(f"Input shape: {x.shape}")
print(f"Layer weights shape: {layer.weights.shape}")
if layer.bias is not None:
    print(f"Layer bias shape: {layer.bias.shape}")
else:
    print("Layer bias: None")

y = layer(x)
print(f"Output shape: {y.shape}")
print(f"Output: {y}")

# Test shape compatibility
assert y.shape == (1, 2), f"Output shape should be (1, 2), got {y.shape}"

# Test without bias
layer_no_bias = Dense(input_size=2, output_size=1, use_bias=False)
x2 = Tensor([[1, 2]])
y2 = layer_no_bias(x2)
assert y2.shape == (1, 1), f"No bias output shape should be (1, 1), got {y2.shape}"
assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"

# Test naive matrix multiplication
layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
x3 = Tensor([[1, 2]])
y3 = layer_naive(x3)
assert y3.shape == (1, 2), f"Naive matmul output shape should be (1, 2), got {y3.shape}"

print("✅ Dense layer tests passed!")

# %% nbgrader={"grade": true, "grade_id": "test-layer-composition", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
# Test layer composition
print("Testing layer composition...")

# Create a simple network: Dense → ReLU → Dense
dense1 = Dense(input_size=3, output_size=2)
relu = ReLU()
dense2 = Dense(input_size=2, output_size=1)

# Test input
x = Tensor([[1, 2, 3]])
print(f"Input: {x}")

# Forward pass through the network
h1 = dense1(x)
print(f"After Dense1: {h1}")

h2 = relu(h1)
print(f"After ReLU: {h2}")

h3 = dense2(h2)
print(f"After Dense2: {h3}")

# Test shapes
assert h1.shape == (1, 2), f"Dense1 output should be (1, 2), got {h1.shape}"
assert h2.shape == (1, 2), f"ReLU output should be (1, 2), got {h2.shape}"
assert h3.shape == (1, 1), f"Dense2 output should be (1, 1), got {h3.shape}"

# Test that ReLU actually applied (non-negative values)
assert np.all(h2.data >= 0), "ReLU should produce non-negative values"

print("✅ Layer composition tests passed!")

# %% [markdown]
"""
## 🧪 Comprehensive Testing: Matrix Multiplication and Dense Layers

Let's thoroughly test your implementations to make sure they work correctly in all scenarios.
This comprehensive testing ensures your layers are robust and ready for real neural networks.
"""

# %% nbgrader={"grade": true, "grade_id": "test-layers-comprehensive", "locked": true, "points": 30, "schema_version": 3, "solution": false, "task": false}
def test_layers_comprehensive():
    """Comprehensive test of matrix multiplication and Dense layers."""
    print("🔬 Testing matrix multiplication and Dense layers comprehensively...")

    tests_passed = 0
    total_tests = 10

    # Test 1: Matrix Multiplication Basic Cases
    try:
        # Test 2x2 matrices
        A = np.array([[1, 2], [3, 4]], dtype=np.float32)
        B = np.array([[5, 6], [7, 8]], dtype=np.float32)
        result = matmul_naive(A, B)
        expected = np.array([[19, 22], [43, 50]], dtype=np.float32)

        assert np.allclose(result, expected), f"2x2 multiplication failed: expected {expected}, got {result}"

        # Compare with NumPy
        numpy_result = A @ B
        assert np.allclose(result, numpy_result), f"Doesn't match NumPy: expected {numpy_result}, got {result}"

        print(f"✅ Matrix multiplication 2x2: {A.shape} × {B.shape} = {result.shape}")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Matrix multiplication basic failed: {e}")

    # Test 2: Matrix Multiplication Different Shapes
    try:
        # Test 1x3 × 3x1 = 1x1
        A1 = np.array([[1, 2, 3]], dtype=np.float32)
        B1 = np.array([[4], [5], [6]], dtype=np.float32)
        result1 = matmul_naive(A1, B1)
        expected1 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32
        assert np.allclose(result1, expected1), f"1x3 × 3x1 failed: expected {expected1}, got {result1}"

        # Test 3x2 × 2x4 = 3x4
        A2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=np.float32)
        B2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.float32)
        result2 = matmul_naive(A2, B2)
        expected2 = A2 @ B2
        assert np.allclose(result2, expected2), f"3x2 × 2x4 failed: expected {expected2}, got {result2}"

        print(f"✅ Matrix multiplication shapes: (1,3)×(3,1), (3,2)×(2,4)")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Matrix multiplication shapes failed: {e}")

    # Test 3: Matrix Multiplication Edge Cases
    try:
        # Test with zeros
        A_zero = np.zeros((2, 3), dtype=np.float32)
        B_zero = np.zeros((3, 2), dtype=np.float32)
        result_zero = matmul_naive(A_zero, B_zero)
        expected_zero = np.zeros((2, 2), dtype=np.float32)
        assert np.allclose(result_zero, expected_zero), "Zero matrix multiplication failed"

        # Test with identity
        A_id = np.array([[1, 2]], dtype=np.float32)
        B_id = np.array([[1, 0], [0, 1]], dtype=np.float32)
        result_id = matmul_naive(A_id, B_id)
        expected_id = np.array([[1, 2]], dtype=np.float32)
        assert np.allclose(result_id, expected_id), "Identity matrix multiplication failed"

        # Test with negative values
        A_neg = np.array([[-1, 2]], dtype=np.float32)
        B_neg = np.array([[3], [-4]], dtype=np.float32)
        result_neg = matmul_naive(A_neg, B_neg)
        expected_neg = np.array([[-11]], dtype=np.float32)  # -1*3 + 2*(-4) = -11
        assert np.allclose(result_neg, expected_neg), "Negative matrix multiplication failed"

        print("✅ Matrix multiplication edge cases: zeros, identity, negatives")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Matrix multiplication edge cases failed: {e}")

    # Test 4: Dense Layer Initialization
    try:
        # Test with bias
        layer_bias = Dense(input_size=3, output_size=2, use_bias=True)
        assert layer_bias.weights.shape == (3, 2), f"Weights shape should be (3, 2), got {layer_bias.weights.shape}"
        assert layer_bias.bias is not None, "Bias should not be None when use_bias=True"
        assert layer_bias.bias.shape == (2,), f"Bias shape should be (2,), got {layer_bias.bias.shape}"

        # Check weight initialization (should not be all zeros)
        assert not np.allclose(layer_bias.weights, 0), "Weights should not be all zeros"
        assert np.allclose(layer_bias.bias, 0), "Bias should be initialized to zeros"

        # Test without bias
        layer_no_bias = Dense(input_size=4, output_size=3, use_bias=False)
        assert layer_no_bias.weights.shape == (4, 3), f"No-bias weights shape should be (4, 3), got {layer_no_bias.weights.shape}"
        assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"

        print("✅ Dense layer initialization: weights, bias, shapes")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Dense layer initialization failed: {e}")

    # Test 5: Dense Layer Forward Pass
    try:
        layer = Dense(input_size=3, output_size=2, use_bias=True)

        # Test single sample
        x_single = Tensor([[1, 2, 3]])  # shape: (1, 3)
        y_single = layer(x_single)
        assert y_single.shape == (1, 2), f"Single sample output should be (1, 2), got {y_single.shape}"

        # Test batch of samples
        x_batch = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  # shape: (3, 3)
        y_batch = layer(x_batch)
        assert y_batch.shape == (3, 2), f"Batch output should be (3, 2), got {y_batch.shape}"

        # Verify computation manually for single sample
        expected_single = np.dot(x_single.data, layer.weights) + layer.bias
        assert np.allclose(y_single.data, expected_single), "Single sample computation incorrect"

        print("✅ Dense layer forward pass: single sample, batch processing")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Dense layer forward pass failed: {e}")

    # Test 6: Dense Layer Without Bias
    try:
        layer_no_bias = Dense(input_size=2, output_size=3, use_bias=False)
        x = Tensor([[1, 2]])
        y = layer_no_bias(x)

        assert y.shape == (1, 3), f"No-bias output should be (1, 3), got {y.shape}"

        # Verify computation (should be just matrix multiplication)
        expected = np.dot(x.data, layer_no_bias.weights)
        assert np.allclose(y.data, expected), "No-bias computation incorrect"

        print("✅ Dense layer without bias: correct computation")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Dense layer without bias failed: {e}")

    # Test 7: Dense Layer with Naive Matrix Multiplication
    try:
        layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
        layer_optimized = Dense(input_size=2, output_size=2, use_naive_matmul=False)

        # Set same weights for comparison
        layer_optimized.weights = layer_naive.weights.copy()
        layer_optimized.bias = layer_naive.bias.copy() if layer_naive.bias is not None else None

        x = Tensor([[1, 2]])
        y_naive = layer_naive(x)
        y_optimized = layer_optimized(x)

        # Both should give same results
        assert np.allclose(y_naive.data, y_optimized.data), "Naive and optimized should give same results"

        print("✅ Dense layer naive vs optimized: consistent results")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Dense layer naive matmul failed: {e}")

    # Test 8: Layer Composition
    try:
        # Create a simple network: Dense → ReLU → Dense
        dense1 = Dense(input_size=3, output_size=4)
        relu = ReLU()
        dense2 = Dense(input_size=4, output_size=2)

        x = Tensor([[1, -2, 3]])

        # Forward pass
        h1 = dense1(x)
        h2 = relu(h1)
        h3 = dense2(h2)

        # Check shapes
        assert h1.shape == (1, 4), f"Dense1 output should be (1, 4), got {h1.shape}"
        assert h2.shape == (1, 4), f"ReLU output should be (1, 4), got {h2.shape}"
        assert h3.shape == (1, 2), f"Dense2 output should be (1, 2), got {h3.shape}"

        # Check ReLU effect
        assert np.all(h2.data >= 0), "ReLU should produce non-negative values"

        print("✅ Layer composition: Dense → ReLU → Dense pipeline")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Layer composition failed: {e}")

    # Test 9: Different Layer Sizes
    try:
        # Test various layer sizes
        test_configs = [
            (1, 1),    # Minimal
            (10, 5),   # Medium
            (100, 50), # Large
            (784, 128) # MNIST-like
        ]

        for input_size, output_size in test_configs:
            layer = Dense(input_size=input_size, output_size=output_size)

            # Test with single sample
            x = Tensor(np.random.randn(1, input_size))
            y = layer(x)

            assert y.shape == (1, output_size), f"Size ({input_size}, {output_size}) failed: got {y.shape}"
            assert layer.weights.shape == (input_size, output_size), f"Weights shape wrong for ({input_size}, {output_size})"

        print("✅ Different layer sizes: (1,1), (10,5), (100,50), (784,128)")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Different layer sizes failed: {e}")

    # Test 10: Real Neural Network Scenario
    try:
        # Simulate MNIST-like scenario: 784 → 128 → 64 → 10
        input_layer = Dense(input_size=784, output_size=128)
        hidden_layer = Dense(input_size=128, output_size=64)
        output_layer = Dense(input_size=64, output_size=10)

        relu1 = ReLU()
        relu2 = ReLU()
        softmax = Softmax()

        # Simulate flattened MNIST image
        x = Tensor(np.random.randn(32, 784))  # Batch of 32 images

        # Forward pass through network
        h1 = input_layer(x)
        h1_activated = relu1(h1)
        h2 = hidden_layer(h1_activated)
        h2_activated = relu2(h2)
        logits = output_layer(h2_activated)
        probabilities = softmax(logits)

        # Check final output
        assert probabilities.shape == (32, 10), f"Final output should be (32, 10), got {probabilities.shape}"

        # Check that probabilities sum to 1 for each sample
        row_sums = np.sum(probabilities.data, axis=1)
        assert np.allclose(row_sums, 1.0), "Each sample should have probabilities summing to 1"

        # Check that all intermediate shapes are correct
        assert h1.shape == (32, 128), f"Hidden 1 shape should be (32, 128), got {h1.shape}"
        assert h2.shape == (32, 64), f"Hidden 2 shape should be (32, 64), got {h2.shape}"
        assert logits.shape == (32, 10), f"Logits shape should be (32, 10), got {logits.shape}"

        print("✅ Real neural network scenario: MNIST-like 784→128→64→10 classification")
        tests_passed += 1
    except Exception as e:
        print(f"❌ Real neural network scenario failed: {e}")

    # Results summary
    print(f"\n📊 Layers Module Results: {tests_passed}/{total_tests} tests passed")

    if tests_passed == total_tests:
        print("🎉 All layers tests passed! Your implementations support:")
        print("  • Matrix multiplication: naive implementation from scratch")
        print("  • Dense layers: linear transformations with learnable parameters")
        print("  • Weight initialization: proper random initialization")
        print("  • Bias handling: optional bias terms")
        print("  • Batch processing: multiple samples at once")
        print("  • Layer composition: building complete neural networks")
        print("  • Real ML scenarios: MNIST-like classification networks")
        print("📈 Progress: All Layer Functionality ✓")
        return True
    else:
        print("⚠️  Some layers tests failed. Common issues:")
        print("  • Check matrix multiplication implementation (triple nested loops)")
        print("  • Verify Dense layer forward pass (y = Wx + b)")
        print("  • Ensure proper weight initialization (not all zeros)")
        print("  • Check shape handling for different input/output sizes")
        print("  • Verify bias handling when use_bias=False")
        return False

# Run the comprehensive test
success = test_layers_comprehensive()

# %% [markdown]
"""
### 🧪 Integration Test: Layers in Complete Neural Networks

Let's test how your layers work in realistic neural network architectures.
"""

# %% nbgrader={"grade": true, "grade_id": "test-layers-integration", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
def test_layers_integration():
    """Integration test with complete neural network architectures."""
    print("🔬 Testing layers in complete neural network architectures...")

    try:
        print("🧠 Building and testing different network architectures...")

        # Architecture 1: Simple Binary Classifier
        print("\n📊 Architecture 1: Binary Classification Network")
        binary_net = [
            Dense(input_size=4, output_size=8),
            ReLU(),
            Dense(input_size=8, output_size=4),
            ReLU(),
            Dense(input_size=4, output_size=1),
            Sigmoid()
        ]

        # Test with batch of samples
        x_binary = Tensor(np.random.randn(10, 4))  # 10 samples, 4 features

        # Forward pass through network
        current = x_binary
        for i, layer in enumerate(binary_net):
            current = layer(current)
            print(f"  Layer {i}: {current.shape}")

        # Verify final output is valid probabilities
        assert current.shape == (10, 1), f"Binary classifier output should be (10, 1), got {current.shape}"
        assert np.all((current.data >= 0) & (current.data <= 1)), "Binary probabilities should be in [0,1]"

        print("✅ Binary classification network: 4→8→4→1 with ReLU/Sigmoid")

        # Architecture 2: Multi-class Classifier
        print("\n📊 Architecture 2: Multi-class Classification Network")
        multiclass_net = [
            Dense(input_size=784, output_size=256),
            ReLU(),
            Dense(input_size=256, output_size=128),
            ReLU(),
            Dense(input_size=128, output_size=10),
            Softmax()
        ]

        # Simulate MNIST-like input
        x_mnist = Tensor(np.random.randn(5, 784))  # 5 images, 784 pixels

        current = x_mnist
        for i, layer in enumerate(multiclass_net):
            current = layer(current)
            print(f"  Layer {i}: {current.shape}")

        # Verify final output is valid probability distribution
        assert current.shape == (5, 10), f"Multi-class output should be (5, 10), got {current.shape}"
        row_sums = np.sum(current.data, axis=1)
        assert np.allclose(row_sums, 1.0), "Each sample should have probabilities summing to 1"

        print("✅ Multi-class classification network: 784→256→128→10 with Softmax")

        # Architecture 3: Deep Network
        print("\n📊 Architecture 3: Deep Network (5 layers)")
        deep_net = [
            Dense(input_size=100, output_size=80),
            ReLU(),
            Dense(input_size=80, output_size=60),
            ReLU(),
            Dense(input_size=60, output_size=40),
            ReLU(),
            Dense(input_size=40, output_size=20),
            ReLU(),
            Dense(input_size=20, output_size=3),
            Softmax()
        ]

        x_deep = Tensor(np.random.randn(8, 100))  # 8 samples, 100 features

        current = x_deep
        for i, layer in enumerate(deep_net):
            current = layer(current)
            if i % 2 == 0:  # Print every other layer to save space
                print(f"  Layer {i}: {current.shape}")

        assert current.shape == (8, 3), f"Deep network output should be (8, 3), got {current.shape}"

        print("✅ Deep network: 100→80→60→40→20→3 with multiple ReLU layers")

        # Test 4: Network with Different Activation Functions
        print("\n📊 Architecture 4: Mixed Activation Functions")
        mixed_net = [
            Dense(input_size=6, output_size=4),
            Tanh(),  # Zero-centered activation
            Dense(input_size=4, output_size=3),
            ReLU(),  # Sparse activation
            Dense(input_size=3, output_size=2),
            Sigmoid()  # Bounded activation
        ]

        x_mixed = Tensor(np.random.randn(3, 6))

        current = x_mixed
        for i, layer in enumerate(mixed_net):
            current = layer(current)
            print(f"  Layer {i}: {current.shape}, range: [{np.min(current.data):.3f}, {np.max(current.data):.3f}]")

        assert current.shape == (3, 2), f"Mixed network output should be (3, 2), got {current.shape}"

        print("✅ Mixed activations network: Tanh→ReLU→Sigmoid combinations")

        # Test 5: Parameter Counting
        print("\n📊 Parameter Analysis")

        def count_parameters(layer):
            """Count trainable parameters in a Dense layer."""
            if isinstance(layer, Dense):
                weight_params = layer.weights.size
                bias_params = layer.bias.size if layer.bias is not None else 0
                return weight_params + bias_params
            return 0

        # Count parameters in binary classifier
        total_params = sum(count_parameters(layer) for layer in binary_net)
        print(f"Binary classifier parameters: {total_params}")

        # Manual verification for first layer: 4*8 + 8 = 40
        first_dense = binary_net[0]
        expected_first = 4 * 8 + 8  # weights + bias
        actual_first = count_parameters(first_dense)
        assert actual_first == expected_first, f"First layer params: expected {expected_first}, got {actual_first}"

        print("✅ Parameter counting: weight and bias parameters calculated correctly")

        # Test 6: Gradient Flow Preparation
        print("\n📊 Gradient Flow Preparation")

        # Test that network can handle different input types
        test_inputs = [
            Tensor(np.zeros((1, 4))),      # All zeros
            Tensor(np.ones((1, 4))),       # All ones
            Tensor(np.random.randn(1, 4)), # Random
            Tensor(np.random.randn(1, 4) * 10)  # Large values
        ]

        for i, test_input in enumerate(test_inputs):
            current = test_input
            for layer in binary_net:
                current = layer(current)

            # Check for numerical stability
            assert not np.any(np.isnan(current.data)), f"Input {i} produced NaN"
            assert not np.any(np.isinf(current.data)), f"Input {i} produced Inf"

        print("✅ Numerical stability: networks handle various input ranges")

        print("\n🎉 Integration test passed! Your layers work correctly in:")
        print("  • Binary classification networks")
        print("  • Multi-class classification networks")
        print("  • Deep networks with multiple hidden layers")
        print("  • Networks with mixed activation functions")
        print("  • Parameter counting and analysis")
        print("  • Numerical stability across input ranges")
        print("📈 Progress: Layers ready for complete neural networks!")

        return True

    except Exception as e:
        print(f"❌ Integration test failed: {e}")
        print("\n💡 This suggests an issue with:")
        print("  • Layer composition and chaining")
        print("  • Shape compatibility between layers")
        print("  • Activation function integration")
        print("  • Numerical stability in deep networks")
        print("  • Check your Dense layer and matrix multiplication")
        return False

# Run the integration test
success = test_layers_integration() and success

# Print final summary
print(f"\n{'='*60}")
print("🎯 LAYERS MODULE TESTING COMPLETE")
print(f"{'='*60}")

if success:
    print("🎉 CONGRATULATIONS! All layers tests passed!")
    print("\n✅ Your layers module successfully implements:")
    print("  • Matrix multiplication: naive implementation from scratch")
    print("  • Dense layers: y = Wx + b linear transformations")
    print("  • Weight initialization: proper random weight setup")
    print("  • Bias handling: optional bias terms")
    print("  • Batch processing: efficient multi-sample computation")
    print("  • Layer composition: building complete neural networks")
    print("  • Integration: works with all activation functions")
    print("  • Real ML scenarios: MNIST-like classification networks")
    print("\n🚀 You're ready to build complete neural network architectures!")
    print("📈 Final Progress: Layers Module ✓ COMPLETE")
else:
    print("⚠️  Some tests failed. Please review the error messages above.")
    print("\n🔧 To fix issues:")
    print("  1. Check your matrix multiplication implementation")
    print("  2. Verify Dense layer forward pass computation")
    print("  3. Ensure proper weight and bias initialization")
    print("  4. Test shape compatibility between layers")
    print("  5. Verify integration with activation functions")
    print("\n💪 Keep building! These layers are the foundation of all neural networks.")

# %% [markdown]
"""
## 🎯 Module Summary

Congratulations! You've successfully implemented the core building blocks of neural networks:

### What You've Accomplished
✅ **Matrix Multiplication**: Implemented from scratch with triple nested loops
✅ **Dense Layer**: The fundamental linear transformation y = Wx + b
✅ **Weight Initialization**: Xavier/Glorot initialization for stable training
✅ **Layer Composition**: Combining layers with activations
✅ **Flexible Implementation**: Support for both naive and optimized matrix multiplication

### Key Concepts You've Learned
- **Matrix multiplication** is the engine of neural networks
- **Dense layers** perform linear transformations that learn features
- **Weight initialization** is crucial for stable training
- **Layer composition** creates powerful nonlinear functions
- **Batch processing** enables efficient computation

### Mathematical Foundations
- **Linear algebra**: Matrix operations power all neural computations
- **Universal approximation**: Dense layers can approximate any function
- **Feature learning**: Each neuron learns different patterns
- **Composability**: Simple operations combine to create complex behaviors

### Next Steps
1. **Export your code**: `tito package nbdev --export 03_layers`
2. **Test your implementation**: `tito module test 03_layers`
3. **Use your layers**:
   ```python
   from tinytorch.core.layers import Dense
   from tinytorch.core.activations import ReLU
   layer = Dense(10, 5)
   activation = ReLU()
   ```
4. **Move to Module 4**: Start building complete neural networks!

**Ready for the next challenge?** Let's compose these layers into complete neural network architectures!
"""