diff --git a/MODULE_ANALYSIS_SUMMARY.md b/MODULE_ANALYSIS_SUMMARY.md
new file mode 100644
index 00000000..4a6065cc
--- /dev/null
+++ b/MODULE_ANALYSIS_SUMMARY.md
@@ -0,0 +1,152 @@
+# TinyTorch Module Analysis Summary
+
+## Key Findings
+
+### ✅ **Excellent Foundation (setup_dev.py)**
+- **Perfect structure**: Follows explain → code → test → repeat pattern
+- **Rich scaffolding**: Every TODO has step-by-step guidance
+- **Immediate feedback**: Tests run after each concept
+- **Educational flow**: Concepts build logically with real-world connections
+
+### ⚠️ **Structural Issues (Modules 01-07)**
+- **Content quality**: Excellent mathematical explanations and implementations
+- **Testing pattern**: All tests at end instead of progressive testing
+- **TODO scaffolding**: Generic `NotImplementedError` without guidance
+- **Student experience**: Large amounts of code before getting feedback
+
+### ❌ **Missing Modules (08-13)**
+- **Empty directories**: 5 out of 13 modules are completely empty
+- **Critical gaps**: Optimizers, training, MLOps missing
+
+## Immediate Action Items
+
+### 1. **Fix Testing Pattern (High Priority)**
+Transform this poor pattern:
+```python
+# All implementations
+def concept_1(): pass
+def concept_2(): pass
+def concept_3(): pass
+
+# All tests at end
+def test_everything(): pass
+```
+
+To this excellent pattern:
+```python
+# Concept 1
+def concept_1(): pass
+def test_concept_1(): pass
+print("✅ Concept 1 tests passed!")
+
+# Concept 2  
+def concept_2(): pass
+def test_concept_2(): pass
+print("✅ Concept 2 tests passed!")
+```
+
+### 2. **Enhance TODO Blocks (High Priority)**
+Replace generic todos:
+```python
+def add(self, other):
+    """Add two tensors."""
+    raise NotImplementedError("Student implementation required")
+```
+
+With rich scaffolding:
+```python
+def add(self, other):
+    """
+    TODO: Implement tensor addition.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Get numpy data from both tensors
+    2. Use numpy's + operator
+    3. Create new Tensor with result
+    4. Return the new tensor
+    
+    EXAMPLE USAGE:
+    t1 = Tensor([[1, 2], [3, 4]])
+    t2 = Tensor([[5, 6], [7, 8]])
+    result = t1.add(t2)  # [[6, 8], [10, 12]]
+    
+    IMPLEMENTATION HINTS:
+    - Use self._data + other._data
+    - Wrap result in new Tensor
+    - NumPy handles broadcasting
+    """
+```
+
+### 3. **Module Priority for Fixes**
+1. **01_tensor** (Highest) - Foundation for everything
+2. **02_activations** (High) - Used in all networks  
+3. **03_layers** (High) - Core building blocks
+4. **07_autograd** (High) - Enables training
+5. **04_networks** (Medium) - Compositions
+6. **05_cnn** (Medium) - Specialized operations
+7. **06_dataloader** (Medium) - Data handling
+
+## Implementation Strategy
+
+### Phase 1: Transform Existing Modules (Weeks 1-2)
+For each module (01-07):
+1. **Identify breakpoints**: Find natural concept boundaries
+2. **Reorganize structure**: Create Step 1, Step 2, etc. with explanations
+3. **Add immediate testing**: Test after each major concept
+4. **Enhance TODO blocks**: Add step-by-step guidance
+5. **Include success messages**: Clear progress indicators
+
+### Phase 2: Create Missing Modules (Weeks 3-4)
+Using the improved structure:
+- **08_optimizers**: SGD, Adam, learning rate scheduling
+- **09_training**: Training loops, loss functions, metrics
+- **10_compression**: Pruning, quantization, knowledge distillation
+- **11_kernels**: Custom operations, CUDA kernels
+- **12_benchmarking**: Performance measurement, profiling
+- **13_mlops**: Model deployment, monitoring, versioning
+
+## Success Metrics
+
+### Student Experience
+- **Immediate feedback**: Results after each concept
+- **Clear guidance**: Step-by-step implementation instructions
+- **Progressive complexity**: Each step builds on previous success
+- **Debugging support**: Clear error messages and examples
+
+### Educational Quality
+- **Consistent structure**: All modules follow same pattern
+- **Rich scaffolding**: Every function has detailed guidance
+- **Real-world connections**: Theory linked to practice
+- **Integration**: Modules work together seamlessly
+
+## Next Steps
+
+### Week 1: Start with Tensor Module
+1. **Backup current**: Create `tensor_dev_backup.py`
+2. **Reorganize structure**: Break into progressive steps
+3. **Add immediate testing**: Test after each operation type
+4. **Test with students**: Validate improved experience
+
+### Week 2: Apply to Activations & Layers
+1. **Apply same pattern**: Use tensor module as template
+2. **Focus on scaffolding**: Rich TODO blocks
+3. **Add visualizations**: Where helpful for understanding
+4. **Progressive testing**: After each activation/layer type
+
+### Week 3-4: Complete Missing Modules
+1. **Use proven pattern**: Follow successful structure
+2. **Real-world focus**: Production-ready implementations
+3. **Integration testing**: Ensure modules work together
+4. **Documentation**: Clear learning outcomes
+
+## Key Principle
+
+**Always follow: Explain → Code → Test → Repeat**
+
+This pattern maximizes student success through:
+- Immediate feedback prevents confusion
+- Rich scaffolding reduces frustration  
+- Progressive complexity builds confidence
+- Clear connections show the bigger picture
+
+The goal is to transform TinyTorch from reference material into a guided learning experience that creates deep understanding of ML systems. 
\ No newline at end of file
diff --git a/Module_Improvement_Guide.md b/Module_Improvement_Guide.md
new file mode 100644
index 00000000..836bc7e4
--- /dev/null
+++ b/Module_Improvement_Guide.md
@@ -0,0 +1,592 @@
+great# Module Improvement Guide: From Poor to Excellent Structure
+
+## Example: Transforming 01_tensor Module
+
+This guide shows how to transform the tensor module from its current structure to follow the **explain → code → test → repeat** pattern exemplified by `setup_dev.py`.
+
+## Current Problem Structure
+
+```python
+# Current tensor_dev.py structure (POOR)
+# Lines 1-300: All explanations
+# Lines 300-700: All implementations  
+# Lines 700-1536: All tests at the end
+
+class Tensor:
+    def __init__(self): 
+        raise NotImplementedError("Student implementation required")
+    
+    def add(self): 
+        raise NotImplementedError("Student implementation required")
+    
+    def multiply(self): 
+        raise NotImplementedError("Student implementation required")
+
+# Much later...
+def test_tensor_creation_comprehensive():
+    # Tests everything at once
+    pass
+
+def test_tensor_arithmetic_comprehensive():
+    # Tests everything at once
+    pass
+```
+
+## Improved Structure (EXCELLENT)
+
+```python
+# Improved tensor_dev.py structure (EXCELLENT)
+# Following: Explain → Code → Test → Repeat
+
+# %% [markdown]
+"""
+## Step 1: What is a Tensor?
+
+### Definition
+A **tensor** is an N-dimensional array with ML-specific operations.
+
+### Why Tensors Matter
+- **Foundation**: Every ML framework uses tensors
+- **Efficiency**: Vectorized operations are faster
+- **Flexibility**: Same operations work on scalars, vectors, matrices
+
+### Real-World Examples
+```python
+# Scalar (0D): A single number
+temperature = Tensor(25.0)
+
+# Vector (1D): A list of numbers  
+rgb_color = Tensor([255, 128, 0])
+
+# Matrix (2D): Image pixels
+image = Tensor([[100, 150], [200, 250]])
+```
+
+Let's build this step by step!
+"""
+
+# %% [markdown]
+"""
+## Step 1A: Tensor Creation
+
+### The Foundation Operation
+Creating tensors is the first thing you'll do in any ML system. Our Tensor class needs to:
+1. Accept various input types (lists, numpy arrays, scalars)
+2. Store data efficiently
+3. Track shape and type information
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-creation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Tensor:
+    def __init__(self, data: Union[int, float, List, np.ndarray], dtype: Optional[str] = None):
+        """
+        Create a tensor from various input types.
+        
+        TODO: Implement tensor creation with proper data handling.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Convert input data to numpy array using np.array()
+        2. Handle dtype conversion if specified
+        3. Store the numpy array in self._data
+        4. Validate that data is numeric (not strings, objects, etc.)
+        
+        EXAMPLE USAGE:
+        ```python
+        # From scalar
+        t1 = Tensor(5.0)
+        
+        # From list
+        t2 = Tensor([1, 2, 3])
+        
+        # From nested list (matrix)
+        t3 = Tensor([[1, 2], [3, 4]])
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - Use np.array(data) to convert input
+        - Check dtype parameter: if provided, use np.array(data, dtype=dtype)
+        - Validate: ensure data is numeric (int, float, complex)
+        - Store in self._data for internal use
+        
+        LEARNING CONNECTIONS:
+        - This is like torch.tensor() in PyTorch
+        - Similar to tf.constant() in TensorFlow
+        - Foundation for all other tensor operations
+        """
+        ### BEGIN SOLUTION
+        if dtype is not None:
+            self._data = np.array(data, dtype=dtype)
+        else:
+            self._data = np.array(data)
+        
+        # Validate numeric data
+        if not np.issubdtype(self._data.dtype, np.number):
+            raise ValueError(f"Tensor data must be numeric, got {self._data.dtype}")
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Tensor Creation
+
+Once you implement the `__init__` method above, run this cell to test it:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-creation", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_creation():
+    """Test tensor creation with various input types"""
+    print("Testing tensor creation...")
+    
+    # Test scalar creation
+    t1 = Tensor(5.0)
+    assert t1._data.shape == (), "Scalar tensor should have empty shape"
+    assert t1._data.item() == 5.0, "Scalar value should be 5.0"
+    
+    # Test list creation
+    t2 = Tensor([1, 2, 3])
+    assert t2._data.shape == (3,), "1D tensor should have shape (3,)"
+    assert np.array_equal(t2._data, [1, 2, 3]), "1D tensor values should match"
+    
+    # Test matrix creation
+    t3 = Tensor([[1, 2], [3, 4]])
+    assert t3._data.shape == (2, 2), "2D tensor should have shape (2, 2)"
+    assert np.array_equal(t3._data, [[1, 2], [3, 4]]), "2D tensor values should match"
+    
+    # Test dtype specification
+    t4 = Tensor([1, 2, 3], dtype='float32')
+    assert t4._data.dtype == np.float32, "Specified dtype should be respected"
+    
+    print("✅ Tensor creation tests passed!")
+    print(f"✅ Created tensors: scalar, vector, matrix")
+    print(f"✅ Handled data types correctly")
+
+# Run the test
+test_tensor_creation()
+
+# %% [markdown]
+"""
+## Step 1B: Tensor Properties
+
+### Essential Information Access
+Every tensor needs to provide basic information about itself:
+- **Shape**: Dimensions of the tensor
+- **Size**: Total number of elements
+- **Data access**: Get the underlying data
+
+### Why Properties Matter
+- **Debugging**: Quickly see tensor dimensions
+- **Validation**: Check compatibility for operations
+- **Integration**: Interface with other libraries
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-properties", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+    @property
+    def data(self) -> np.ndarray:
+        """
+        Get the underlying numpy array data.
+        
+        TODO: Implement data property access.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Return self._data directly
+        2. This gives users access to the numpy array
+        
+        EXAMPLE USAGE:
+        ```python
+        t = Tensor([[1, 2], [3, 4]])
+        print(t.data)  # [[1 2]
+                      #  [3 4]]
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - Simple property: just return self._data
+        - No validation needed here
+        - This is like tensor.numpy() in PyTorch
+        """
+        ### BEGIN SOLUTION
+        return self._data
+        ### END SOLUTION
+
+    @property
+    def shape(self) -> Tuple[int, ...]:
+        """
+        Get the shape (dimensions) of the tensor.
+        
+        TODO: Implement shape property.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Return self._data.shape
+        2. This gives the dimensions as a tuple
+        
+        EXAMPLE USAGE:
+        ```python
+        t = Tensor([[1, 2], [3, 4]])
+        print(t.shape)  # (2, 2)
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - NumPy arrays have a .shape attribute
+        - Return self._data.shape
+        - This is like tensor.shape in PyTorch
+        """
+        ### BEGIN SOLUTION
+        return self._data.shape
+        ### END SOLUTION
+
+    @property
+    def size(self) -> int:
+        """
+        Get the total number of elements in the tensor.
+        
+        TODO: Implement size property.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Return self._data.size
+        2. This gives total elements across all dimensions
+        
+        EXAMPLE USAGE:
+        ```python
+        t = Tensor([[1, 2], [3, 4]])
+        print(t.size)  # 4 (2×2 = 4 elements)
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - NumPy arrays have a .size attribute
+        - Return self._data.size
+        - This is like tensor.numel() in PyTorch
+        """
+        ### BEGIN SOLUTION
+        return self._data.size
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Tensor Properties
+
+Once you implement the properties above, run this cell to test them:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-properties", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_properties():
+    """Test tensor properties: data, shape, size"""
+    print("Testing tensor properties...")
+    
+    # Test scalar properties
+    t1 = Tensor(5.0)
+    assert t1.shape == (), "Scalar shape should be empty tuple"
+    assert t1.size == 1, "Scalar size should be 1"
+    assert t1.data.item() == 5.0, "Scalar data should be accessible"
+    
+    # Test vector properties
+    t2 = Tensor([1, 2, 3, 4])
+    assert t2.shape == (4,), "Vector shape should be (4,)"
+    assert t2.size == 4, "Vector size should be 4"
+    assert np.array_equal(t2.data, [1, 2, 3, 4]), "Vector data should match"
+    
+    # Test matrix properties
+    t3 = Tensor([[1, 2, 3], [4, 5, 6]])
+    assert t3.shape == (2, 3), "Matrix shape should be (2, 3)"
+    assert t3.size == 6, "Matrix size should be 6"
+    assert np.array_equal(t3.data, [[1, 2, 3], [4, 5, 6]]), "Matrix data should match"
+    
+    print("✅ Tensor properties tests passed!")
+    print(f"✅ Shape, size, and data access working correctly")
+
+# Run the test
+test_tensor_properties()
+
+# %% [markdown]
+"""
+## Step 2: Tensor Arithmetic
+
+### The Heart of ML: Mathematical Operations
+Now we implement the core mathematical operations that make ML possible:
+- **Addition**: Element-wise addition of tensors
+- **Multiplication**: Element-wise multiplication  
+- **Subtraction**: Element-wise subtraction
+- **Division**: Element-wise division
+
+### Why Arithmetic Matters
+- **Neural networks**: Every layer uses tensor arithmetic
+- **Optimization**: Gradient updates use arithmetic
+- **Data processing**: Normalization, scaling, transformations
+"""
+
+# %% [markdown]
+"""
+## Step 2A: Tensor Addition
+
+### The Foundation Operation
+Addition is the most basic and important tensor operation:
+- **Element-wise**: Each element adds to corresponding element
+- **Broadcasting**: Smaller tensors can add to larger ones
+- **Commutative**: a + b = b + a
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-addition", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+    def add(self, other: 'Tensor') -> 'Tensor':
+        """
+        Add two tensors element-wise.
+        
+        TODO: Implement tensor addition.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Get the numpy data from both tensors
+        2. Use numpy's + operator for element-wise addition
+        3. Create a new Tensor with the result
+        4. Return the new tensor
+        
+        EXAMPLE USAGE:
+        ```python
+        t1 = Tensor([[1, 2], [3, 4]])
+        t2 = Tensor([[5, 6], [7, 8]])
+        result = t1.add(t2)
+        print(result.data)  # [[6, 8], [10, 12]]
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - Use self._data + other._data for numpy addition
+        - Wrap result in new Tensor: return Tensor(result)
+        - NumPy handles broadcasting automatically
+        - This is like torch.add() in PyTorch
+        
+        LEARNING CONNECTIONS:
+        - This is used in every neural network layer
+        - Gradient updates use addition: params = params + learning_rate * gradients
+        - Data preprocessing: adding bias, normalization
+        """
+        ### BEGIN SOLUTION
+        result = self._data + other._data
+        return Tensor(result)
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Tensor Addition
+
+Once you implement the `add` method above, run this cell to test it:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-addition", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_addition():
+    """Test tensor addition with various shapes"""
+    print("Testing tensor addition...")
+    
+    # Test same-shape addition
+    t1 = Tensor([[1, 2], [3, 4]])
+    t2 = Tensor([[5, 6], [7, 8]])
+    result = t1.add(t2)
+    expected = np.array([[6, 8], [10, 12]])
+    assert np.array_equal(result.data, expected), "Same-shape addition failed"
+    
+    # Test scalar addition (broadcasting)
+    t3 = Tensor([[1, 2], [3, 4]])
+    t4 = Tensor(10)
+    result = t3.add(t4)
+    expected = np.array([[11, 12], [13, 14]])
+    assert np.array_equal(result.data, expected), "Scalar addition failed"
+    
+    # Test vector addition (broadcasting)
+    t5 = Tensor([[1, 2], [3, 4]])
+    t6 = Tensor([10, 20])
+    result = t5.add(t6)
+    expected = np.array([[11, 22], [13, 24]])
+    assert np.array_equal(result.data, expected), "Vector addition failed"
+    
+    print("✅ Tensor addition tests passed!")
+    print(f"✅ Same-shape, scalar, and vector addition working")
+
+# Run the test
+test_tensor_addition()
+
+# %% [markdown]
+"""
+## Step 2B: Tensor Multiplication
+
+### Scaling and Element-wise Products
+Multiplication is crucial for scaling values and computing element-wise products:
+- **Element-wise**: Each element multiplies with corresponding element
+- **Broadcasting**: Works with different shapes
+- **Commutative**: a * b = b * a
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-multiplication", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+    def multiply(self, other: 'Tensor') -> 'Tensor':
+        """
+        Multiply two tensors element-wise.
+        
+        TODO: Implement tensor multiplication.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Get the numpy data from both tensors
+        2. Use numpy's * operator for element-wise multiplication
+        3. Create a new Tensor with the result
+        4. Return the new tensor
+        
+        EXAMPLE USAGE:
+        ```python
+        t1 = Tensor([[1, 2], [3, 4]])
+        t2 = Tensor([[2, 3], [4, 5]])
+        result = t1.multiply(t2)
+        print(result.data)  # [[2, 6], [12, 20]]
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - Use self._data * other._data for numpy multiplication
+        - Wrap result in new Tensor: return Tensor(result)
+        - NumPy handles broadcasting automatically
+        - This is like torch.mul() in PyTorch
+        
+        LEARNING CONNECTIONS:
+        - Used in activation functions: ReLU masks
+        - Attention mechanisms: attention weights * values
+        - Scaling: learning_rate * gradients
+        """
+        ### BEGIN SOLUTION
+        result = self._data * other._data
+        return Tensor(result)
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Test Your Tensor Multiplication
+
+Once you implement the `multiply` method above, run this cell to test it:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-multiplication", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_multiplication():
+    """Test tensor multiplication with various shapes"""
+    print("Testing tensor multiplication...")
+    
+    # Test same-shape multiplication
+    t1 = Tensor([[1, 2], [3, 4]])
+    t2 = Tensor([[2, 3], [4, 5]])
+    result = t1.multiply(t2)
+    expected = np.array([[2, 6], [12, 20]])
+    assert np.array_equal(result.data, expected), "Same-shape multiplication failed"
+    
+    # Test scalar multiplication (broadcasting)
+    t3 = Tensor([[1, 2], [3, 4]])
+    t4 = Tensor(2)
+    result = t3.multiply(t4)
+    expected = np.array([[2, 4], [6, 8]])
+    assert np.array_equal(result.data, expected), "Scalar multiplication failed"
+    
+    # Test vector multiplication (broadcasting)
+    t5 = Tensor([[1, 2], [3, 4]])
+    t6 = Tensor([2, 3])
+    result = t5.multiply(t6)
+    expected = np.array([[2, 6], [6, 12]])
+    assert np.array_equal(result.data, expected), "Vector multiplication failed"
+    
+    print("✅ Tensor multiplication tests passed!")
+    print(f"✅ Same-shape, scalar, and vector multiplication working")
+
+# Run the test
+test_tensor_multiplication()
+
+# %% [markdown]
+"""
+## 🎯 Step 3: Integration Test
+
+### Putting It All Together
+Now let's test that all our tensor operations work together in realistic scenarios:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-integration", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_integration():
+    """Test complete tensor functionality together"""
+    print("Testing tensor integration...")
+    
+    # Create test tensors
+    t1 = Tensor([[1, 2], [3, 4]])
+    t2 = Tensor([[2, 1], [1, 2]])
+    scalar = Tensor(0.5)
+    
+    # Test chained operations
+    result = t1.add(t2).multiply(scalar)
+    expected = np.array([[1.5, 1.5], [2.0, 3.0]])
+    assert np.array_equal(result.data, expected), "Chained operations failed"
+    
+    # Test properties after operations
+    assert result.shape == (2, 2), "Result shape should be (2, 2)"
+    assert result.size == 4, "Result size should be 4"
+    
+    # Test with different shapes (broadcasting)
+    t3 = Tensor([1, 2, 3])
+    t4 = Tensor([[1], [2], [3]])
+    result = t3.add(t4)
+    assert result.shape == (3, 3), "Broadcasting result should be (3, 3)"
+    
+    print("✅ Tensor integration tests passed!")
+    print(f"✅ All tensor operations work together correctly")
+    print(f"✅ Ready to build neural networks!")
+
+# Run the integration test
+test_tensor_integration()
+
+# %% [markdown]
+"""
+## 🎯 Module Summary: Tensor Mastery Achieved!
+
+Congratulations! You've successfully implemented the core Tensor class with:
+
+### ✅ What You've Built
+- **Tensor Creation**: Handle various input types (scalars, lists, arrays)
+- **Properties**: Access shape, size, and data efficiently
+- **Arithmetic**: Add and multiply tensors with broadcasting support
+- **Integration**: Operations work together seamlessly
+
+### ✅ Key Learning Outcomes
+- **Understanding**: Tensors as the foundation of ML systems
+- **Implementation**: Built tensor operations from scratch
+- **Testing**: Comprehensive validation at each step
+- **Integration**: Chained operations for complex computations
+
+### ✅ Ready for Next Steps
+Your tensor implementation is now ready to power:
+- **Activations**: ReLU, Sigmoid, Tanh will operate on your tensors
+- **Layers**: Dense layers will use tensor arithmetic
+- **Networks**: Complete neural networks built on your foundation
+
+**Next Module**: Activations - Adding nonlinearity to enable complex learning!
+"""
+```
+
+## Key Improvements Demonstrated
+
+### 1. **Progressive Structure**
+- Each concept is explained, implemented, and tested before moving on
+- Students get immediate feedback after each step
+- No overwhelming amount of code without validation
+
+### 2. **Rich Scaffolding**
+- Every TODO has step-by-step implementation guidance
+- Example usage shows exactly what the function should do
+- Implementation hints provide specific technical guidance
+- Learning connections show how concepts fit together
+
+### 3. **Immediate Testing**
+- Each function is tested immediately after implementation
+- Tests provide clear success messages and specific achievements
+- Integration tests show how concepts work together
+
+### 4. **Educational Flow**
+- Concepts build logically from simple to complex
+- Real-world motivation before technical implementation
+- Visual examples and concrete cases before abstract theory
+
+## Implementation Steps for Other Modules
+
+1. **Identify natural breakpoints** in the current module
+2. **Reorganize** into Step 1, Step 2, etc. with explanations
+3. **Add rich TODO blocks** with step-by-step guidance
+4. **Insert immediate testing** after each major concept
+5. **Add success messages** and progress indicators
+6. **Include learning connections** between concepts
+
+This transformation turns modules from reference material into guided learning experiences that maximize student success through immediate feedback and clear progression. 
\ No newline at end of file
diff --git a/docs/development/testing-design.md b/docs/development/testing-design.md
index 21814ace..fe5a3269 100644
--- a/docs/development/testing-design.md
+++ b/docs/development/testing-design.md
@@ -283,7 +283,7 @@ class TestBasicMLPipeline:
 ### Test Organization
 ```
 modules/source/{module}/{module}_dev.py    # Implementation + comprehensive inline tests
-tests/test_{module}.py                     # Module tests with mocks (for grading)
+tests/test_{module}.py                     # Package tests for exported functionality
 tests/integration/                         # Cross-module tests with vetted solutions
 ```
 
diff --git a/modules/source/00_setup/README.md b/modules/source/00_setup/README.md
index bef90efe..8d05c48c 100644
--- a/modules/source/00_setup/README.md
+++ b/modules/source/00_setup/README.md
@@ -77,7 +77,7 @@ Run the comprehensive test suite using pytest:
 tito test --module setup
 
 # Or directly with pytest
-python -m pytest modules/setup/tests/test_setup.py -v
+python -m pytest tests/test_setup.py -v
 ```
 
 ### Test Coverage
diff --git a/modules/source/01_tensor/tensor_dev_backup.py b/modules/source/01_tensor/tensor_dev_backup.py
new file mode 100644
index 00000000..671aaf3a
--- /dev/null
+++ b/modules/source/01_tensor/tensor_dev_backup.py
@@ -0,0 +1,1536 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 1: Tensor - Core Data Structure
+
+Welcome to the Tensor module! This is where TinyTorch really begins. You'll implement the fundamental data structure that powers all ML systems.
+
+## Learning Goals
+- Understand tensors as N-dimensional arrays with ML-specific operations
+- Implement a complete Tensor class with arithmetic operations
+- Handle shape management, data types, and memory layout
+- Build the foundation for neural networks and automatic differentiation
+- Master the NBGrader workflow with comprehensive testing
+
+## Build → Use → Understand
+1. **Build**: Create the Tensor class with core operations
+2. **Use**: Perform tensor arithmetic and transformations
+3. **Understand**: How tensors form the foundation of ML systems
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.tensor
+
+#| export
+import numpy as np
+import sys
+from typing import Union, List, Tuple, Optional, Any
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
+print("🔥 TinyTorch Tensor Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
+print("Ready to build tensors!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/01_tensor/tensor_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.tensor`
+
+```python
+# Final package structure:
+from tinytorch.core.tensor import Tensor  # The foundation of everything!
+from tinytorch.core.activations import ReLU, Sigmoid, Tanh
+from tinytorch.core.layers import Dense, Conv2D
+```
+
+**Why this matters:**
+- **Learning:** Focused modules for deep understanding
+- **Production:** Proper organization like PyTorch's `torch.Tensor`
+- **Consistency:** All tensor operations live together in `core.tensor`
+- **Foundation:** Every other module depends on Tensor
+"""
+
+# %% [markdown]
+"""
+## Step 1: What is a Tensor?
+
+### Definition
+A **tensor** is an N-dimensional array with ML-specific operations. Think of it as a container that can hold data in multiple dimensions:
+
+- **Scalar** (0D): A single number - `5.0`
+- **Vector** (1D): A list of numbers - `[1, 2, 3]`  
+- **Matrix** (2D): A 2D array - `[[1, 2], [3, 4]]`
+- **Higher dimensions**: 3D, 4D, etc. for images, video, batches
+
+### The Mathematical Foundation: From Scalars to Tensors
+Understanding tensors requires building from mathematical fundamentals:
+
+#### **Scalars (Rank 0)**
+- **Definition**: A single number with no direction
+- **Examples**: Temperature (25°C), mass (5.2 kg), probability (0.7)
+- **Operations**: Addition, multiplication, comparison
+- **ML Context**: Loss values, learning rates, regularization parameters
+
+#### **Vectors (Rank 1)**
+- **Definition**: An ordered list of numbers with direction and magnitude
+- **Examples**: Position [x, y, z], RGB color [255, 128, 0], word embedding [0.1, -0.5, 0.8]
+- **Operations**: Dot product, cross product, norm calculation
+- **ML Context**: Feature vectors, gradients, model parameters
+
+#### **Matrices (Rank 2)**
+- **Definition**: A 2D array organizing data in rows and columns
+- **Examples**: Image (height × width), weight matrix (input × output), covariance matrix
+- **Operations**: Matrix multiplication, transpose, inverse, eigendecomposition
+- **ML Context**: Linear layer weights, attention matrices, batch data
+
+#### **Higher-Order Tensors (Rank 3+)**
+- **Definition**: Multi-dimensional arrays extending matrices
+- **Examples**: 
+  - **3D**: Video frames (time × height × width), RGB images (height × width × channels)
+  - **4D**: Image batches (batch × height × width × channels)
+  - **5D**: Video batches (batch × time × height × width × channels)
+- **Operations**: Tensor products, contractions, decompositions
+- **ML Context**: Convolutional features, RNN states, transformer attention
+
+### Why Tensors Matter in ML: The Computational Foundation
+
+#### **1. Unified Data Representation**
+Tensors provide a consistent way to represent all ML data:
+```python
+# All of these are tensors with different shapes
+scalar_loss = Tensor(0.5)              # Shape: ()
+feature_vector = Tensor([1, 2, 3])      # Shape: (3,)
+weight_matrix = Tensor([[1, 2], [3, 4]]) # Shape: (2, 2)
+image_batch = Tensor(np.random.rand(32, 224, 224, 3)) # Shape: (32, 224, 224, 3)
+```
+
+#### **2. Efficient Batch Processing**
+ML systems process multiple samples simultaneously:
+```python
+# Instead of processing one image at a time:
+for image in images:
+    result = model(image)  # Slow: 1000 separate operations
+
+# Process entire batch at once:
+batch_result = model(image_batch)  # Fast: 1 vectorized operation
+```
+
+#### **3. Hardware Acceleration**
+Modern hardware (GPUs, TPUs) excels at tensor operations:
+- **Parallel processing**: Multiple operations simultaneously
+- **Vectorization**: SIMD (Single Instruction, Multiple Data) operations
+- **Memory optimization**: Contiguous memory layout for cache efficiency
+
+#### **4. Automatic Differentiation**
+Tensors enable gradient computation through computational graphs:
+```python
+# Each tensor operation creates a node in the computation graph
+x = Tensor([1, 2, 3])
+y = x * 2          # Node: multiplication
+z = y + 1          # Node: addition
+loss = z.sum()     # Node: summation
+# Gradients flow backward through this graph
+```
+
+### Real-World Examples: Tensors in Action
+
+#### **Computer Vision**
+- **Grayscale image**: 2D tensor `(height, width)` - `(28, 28)` for MNIST
+- **Color image**: 3D tensor `(height, width, channels)` - `(224, 224, 3)` for RGB
+- **Image batch**: 4D tensor `(batch, height, width, channels)` - `(32, 224, 224, 3)`
+- **Video**: 5D tensor `(batch, time, height, width, channels)`
+
+#### **Natural Language Processing**
+- **Word embedding**: 1D tensor `(embedding_dim,)` - `(300,)` for Word2Vec
+- **Sentence**: 2D tensor `(sequence_length, embedding_dim)` - `(50, 768)` for BERT
+- **Batch of sentences**: 3D tensor `(batch, sequence_length, embedding_dim)`
+
+#### **Audio Processing**
+- **Audio signal**: 1D tensor `(time_steps,)` - `(16000,)` for 1 second at 16kHz
+- **Spectrogram**: 2D tensor `(time_frames, frequency_bins)`
+- **Batch of audio**: 3D tensor `(batch, time_steps, features)`
+
+#### **Time Series**
+- **Single series**: 2D tensor `(time_steps, features)`
+- **Multiple series**: 3D tensor `(batch, time_steps, features)`
+- **Multivariate forecasting**: 4D tensor `(batch, time_steps, features, predictions)`
+
+### Why Not Just Use NumPy?
+
+While we use NumPy internally, our Tensor class adds ML-specific functionality:
+
+#### **1. ML-Specific Operations**
+- **Gradient tracking**: For automatic differentiation (coming in Module 7)
+- **GPU support**: For hardware acceleration (future extension)
+- **Broadcasting semantics**: ML-friendly dimension handling
+
+#### **2. Consistent API**
+- **Type safety**: Predictable behavior across operations
+- **Error checking**: Clear error messages for debugging
+- **Integration**: Seamless work with other TinyTorch components
+
+#### **3. Educational Value**
+- **Conceptual clarity**: Understand what tensors really are
+- **Implementation insight**: See how frameworks work internally
+- **Debugging skills**: Trace through tensor operations step by step
+
+#### **4. Extensibility**
+- **Future features**: Ready for gradients, GPU, distributed computing
+- **Customization**: Add domain-specific operations
+- **Optimization**: Profile and optimize specific use cases
+
+### Performance Considerations: Building Efficient Tensors
+
+#### **Memory Layout**
+- **Contiguous arrays**: Better cache locality and performance
+- **Data types**: `float32` vs `float64` trade-offs
+- **Memory sharing**: Avoid unnecessary copies
+
+#### **Vectorization**
+- **SIMD operations**: Single Instruction, Multiple Data
+- **Broadcasting**: Efficient operations on different shapes
+- **Batch operations**: Process multiple samples simultaneously
+
+#### **Numerical Stability**
+- **Precision**: Balancing speed and accuracy
+- **Overflow/underflow**: Handling extreme values
+- **Gradient flow**: Maintaining numerical stability for training
+
+Let's start building our tensor foundation!
+"""
+
+# %% [markdown]
+"""
+## 🧠 The Mathematical Foundation
+
+### Linear Algebra Refresher
+Tensors are generalizations of scalars, vectors, and matrices:
+
+```
+Scalar (0D): 5
+Vector (1D): [1, 2, 3]
+Matrix (2D): [[1, 2], [3, 4]]
+Tensor (3D): [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]
+```
+
+### Why This Matters for Neural Networks
+- **Forward Pass**: Matrix multiplication between layers
+- **Batch Processing**: Multiple samples processed simultaneously
+- **Convolutions**: 3D operations on image data
+- **Gradients**: Derivatives computed across all dimensions
+
+### Connection to Real ML Systems
+Every major ML framework uses tensors:
+- **PyTorch**: `torch.Tensor`
+- **TensorFlow**: `tf.Tensor`
+- **JAX**: `jax.numpy.ndarray`
+- **TinyTorch**: `tinytorch.core.tensor.Tensor` (what we're building!)
+
+### Performance Considerations
+- **Memory Layout**: Contiguous arrays for cache efficiency
+- **Vectorization**: SIMD operations for speed
+- **Broadcasting**: Efficient operations on different shapes
+- **Type Consistency**: Avoiding unnecessary conversions
+"""
+
+# %% [markdown]
+"""
+## Step 2: The Tensor Class Foundation
+
+### Core Concept: Wrapping NumPy with ML Intelligence
+Our Tensor class wraps NumPy arrays with ML-specific functionality. This design pattern is used by all major ML frameworks:
+
+- **PyTorch**: `torch.Tensor` wraps ATen (C++ tensor library)
+- **TensorFlow**: `tf.Tensor` wraps Eigen (C++ linear algebra library)
+- **JAX**: `jax.numpy.ndarray` wraps XLA (Google's linear algebra compiler)
+- **TinyTorch**: `Tensor` wraps NumPy (Python's numerical computing library)
+
+### Design Requirements Analysis
+
+#### **1. Input Flexibility**
+Our tensor must handle diverse input types:
+```python
+# Scalars (Python numbers)
+t1 = Tensor(5)           # int → numpy array
+t2 = Tensor(3.14)        # float → numpy array
+
+# Lists (Python sequences)
+t3 = Tensor([1, 2, 3])   # list → numpy array
+t4 = Tensor([[1, 2], [3, 4]])  # nested list → 2D array
+
+# NumPy arrays (existing arrays)
+t5 = Tensor(np.array([1, 2, 3]))  # array → tensor wrapper
+```
+
+#### **2. Type Management**
+ML systems need consistent, predictable types:
+- **Default behavior**: Auto-detect appropriate types
+- **Explicit control**: Allow manual type specification
+- **Performance optimization**: Prefer `float32` over `float64`
+- **Memory efficiency**: Use appropriate precision
+
+#### **3. Property Access**
+Essential tensor properties for ML operations:
+- **Shape**: Dimensions for compatibility checking
+- **Size**: Total elements for memory estimation
+- **Data type**: For numerical computation planning
+- **Data access**: For integration with other libraries
+
+#### **4. Arithmetic Operations**
+Support for mathematical operations:
+- **Element-wise**: Addition, multiplication, subtraction, division
+- **Broadcasting**: Operations on different shapes
+- **Type promotion**: Consistent result types
+- **Error handling**: Clear messages for incompatible operations
+
+### Implementation Strategy
+
+#### **Memory Management**
+- **Copy vs. Reference**: When to copy data vs. share memory
+- **Type conversion**: Efficient dtype changes
+- **Contiguous layout**: Ensure optimal memory access patterns
+
+#### **Error Handling**
+- **Input validation**: Check for valid input types
+- **Shape compatibility**: Verify operations are mathematically valid
+- **Informative messages**: Help users debug issues quickly
+
+#### **Performance Optimization**
+- **Lazy evaluation**: Defer expensive operations when possible
+- **Vectorization**: Use NumPy's optimized operations
+- **Memory reuse**: Minimize unnecessary allocations
+
+### Learning Objectives for Implementation
+
+By implementing this Tensor class, you'll learn:
+1. **Wrapper pattern**: How to extend existing libraries
+2. **Type system design**: Managing data types in numerical computing
+3. **API design**: Creating intuitive, consistent interfaces
+4. **Performance considerations**: Balancing flexibility and speed
+5. **Error handling**: Providing helpful feedback to users
+
+Let's implement our tensor foundation!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Tensor:
+    """
+    TinyTorch Tensor: N-dimensional array with ML operations.
+    
+    The fundamental data structure for all TinyTorch operations.
+    Wraps NumPy arrays with ML-specific functionality.
+    """
+    
+    def __init__(self, data: Union[int, float, List, np.ndarray], dtype: Optional[str] = None):
+        """
+        Create a new tensor from data.
+        
+        Args:
+            data: Input data (scalar, list, or numpy array)
+            dtype: Data type ('float32', 'int32', etc.). Defaults to auto-detect.
+            
+        TODO: Implement tensor creation with proper type handling.
+        
+        STEP-BY-STEP:
+        1. Check if data is a scalar (int/float) - convert to numpy array
+        2. Check if data is a list - convert to numpy array  
+        3. Check if data is already a numpy array - use as-is
+        4. Apply dtype conversion if specified
+        5. Store the result in self._data
+        
+        EXAMPLE:
+        Tensor(5) → stores np.array(5)
+        Tensor([1, 2, 3]) → stores np.array([1, 2, 3])
+        Tensor(np.array([1, 2, 3])) → stores the array directly
+        
+        HINTS:
+        - Use isinstance() to check data types
+        - Use np.array() for conversion
+        - Handle dtype parameter for type conversion
+        - Store the array in self._data
+        """
+        ### BEGIN SOLUTION
+        # Convert input to numpy array
+        if isinstance(data, (int, float, np.number)):
+            # Handle Python and NumPy scalars
+            if dtype is None:
+                # Auto-detect type: int for integers, float32 for floats
+                if isinstance(data, int) or (isinstance(data, np.number) and np.issubdtype(type(data), np.integer)):
+                    dtype = 'int32'
+                else:
+                    dtype = 'float32'
+            self._data = np.array(data, dtype=dtype)
+        elif isinstance(data, list):
+            # Let NumPy auto-detect type, then convert if needed
+            temp_array = np.array(data)
+            if dtype is None:
+                # Use NumPy's auto-detected type, but prefer float32 for floats
+                if temp_array.dtype == np.float64:
+                    dtype = 'float32'
+                else:
+                    dtype = str(temp_array.dtype)
+            self._data = np.array(data, dtype=dtype)
+        elif isinstance(data, np.ndarray):
+            # Already a numpy array
+            if dtype is None:
+                # Keep existing dtype, but prefer float32 for float64
+                if data.dtype == np.float64:
+                    dtype = 'float32'
+                else:
+                    dtype = str(data.dtype)
+            self._data = data.astype(dtype) if dtype != data.dtype else data.copy()
+        else:
+            # Try to convert unknown types
+            self._data = np.array(data, dtype=dtype)
+        ### END SOLUTION
+
+    @property
+    def data(self) -> np.ndarray:
+        """
+        Access underlying numpy array.
+        
+        TODO: Return the stored numpy array.
+        
+        HINT: Return self._data (the array you stored in __init__)
+        """
+        ### BEGIN SOLUTION
+        return self._data
+        ### END SOLUTION
+    
+    @property
+    def shape(self) -> Tuple[int, ...]:
+        """
+        Get tensor shape.
+        
+        TODO: Return the shape of the stored numpy array.
+        
+        HINT: Use .shape attribute of the numpy array
+        EXAMPLE: Tensor([1, 2, 3]).shape should return (3,)
+        """
+        ### BEGIN SOLUTION
+        return self._data.shape
+        ### END SOLUTION
+    
+    @property
+    def size(self) -> int:
+        """
+        Get total number of elements.
+        
+        TODO: Return the total number of elements in the tensor.
+        
+        HINT: Use .size attribute of the numpy array
+        EXAMPLE: Tensor([1, 2, 3]).size should return 3
+        """
+        ### BEGIN SOLUTION
+        return self._data.size
+        ### END SOLUTION
+    
+    @property
+    def dtype(self) -> np.dtype:
+        """
+        Get data type as numpy dtype.
+        
+        TODO: Return the data type of the stored numpy array.
+        
+        HINT: Use .dtype attribute of the numpy array
+        EXAMPLE: Tensor([1, 2, 3]).dtype should return dtype('int32')
+        """
+        ### BEGIN SOLUTION
+        return self._data.dtype
+        ### END SOLUTION
+    
+    def __repr__(self) -> str:
+        """
+        String representation.
+        
+        TODO: Create a clear string representation of the tensor.
+        
+        APPROACH:
+        1. Convert the numpy array to a list for readable output
+        2. Include the shape and dtype information
+        3. Format: "Tensor([data], shape=shape, dtype=dtype)"
+        
+        EXAMPLE:
+        Tensor([1, 2, 3]) → "Tensor([1, 2, 3], shape=(3,), dtype=int32)"
+        
+        HINTS:
+        - Use .tolist() to convert numpy array to list
+        - Include shape and dtype information
+        - Keep format consistent and readable
+        """
+        ### BEGIN SOLUTION
+        return f"Tensor({self._data.tolist()}, shape={self.shape}, dtype={self.dtype})"
+        ### END SOLUTION
+
+    def add(self, other: 'Tensor') -> 'Tensor':
+        """
+        Add two tensors element-wise.
+        
+        TODO: Implement tensor addition.
+        
+        APPROACH:
+        1. Add the numpy arrays using +
+        2. Return a new Tensor with the result
+        3. Handle broadcasting automatically
+        
+        EXAMPLE:
+        Tensor([1, 2]) + Tensor([3, 4]) → Tensor([4, 6])
+        
+        HINTS:
+        - Use self._data + other._data
+        - Return Tensor(result)
+        - NumPy handles broadcasting automatically
+        """
+        ### BEGIN SOLUTION
+        result = self._data + other._data
+        return Tensor(result)
+        ### END SOLUTION
+
+    def multiply(self, other: 'Tensor') -> 'Tensor':
+        """
+        Multiply two tensors element-wise.
+        
+        TODO: Implement tensor multiplication.
+        
+        APPROACH:
+        1. Multiply the numpy arrays using *
+        2. Return a new Tensor with the result
+        3. Handle broadcasting automatically
+        
+        EXAMPLE:
+        Tensor([1, 2]) * Tensor([3, 4]) → Tensor([3, 8])
+        
+        HINTS:
+        - Use self._data * other._data
+        - Return Tensor(result)
+        - This is element-wise, not matrix multiplication
+        """
+        ### BEGIN SOLUTION
+        result = self._data * other._data
+        return Tensor(result)
+        ### END SOLUTION
+
+    def __add__(self, other: Union['Tensor', int, float]) -> 'Tensor':
+        """
+        Addition operator: tensor + other
+        
+        TODO: Implement + operator for tensors.
+        
+        APPROACH:
+        1. If other is a Tensor, use tensor addition
+        2. If other is a scalar, convert to Tensor first
+        3. Return the result
+        
+        EXAMPLE:
+        Tensor([1, 2]) + Tensor([3, 4]) → Tensor([4, 6])
+        Tensor([1, 2]) + 5 → Tensor([6, 7])
+        """
+        ### BEGIN SOLUTION
+        if isinstance(other, Tensor):
+            return self.add(other)
+        else:
+            return self.add(Tensor(other))
+        ### END SOLUTION
+
+    def __mul__(self, other: Union['Tensor', int, float]) -> 'Tensor':
+        """
+        Multiplication operator: tensor * other
+        
+        TODO: Implement * operator for tensors.
+        
+        APPROACH:
+        1. If other is a Tensor, use tensor multiplication
+        2. If other is a scalar, convert to Tensor first
+        3. Return the result
+        
+        EXAMPLE:
+        Tensor([1, 2]) * Tensor([3, 4]) → Tensor([3, 8])
+        Tensor([1, 2]) * 3 → Tensor([3, 6])
+        """
+        ### BEGIN SOLUTION
+        if isinstance(other, Tensor):
+            return self.multiply(other)
+        else:
+            return self.multiply(Tensor(other))
+        ### END SOLUTION
+
+    def __sub__(self, other: Union['Tensor', int, float]) -> 'Tensor':
+        """
+        Subtraction operator: tensor - other
+        
+        TODO: Implement - operator for tensors.
+        
+        APPROACH:
+        1. Convert other to Tensor if needed
+        2. Subtract using numpy arrays
+        3. Return new Tensor with result
+        
+        EXAMPLE:
+        Tensor([5, 6]) - Tensor([1, 2]) → Tensor([4, 4])
+        Tensor([5, 6]) - 1 → Tensor([4, 5])
+        """
+        ### BEGIN SOLUTION
+        if isinstance(other, Tensor):
+            result = self._data - other._data
+        else:
+            result = self._data - other
+        return Tensor(result)
+        ### END SOLUTION
+
+    def __truediv__(self, other: Union['Tensor', int, float]) -> 'Tensor':
+        """
+        Division operator: tensor / other
+        
+        TODO: Implement / operator for tensors.
+        
+        APPROACH:
+        1. Convert other to Tensor if needed
+        2. Divide using numpy arrays
+        3. Return new Tensor with result
+        
+        EXAMPLE:
+        Tensor([6, 8]) / Tensor([2, 4]) → Tensor([3, 2])
+        Tensor([6, 8]) / 2 → Tensor([3, 4])
+        """
+        ### BEGIN SOLUTION
+        if isinstance(other, Tensor):
+            result = self._data / other._data
+        else:
+            result = self._data / other
+        return Tensor(result)
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: Tensor Creation
+
+Let's test your tensor creation implementation right away! This gives you immediate feedback on whether your `__init__` method works correctly.
+
+**This is a unit test** - it tests one specific function (tensor creation) in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-creation-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false}
+# Test tensor creation immediately after implementation
+print("🔬 Unit Test: Tensor Creation...")
+
+# Test basic tensor creation
+try:
+    # Test scalar
+    scalar = Tensor(5.0)
+    assert hasattr(scalar, '_data'), "Tensor should have _data attribute"
+    assert scalar._data.shape == (), f"Scalar should have shape (), got {scalar._data.shape}"
+    print("✅ Scalar creation works")
+    
+    # Test vector
+    vector = Tensor([1, 2, 3])
+    assert vector._data.shape == (3,), f"Vector should have shape (3,), got {vector._data.shape}"
+    print("✅ Vector creation works")
+    
+    # Test matrix
+    matrix = Tensor([[1, 2], [3, 4]])
+    assert matrix._data.shape == (2, 2), f"Matrix should have shape (2, 2), got {matrix._data.shape}"
+    print("✅ Matrix creation works")
+    
+    print("📈 Progress: Tensor Creation ✓")
+    
+except Exception as e:
+    print(f"❌ Tensor creation test failed: {e}")
+    raise
+
+print("🎯 Tensor creation behavior:")
+print("   Converts data to NumPy arrays")
+print("   Preserves shape and data type")
+print("   Stores in _data attribute")
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: Tensor Properties
+
+Now let's test that your tensor properties work correctly. This tests the @property methods you implemented.
+
+**This is a unit test** - it tests specific properties (shape, size, dtype, data) in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-properties-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false}
+# Test tensor properties immediately after implementation
+print("🔬 Unit Test: Tensor Properties...")
+
+# Test properties with simple examples
+try:
+    # Test with a simple matrix
+    tensor = Tensor([[1, 2, 3], [4, 5, 6]])
+    
+    # Test shape property
+    assert tensor.shape == (2, 3), f"Shape should be (2, 3), got {tensor.shape}"
+    print("✅ Shape property works")
+    
+    # Test size property
+    assert tensor.size == 6, f"Size should be 6, got {tensor.size}"
+    print("✅ Size property works")
+    
+    # Test data property
+    assert np.array_equal(tensor.data, np.array([[1, 2, 3], [4, 5, 6]])), "Data property should return numpy array"
+    print("✅ Data property works")
+    
+    # Test dtype property
+    assert tensor.dtype in [np.int32, np.int64], f"Dtype should be int32 or int64, got {tensor.dtype}"
+    print("✅ Dtype property works")
+    
+    print("📈 Progress: Tensor Properties ✓")
+    
+except Exception as e:
+    print(f"❌ Tensor properties test failed: {e}")
+    raise
+
+print("🎯 Tensor properties behavior:")
+print("   shape: Returns tuple of dimensions")
+print("   size: Returns total number of elements")
+print("   data: Returns underlying NumPy array")
+print("   dtype: Returns NumPy data type")
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: Tensor Arithmetic
+
+Let's test your tensor arithmetic operations. This tests the __add__, __mul__, __sub__, __truediv__ methods.
+
+**This is a unit test** - it tests specific arithmetic operations in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-arithmetic-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false}
+# Test tensor arithmetic immediately after implementation
+print("🔬 Unit Test: Tensor Arithmetic...")
+
+# Test basic arithmetic with simple examples
+try:
+    # Test addition
+    a = Tensor([1, 2, 3])
+    b = Tensor([4, 5, 6])
+    result = a + b
+    expected = np.array([5, 7, 9])
+    assert np.array_equal(result.data, expected), f"Addition failed: expected {expected}, got {result.data}"
+    print("✅ Addition works")
+    
+    # Test scalar addition
+    result_scalar = a + 10
+    expected_scalar = np.array([11, 12, 13])
+    assert np.array_equal(result_scalar.data, expected_scalar), f"Scalar addition failed: expected {expected_scalar}, got {result_scalar.data}"
+    print("✅ Scalar addition works")
+    
+    # Test multiplication
+    result_mul = a * b
+    expected_mul = np.array([4, 10, 18])
+    assert np.array_equal(result_mul.data, expected_mul), f"Multiplication failed: expected {expected_mul}, got {result_mul.data}"
+    print("✅ Multiplication works")
+    
+    # Test scalar multiplication
+    result_scalar_mul = a * 2
+    expected_scalar_mul = np.array([2, 4, 6])
+    assert np.array_equal(result_scalar_mul.data, expected_scalar_mul), f"Scalar multiplication failed: expected {expected_scalar_mul}, got {result_scalar_mul.data}"
+    print("✅ Scalar multiplication works")
+    
+    print("📈 Progress: Tensor Arithmetic ✓")
+    
+except Exception as e:
+    print(f"❌ Tensor arithmetic test failed: {e}")
+    raise
+
+print("🎯 Tensor arithmetic behavior:")
+print("   Element-wise operations on tensors")
+print("   Broadcasting with scalars")
+print("   Returns new Tensor objects")
+
+# %% [markdown]
+"""
+### 🧪 Comprehensive Test: Tensor Creation
+
+Let's thoroughly test your tensor creation to make sure it handles all the cases you'll encounter in ML.
+This tests the foundation of everything else we'll build.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-creation-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_creation_comprehensive():
+    """Comprehensive test of tensor creation with all data types and shapes."""
+    print("🔬 Testing comprehensive tensor creation...")
+    
+    tests_passed = 0
+    total_tests = 8
+    
+    # Test 1: Scalar creation (0D tensor)
+    try:
+        scalar_int = Tensor(42)
+        scalar_float = Tensor(3.14)
+        scalar_zero = Tensor(0)
+        
+        assert hasattr(scalar_int, '_data'), "Tensor should have _data attribute"
+        assert scalar_int._data.shape == (), f"Scalar should have shape (), got {scalar_int._data.shape}"
+        assert scalar_float._data.shape == (), f"Float scalar should have shape (), got {scalar_float._data.shape}"
+        assert scalar_zero._data.shape == (), f"Zero scalar should have shape (), got {scalar_zero._data.shape}"
+        
+        print("✅ Scalar creation: integers, floats, and zero")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Scalar creation failed: {e}")
+    
+    # Test 2: Vector creation (1D tensor)
+    try:
+        vector_int = Tensor([1, 2, 3, 4, 5])
+        vector_float = Tensor([1.0, 2.5, 3.7])
+        vector_single = Tensor([42])
+        vector_empty = Tensor([])
+        
+        assert vector_int._data.shape == (5,), f"Int vector should have shape (5,), got {vector_int._data.shape}"
+        assert vector_float._data.shape == (3,), f"Float vector should have shape (3,), got {vector_float._data.shape}"
+        assert vector_single._data.shape == (1,), f"Single element vector should have shape (1,), got {vector_single._data.shape}"
+        assert vector_empty._data.shape == (0,), f"Empty vector should have shape (0,), got {vector_empty._data.shape}"
+        
+        print("✅ Vector creation: integers, floats, single element, and empty")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Vector creation failed: {e}")
+    
+    # Test 3: Matrix creation (2D tensor)
+    try:
+        matrix_2x2 = Tensor([[1, 2], [3, 4]])
+        matrix_3x2 = Tensor([[1, 2], [3, 4], [5, 6]])
+        matrix_1x3 = Tensor([[1, 2, 3]])
+        
+        assert matrix_2x2._data.shape == (2, 2), f"2x2 matrix should have shape (2, 2), got {matrix_2x2._data.shape}"
+        assert matrix_3x2._data.shape == (3, 2), f"3x2 matrix should have shape (3, 2), got {matrix_3x2._data.shape}"
+        assert matrix_1x3._data.shape == (1, 3), f"1x3 matrix should have shape (1, 3), got {matrix_1x3._data.shape}"
+        
+        print("✅ Matrix creation: 2x2, 3x2, and 1x3 matrices")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Matrix creation failed: {e}")
+    
+    # Test 4: Data type handling
+    try:
+        int_tensor = Tensor([1, 2, 3])
+        float_tensor = Tensor([1.0, 2.0, 3.0])
+        mixed_tensor = Tensor([1, 2.5, 3])  # Should convert to float
+        
+        # Check that data types are reasonable
+        assert int_tensor._data.dtype in [np.int32, np.int64], f"Int tensor has unexpected dtype: {int_tensor._data.dtype}"
+        assert float_tensor._data.dtype in [np.float32, np.float64], f"Float tensor has unexpected dtype: {float_tensor._data.dtype}"
+        assert mixed_tensor._data.dtype in [np.float32, np.float64], f"Mixed tensor should be float, got: {mixed_tensor._data.dtype}"
+        
+        print("✅ Data type handling: integers, floats, and mixed types")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Data type handling failed: {e}")
+    
+    # Test 5: NumPy array input
+    try:
+        np_array = np.array([1, 2, 3, 4])
+        tensor_from_np = Tensor(np_array)
+        
+        assert tensor_from_np._data.shape == (4,), f"Tensor from NumPy should have shape (4,), got {tensor_from_np._data.shape}"
+        assert np.array_equal(tensor_from_np._data, np_array), "Tensor from NumPy should preserve data"
+        
+        print("✅ NumPy array input: conversion works correctly")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ NumPy array input failed: {e}")
+    
+    # Test 6: Large tensor creation
+    try:
+        large_tensor = Tensor(list(range(1000)))
+        assert large_tensor._data.shape == (1000,), f"Large tensor should have shape (1000,), got {large_tensor._data.shape}"
+        assert large_tensor._data[0] == 0, "Large tensor should start with 0"
+        assert large_tensor._data[-1] == 999, "Large tensor should end with 999"
+        
+        print("✅ Large tensor creation: 1000 elements")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Large tensor creation failed: {e}")
+    
+    # Test 7: Negative numbers
+    try:
+        negative_tensor = Tensor([-1, -2, -3])
+        mixed_signs = Tensor([-1, 0, 1])
+        
+        assert negative_tensor._data.shape == (3,), f"Negative tensor should have shape (3,), got {negative_tensor._data.shape}"
+        assert np.array_equal(negative_tensor._data, np.array([-1, -2, -3])), "Negative numbers should be preserved"
+        assert np.array_equal(mixed_signs._data, np.array([-1, 0, 1])), "Mixed signs should be preserved"
+        
+        print("✅ Negative numbers: handled correctly")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Negative numbers failed: {e}")
+    
+    # Test 8: Edge cases
+    try:
+        # Very large numbers
+        big_tensor = Tensor([1e6, 1e-6])
+        assert big_tensor._data.shape == (2,), "Big numbers tensor should have correct shape"
+        
+        # Zero tensor
+        zero_tensor = Tensor([0, 0, 0])
+        assert np.all(zero_tensor._data == 0), "Zero tensor should contain all zeros"
+        
+        print("✅ Edge cases: large numbers and zeros")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Edge cases failed: {e}")
+    
+    # Results summary
+    print(f"\n📊 Tensor Creation Results: {tests_passed}/{total_tests} tests passed")
+    
+    if tests_passed == total_tests:
+        print("🎉 All tensor creation tests passed! Your Tensor class can handle:")
+        print("  • Scalars, vectors, and matrices")
+        print("  • Different data types (int, float)")
+        print("  • NumPy arrays")
+        print("  • Large tensors and edge cases")
+        print("📈 Progress: Tensor Creation ✓")
+        return True
+    else:
+        print("⚠️  Some tensor creation tests failed. Common issues:")
+        print("  • Check your __init__ method implementation")
+        print("  • Make sure you're storing data in self._data")
+        print("  • Verify NumPy array conversion works correctly")
+        print("  • Test with different input types (int, float, list, np.array)")
+        return False
+
+# Run the comprehensive test
+success = test_tensor_creation_comprehensive()
+
+# %% [markdown]
+"""
+### 🧪 Comprehensive Test: Tensor Properties
+
+Now let's test all the properties your tensor should have. These properties are essential for ML operations.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-properties-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_properties_comprehensive():
+    """Comprehensive test of tensor properties (shape, size, dtype, data access)."""
+    print("🔬 Testing comprehensive tensor properties...")
+    
+    tests_passed = 0
+    total_tests = 6
+    
+    # Test 1: Shape property
+    try:
+        scalar = Tensor(5.0)
+        vector = Tensor([1, 2, 3])
+        matrix = Tensor([[1, 2], [3, 4]])
+        tensor_3d = Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
+        
+        assert scalar.shape == (), f"Scalar shape should be (), got {scalar.shape}"
+        assert vector.shape == (3,), f"Vector shape should be (3,), got {vector.shape}"
+        assert matrix.shape == (2, 2), f"Matrix shape should be (2, 2), got {matrix.shape}"
+        assert tensor_3d.shape == (2, 2, 2), f"3D tensor shape should be (2, 2, 2), got {tensor_3d.shape}"
+        
+        print("✅ Shape property: scalar, vector, matrix, and 3D tensor")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Shape property failed: {e}")
+    
+    # Test 2: Size property
+    try:
+        scalar = Tensor(5.0)
+        vector = Tensor([1, 2, 3])
+        matrix = Tensor([[1, 2], [3, 4]])
+        empty = Tensor([])
+        
+        assert scalar.size == 1, f"Scalar size should be 1, got {scalar.size}"
+        assert vector.size == 3, f"Vector size should be 3, got {vector.size}"
+        assert matrix.size == 4, f"Matrix size should be 4, got {matrix.size}"
+        assert empty.size == 0, f"Empty tensor size should be 0, got {empty.size}"
+        
+        print("✅ Size property: scalar, vector, matrix, and empty tensor")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Size property failed: {e}")
+    
+    # Test 3: Data type property
+    try:
+        int_tensor = Tensor([1, 2, 3])
+        float_tensor = Tensor([1.0, 2.0, 3.0])
+        
+        # Check that dtype is accessible and reasonable
+        assert hasattr(int_tensor, 'dtype'), "Tensor should have dtype property"
+        assert hasattr(float_tensor, 'dtype'), "Tensor should have dtype property"
+        
+        # Data types should be NumPy dtypes
+        assert isinstance(int_tensor.dtype, np.dtype), f"dtype should be np.dtype, got {type(int_tensor.dtype)}"
+        assert isinstance(float_tensor.dtype, np.dtype), f"dtype should be np.dtype, got {type(float_tensor.dtype)}"
+        
+        print(f"✅ Data type property: int tensor is {int_tensor.dtype}, float tensor is {float_tensor.dtype}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Data type property failed: {e}")
+    
+    # Test 4: Data access property
+    try:
+        scalar = Tensor(5.0)
+        vector = Tensor([1, 2, 3])
+        matrix = Tensor([[1, 2], [3, 4]])
+        
+        # Test data access
+        assert hasattr(scalar, 'data'), "Tensor should have data property"
+        assert hasattr(vector, 'data'), "Tensor should have data property"
+        assert hasattr(matrix, 'data'), "Tensor should have data property"
+        
+        # Test data content
+        assert scalar.data.item() == 5.0, f"Scalar data should be 5.0, got {scalar.data.item()}"
+        assert np.array_equal(vector.data, np.array([1, 2, 3])), "Vector data mismatch"
+        assert np.array_equal(matrix.data, np.array([[1, 2], [3, 4]])), "Matrix data mismatch"
+        
+        print("✅ Data access: scalar, vector, and matrix data retrieval")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Data access failed: {e}")
+    
+    # Test 5: String representation
+    try:
+        scalar = Tensor(5.0)
+        vector = Tensor([1, 2, 3])
+        
+        # Test that __repr__ works
+        scalar_str = str(scalar)
+        vector_str = str(vector)
+        
+        assert isinstance(scalar_str, str), "Tensor string representation should be a string"
+        assert isinstance(vector_str, str), "Tensor string representation should be a string"
+        assert len(scalar_str) > 0, "Tensor string representation should not be empty"
+        assert len(vector_str) > 0, "Tensor string representation should not be empty"
+        
+        print(f"✅ String representation: scalar={scalar_str[:50]}{'...' if len(scalar_str) > 50 else ''}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ String representation failed: {e}")
+    
+    # Test 6: Property consistency
+    try:
+        test_cases = [
+            Tensor(42),
+            Tensor([1, 2, 3, 4, 5]),
+            Tensor([[1, 2, 3], [4, 5, 6]]),
+            Tensor([])
+        ]
+        
+        for i, tensor in enumerate(test_cases):
+            # Size should equal product of shape
+            expected_size = np.prod(tensor.shape) if tensor.shape else 1
+            assert tensor.size == expected_size, f"Test case {i}: size {tensor.size} doesn't match shape {tensor.shape}"
+            
+            # Data shape should match tensor shape
+            assert tensor.data.shape == tensor.shape, f"Test case {i}: data shape {tensor.data.shape} doesn't match tensor shape {tensor.shape}"
+        
+        print("✅ Property consistency: size matches shape, data shape matches tensor shape")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Property consistency failed: {e}")
+    
+    # Results summary
+    print(f"\n📊 Tensor Properties Results: {tests_passed}/{total_tests} tests passed")
+    
+    if tests_passed == total_tests:
+        print("🎉 All tensor property tests passed! Your tensor has:")
+        print("  • Correct shape property for all dimensions")
+        print("  • Accurate size calculation")
+        print("  • Proper data type handling")
+        print("  • Working data access")
+        print("  • Good string representation")
+        print("📈 Progress: Tensor Creation ✓, Properties ✓")
+        return True
+    else:
+        print("⚠️  Some property tests failed. Common issues:")
+        print("  • Check your @property decorators")
+        print("  • Verify shape returns self._data.shape")
+        print("  • Make sure size returns self._data.size")
+        print("  • Ensure dtype returns self._data.dtype")
+        print("  • Test your __repr__ method")
+        return False
+
+# Run the comprehensive test
+success = test_tensor_properties_comprehensive() and success
+
+# %% [markdown]
+"""
+### 🧪 Comprehensive Test: Tensor Arithmetic
+
+Let's test all arithmetic operations. These are the foundation of neural network computations!
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-arithmetic-comprehensive", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_arithmetic_comprehensive():
+    """Comprehensive test of tensor arithmetic operations."""
+    print("🔬 Testing comprehensive tensor arithmetic...")
+    
+    tests_passed = 0
+    total_tests = 8
+    
+    # Test 1: Basic addition method
+    try:
+        a = Tensor([1, 2, 3])
+        b = Tensor([4, 5, 6])
+        c = a.add(b)
+        
+        expected = np.array([5, 7, 9])
+        assert np.array_equal(c.data, expected), f"Addition method failed: expected {expected}, got {c.data}"
+        assert isinstance(c, Tensor), "Addition should return a Tensor"
+        
+        print(f"✅ Addition method: {a.data} + {b.data} = {c.data}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Addition method failed: {e}")
+    
+    # Test 2: Basic multiplication method
+    try:
+        a = Tensor([1, 2, 3])
+        b = Tensor([4, 5, 6])
+        c = a.multiply(b)
+        
+        expected = np.array([4, 10, 18])
+        assert np.array_equal(c.data, expected), f"Multiplication method failed: expected {expected}, got {c.data}"
+        assert isinstance(c, Tensor), "Multiplication should return a Tensor"
+        
+        print(f"✅ Multiplication method: {a.data} * {b.data} = {c.data}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Multiplication method failed: {e}")
+    
+    # Test 3: Addition operator (+)
+    try:
+        a = Tensor([1, 2, 3])
+        b = Tensor([4, 5, 6])
+        c = a + b
+        
+        expected = np.array([5, 7, 9])
+        assert np.array_equal(c.data, expected), f"+ operator failed: expected {expected}, got {c.data}"
+        assert isinstance(c, Tensor), "+ operator should return a Tensor"
+        
+        print(f"✅ + operator: {a.data} + {b.data} = {c.data}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ + operator failed: {e}")
+    
+    # Test 4: Multiplication operator (*)
+    try:
+        a = Tensor([1, 2, 3])
+        b = Tensor([4, 5, 6])
+        c = a * b
+        
+        expected = np.array([4, 10, 18])
+        assert np.array_equal(c.data, expected), f"* operator failed: expected {expected}, got {c.data}"
+        assert isinstance(c, Tensor), "* operator should return a Tensor"
+        
+        print(f"✅ * operator: {a.data} * {b.data} = {c.data}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ * operator failed: {e}")
+    
+    # Test 5: Subtraction operator (-)
+    try:
+        a = Tensor([1, 2, 3])
+        b = Tensor([4, 5, 6])
+        c = b - a
+        
+        expected = np.array([3, 3, 3])
+        assert np.array_equal(c.data, expected), f"- operator failed: expected {expected}, got {c.data}"
+        assert isinstance(c, Tensor), "- operator should return a Tensor"
+        
+        print(f"✅ - operator: {b.data} - {a.data} = {c.data}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ - operator failed: {e}")
+    
+    # Test 6: Division operator (/)
+    try:
+        a = Tensor([1, 2, 4])
+        b = Tensor([2, 4, 8])
+        c = b / a
+        
+        expected = np.array([2.0, 2.0, 2.0])
+        assert np.allclose(c.data, expected), f"/ operator failed: expected {expected}, got {c.data}"
+        assert isinstance(c, Tensor), "/ operator should return a Tensor"
+        
+        print(f"✅ / operator: {b.data} / {a.data} = {c.data}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ / operator failed: {e}")
+    
+    # Test 7: Scalar operations
+    try:
+        a = Tensor([1, 2, 3])
+        
+        # Addition with scalar
+        b = a + 10
+        expected_add = np.array([11, 12, 13])
+        assert np.array_equal(b.data, expected_add), f"Scalar addition failed: expected {expected_add}, got {b.data}"
+        
+        # Multiplication with scalar
+        c = a * 2
+        expected_mul = np.array([2, 4, 6])
+        assert np.array_equal(c.data, expected_mul), f"Scalar multiplication failed: expected {expected_mul}, got {c.data}"
+        
+        # Subtraction with scalar
+        d = a - 1
+        expected_sub = np.array([0, 1, 2])
+        assert np.array_equal(d.data, expected_sub), f"Scalar subtraction failed: expected {expected_sub}, got {d.data}"
+        
+        # Division with scalar
+        e = a / 2
+        expected_div = np.array([0.5, 1.0, 1.5])
+        assert np.allclose(e.data, expected_div), f"Scalar division failed: expected {expected_div}, got {e.data}"
+        
+        print(f"✅ Scalar operations: +10, *2, -1, /2 all work correctly")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Scalar operations failed: {e}")
+    
+    # Test 8: Matrix operations
+    try:
+        matrix_a = Tensor([[1, 2], [3, 4]])
+        matrix_b = Tensor([[5, 6], [7, 8]])
+        
+        # Matrix addition
+        c = matrix_a + matrix_b
+        expected = np.array([[6, 8], [10, 12]])
+        assert np.array_equal(c.data, expected), f"Matrix addition failed: expected {expected}, got {c.data}"
+        assert c.shape == (2, 2), f"Matrix addition should preserve shape, got {c.shape}"
+        
+        # Matrix multiplication (element-wise)
+        d = matrix_a * matrix_b
+        expected_mul = np.array([[5, 12], [21, 32]])
+        assert np.array_equal(d.data, expected_mul), f"Matrix multiplication failed: expected {expected_mul}, got {d.data}"
+        
+        print(f"✅ Matrix operations: 2x2 matrix addition and multiplication")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Matrix operations failed: {e}")
+    
+    # Results summary
+    print(f"\n📊 Tensor Arithmetic Results: {tests_passed}/{total_tests} tests passed")
+    
+    if tests_passed == total_tests:
+        print("🎉 All tensor arithmetic tests passed! Your tensor supports:")
+        print("  • Basic methods: add(), multiply()")
+        print("  • Python operators: +, -, *, /")
+        print("  • Scalar operations: tensor + number")
+        print("  • Matrix operations: element-wise operations")
+        print("📈 Progress: Tensor Creation ✓, Properties ✓, Arithmetic ✓")
+        return True
+    else:
+        print("⚠️  Some arithmetic tests failed. Common issues:")
+        print("  • Check your add() and multiply() methods")
+        print("  • Verify operator overloading (__add__, __mul__, __sub__, __truediv__)")
+        print("  • Make sure scalar operations work (convert scalar to Tensor)")
+        print("  • Test with different tensor shapes")
+        return False
+
+# Run the comprehensive test
+success = test_tensor_arithmetic_comprehensive() and success
+
+# %% [markdown]
+"""
+### 🧪 Final Integration Test: Real ML Scenario
+
+Let's test your tensor with a realistic machine learning scenario to make sure everything works together.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-integration", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+def test_tensor_integration():
+    """Integration test with realistic ML scenario."""
+    print("🔬 Testing tensor integration with ML scenario...")
+    
+    try:
+        print("🧠 Simulating a simple neural network forward pass...")
+        
+        # Simulate input data (batch of 2 samples, 3 features each)
+        X = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
+        print(f"📊 Input data shape: {X.shape}")
+        
+        # Simulate weights (3 input features, 2 output neurons)
+        W = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])
+        print(f"🎯 Weights shape: {W.shape}")
+        
+        # Simulate bias (2 output neurons)
+        b = Tensor([0.1, 0.2])
+        print(f"⚖️  Bias shape: {b.shape}")
+        
+        # Simple linear transformation: y = X * W + b
+        # Note: This is a simplified version - real matrix multiplication would be different
+        # But we can test element-wise operations
+        
+        # Test that we can do basic operations needed for ML
+        sample = Tensor([1.0, 2.0, 3.0])  # Single sample
+        weight_col = Tensor([0.1, 0.3, 0.5])  # First column of weights
+        
+        # Compute dot product manually using element-wise operations
+        products = sample * weight_col  # Element-wise multiplication
+        print(f"✅ Element-wise multiplication works: {products.data}")
+        
+        # Test addition for bias
+        result = products + Tensor([0.1, 0.1, 0.1])
+        print(f"✅ Bias addition works: {result.data}")
+        
+        # Test with different shapes
+        matrix_a = Tensor([[1, 2], [3, 4]])
+        matrix_b = Tensor([[0.1, 0.2], [0.3, 0.4]])
+        matrix_result = matrix_a * matrix_b
+        print(f"✅ Matrix operations work: {matrix_result.data}")
+        
+        # Test scalar operations (common in ML)
+        scaled = sample * 0.5  # Learning rate scaling
+        print(f"✅ Scalar scaling works: {scaled.data}")
+        
+        # Test normalization-like operations
+        mean_val = Tensor([2.0, 2.0, 2.0])  # Simulate mean
+        normalized = sample - mean_val
+        print(f"✅ Mean subtraction works: {normalized.data}")
+        
+        print("\n🎉 Integration test passed! Your tensor class can handle:")
+        print("  • Multi-dimensional data (batches, features)")
+        print("  • Element-wise operations needed for ML")
+        print("  • Scalar operations (learning rates, normalization)")
+        print("  • Matrix operations (weights, transformations)")
+        print("📈 Progress: All tensor functionality ✓")
+        print("🚀 Ready for neural network layers!")
+        
+        return True
+        
+    except Exception as e:
+        print(f"❌ Integration test failed: {e}")
+        print("\n💡 This suggests an issue with:")
+        print("  • Basic tensor operations not working together")
+        print("  • Shape handling problems")
+        print("  • Arithmetic operation implementation")
+        print("  • Check your tensor creation and arithmetic methods")
+        return False
+
+# Run the integration test
+success = test_tensor_integration() and success
+
+# Print final summary
+print(f"\n{'='*60}")
+print("🎯 TENSOR MODULE TESTING COMPLETE")
+print(f"{'='*60}")
+
+if success:
+    print("🎉 CONGRATULATIONS! All tensor tests passed!")
+    print("\n✅ Your Tensor class successfully implements:")
+    print("  • Comprehensive tensor creation (scalars, vectors, matrices)")
+    print("  • All essential properties (shape, size, dtype, data access)")
+    print("  • Complete arithmetic operations (methods and operators)")
+    print("  • Scalar and matrix operations")
+    print("  • Real ML scenario compatibility")
+    print("\n🚀 You're ready to move to the next module!")
+    print("📈 Final Progress: Tensor Module ✓ COMPLETE")
+else:
+    print("⚠️  Some tests failed. Please review the error messages above.")
+    print("\n🔧 To fix issues:")
+    print("  1. Check the specific test that failed")
+    print("  2. Review the error message and hints")
+    print("  3. Fix your implementation")
+    print("  4. Re-run the notebook cells")
+    print("\n💪 Don't give up! Debugging is part of learning.")
+
+# %% [markdown]
+"""
+## Step 3: Tensor Arithmetic Operations
+
+### Why Arithmetic Matters
+Tensor arithmetic is the foundation of all neural network operations:
+- **Forward pass**: Matrix multiplications and additions
+- **Activation functions**: Element-wise operations
+- **Loss computation**: Differences and squares
+- **Gradient computation**: Chain rule applications
+
+### Operations We'll Implement
+- **Addition**: Element-wise addition of tensors
+- **Multiplication**: Element-wise multiplication
+- **Python operators**: `+`, `-`, `*`, `/` for natural syntax
+- **Broadcasting**: Handle different shapes automatically
+"""
+
+# %% [markdown]
+"""
+## Step 3: Tensor Arithmetic Methods
+
+The arithmetic methods are now part of the Tensor class above. Let's test them!
+"""
+
+# %% [markdown]
+"""
+## Step 4: Python Operator Overloading
+
+### Why Operator Overloading?
+Python's magic methods allow us to use natural syntax:
+- `a + b` instead of `a.add(b)`
+- `a * b` instead of `a.multiply(b)`
+- `a - b` for subtraction
+- `a / b` for division
+
+This makes tensor operations feel natural and readable.
+"""
+
+# %% [markdown]
+"""
+## Step 4: Operator Overloading
+
+The operator methods (__add__, __mul__, __sub__, __truediv__) are now part of the Tensor class above. This enables natural syntax like `a + b` and `a * b`.
+"""
+
+# %% [markdown]
+"""
+### 🧪 Test Your Tensor Implementation
+
+Once you implement the Tensor class above, run these cells to test your implementation:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-creation", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test tensor creation and properties
+print("Testing tensor creation...")
+
+# Test scalar creation
+scalar = Tensor(5.0)
+assert scalar.shape == (), f"Scalar shape should be (), got {scalar.shape}"
+assert scalar.size == 1, f"Scalar size should be 1, got {scalar.size}"
+assert scalar.data.item() == 5.0, f"Scalar value should be 5.0, got {scalar.data.item()}"
+
+# Test vector creation
+vector = Tensor([1, 2, 3])
+assert vector.shape == (3,), f"Vector shape should be (3,), got {vector.shape}"
+assert vector.size == 3, f"Vector size should be 3, got {vector.size}"
+assert np.array_equal(vector.data, np.array([1, 2, 3])), "Vector data mismatch"
+
+# Test matrix creation
+matrix = Tensor([[1, 2], [3, 4]])
+assert matrix.shape == (2, 2), f"Matrix shape should be (2, 2), got {matrix.shape}"
+assert matrix.size == 4, f"Matrix size should be 4, got {matrix.size}"
+assert np.array_equal(matrix.data, np.array([[1, 2], [3, 4]])), "Matrix data mismatch"
+
+# Test dtype handling
+float_tensor = Tensor([1.0, 2.0, 3.0])
+assert float_tensor.dtype == np.float32, f"Float tensor dtype should be float32, got {float_tensor.dtype}"
+
+int_tensor = Tensor([1, 2, 3])
+# Note: NumPy may default to int64 on some systems, so we check for integer types
+assert int_tensor.dtype in [np.int32, np.int64], f"Int tensor dtype should be int32 or int64, got {int_tensor.dtype}"
+
+print("✅ Tensor creation tests passed!")
+print(f"✅ Scalar: {scalar}")
+print(f"✅ Vector: {vector}")
+print(f"✅ Matrix: {matrix}")
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-arithmetic", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test tensor arithmetic operations
+print("Testing tensor arithmetic...")
+
+# Test addition
+a = Tensor([1, 2, 3])
+b = Tensor([4, 5, 6])
+c = a + b
+expected = np.array([5, 7, 9])
+assert np.array_equal(c.data, expected), f"Addition failed: expected {expected}, got {c.data}"
+
+# Test multiplication
+d = a * b
+expected = np.array([4, 10, 18])
+assert np.array_equal(d.data, expected), f"Multiplication failed: expected {expected}, got {d.data}"
+
+# Test subtraction
+e = b - a
+expected = np.array([3, 3, 3])
+assert np.array_equal(e.data, expected), f"Subtraction failed: expected {expected}, got {e.data}"
+
+# Test division
+f = b / a
+expected = np.array([4.0, 2.5, 2.0])
+assert np.allclose(f.data, expected), f"Division failed: expected {expected}, got {f.data}"
+
+# Test scalar operations
+g = a + 10
+expected = np.array([11, 12, 13])
+assert np.array_equal(g.data, expected), f"Scalar addition failed: expected {expected}, got {g.data}"
+
+h = a * 2
+expected = np.array([2, 4, 6])
+assert np.array_equal(h.data, expected), f"Scalar multiplication failed: expected {expected}, got {h.data}"
+
+print("✅ Tensor arithmetic tests passed!")
+print(f"✅ Addition: {a} + {b} = {c}")
+print(f"✅ Multiplication: {a} * {b} = {d}")
+print(f"✅ Subtraction: {b} - {a} = {e}")
+print(f"✅ Division: {b} / {a} = {f}")
+
+# %% nbgrader={"grade": true, "grade_id": "test-tensor-broadcasting", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test tensor broadcasting
+print("Testing tensor broadcasting...")
+
+# Test scalar broadcasting
+matrix = Tensor([[1, 2], [3, 4]])
+scalar = Tensor(10)
+result = matrix + scalar
+expected = np.array([[11, 12], [13, 14]])
+assert np.array_equal(result.data, expected), f"Scalar broadcasting failed: expected {expected}, got {result.data}"
+
+# Test vector broadcasting
+vector = Tensor([1, 2])
+result = matrix + vector
+expected = np.array([[2, 4], [4, 6]])
+assert np.array_equal(result.data, expected), f"Vector broadcasting failed: expected {expected}, got {result.data}"
+
+# Test different shapes
+a = Tensor([[1], [2], [3]])  # (3, 1)
+b = Tensor([10, 20])         # (2,)
+result = a + b
+expected = np.array([[11, 21], [12, 22], [13, 23]])
+assert np.array_equal(result.data, expected), f"Shape broadcasting failed: expected {expected}, got {result.data}"
+
+print("✅ Tensor broadcasting tests passed!")
+print(f"✅ Matrix + Scalar: {matrix} + {scalar} = {result}")
+print(f"✅ Broadcasting works correctly!")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented the core Tensor class for TinyTorch:
+
+### What You've Accomplished
+✅ **Tensor Creation**: Handle scalars, vectors, matrices, and higher-dimensional arrays  
+✅ **Data Types**: Proper dtype handling with auto-detection and conversion  
+✅ **Properties**: Shape, size, dtype, and data access  
+✅ **Arithmetic**: Addition, multiplication, subtraction, division  
+✅ **Operators**: Natural Python syntax with `+`, `-`, `*`, `/`  
+✅ **Broadcasting**: Automatic shape compatibility like NumPy  
+
+### Key Concepts You've Learned
+- **Tensors** are the fundamental data structure for ML systems
+- **NumPy backend** provides efficient computation with ML-friendly API
+- **Operator overloading** makes tensor operations feel natural
+- **Broadcasting** enables flexible operations between different shapes
+- **Type safety** ensures consistent behavior across operations
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 01_tensor`
+2. **Test your implementation**: `tito module test 01_tensor`
+3. **Use your tensors**: 
+   ```python
+   from tinytorch.core.tensor import Tensor
+   t = Tensor([1, 2, 3])
+   print(t + 5)  # Your tensor in action!
+   ```
+4. **Move to Module 2**: Start building activation functions!
+
+**Ready for the next challenge?** Let's add the mathematical functions that make neural networks powerful!
+""" 
\ No newline at end of file
diff --git a/modules/source/01_tensor/tests/test_tensor.py b/modules/source/01_tensor/tests/test_tensor.py
deleted file mode 100644
index 1b182af8..00000000
--- a/modules/source/01_tensor/tests/test_tensor.py
+++ /dev/null
@@ -1,337 +0,0 @@
-"""
-Test suite for the tensor module.
-This tests the student implementations to ensure they work correctly.
-"""
-
-import pytest
-import numpy as np
-import sys
-import os
-
-# Import from the main package (rock solid foundation)
-from tinytorch.core.tensor import Tensor
-
-def safe_numpy(tensor):
-    """Get numpy array from tensor, using .numpy() if available, otherwise .data"""
-    if hasattr(tensor, 'numpy'):
-        return tensor.numpy()
-    else:
-        return tensor.data
-
-def safe_item(tensor):
-    """Get scalar value from tensor, using .item() if available, otherwise .data"""
-    if hasattr(tensor, 'item'):
-        return tensor.item()
-    else:
-        return float(tensor.data)
-
-class TestTensorCreation:
-    """Test tensor creation from different data types."""
-    
-    def test_scalar_creation(self):
-        """Test creating tensors from scalars."""
-        # Float scalar
-        t1 = Tensor(5.0)
-        assert t1.shape == ()
-        assert t1.size == 1
-        assert safe_item(t1) == 5.0
-        
-        # Integer scalar  
-        t2 = Tensor(42)
-        assert t2.shape == ()
-        assert t2.size == 1
-        assert safe_item(t2) == 42.0  # Should convert to float32
-    
-    def test_vector_creation(self):
-        """Test creating 1D tensors."""
-        t = Tensor([1, 2, 3, 4])
-        assert t.shape == (4,)
-        assert t.size == 4
-        assert t.dtype == np.int32  # Integer list defaults to int32
-        np.testing.assert_array_equal(safe_numpy(t), [1, 2, 3, 4])
-    
-    def test_matrix_creation(self):
-        """Test creating 2D tensors.""" 
-        t = Tensor([[1, 2], [3, 4]])
-        assert t.shape == (2, 2)
-        assert t.size == 4
-        expected = np.array([[1.0, 2.0], [3.0, 4.0]], dtype='float32')
-        np.testing.assert_array_equal(safe_numpy(t), expected)
-    
-    def test_numpy_array_creation(self):
-        """Test creating tensors from numpy arrays."""
-        arr = np.array([1, 2, 3], dtype='int32')
-        t = Tensor(arr)
-        assert t.shape == (3,)
-        assert t.dtype in ['int32', 'float32']  # May convert
-    
-    def test_dtype_specification(self):
-        """Test explicit dtype specification."""
-        t = Tensor([1, 2, 3], dtype='int32')
-        assert t.dtype == np.int32
-    
-    def test_invalid_data_type(self):
-        """Test error handling for invalid data types."""
-        with pytest.raises(TypeError):
-            Tensor("invalid")
-        with pytest.raises(TypeError):
-            Tensor({"dict": "invalid"})
-
-class TestTensorProperties:
-    """Test tensor properties and methods."""
-    
-    def test_shape_property(self):
-        """Test shape property for different dimensions."""
-        assert Tensor(5).shape == ()
-        assert Tensor([1, 2, 3]).shape == (3,)
-        assert Tensor([[1, 2], [3, 4]]).shape == (2, 2)
-        assert Tensor([[[1]]]).shape == (1, 1, 1)
-    
-    def test_size_property(self):
-        """Test size property."""
-        assert Tensor(5).size == 1
-        assert Tensor([1, 2, 3]).size == 3
-        assert Tensor([[1, 2], [3, 4]]).size == 4
-        assert Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]).size == 8
-    
-    def test_dtype_property(self):
-        """Test dtype property."""
-        t1 = Tensor(5.0)
-        assert t1.dtype == np.float32
-        
-        t2 = Tensor([1, 2, 3], dtype='int32')
-        assert t2.dtype == np.int32
-    
-    def test_repr(self):
-        """Test string representation."""
-        t = Tensor([1, 2, 3])
-        repr_str = repr(t)
-        assert 'Tensor' in repr_str
-        assert 'shape=' in repr_str
-        assert 'dtype=' in repr_str
-
-class TestArithmeticOperations:
-    """Test tensor arithmetic operations."""
-    
-    def test_tensor_addition(self):
-        """Test tensor + tensor addition."""
-        a = Tensor([1, 2, 3])
-        b = Tensor([4, 5, 6])
-        result = a + b
-        expected = [5.0, 7.0, 9.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_scalar_addition(self):
-        """Test tensor + scalar addition."""
-        a = Tensor([1, 2, 3])
-        result = a + 10
-        expected = [11.0, 12.0, 13.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_reverse_addition(self):
-        """Test scalar + tensor addition."""
-        a = Tensor([1, 2, 3])
-        result = 10 + a
-        expected = [11.0, 12.0, 13.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_tensor_subtraction(self):
-        """Test tensor - tensor subtraction."""
-        a = Tensor([5, 7, 9])
-        b = Tensor([1, 2, 3])
-        result = a - b
-        expected = [4.0, 5.0, 6.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_scalar_subtraction(self):
-        """Test tensor - scalar subtraction."""
-        a = Tensor([10, 20, 30])
-        result = a - 5
-        expected = [5.0, 15.0, 25.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_tensor_multiplication(self):
-        """Test tensor * tensor multiplication."""
-        a = Tensor([2, 3, 4])
-        b = Tensor([5, 6, 7])
-        result = a * b
-        expected = [10.0, 18.0, 28.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_scalar_multiplication(self):
-        """Test tensor * scalar multiplication."""
-        a = Tensor([1, 2, 3])
-        result = a * 3
-        expected = [3.0, 6.0, 9.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_reverse_multiplication(self):
-        """Test scalar * tensor multiplication."""
-        a = Tensor([1, 2, 3])
-        result = 3 * a
-        expected = [3.0, 6.0, 9.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_tensor_division(self):
-        """Test tensor / tensor division."""
-        a = Tensor([6, 8, 10])
-        b = Tensor([2, 4, 5])
-        result = a / b
-        expected = [3.0, 2.0, 2.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_scalar_division(self):
-        """Test tensor / scalar division."""
-        a = Tensor([6, 8, 10])
-        result = a / 2
-        expected = [3.0, 4.0, 5.0]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-
-class TestUtilityMethods:
-    """Test tensor utility methods (stretch goals for students)."""
-    
-    def test_reshape(self):
-        """Test tensor reshaping (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'reshape'):
-            reshaped = t.reshape(4)
-            assert reshaped.shape == (4,)
-            expected = [1.0, 2.0, 3.0, 4.0]
-            np.testing.assert_array_equal(safe_numpy(reshaped), expected)
-            
-            # Reshape to 2D
-            reshaped2 = t.reshape(1, 4)
-            assert reshaped2.shape == (1, 4)
-        else:
-            pytest.skip("reshape method not implemented - stretch goal for students")
-    
-    def test_transpose(self):
-        """Test tensor transpose (if implemented)."""
-        t = Tensor([[1, 2, 3], [4, 5, 6]])
-        if hasattr(t, 'transpose'):
-            transposed = t.transpose()
-            assert transposed.shape == (3, 2)
-            expected = [[1.0, 4.0], [2.0, 5.0], [3.0, 6.0]]
-            np.testing.assert_array_equal(safe_numpy(transposed), expected)
-        else:
-            pytest.skip("transpose method not implemented - stretch goal for students")
-    
-    def test_sum_all(self):
-        """Test summing all elements (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'sum'):
-            result = t.sum()
-            expected = 10.0
-            assert abs(safe_item(result) - expected) < 1e-6
-        else:
-            pytest.skip("sum method not implemented - stretch goal for students")
-    
-    def test_sum_axis(self):
-        """Test summing along specific axes (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'sum'):
-            # Sum along axis 0 (columns)
-            sum0 = t.sum(axis=0)
-            expected0 = [4.0, 6.0]
-            np.testing.assert_array_equal(safe_numpy(sum0), expected0)
-            
-            # Sum along axis 1 (rows)
-            sum1 = t.sum(axis=1)
-            expected1 = [3.0, 7.0]
-            np.testing.assert_array_equal(safe_numpy(sum1), expected1)
-        else:
-            pytest.skip("sum method not implemented - stretch goal for students")
-    
-    def test_mean(self):
-        """Test mean calculation (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'mean'):
-            result = t.mean()
-            expected = 2.5
-            assert abs(safe_item(result) - expected) < 1e-6
-        else:
-            pytest.skip("mean method not implemented - stretch goal for students")
-    
-    def test_max(self):
-        """Test maximum value (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'max'):
-            result = t.max()
-            expected = 4.0
-            assert abs(safe_item(result) - expected) < 1e-6
-        else:
-            pytest.skip("max method not implemented - stretch goal for students")
-    
-    def test_min(self):
-        """Test minimum value (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'min'):
-            result = t.min()
-            expected = 1.0
-            assert abs(safe_item(result) - expected) < 1e-6
-        else:
-            pytest.skip("min method not implemented - stretch goal for students")
-    
-    def test_item_scalar(self):
-        """Test converting single-element tensor to scalar (if implemented)."""
-        t = Tensor(42.0)
-        if hasattr(t, 'item'):
-            assert t.item() == 42.0
-        else:
-            pytest.skip("item method not implemented - stretch goal for students")
-    
-    def test_item_error(self):
-        """Test item() error for multi-element tensors (if implemented)."""
-        t = Tensor([1, 2, 3])
-        if hasattr(t, 'item'):
-            with pytest.raises(ValueError):
-                t.item()
-        else:
-            pytest.skip("item method not implemented - stretch goal for students")
-    
-    def test_numpy_conversion(self):
-        """Test converting tensor to numpy array (if implemented)."""
-        t = Tensor([[1, 2], [3, 4]])
-        if hasattr(t, 'numpy'):
-            arr = t.numpy()
-            assert isinstance(arr, np.ndarray)
-            expected = [[1.0, 2.0], [3.0, 4.0]]
-            np.testing.assert_array_equal(arr, expected)
-        else:
-            pytest.skip("numpy method not implemented - stretch goal for students")
-
-class TestEdgeCases:
-    """Test edge cases and error handling."""
-    
-    def test_empty_list(self):
-        """Test creating tensor from empty list."""
-        t = Tensor([])
-        assert t.shape == (0,)
-        assert t.size == 0
-    
-    def test_mixed_operations(self):
-        """Test combining different operations."""
-        a = Tensor([[1, 2], [3, 4]])
-        b = Tensor([[2, 2], [2, 2]])
-        
-        # Complex expression
-        result = (a + b) * 2 - 1
-        expected = [[5.0, 7.0], [9.0, 11.0]]
-        np.testing.assert_array_equal(safe_numpy(result), expected)
-    
-    def test_chained_operations(self):
-        """Test chaining multiple operations (if methods implemented)."""
-        t = Tensor([[1, 2, 3], [4, 5, 6]])
-        if hasattr(t, 'sum') and hasattr(t, 'mean'):
-            result = t.sum(axis=1).mean()
-            expected = 10.5  # (6 + 15) / 2
-            assert abs(safe_item(result) - expected) < 1e-6
-        else:
-            pytest.skip("Advanced methods not implemented - stretch goal for students")
-
-def run_tensor_tests():
-    """Run all tensor tests."""
-    pytest.main([__file__, "-v"])
-
-if __name__ == "__main__":
-    run_tensor_tests() 
\ No newline at end of file
diff --git a/modules/source/02_activations/activations_dev.py b/modules/source/02_activations/activations_dev.py
index 517bd559..8acd9f64 100644
--- a/modules/source/02_activations/activations_dev.py
+++ b/modules/source/02_activations/activations_dev.py
@@ -230,11 +230,11 @@ Once you implement the ReLU forward method above, run this cell to test it:
 def test_relu_activation():
     """Test ReLU activation function"""
     print("Testing ReLU activation...")
-    
-    # Create ReLU instance
-    relu = ReLU()
-    
-    # Test with mixed positive/negative values
+
+# Create ReLU instance
+relu = ReLU()
+
+# Test with mixed positive/negative values
     test_input = Tensor([[-2, -1, 0, 1, 2]])
     result = relu(test_input)
     expected = np.array([[0, 0, 0, 1, 2]])
@@ -368,10 +368,10 @@ Once you implement the Sigmoid forward method above, run this cell to test it:
 def test_sigmoid_activation():
     """Test Sigmoid activation function"""
     print("Testing Sigmoid activation...")
-    
-    # Create Sigmoid instance
-    sigmoid = Sigmoid()
-    
+
+# Create Sigmoid instance
+sigmoid = Sigmoid()
+
     # Test with known values
     test_input = Tensor([[0]])
     result = sigmoid(test_input)
@@ -514,10 +514,10 @@ Once you implement the Tanh forward method above, run this cell to test it:
 def test_tanh_activation():
     """Test Tanh activation function"""
     print("Testing Tanh activation...")
-    
-    # Create Tanh instance
-    tanh = Tanh()
-    
+
+# Create Tanh instance
+tanh = Tanh()
+
     # Test with zero (should be 0)
     test_input = Tensor([[0]])
     result = tanh(test_input)
@@ -676,10 +676,10 @@ Once you implement the Softmax forward method above, run this cell to test it:
 def test_softmax_activation():
     """Test Softmax activation function"""
     print("Testing Softmax activation...")
-    
-    # Create Softmax instance
-    softmax = Softmax()
-    
+
+# Create Softmax instance
+softmax = Softmax()
+
     # Test with simple input
     test_input = Tensor([[1, 2, 3]])
     result = softmax(test_input)
@@ -718,8 +718,8 @@ def test_softmax_activation():
     
     large_sum = np.sum(large_result.data)
     assert abs(large_sum - 1.0) < 1e-6, "Large values should still sum to 1"
-    
-    # Test shape preservation
+
+# Test shape preservation
     assert batch_result.shape == batch_input.shape, "Softmax should preserve shape"
     
     print("✅ Softmax activation tests passed!")
@@ -751,9 +751,9 @@ def test_activations_integration():
     print("Testing activation functions integration...")
     
     # Create instances of all activation functions
-    relu = ReLU()
-    sigmoid = Sigmoid()
-    tanh = Tanh()
+        relu = ReLU()
+        sigmoid = Sigmoid()
+        tanh = Tanh()
     softmax = Softmax()
     
     # Test data: simulating neural network layer outputs
@@ -791,7 +791,7 @@ def test_activations_integration():
     # Test Softmax properties
     softmax_sum = np.sum(softmax_result.data)
     assert abs(softmax_sum - 1.0) < 1e-6, "Softmax outputs should sum to 1"
-    
+        
     # Test chaining activations (realistic neural network scenario)
     # Hidden layer with ReLU
     hidden_output = relu(test_data)
@@ -815,8 +815,8 @@ def test_activations_integration():
     ])
     
     batch_softmax = softmax(batch_data)
-    
-    # Each row should sum to 1
+        
+        # Each row should sum to 1
     for i in range(batch_data.shape[0]):
         row_sum = np.sum(batch_softmax.data[i])
         assert abs(row_sum - 1.0) < 1e-6, f"Batch row {i} should sum to 1"
diff --git a/modules/source/02_activations/tests/test_activations.py b/modules/source/02_activations/tests/test_activations.py
deleted file mode 100644
index c7b5fbb4..00000000
--- a/modules/source/02_activations/tests/test_activations.py
+++ /dev/null
@@ -1,332 +0,0 @@
-"""
-Test suite for the activations module.
-This tests the student implementations to ensure they work correctly.
-"""
-
-import pytest
-import numpy as np
-import sys
-import os
-
-# Import from the main package (rock solid foundation)
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
-
-
-class TestReLU:
-    """Test the ReLU activation function."""
-    
-    def test_relu_basic_functionality(self):
-        """Test basic ReLU behavior: max(0, x)"""
-        relu = ReLU()
-        
-        # Test mixed positive/negative values
-        x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
-        y = relu(x)
-        expected = np.array([[0.0, 0.0, 0.0, 1.0, 2.0]])
-        
-        assert np.allclose(y.data, expected), f"Expected {expected}, got {y.data}"
-    
-    def test_relu_all_positive(self):
-        """Test ReLU with all positive values (should be unchanged)"""
-        relu = ReLU()
-        
-        x = Tensor([[1.0, 2.5, 3.7, 10.0]])
-        y = relu(x)
-        
-        assert np.allclose(y.data, x.data), "ReLU should preserve positive values"
-    
-    def test_relu_all_negative(self):
-        """Test ReLU with all negative values (should be zeros)"""
-        relu = ReLU()
-        
-        x = Tensor([[-1.0, -2.5, -3.7, -10.0]])
-        y = relu(x)
-        expected = np.zeros_like(x.data)
-        
-        assert np.allclose(y.data, expected), "ReLU should zero out negative values"
-    
-    def test_relu_zero_input(self):
-        """Test ReLU with zero input"""
-        relu = ReLU()
-        
-        x = Tensor([[0.0]])
-        y = relu(x)
-        
-        assert y.data[0, 0] == 0.0, "ReLU(0) should be 0"
-    
-    def test_relu_shape_preservation(self):
-        """Test that ReLU preserves tensor shape"""
-        relu = ReLU()
-        
-        # Test different shapes
-        shapes = [(1, 5), (2, 3), (4, 1), (3, 3)]
-        for shape in shapes:
-            x = Tensor(np.random.randn(*shape))
-            y = relu(x)
-            assert y.shape == x.shape, f"Shape mismatch: expected {x.shape}, got {y.shape}"
-    
-    def test_relu_callable(self):
-        """Test that ReLU can be called directly"""
-        relu = ReLU()
-        x = Tensor([[1.0, -1.0]])
-        
-        y1 = relu(x)
-        y2 = relu.forward(x)
-        
-        assert np.allclose(y1.data, y2.data), "Direct call should match forward method"
-
-
-class TestSigmoid:
-    """Test the Sigmoid activation function."""
-    
-    def test_sigmoid_basic_functionality(self):
-        """Test basic Sigmoid behavior"""
-        sigmoid = Sigmoid()
-        
-        # Test known values
-        x = Tensor([[0.0]])
-        y = sigmoid(x)
-        assert abs(y.data[0, 0] - 0.5) < 1e-6, "Sigmoid(0) should be 0.5"
-    
-    def test_sigmoid_range(self):
-        """Test that Sigmoid outputs are in (0, 1)"""
-        sigmoid = Sigmoid()
-        
-        # Test wide range of inputs
-        x = Tensor([[-10.0, -5.0, -1.0, 0.0, 1.0, 5.0, 10.0]])
-        y = sigmoid(x)
-        
-        assert np.all(y.data > 0), "Sigmoid outputs should be > 0"
-        assert np.all(y.data < 1), "Sigmoid outputs should be < 1"
-    
-    def test_sigmoid_numerical_stability(self):
-        """Test Sigmoid with extreme values (numerical stability)"""
-        sigmoid = Sigmoid()
-        
-        # Test extreme values that could cause overflow
-        x = Tensor([[-100.0, -50.0, 50.0, 100.0]])
-        y = sigmoid(x)
-        
-        # Should not contain NaN or inf
-        assert not np.any(np.isnan(y.data)), "Sigmoid should not produce NaN"
-        assert not np.any(np.isinf(y.data)), "Sigmoid should not produce inf"
-        
-        # Should be close to 0 for very negative, close to 1 for very positive
-        assert y.data[0, 0] < 1e-10, "Sigmoid(-100) should be very close to 0"
-        assert y.data[0, 1] < 1e-10, "Sigmoid(-50) should be very close to 0"
-        assert y.data[0, 2] > 1 - 1e-10, "Sigmoid(50) should be very close to 1"
-        assert y.data[0, 3] > 1 - 1e-10, "Sigmoid(100) should be very close to 1"
-    
-    def test_sigmoid_monotonicity(self):
-        """Test that Sigmoid is monotonically increasing"""
-        sigmoid = Sigmoid()
-        
-        x = Tensor([[-3.0, -1.0, 0.0, 1.0, 3.0]])
-        y = sigmoid(x)
-        
-        # Check that outputs are increasing
-        for i in range(len(y.data[0]) - 1):
-            assert y.data[0, i] < y.data[0, i + 1], "Sigmoid should be monotonically increasing"
-    
-    def test_sigmoid_shape_preservation(self):
-        """Test that Sigmoid preserves tensor shape"""
-        sigmoid = Sigmoid()
-        
-        shapes = [(1, 5), (2, 3), (4, 1)]
-        for shape in shapes:
-            x = Tensor(np.random.randn(*shape))
-            y = sigmoid(x)
-            assert y.shape == x.shape, f"Shape mismatch: expected {x.shape}, got {y.shape}"
-    
-    def test_sigmoid_callable(self):
-        """Test that Sigmoid can be called directly"""
-        sigmoid = Sigmoid()
-        x = Tensor([[1.0, -1.0]])
-        
-        y1 = sigmoid(x)
-        y2 = sigmoid.forward(x)
-        
-        assert np.allclose(y1.data, y2.data), "Direct call should match forward method"
-
-
-class TestTanh:
-    """Test the Tanh activation function."""
-    
-    def test_tanh_basic_functionality(self):
-        """Test basic Tanh behavior"""
-        tanh = Tanh()
-        
-        # Test known values
-        x = Tensor([[0.0]])
-        y = tanh(x)
-        assert abs(y.data[0, 0] - 0.0) < 1e-6, "Tanh(0) should be 0"
-    
-    def test_tanh_range(self):
-        """Test that Tanh outputs are in [-1, 1]"""
-        tanh = Tanh()
-        
-        # Test wide range of inputs
-        x = Tensor([[-10.0, -5.0, -1.0, 0.0, 1.0, 5.0, 10.0]])
-        y = tanh(x)
-        
-        assert np.all(y.data >= -1), "Tanh outputs should be >= -1"
-        assert np.all(y.data <= 1), "Tanh outputs should be <= 1"
-    
-    def test_tanh_symmetry(self):
-        """Test that Tanh is symmetric: tanh(-x) = -tanh(x)"""
-        tanh = Tanh()
-        
-        x = Tensor([[1.0, 2.0, 3.0]])
-        x_neg = Tensor([[-1.0, -2.0, -3.0]])
-        
-        y_pos = tanh(x)
-        y_neg = tanh(x_neg)
-        
-        assert np.allclose(y_neg.data, -y_pos.data), "Tanh should be symmetric"
-    
-    def test_tanh_monotonicity(self):
-        """Test that Tanh is monotonically increasing"""
-        tanh = Tanh()
-        
-        x = Tensor([[-3.0, -1.0, 0.0, 1.0, 3.0]])
-        y = tanh(x)
-        
-        # Check that outputs are increasing
-        for i in range(len(y.data[0]) - 1):
-            assert y.data[0, i] < y.data[0, i + 1], "Tanh should be monotonically increasing"
-    
-    def test_tanh_extreme_values(self):
-        """Test Tanh with extreme values"""
-        tanh = Tanh()
-        
-        x = Tensor([[-100.0, 100.0]])
-        y = tanh(x)
-        
-        # Should be close to -1 and 1 respectively
-        assert abs(y.data[0, 0] - (-1.0)) < 1e-10, "Tanh(-100) should be very close to -1"
-        assert abs(y.data[0, 1] - 1.0) < 1e-10, "Tanh(100) should be very close to 1"
-    
-    def test_tanh_shape_preservation(self):
-        """Test that Tanh preserves tensor shape"""
-        tanh = Tanh()
-        
-        shapes = [(1, 5), (2, 3), (4, 1)]
-        for shape in shapes:
-            x = Tensor(np.random.randn(*shape))
-            y = tanh(x)
-            assert y.shape == x.shape, f"Shape mismatch: expected {x.shape}, got {y.shape}"
-    
-    def test_tanh_callable(self):
-        """Test that Tanh can be called directly"""
-        tanh = Tanh()
-        x = Tensor([[1.0, -1.0]])
-        
-        y1 = tanh(x)
-        y2 = tanh.forward(x)
-        
-        assert np.allclose(y1.data, y2.data), "Direct call should match forward method"
-
-
-class TestActivationComparison:
-    """Test interactions and comparisons between activation functions."""
-    
-    def test_activation_consistency(self):
-        """Test that all activations work with the same input"""
-        relu = ReLU()
-        sigmoid = Sigmoid()
-        tanh = Tanh()
-        
-        x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
-        
-        # All should process without error
-        y_relu = relu(x)
-        y_sigmoid = sigmoid(x)
-        y_tanh = tanh(x)
-        
-        # All should preserve shape
-        assert y_relu.shape == x.shape
-        assert y_sigmoid.shape == x.shape
-        assert y_tanh.shape == x.shape
-    
-    def test_activation_ranges(self):
-        """Test that activations have expected output ranges"""
-        relu = ReLU()
-        sigmoid = Sigmoid()
-        tanh = Tanh()
-        
-        x = Tensor([[-5.0, -2.0, 0.0, 2.0, 5.0]])
-        
-        y_relu = relu(x)
-        y_sigmoid = sigmoid(x)
-        y_tanh = tanh(x)
-        
-        # ReLU: [0, inf)
-        assert np.all(y_relu.data >= 0), "ReLU should be non-negative"
-        
-        # Sigmoid: (0, 1)
-        assert np.all(y_sigmoid.data > 0), "Sigmoid should be positive"
-        assert np.all(y_sigmoid.data < 1), "Sigmoid should be less than 1"
-        
-        # Tanh: (-1, 1)
-        assert np.all(y_tanh.data > -1), "Tanh should be greater than -1"
-        assert np.all(y_tanh.data < 1), "Tanh should be less than 1"
-
-
-# Integration tests with edge cases
-class TestActivationEdgeCases:
-    """Test edge cases and boundary conditions."""
-    
-    def test_zero_tensor(self):
-        """Test all activations with zero tensor"""
-        relu = ReLU()
-        sigmoid = Sigmoid()
-        tanh = Tanh()
-        
-        x = Tensor([[0.0, 0.0, 0.0]])
-        
-        y_relu = relu(x)
-        y_sigmoid = sigmoid(x)
-        y_tanh = tanh(x)
-        
-        assert np.allclose(y_relu.data, [0.0, 0.0, 0.0]), "ReLU(0) should be 0"
-        assert np.allclose(y_sigmoid.data, [0.5, 0.5, 0.5]), "Sigmoid(0) should be 0.5"
-        assert np.allclose(y_tanh.data, [0.0, 0.0, 0.0]), "Tanh(0) should be 0"
-    
-    def test_single_element_tensor(self):
-        """Test all activations with single element tensor"""
-        relu = ReLU()
-        sigmoid = Sigmoid()
-        tanh = Tanh()
-        
-        x = Tensor([[1.0]])
-        
-        y_relu = relu(x)
-        y_sigmoid = sigmoid(x)
-        y_tanh = tanh(x)
-        
-        assert y_relu.shape == (1, 1)
-        assert y_sigmoid.shape == (1, 1)
-        assert y_tanh.shape == (1, 1)
-    
-    def test_large_tensor(self):
-        """Test activations with larger tensors"""
-        relu = ReLU()
-        sigmoid = Sigmoid()
-        tanh = Tanh()
-        
-        # Create a 10x10 tensor
-        x = Tensor(np.random.randn(10, 10))
-        
-        y_relu = relu(x)
-        y_sigmoid = sigmoid(x)
-        y_tanh = tanh(x)
-        
-        assert y_relu.shape == (10, 10)
-        assert y_sigmoid.shape == (10, 10)
-        assert y_tanh.shape == (10, 10)
-
-
-if __name__ == "__main__":
-    # Run tests with pytest
-    pytest.main([__file__, "-v"]) 
\ No newline at end of file
diff --git a/modules/source/03_layers/layers_dev.py b/modules/source/03_layers/layers_dev.py
index 8a7e362f..bbc89ed5 100644
--- a/modules/source/03_layers/layers_dev.py
+++ b/modules/source/03_layers/layers_dev.py
@@ -46,8 +46,8 @@ except ImportError:
     sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
     sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
     try:
-        from tensor_dev import Tensor
-        from activations_dev import ReLU, Sigmoid, Tanh, Softmax
+    from tensor_dev import Tensor
+    from activations_dev import ReLU, Sigmoid, Tanh, Softmax
     except ImportError:
         # If the local modules are not available, use relative imports
         from ..tensor.tensor_dev import Tensor
@@ -188,7 +188,7 @@ def matmul_naive(A: np.ndarray, B: np.ndarray) -> np.ndarray:
     Naive matrix multiplication using explicit for-loops.
     
     This helps you understand what matrix multiplication really does!
-    
+        
     TODO: Implement matrix multiplication using three nested for-loops.
     
     STEP-BY-STEP IMPLEMENTATION:
@@ -259,8 +259,8 @@ Once you implement the `matmul_naive` function above, run this cell to test it:
 def test_matrix_multiplication():
     """Test matrix multiplication implementation"""
     print("Testing matrix multiplication...")
-    
-    # Test simple 2x2 case
+
+# Test simple 2x2 case
     A = np.array([[1, 2], [3, 4]], dtype=np.float32)
     B = np.array([[5, 6], [7, 8]], dtype=np.float32)
     
@@ -272,8 +272,8 @@ def test_matrix_multiplication():
     # Compare with NumPy
     numpy_result = A @ B
     assert np.allclose(result, numpy_result), f"Doesn't match NumPy: got {result}, expected {numpy_result}"
-    
-    # Test different shapes
+
+# Test different shapes
     A2 = np.array([[1, 2, 3]], dtype=np.float32)  # 1x3
     B2 = np.array([[4], [5], [6]], dtype=np.float32)  # 3x1
     result2 = matmul_naive(A2, B2)
@@ -423,7 +423,7 @@ class Dense:
         else:
             self.bias = None
         ### END SOLUTION
-
+    
     def forward(self, x: Tensor) -> Tensor:
         """
         Forward pass through the Dense layer.
@@ -472,7 +472,7 @@ class Dense:
         
         return Tensor(linear_output)
         ### END SOLUTION
-
+    
     def __call__(self, x: Tensor) -> Tensor:
         """Make the layer callable: layer(x) instead of layer.forward(x)"""
         return self.forward(x)
@@ -509,8 +509,8 @@ def test_dense_layer():
     batch_output = layer(batch_input)
     
     assert batch_output.shape == (2, 2), f"Batch output shape should be (2, 2), got {batch_output.shape}"
-    
-    # Test without bias
+
+# Test without bias
     no_bias_layer = Dense(input_size=3, output_size=2, use_bias=False)
     assert no_bias_layer.bias is None, "Layer without bias should have None bias"
     
@@ -538,7 +538,7 @@ def test_dense_layer():
     scaled_output = layer(scaled_input)
     
     # Due to bias, this won't be exactly 2*output, but the linear part should scale
-    print("✅ Dense layer tests passed!")
+print("✅ Dense layer tests passed!")
     print(f"✅ Correct weight and bias initialization")
     print(f"✅ Forward pass produces correct shapes")
     print(f"✅ Batch processing works correctly")
@@ -582,7 +582,7 @@ def test_layer_activation_integration():
     
     # Create layer and activation functions
     layer = Dense(input_size=4, output_size=3)
-    relu = ReLU()
+        relu = ReLU()
     sigmoid = Sigmoid()
     tanh = Tanh()
     softmax = Softmax()
diff --git a/modules/source/03_layers/layers_dev_backup.py b/modules/source/03_layers/layers_dev_backup.py
new file mode 100644
index 00000000..576016f5
--- /dev/null
+++ b/modules/source/03_layers/layers_dev_backup.py
@@ -0,0 +1,1286 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 3: Layers - Building Blocks of Neural Networks
+
+Welcome to the Layers module! This is where we build the fundamental components that stack together to form neural networks.
+
+## Learning Goals
+- Understand how matrix multiplication powers neural networks
+- Implement naive matrix multiplication from scratch for deep understanding
+- Build the Dense (Linear) layer - the foundation of all neural networks
+- Learn weight initialization strategies and their importance
+- See how layers compose with activations to create powerful networks
+
+## Build → Use → Understand
+1. **Build**: Matrix multiplication and Dense layers from scratch
+2. **Use**: Create and test layers with real data
+3. **Understand**: How linear transformations enable feature learning
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "layers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.layers
+
+#| export
+import numpy as np
+import matplotlib.pyplot as plt
+import os
+import sys
+from typing import Union, List, Tuple, Optional
+
+# Import our dependencies - try from package first, then local modules
+try:
+    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
+except ImportError:
+    # For development, import from local modules
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
+    from tensor_dev import Tensor
+    from activations_dev import ReLU, Sigmoid, Tanh, Softmax
+
+# %% nbgrader={"grade": false, "grade_id": "layers-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| hide
+#| export
+def _should_show_plots():
+    """Check if we should show plots (disable during testing)"""
+    # Check multiple conditions that indicate we're in test mode
+    is_pytest = (
+        'pytest' in sys.modules or
+        'test' in sys.argv or
+        os.environ.get('PYTEST_CURRENT_TEST') is not None or
+        any('test' in arg for arg in sys.argv) or
+        any('pytest' in arg for arg in sys.argv)
+    )
+    
+    # Show plots in development mode (when not in test mode)
+    return not is_pytest
+
+# %% nbgrader={"grade": false, "grade_id": "layers-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
+print("🔥 TinyTorch Layers Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
+print("Ready to build neural network layers!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/03_layers/layers_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.layers`
+
+```python
+# Final package structure:
+from tinytorch.core.layers import Dense, Conv2D  # All layer types together!
+from tinytorch.core.tensor import Tensor  # The foundation
+from tinytorch.core.activations import ReLU, Sigmoid  # Nonlinearity
+```
+
+**Why this matters:**
+- **Learning:** Focused modules for deep understanding
+- **Production:** Proper organization like PyTorch's `torch.nn.Linear`
+- **Consistency:** All layer types live together in `core.layers`
+- **Integration:** Works seamlessly with tensors and activations
+"""
+
+# %% [markdown]
+"""
+## 🧠 The Mathematical Foundation of Neural Layers
+
+### Linear Algebra at the Heart of ML
+Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**:
+
+$$\text{Layer: } y = Wx + b \text{ (linear transformation)}$$
+$$\text{Activation: } z = \sigma(y) \text{ (nonlinear transformation)}$$
+
+### Matrix Multiplication: The Engine of Deep Learning
+Every forward pass in a neural network involves matrix multiplication:
+- **Dense layers**: Matrix multiplication between inputs and weights
+- **Convolutional layers**: Convolution as matrix multiplication
+- **Attention**: Query-key-value matrix operations
+- **Transformers**: Self-attention through matrix operations
+
+### Why Matrix Multiplication Matters
+- **Parallel computation**: GPUs excel at matrix operations
+- **Batch processing**: Handle multiple samples simultaneously
+- **Feature learning**: Each row/column learns different patterns
+- **Composability**: Layers stack naturally through matrix chains
+
+### Connection to Real ML Systems
+Every framework optimizes matrix multiplication:
+- **PyTorch**: `torch.nn.Linear` uses optimized BLAS
+- **TensorFlow**: `tf.keras.layers.Dense` uses cuDNN
+- **JAX**: `jax.numpy.dot` uses XLA compilation
+- **TinyTorch**: `tinytorch.core.layers.Dense` (what we're building!)
+
+### Performance Considerations
+- **Memory layout**: Contiguous arrays for cache efficiency
+- **Vectorization**: SIMD operations for speed
+- **Parallelization**: Multi-threading and GPU acceleration
+- **Numerical stability**: Proper initialization and normalization
+"""
+
+# %% [markdown]
+"""
+## Step 1: Understanding Matrix Multiplication
+
+### What is Matrix Multiplication?
+Matrix multiplication is the **fundamental operation** that powers neural networks. When we multiply matrices A and B:
+
+$$C = A \times B$$
+
+Each element $C_{i,j}$ is the **dot product** of row $i$ from A and column $j$ from B.
+
+### The Mathematical Foundation: Linear Algebra in Neural Networks
+
+#### **Why Matrix Multiplication in Neural Networks?**
+Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**:
+
+```python
+# The core neural network operation:
+linear_output = weights @ input + bias    # Linear transformation (matrix multiplication)
+activation_output = activation_function(linear_output)  # Nonlinear transformation
+```
+
+#### **The Geometric Interpretation**
+Matrix multiplication represents **geometric transformations** in high-dimensional space:
+
+- **Rotation**: Changing the orientation of data
+- **Scaling**: Stretching or compressing along certain dimensions
+- **Projection**: Mapping to lower or higher dimensional spaces
+- **Translation**: Shifting data (via bias terms)
+
+#### **Why This Matters for Learning**
+Each layer learns to transform the input space to make the final task easier:
+
+```python
+# Example: Image classification
+raw_pixels → [Layer 1] → edges → [Layer 2] → shapes → [Layer 3] → objects → [Layer 4] → classes
+```
+
+### The Computational Perspective
+
+#### **Batch Processing Power**
+Matrix multiplication enables efficient batch processing:
+
+```python
+# Single sample (inefficient):
+for sample in batch:
+    output = weights @ sample + bias  # Process one at a time
+
+# Batch processing (efficient):
+batch_output = weights @ batch + bias  # Process all samples simultaneously
+```
+
+#### **Parallelization Benefits**
+- **CPU**: Multiple cores can compute different parts simultaneously
+- **GPU**: Thousands of cores excel at matrix operations
+- **TPU**: Specialized hardware designed for matrix multiplication
+- **Memory**: Contiguous memory access patterns improve cache efficiency
+
+#### **Computational Complexity**
+For matrices A(m×n) and B(n×p):
+- **Time complexity**: O(mnp) - cubic in the worst case
+- **Space complexity**: O(mp) - for the output matrix
+- **Optimization**: Modern libraries use optimized algorithms (Strassen, etc.)
+
+### Real-World Applications: Where Matrix Multiplication Shines
+
+#### **Computer Vision**
+```python
+# Convolutional layers can be expressed as matrix multiplication:
+# Image patches → Matrix A
+# Convolutional filters → Matrix B
+# Feature maps → Matrix C = A @ B
+```
+
+#### **Natural Language Processing**
+```python
+# Transformer attention mechanism:
+# Query matrix Q, Key matrix K, Value matrix V
+# Attention weights = softmax(Q @ K.T / sqrt(d_k))
+# Output = Attention_weights @ V
+```
+
+#### **Recommendation Systems**
+```python
+# Matrix factorization:
+# User-item matrix R ≈ User_factors @ Item_factors.T
+# Collaborative filtering through matrix operations
+```
+
+### The Algorithm: Understanding Every Step
+
+For matrices A(m×n) and B(n×p) → C(m×p):
+```python
+for i in range(m):        # For each row of A
+    for j in range(p):    # For each column of B
+        for k in range(n):  # Compute dot product
+            C[i,j] += A[i,k] * B[k,j]
+```
+
+#### **Visual Breakdown**
+```
+A = [[1, 2],     B = [[5, 6],     C = [[19, 22],
+     [3, 4]]          [7, 8]]          [43, 50]]
+
+C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19
+C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22
+C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43
+C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50
+```
+
+#### **Memory Access Pattern**
+- **Row-major order**: Access elements row by row for cache efficiency
+- **Cache locality**: Nearby elements are likely to be accessed together
+- **Blocking**: Divide large matrices into blocks for better cache usage
+
+### Performance Considerations: Making It Fast
+
+#### **Optimization Strategies**
+1. **Vectorization**: Use SIMD instructions for parallel element operations
+2. **Blocking**: Divide matrices into cache-friendly blocks
+3. **Loop unrolling**: Reduce loop overhead
+4. **Memory alignment**: Ensure data is aligned for optimal access
+
+#### **Modern Libraries**
+- **BLAS (Basic Linear Algebra Subprograms)**: Optimized matrix operations
+- **Intel MKL**: Highly optimized for Intel processors
+- **OpenBLAS**: Open-source optimized BLAS
+- **cuBLAS**: GPU-accelerated BLAS from NVIDIA
+
+#### **Why We Implement Naive Version**
+Understanding the basic algorithm helps you:
+- **Debug performance issues**: Know what's happening under the hood
+- **Optimize for specific cases**: Custom implementations for special matrices
+- **Understand complexity**: Appreciate the optimizations in modern libraries
+- **Educational value**: See the mathematical foundation clearly
+
+### Connection to Neural Network Architecture
+
+#### **Layer Composition**
+```python
+# Each layer is a matrix multiplication:
+layer1_output = W1 @ input + b1
+layer2_output = W2 @ layer1_output + b2
+layer3_output = W3 @ layer2_output + b3
+
+# This is equivalent to:
+final_output = W3 @ (W2 @ (W1 @ input + b1) + b2) + b3
+```
+
+#### **Gradient Flow**
+During backpropagation, gradients flow through matrix operations:
+```python
+# Forward: y = W @ x + b
+# Backward: 
+# dW = dy @ x.T
+# dx = W.T @ dy
+# db = dy.sum(axis=0)
+```
+
+#### **Weight Initialization**
+Matrix multiplication behavior depends on weight initialization:
+- **Xavier/Glorot**: Maintains variance across layers
+- **He initialization**: Optimized for ReLU activations
+- **Orthogonal**: Preserves gradient norms
+
+Let's implement matrix multiplication to truly understand it!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "matmul-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def matmul_naive(A: np.ndarray, B: np.ndarray) -> np.ndarray:
+    """
+    Naive matrix multiplication using explicit for-loops.
+    
+    This helps you understand what matrix multiplication really does!
+    
+    Args:
+        A: Matrix of shape (m, n)
+        B: Matrix of shape (n, p)
+        
+    Returns:
+        Matrix of shape (m, p) where C[i,j] = sum(A[i,k] * B[k,j] for k in range(n))
+        
+    TODO: Implement matrix multiplication using three nested for-loops.
+    
+    APPROACH:
+    1. Get the dimensions: m, n from A and n2, p from B
+    2. Check that n == n2 (matrices must be compatible)
+    3. Create output matrix C of shape (m, p) filled with zeros
+    4. Use three nested loops:
+       - i loop: rows of A (0 to m-1)
+       - j loop: columns of B (0 to p-1) 
+       - k loop: shared dimension (0 to n-1)
+    5. For each (i,j), compute: C[i,j] += A[i,k] * B[k,j]
+    
+    EXAMPLE:
+    A = [[1, 2],     B = [[5, 6],
+         [3, 4]]          [7, 8]]
+    
+    C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0] = 1*5 + 2*7 = 19
+    C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1] = 1*6 + 2*8 = 22
+    C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0] = 3*5 + 4*7 = 43
+    C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1] = 3*6 + 4*8 = 50
+    
+    HINTS:
+    - Start with C = np.zeros((m, p))
+    - Use three nested for loops: for i in range(m): for j in range(p): for k in range(n):
+    - Accumulate the sum: C[i,j] += A[i,k] * B[k,j]
+    """
+    ### BEGIN SOLUTION
+    # Get matrix dimensions
+    m, n = A.shape
+    n2, p = B.shape
+    
+    # Check compatibility
+    if n != n2:
+        raise ValueError(f"Incompatible matrix dimensions: A is {m}x{n}, B is {n2}x{p}")
+    
+    # Initialize result matrix
+    C = np.zeros((m, p))
+    
+    # Triple nested loop for matrix multiplication
+    for i in range(m):
+        for j in range(p):
+            for k in range(n):
+                C[i, j] += A[i, k] * B[k, j]
+    
+    return C
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: Matrix Multiplication
+
+Let's test your matrix multiplication implementation right away! This is the foundation of neural networks.
+
+**This is a unit test** - it tests one specific function (matmul_naive) in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-matmul-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test matrix multiplication immediately after implementation
+print("🔬 Unit Test: Matrix Multiplication...")
+
+# Test simple 2x2 case
+try:
+    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+    B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+    
+    result = matmul_naive(A, B)
+    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+    
+    assert np.allclose(result, expected), f"Matrix multiplication failed: expected {expected}, got {result}"
+    print(f"✅ Simple 2x2 test: {A.tolist()} @ {B.tolist()} = {result.tolist()}")
+    
+    # Compare with NumPy
+    numpy_result = A @ B
+    assert np.allclose(result, numpy_result), f"Doesn't match NumPy: got {result}, expected {numpy_result}"
+    print("✅ Matches NumPy's result")
+    
+except Exception as e:
+    print(f"❌ Matrix multiplication test failed: {e}")
+    raise
+
+# Test different shapes
+try:
+    A2 = np.array([[1, 2, 3]], dtype=np.float32)  # 1x3
+    B2 = np.array([[4], [5], [6]], dtype=np.float32)  # 3x1
+    result2 = matmul_naive(A2, B2)
+    expected2 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32
+    
+    assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"
+    print(f"✅ Different shapes test: {A2.tolist()} @ {B2.tolist()} = {result2.tolist()}")
+    
+except Exception as e:
+    print(f"❌ Different shapes test failed: {e}")
+    raise
+
+# Show the algorithm in action
+print("🎯 Matrix multiplication algorithm:")
+print("   C[i,j] = Σ(A[i,k] * B[k,j]) for all k")
+print("   Triple nested loops compute each element")
+print("📈 Progress: Matrix multiplication ✓")
+
+# %% [markdown]
+"""
+## Step 2: Building the Dense Layer
+
+Now let's build the **Dense layer**, the most fundamental building block of neural networks. A Dense layer performs a linear transformation: `y = Wx + b`
+
+### What is a Dense Layer?
+- **Linear transformation**: `y = Wx + b`
+- **W**: Weight matrix (learnable parameters)
+- **x**: Input tensor
+- **b**: Bias vector (learnable parameters)
+- **y**: Output tensor
+
+### Why Dense Layers Matter
+- **Universal approximation**: Can approximate any function with enough neurons
+- **Feature learning**: Each neuron learns a different feature
+- **Nonlinearity**: When combined with activation functions, becomes very powerful
+- **Foundation**: All other layers build on this concept
+
+### The Math
+For input x of shape (batch_size, input_size):
+- **W**: Weight matrix of shape (input_size, output_size)
+- **b**: Bias vector of shape (output_size)
+- **y**: Output of shape (batch_size, output_size)
+
+### Visual Example
+```
+Input: x = [1, 2, 3] (3 features)
+Weights: W = [[0.1, 0.2],    Bias: b = [0.1, 0.2]
+              [0.3, 0.4],
+              [0.5, 0.6]]
+
+Step 1: Wx = [0.1*1 + 0.3*2 + 0.5*3,  0.2*1 + 0.4*2 + 0.6*3]
+            = [2.2, 3.2]
+
+Step 2: y = Wx + b = [2.2 + 0.1, 3.2 + 0.2] = [2.3, 3.4]
+```
+
+Let's implement this!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "dense-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Dense:
+    """
+    Dense (Linear) Layer: y = Wx + b
+    
+    The fundamental building block of neural networks.
+    Performs linear transformation: matrix multiplication + bias addition.
+    """
+    
+    def __init__(self, input_size: int, output_size: int, use_bias: bool = True, 
+                 use_naive_matmul: bool = False):
+        """
+        Initialize Dense layer with random weights.
+        
+        Args:
+            input_size: Number of input features
+            output_size: Number of output features
+            use_bias: Whether to include bias term (default: True)
+            use_naive_matmul: Whether to use naive matrix multiplication (for learning)
+            
+        TODO: Implement Dense layer initialization with proper weight initialization.
+        
+        APPROACH:
+        1. Store layer parameters (input_size, output_size, use_bias, use_naive_matmul)
+        2. Initialize weights with Xavier/Glorot initialization
+        3. Initialize bias to zeros (if use_bias=True)
+        4. Convert to float32 for consistency
+        
+        EXAMPLE:
+        Dense(3, 2) creates:
+        - weights: shape (3, 2) with small random values
+        - bias: shape (2,) with zeros
+        
+        HINTS:
+        - Use np.random.randn() for random initialization
+        - Scale weights by sqrt(2/(input_size + output_size)) for Xavier init
+        - Use np.zeros() for bias initialization
+        - Convert to float32 with .astype(np.float32)
+        """
+        ### BEGIN SOLUTION
+        # Store parameters
+        self.input_size = input_size
+        self.output_size = output_size
+        self.use_bias = use_bias
+        self.use_naive_matmul = use_naive_matmul
+        
+        # Xavier/Glorot initialization
+        scale = np.sqrt(2.0 / (input_size + output_size))
+        self.weights = np.random.randn(input_size, output_size).astype(np.float32) * scale
+        
+        # Initialize bias
+        if use_bias:
+            self.bias = np.zeros(output_size, dtype=np.float32)
+        else:
+            self.bias = None
+        ### END SOLUTION
+    
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        Forward pass: y = Wx + b
+        
+        Args:
+            x: Input tensor of shape (batch_size, input_size)
+            
+        Returns:
+            Output tensor of shape (batch_size, output_size)
+            
+        TODO: Implement matrix multiplication and bias addition.
+        
+        APPROACH:
+        1. Choose matrix multiplication method based on use_naive_matmul flag
+        2. Perform matrix multiplication: Wx
+        3. Add bias if use_bias=True
+        4. Return result wrapped in Tensor
+        
+        EXAMPLE:
+        Input x: Tensor([[1, 2, 3]])  # shape (1, 3)
+        Weights: shape (3, 2)
+        Output: Tensor([[val1, val2]])  # shape (1, 2)
+        
+        HINTS:
+        - Use self.use_naive_matmul to choose between matmul_naive and @
+        - x.data gives you the numpy array
+        - Use broadcasting for bias addition: result + self.bias
+        - Return Tensor(result) to wrap the result
+        """
+        ### BEGIN SOLUTION
+        # Matrix multiplication
+        if self.use_naive_matmul:
+            result = matmul_naive(x.data, self.weights)
+        else:
+            result = x.data @ self.weights
+        
+        # Add bias
+        if self.use_bias:
+            result += self.bias
+        
+        return Tensor(result)
+        ### END SOLUTION
+    
+    def __call__(self, x: Tensor) -> Tensor:
+        """Make layer callable: layer(x) same as layer.forward(x)"""
+        return self.forward(x)
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: Dense Layer
+
+Let's test your Dense layer implementation! This is the fundamental building block of neural networks.
+
+**This is a unit test** - it tests one specific class (Dense layer) in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-dense-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test Dense layer immediately after implementation
+print("🔬 Unit Test: Dense Layer...")
+
+# Test basic Dense layer
+try:
+    layer = Dense(input_size=3, output_size=2, use_bias=True)
+    x = Tensor([[1, 2, 3]])  # batch_size=1, input_size=3
+    
+    print(f"Input shape: {x.shape}")
+    print(f"Layer weights shape: {layer.weights.shape}")
+    if layer.bias is not None:
+        print(f"Layer bias shape: {layer.bias.shape}")
+    
+    y = layer(x)
+    print(f"Output shape: {y.shape}")
+    print(f"Output: {y}")
+    
+    # Test shape compatibility
+    assert y.shape == (1, 2), f"Output shape should be (1, 2), got {y.shape}"
+    print("✅ Dense layer produces correct output shape")
+    
+    # Test weights initialization
+    assert layer.weights.shape == (3, 2), f"Weights shape should be (3, 2), got {layer.weights.shape}"
+    if layer.bias is not None:
+        assert layer.bias.shape == (2,), f"Bias shape should be (2,), got {layer.bias.shape}"
+    print("✅ Dense layer has correct weight and bias shapes")
+    
+    # Test that weights are not all zeros (proper initialization)
+    assert not np.allclose(layer.weights, 0), "Weights should not be all zeros"
+    if layer.bias is not None:
+        assert np.allclose(layer.bias, 0), "Bias should be initialized to zeros"
+    print("✅ Dense layer has proper weight initialization")
+    
+except Exception as e:
+    print(f"❌ Dense layer test failed: {e}")
+    raise
+
+# Test without bias
+try:
+    layer_no_bias = Dense(input_size=2, output_size=1, use_bias=False)
+    x2 = Tensor([[1, 2]])
+    y2 = layer_no_bias(x2)
+    
+    assert y2.shape == (1, 1), f"No bias output shape should be (1, 1), got {y2.shape}"
+    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
+    print("✅ Dense layer works without bias")
+    
+except Exception as e:
+    print(f"❌ Dense layer no-bias test failed: {e}")
+    raise
+
+# Test naive matrix multiplication
+try:
+    layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
+    x3 = Tensor([[1, 2]])
+    y3 = layer_naive(x3)
+    
+    assert y3.shape == (1, 2), f"Naive matmul output shape should be (1, 2), got {y3.shape}"
+    print("✅ Dense layer works with naive matrix multiplication")
+    
+except Exception as e:
+    print(f"❌ Dense layer naive matmul test failed: {e}")
+    raise
+
+# Show the linear transformation in action
+print("🎯 Dense layer behavior:")
+print("   y = Wx + b (linear transformation)")
+print("   W: learnable weight matrix")
+print("   b: learnable bias vector")
+print("📈 Progress: Matrix multiplication ✓, Dense layer ✓")
+
+# %% [markdown]
+"""
+### 🧪 Test Your Implementations
+
+Once you implement the functions above, run these cells to test them:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-matmul-naive", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test matrix multiplication
+print("Testing matrix multiplication...")
+
+# Test case 1: Simple 2x2 matrices
+A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+
+result = matmul_naive(A, B)
+expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+
+print(f"Matrix A:\n{A}")
+print(f"Matrix B:\n{B}")
+print(f"Your result:\n{result}")
+print(f"Expected:\n{expected}")
+
+assert np.allclose(result, expected), f"Result doesn't match expected: got {result}, expected {expected}"
+
+# Test case 2: Compare with NumPy
+numpy_result = A @ B
+assert np.allclose(result, numpy_result), f"Doesn't match NumPy result: got {result}, expected {numpy_result}"
+
+# Test case 3: Different shapes
+A2 = np.array([[1, 2, 3]], dtype=np.float32)  # 1x3
+B2 = np.array([[4], [5], [6]], dtype=np.float32)  # 3x1
+result2 = matmul_naive(A2, B2)
+expected2 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32
+assert np.allclose(result2, expected2), f"Different shapes failed: got {result2}, expected {expected2}"
+
+print("✅ Matrix multiplication tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-dense-layer", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test Dense layer
+print("Testing Dense layer...")
+
+# Test basic Dense layer
+layer = Dense(input_size=3, output_size=2, use_bias=True)
+x = Tensor([[1, 2, 3]])  # batch_size=1, input_size=3
+
+print(f"Input shape: {x.shape}")
+print(f"Layer weights shape: {layer.weights.shape}")
+if layer.bias is not None:
+    print(f"Layer bias shape: {layer.bias.shape}")
+else:
+    print("Layer bias: None")
+
+y = layer(x)
+print(f"Output shape: {y.shape}")
+print(f"Output: {y}")
+
+# Test shape compatibility
+assert y.shape == (1, 2), f"Output shape should be (1, 2), got {y.shape}"
+
+# Test without bias
+layer_no_bias = Dense(input_size=2, output_size=1, use_bias=False)
+x2 = Tensor([[1, 2]])
+y2 = layer_no_bias(x2)
+assert y2.shape == (1, 1), f"No bias output shape should be (1, 1), got {y2.shape}"
+assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
+
+# Test naive matrix multiplication
+layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
+x3 = Tensor([[1, 2]])
+y3 = layer_naive(x3)
+assert y3.shape == (1, 2), f"Naive matmul output shape should be (1, 2), got {y3.shape}"
+
+print("✅ Dense layer tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-layer-composition", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test layer composition
+print("Testing layer composition...")
+
+# Create a simple network: Dense → ReLU → Dense
+dense1 = Dense(input_size=3, output_size=2)
+relu = ReLU()
+dense2 = Dense(input_size=2, output_size=1)
+
+# Test input
+x = Tensor([[1, 2, 3]])
+print(f"Input: {x}")
+
+# Forward pass through the network
+h1 = dense1(x)
+print(f"After Dense1: {h1}")
+
+h2 = relu(h1)
+print(f"After ReLU: {h2}")
+
+h3 = dense2(h2)
+print(f"After Dense2: {h3}")
+
+# Test shapes
+assert h1.shape == (1, 2), f"Dense1 output should be (1, 2), got {h1.shape}"
+assert h2.shape == (1, 2), f"ReLU output should be (1, 2), got {h2.shape}"
+assert h3.shape == (1, 1), f"Dense2 output should be (1, 1), got {h3.shape}"
+
+# Test that ReLU actually applied (non-negative values)
+assert np.all(h2.data >= 0), "ReLU should produce non-negative values"
+
+print("✅ Layer composition tests passed!")
+
+# %% [markdown]
+"""
+## 🧪 Comprehensive Testing: Matrix Multiplication and Dense Layers
+
+Let's thoroughly test your implementations to make sure they work correctly in all scenarios.
+This comprehensive testing ensures your layers are robust and ready for real neural networks.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-layers-comprehensive", "locked": true, "points": 30, "schema_version": 3, "solution": false, "task": false}
+def test_layers_comprehensive():
+    """Comprehensive test of matrix multiplication and Dense layers."""
+    print("🔬 Testing matrix multiplication and Dense layers comprehensively...")
+    
+    tests_passed = 0
+    total_tests = 10
+    
+    # Test 1: Matrix Multiplication Basic Cases
+    try:
+        # Test 2x2 matrices
+        A = np.array([[1, 2], [3, 4]], dtype=np.float32)
+        B = np.array([[5, 6], [7, 8]], dtype=np.float32)
+        result = matmul_naive(A, B)
+        expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
+        
+        assert np.allclose(result, expected), f"2x2 multiplication failed: expected {expected}, got {result}"
+        
+        # Compare with NumPy
+        numpy_result = A @ B
+        assert np.allclose(result, numpy_result), f"Doesn't match NumPy: expected {numpy_result}, got {result}"
+        
+        print(f"✅ Matrix multiplication 2x2: {A.shape} × {B.shape} = {result.shape}")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Matrix multiplication basic failed: {e}")
+    
+    # Test 2: Matrix Multiplication Different Shapes
+    try:
+        # Test 1x3 × 3x1 = 1x1
+        A1 = np.array([[1, 2, 3]], dtype=np.float32)
+        B1 = np.array([[4], [5], [6]], dtype=np.float32)
+        result1 = matmul_naive(A1, B1)
+        expected1 = np.array([[32]], dtype=np.float32)  # 1*4 + 2*5 + 3*6 = 32
+        assert np.allclose(result1, expected1), f"1x3 × 3x1 failed: expected {expected1}, got {result1}"
+        
+        # Test 3x2 × 2x4 = 3x4
+        A2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=np.float32)
+        B2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.float32)
+        result2 = matmul_naive(A2, B2)
+        expected2 = A2 @ B2
+        assert np.allclose(result2, expected2), f"3x2 × 2x4 failed: expected {expected2}, got {result2}"
+        
+        print(f"✅ Matrix multiplication shapes: (1,3)×(3,1), (3,2)×(2,4)")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Matrix multiplication shapes failed: {e}")
+    
+    # Test 3: Matrix Multiplication Edge Cases
+    try:
+        # Test with zeros
+        A_zero = np.zeros((2, 3), dtype=np.float32)
+        B_zero = np.zeros((3, 2), dtype=np.float32)
+        result_zero = matmul_naive(A_zero, B_zero)
+        expected_zero = np.zeros((2, 2), dtype=np.float32)
+        assert np.allclose(result_zero, expected_zero), "Zero matrix multiplication failed"
+        
+        # Test with identity
+        A_id = np.array([[1, 2]], dtype=np.float32)
+        B_id = np.array([[1, 0], [0, 1]], dtype=np.float32)
+        result_id = matmul_naive(A_id, B_id)
+        expected_id = np.array([[1, 2]], dtype=np.float32)
+        assert np.allclose(result_id, expected_id), "Identity matrix multiplication failed"
+        
+        # Test with negative values
+        A_neg = np.array([[-1, 2]], dtype=np.float32)
+        B_neg = np.array([[3], [-4]], dtype=np.float32)
+        result_neg = matmul_naive(A_neg, B_neg)
+        expected_neg = np.array([[-11]], dtype=np.float32)  # -1*3 + 2*(-4) = -11
+        assert np.allclose(result_neg, expected_neg), "Negative matrix multiplication failed"
+        
+        print("✅ Matrix multiplication edge cases: zeros, identity, negatives")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Matrix multiplication edge cases failed: {e}")
+    
+    # Test 4: Dense Layer Initialization
+    try:
+        # Test with bias
+        layer_bias = Dense(input_size=3, output_size=2, use_bias=True)
+        assert layer_bias.weights.shape == (3, 2), f"Weights shape should be (3, 2), got {layer_bias.weights.shape}"
+        assert layer_bias.bias is not None, "Bias should not be None when use_bias=True"
+        assert layer_bias.bias.shape == (2,), f"Bias shape should be (2,), got {layer_bias.bias.shape}"
+        
+        # Check weight initialization (should not be all zeros)
+        assert not np.allclose(layer_bias.weights, 0), "Weights should not be all zeros"
+        assert np.allclose(layer_bias.bias, 0), "Bias should be initialized to zeros"
+        
+        # Test without bias
+        layer_no_bias = Dense(input_size=4, output_size=3, use_bias=False)
+        assert layer_no_bias.weights.shape == (4, 3), f"No-bias weights shape should be (4, 3), got {layer_no_bias.weights.shape}"
+        assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
+        
+        print("✅ Dense layer initialization: weights, bias, shapes")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Dense layer initialization failed: {e}")
+    
+    # Test 5: Dense Layer Forward Pass
+    try:
+        layer = Dense(input_size=3, output_size=2, use_bias=True)
+        
+        # Test single sample
+        x_single = Tensor([[1, 2, 3]])  # shape: (1, 3)
+        y_single = layer(x_single)
+        assert y_single.shape == (1, 2), f"Single sample output should be (1, 2), got {y_single.shape}"
+        
+        # Test batch of samples
+        x_batch = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  # shape: (3, 3)
+        y_batch = layer(x_batch)
+        assert y_batch.shape == (3, 2), f"Batch output should be (3, 2), got {y_batch.shape}"
+        
+        # Verify computation manually for single sample
+        expected_single = np.dot(x_single.data, layer.weights) + layer.bias
+        assert np.allclose(y_single.data, expected_single), "Single sample computation incorrect"
+        
+        print("✅ Dense layer forward pass: single sample, batch processing")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Dense layer forward pass failed: {e}")
+    
+    # Test 6: Dense Layer Without Bias
+    try:
+        layer_no_bias = Dense(input_size=2, output_size=3, use_bias=False)
+        x = Tensor([[1, 2]])
+        y = layer_no_bias(x)
+        
+        assert y.shape == (1, 3), f"No-bias output should be (1, 3), got {y.shape}"
+        
+        # Verify computation (should be just matrix multiplication)
+        expected = np.dot(x.data, layer_no_bias.weights)
+        assert np.allclose(y.data, expected), "No-bias computation incorrect"
+        
+        print("✅ Dense layer without bias: correct computation")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Dense layer without bias failed: {e}")
+    
+    # Test 7: Dense Layer with Naive Matrix Multiplication
+    try:
+        layer_naive = Dense(input_size=2, output_size=2, use_naive_matmul=True)
+        layer_optimized = Dense(input_size=2, output_size=2, use_naive_matmul=False)
+        
+        # Set same weights for comparison
+        layer_optimized.weights = layer_naive.weights.copy()
+        layer_optimized.bias = layer_naive.bias.copy() if layer_naive.bias is not None else None
+        
+        x = Tensor([[1, 2]])
+        y_naive = layer_naive(x)
+        y_optimized = layer_optimized(x)
+        
+        # Both should give same results
+        assert np.allclose(y_naive.data, y_optimized.data), "Naive and optimized should give same results"
+        
+        print("✅ Dense layer naive vs optimized: consistent results")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Dense layer naive matmul failed: {e}")
+    
+    # Test 8: Layer Composition
+    try:
+        # Create a simple network: Dense → ReLU → Dense
+        dense1 = Dense(input_size=3, output_size=4)
+        relu = ReLU()
+        dense2 = Dense(input_size=4, output_size=2)
+        
+        x = Tensor([[1, -2, 3]])
+        
+        # Forward pass
+        h1 = dense1(x)
+        h2 = relu(h1)
+        h3 = dense2(h2)
+        
+        # Check shapes
+        assert h1.shape == (1, 4), f"Dense1 output should be (1, 4), got {h1.shape}"
+        assert h2.shape == (1, 4), f"ReLU output should be (1, 4), got {h2.shape}"
+        assert h3.shape == (1, 2), f"Dense2 output should be (1, 2), got {h3.shape}"
+        
+        # Check ReLU effect
+        assert np.all(h2.data >= 0), "ReLU should produce non-negative values"
+        
+        print("✅ Layer composition: Dense → ReLU → Dense pipeline")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Layer composition failed: {e}")
+    
+    # Test 9: Different Layer Sizes
+    try:
+        # Test various layer sizes
+        test_configs = [
+            (1, 1),    # Minimal
+            (10, 5),   # Medium
+            (100, 50), # Large
+            (784, 128) # MNIST-like
+        ]
+        
+        for input_size, output_size in test_configs:
+            layer = Dense(input_size=input_size, output_size=output_size)
+            
+            # Test with single sample
+            x = Tensor(np.random.randn(1, input_size))
+            y = layer(x)
+            
+            assert y.shape == (1, output_size), f"Size ({input_size}, {output_size}) failed: got {y.shape}"
+            assert layer.weights.shape == (input_size, output_size), f"Weights shape wrong for ({input_size}, {output_size})"
+        
+        print("✅ Different layer sizes: (1,1), (10,5), (100,50), (784,128)")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Different layer sizes failed: {e}")
+    
+    # Test 10: Real Neural Network Scenario
+    try:
+        # Simulate MNIST-like scenario: 784 → 128 → 64 → 10
+        input_layer = Dense(input_size=784, output_size=128)
+        hidden_layer = Dense(input_size=128, output_size=64)
+        output_layer = Dense(input_size=64, output_size=10)
+        
+        relu1 = ReLU()
+        relu2 = ReLU()
+        softmax = Softmax()
+        
+        # Simulate flattened MNIST image
+        x = Tensor(np.random.randn(32, 784))  # Batch of 32 images
+        
+        # Forward pass through network
+        h1 = input_layer(x)
+        h1_activated = relu1(h1)
+        h2 = hidden_layer(h1_activated)
+        h2_activated = relu2(h2)
+        logits = output_layer(h2_activated)
+        probabilities = softmax(logits)
+        
+        # Check final output
+        assert probabilities.shape == (32, 10), f"Final output should be (32, 10), got {probabilities.shape}"
+        
+        # Check that probabilities sum to 1 for each sample
+        row_sums = np.sum(probabilities.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Each sample should have probabilities summing to 1"
+        
+        # Check that all intermediate shapes are correct
+        assert h1.shape == (32, 128), f"Hidden 1 shape should be (32, 128), got {h1.shape}"
+        assert h2.shape == (32, 64), f"Hidden 2 shape should be (32, 64), got {h2.shape}"
+        assert logits.shape == (32, 10), f"Logits shape should be (32, 10), got {logits.shape}"
+        
+        print("✅ Real neural network scenario: MNIST-like 784→128→64→10 classification")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Real neural network scenario failed: {e}")
+    
+    # Results summary
+    print(f"\n📊 Layers Module Results: {tests_passed}/{total_tests} tests passed")
+    
+    if tests_passed == total_tests:
+        print("🎉 All layers tests passed! Your implementations support:")
+        print("  • Matrix multiplication: naive implementation from scratch")
+        print("  • Dense layers: linear transformations with learnable parameters")
+        print("  • Weight initialization: proper random initialization")
+        print("  • Bias handling: optional bias terms")
+        print("  • Batch processing: multiple samples at once")
+        print("  • Layer composition: building complete neural networks")
+        print("  • Real ML scenarios: MNIST-like classification networks")
+        print("📈 Progress: All Layer Functionality ✓")
+        return True
+    else:
+        print("⚠️  Some layers tests failed. Common issues:")
+        print("  • Check matrix multiplication implementation (triple nested loops)")
+        print("  • Verify Dense layer forward pass (y = Wx + b)")
+        print("  • Ensure proper weight initialization (not all zeros)")
+        print("  • Check shape handling for different input/output sizes")
+        print("  • Verify bias handling when use_bias=False")
+        return False
+
+# Run the comprehensive test
+success = test_layers_comprehensive()
+
+# %% [markdown]
+"""
+### 🧪 Integration Test: Layers in Complete Neural Networks
+
+Let's test how your layers work in realistic neural network architectures.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-layers-integration", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
+def test_layers_integration():
+    """Integration test with complete neural network architectures."""
+    print("🔬 Testing layers in complete neural network architectures...")
+    
+    try:
+        print("🧠 Building and testing different network architectures...")
+        
+        # Architecture 1: Simple Binary Classifier
+        print("\n📊 Architecture 1: Binary Classification Network")
+        binary_net = [
+            Dense(input_size=4, output_size=8),
+            ReLU(),
+            Dense(input_size=8, output_size=4),
+            ReLU(),
+            Dense(input_size=4, output_size=1),
+            Sigmoid()
+        ]
+        
+        # Test with batch of samples
+        x_binary = Tensor(np.random.randn(10, 4))  # 10 samples, 4 features
+        
+        # Forward pass through network
+        current = x_binary
+        for i, layer in enumerate(binary_net):
+            current = layer(current)
+            print(f"  Layer {i}: {current.shape}")
+        
+        # Verify final output is valid probabilities
+        assert current.shape == (10, 1), f"Binary classifier output should be (10, 1), got {current.shape}"
+        assert np.all((current.data >= 0) & (current.data <= 1)), "Binary probabilities should be in [0,1]"
+        
+        print("✅ Binary classification network: 4→8→4→1 with ReLU/Sigmoid")
+        
+        # Architecture 2: Multi-class Classifier
+        print("\n📊 Architecture 2: Multi-class Classification Network")
+        multiclass_net = [
+            Dense(input_size=784, output_size=256),
+            ReLU(),
+            Dense(input_size=256, output_size=128),
+            ReLU(),
+            Dense(input_size=128, output_size=10),
+            Softmax()
+        ]
+        
+        # Simulate MNIST-like input
+        x_mnist = Tensor(np.random.randn(5, 784))  # 5 images, 784 pixels
+        
+        current = x_mnist
+        for i, layer in enumerate(multiclass_net):
+            current = layer(current)
+            print(f"  Layer {i}: {current.shape}")
+        
+        # Verify final output is valid probability distribution
+        assert current.shape == (5, 10), f"Multi-class output should be (5, 10), got {current.shape}"
+        row_sums = np.sum(current.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Each sample should have probabilities summing to 1"
+        
+        print("✅ Multi-class classification network: 784→256→128→10 with Softmax")
+        
+        # Architecture 3: Deep Network
+        print("\n📊 Architecture 3: Deep Network (5 layers)")
+        deep_net = [
+            Dense(input_size=100, output_size=80),
+            ReLU(),
+            Dense(input_size=80, output_size=60),
+            ReLU(),
+            Dense(input_size=60, output_size=40),
+            ReLU(),
+            Dense(input_size=40, output_size=20),
+            ReLU(),
+            Dense(input_size=20, output_size=3),
+            Softmax()
+        ]
+        
+        x_deep = Tensor(np.random.randn(8, 100))  # 8 samples, 100 features
+        
+        current = x_deep
+        for i, layer in enumerate(deep_net):
+            current = layer(current)
+            if i % 2 == 0:  # Print every other layer to save space
+                print(f"  Layer {i}: {current.shape}")
+        
+        assert current.shape == (8, 3), f"Deep network output should be (8, 3), got {current.shape}"
+        
+        print("✅ Deep network: 100→80→60→40→20→3 with multiple ReLU layers")
+        
+        # Test 4: Network with Different Activation Functions
+        print("\n📊 Architecture 4: Mixed Activation Functions")
+        mixed_net = [
+            Dense(input_size=6, output_size=4),
+            Tanh(),  # Zero-centered activation
+            Dense(input_size=4, output_size=3),
+            ReLU(),  # Sparse activation
+            Dense(input_size=3, output_size=2),
+            Sigmoid()  # Bounded activation
+        ]
+        
+        x_mixed = Tensor(np.random.randn(3, 6))
+        
+        current = x_mixed
+        for i, layer in enumerate(mixed_net):
+            current = layer(current)
+            print(f"  Layer {i}: {current.shape}, range: [{np.min(current.data):.3f}, {np.max(current.data):.3f}]")
+        
+        assert current.shape == (3, 2), f"Mixed network output should be (3, 2), got {current.shape}"
+        
+        print("✅ Mixed activations network: Tanh→ReLU→Sigmoid combinations")
+        
+        # Test 5: Parameter Counting
+        print("\n📊 Parameter Analysis")
+        
+        def count_parameters(layer):
+            """Count trainable parameters in a Dense layer."""
+            if isinstance(layer, Dense):
+                weight_params = layer.weights.size
+                bias_params = layer.bias.size if layer.bias is not None else 0
+                return weight_params + bias_params
+            return 0
+        
+        # Count parameters in binary classifier
+        total_params = sum(count_parameters(layer) for layer in binary_net)
+        print(f"Binary classifier parameters: {total_params}")
+        
+        # Manual verification for first layer: 4*8 + 8 = 40
+        first_dense = binary_net[0]
+        expected_first = 4 * 8 + 8  # weights + bias
+        actual_first = count_parameters(first_dense)
+        assert actual_first == expected_first, f"First layer params: expected {expected_first}, got {actual_first}"
+        
+        print("✅ Parameter counting: weight and bias parameters calculated correctly")
+        
+        # Test 6: Gradient Flow Preparation
+        print("\n📊 Gradient Flow Preparation")
+        
+        # Test that network can handle different input types
+        test_inputs = [
+            Tensor(np.zeros((1, 4))),      # All zeros
+            Tensor(np.ones((1, 4))),       # All ones
+            Tensor(np.random.randn(1, 4)), # Random
+            Tensor(np.random.randn(1, 4) * 10)  # Large values
+        ]
+        
+        for i, test_input in enumerate(test_inputs):
+            current = test_input
+            for layer in binary_net:
+                current = layer(current)
+            
+            # Check for numerical stability
+            assert not np.any(np.isnan(current.data)), f"Input {i} produced NaN"
+            assert not np.any(np.isinf(current.data)), f"Input {i} produced Inf"
+        
+        print("✅ Numerical stability: networks handle various input ranges")
+        
+        print("\n🎉 Integration test passed! Your layers work correctly in:")
+        print("  • Binary classification networks")
+        print("  • Multi-class classification networks") 
+        print("  • Deep networks with multiple hidden layers")
+        print("  • Networks with mixed activation functions")
+        print("  • Parameter counting and analysis")
+        print("  • Numerical stability across input ranges")
+        print("📈 Progress: Layers ready for complete neural networks!")
+        
+        return True
+        
+    except Exception as e:
+        print(f"❌ Integration test failed: {e}")
+        print("\n💡 This suggests an issue with:")
+        print("  • Layer composition and chaining")
+        print("  • Shape compatibility between layers")
+        print("  • Activation function integration")
+        print("  • Numerical stability in deep networks")
+        print("  • Check your Dense layer and matrix multiplication")
+        return False
+
+# Run the integration test
+success = test_layers_integration() and success
+
+# Print final summary
+print(f"\n{'='*60}")
+print("🎯 LAYERS MODULE TESTING COMPLETE")
+print(f"{'='*60}")
+
+if success:
+    print("🎉 CONGRATULATIONS! All layers tests passed!")
+    print("\n✅ Your layers module successfully implements:")
+    print("  • Matrix multiplication: naive implementation from scratch")
+    print("  • Dense layers: y = Wx + b linear transformations")
+    print("  • Weight initialization: proper random weight setup")
+    print("  • Bias handling: optional bias terms")
+    print("  • Batch processing: efficient multi-sample computation")
+    print("  • Layer composition: building complete neural networks")
+    print("  • Integration: works with all activation functions")
+    print("  • Real ML scenarios: MNIST-like classification networks")
+    print("\n🚀 You're ready to build complete neural network architectures!")
+    print("📈 Final Progress: Layers Module ✓ COMPLETE")
+else:
+    print("⚠️  Some tests failed. Please review the error messages above.")
+    print("\n🔧 To fix issues:")
+    print("  1. Check your matrix multiplication implementation")
+    print("  2. Verify Dense layer forward pass computation")
+    print("  3. Ensure proper weight and bias initialization")
+    print("  4. Test shape compatibility between layers")
+    print("  5. Verify integration with activation functions")
+    print("\n💪 Keep building! These layers are the foundation of all neural networks.")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented the core building blocks of neural networks:
+
+### What You've Accomplished
+✅ **Matrix Multiplication**: Implemented from scratch with triple nested loops  
+✅ **Dense Layer**: The fundamental linear transformation y = Wx + b  
+✅ **Weight Initialization**: Xavier/Glorot initialization for stable training  
+✅ **Layer Composition**: Combining layers with activations  
+✅ **Flexible Implementation**: Support for both naive and optimized matrix multiplication  
+
+### Key Concepts You've Learned
+- **Matrix multiplication** is the engine of neural networks
+- **Dense layers** perform linear transformations that learn features
+- **Weight initialization** is crucial for stable training
+- **Layer composition** creates powerful nonlinear functions
+- **Batch processing** enables efficient computation
+
+### Mathematical Foundations
+- **Linear algebra**: Matrix operations power all neural computations
+- **Universal approximation**: Dense layers can approximate any function
+- **Feature learning**: Each neuron learns different patterns
+- **Composability**: Simple operations combine to create complex behaviors
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 03_layers`
+2. **Test your implementation**: `tito module test 03_layers`
+3. **Use your layers**: 
+   ```python
+   from tinytorch.core.layers import Dense
+   from tinytorch.core.activations import ReLU
+   layer = Dense(10, 5)
+   activation = ReLU()
+   ```
+4. **Move to Module 4**: Start building complete neural networks!
+
+**Ready for the next challenge?** Let's compose these layers into complete neural network architectures!
+""" 
\ No newline at end of file
diff --git a/modules/source/03_layers/tests/test_layers.py b/modules/source/03_layers/tests/test_layers.py
deleted file mode 100644
index e2597b8f..00000000
--- a/modules/source/03_layers/tests/test_layers.py
+++ /dev/null
@@ -1,336 +0,0 @@
-"""
-Test suite for the layers module.
-This tests the student implementations to ensure they work correctly.
-"""
-
-import pytest
-import numpy as np
-import sys
-import os
-
-# Import from the main package (rock solid foundation)
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.layers import Dense
-from tinytorch.core.activations import ReLU, Sigmoid, Tanh
-
-def safe_numpy(tensor):
-    """Get numpy array from tensor, using .numpy() if available, otherwise .data"""
-    if hasattr(tensor, 'numpy'):
-        return tensor.numpy()
-    else:
-        return tensor.data
-
-class TestDenseLayer:
-    """Test Dense (Linear) layer functionality."""
-    
-    def test_dense_creation(self):
-        """Test creating Dense layers with different configurations."""
-        # Basic dense layer
-        layer = Dense(input_size=3, output_size=2)
-        assert layer.input_size == 3
-        assert layer.output_size == 2
-        assert layer.use_bias == True
-        assert layer.weights.shape == (3, 2)
-        assert layer.bias.shape == (2,)
-        
-        # Dense layer without bias
-        layer_no_bias = Dense(input_size=4, output_size=3, use_bias=False)
-        assert layer_no_bias.use_bias == False
-        assert layer_no_bias.bias is None
-    
-    def test_dense_forward_single(self):
-        """Test Dense layer forward pass with single input."""
-        layer = Dense(input_size=3, output_size=2)
-        
-        # Single input
-        x = Tensor([[1.0, 2.0, 3.0]])
-        y = layer(x)
-        
-        assert y.shape == (1, 2)
-        assert isinstance(y, Tensor)
-    
-    def test_dense_forward_batch(self):
-        """Test Dense layer forward pass with batch input."""
-        layer = Dense(input_size=3, output_size=2)
-        
-        # Batch input
-        x = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-        y = layer(x)
-        
-        assert y.shape == (2, 2)
-        assert isinstance(y, Tensor)
-    
-    def test_dense_no_bias(self):
-        """Test Dense layer without bias."""
-        layer = Dense(input_size=2, output_size=1, use_bias=False)
-        
-        x = Tensor([[1.0, 2.0]])
-        y = layer(x)
-        
-        assert y.shape == (1, 1)
-        # Should be just matrix multiplication without bias
-        expected = safe_numpy(x) @ safe_numpy(layer.weights)
-        np.testing.assert_array_almost_equal(safe_numpy(y), expected)
-    
-    def test_dense_callable(self):
-        """Test that Dense layer is callable."""
-        layer = Dense(input_size=2, output_size=1)
-        x = Tensor([[1.0, 2.0]])
-        
-        # Both should work
-        y1 = layer.forward(x)
-        y2 = layer(x)
-        
-        np.testing.assert_array_equal(safe_numpy(y1), safe_numpy(y2))
-
-class TestActivationFunctions:
-    """Test activation function implementations."""
-    
-    def test_relu_basic(self):
-        """Test ReLU activation function."""
-        relu = ReLU()
-        x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
-        y = relu(x)
-        
-        expected = [[0.0, 0.0, 0.0, 1.0, 2.0]]
-        np.testing.assert_array_equal(safe_numpy(y), expected)
-    
-    def test_relu_callable(self):
-        """Test that ReLU is callable."""
-        relu = ReLU()
-        x = Tensor([[1.0, -1.0]])
-        
-        y1 = relu.forward(x)
-        y2 = relu(x)
-        
-        np.testing.assert_array_equal(safe_numpy(y1), safe_numpy(y2))
-    
-    def test_sigmoid_basic(self):
-        """Test Sigmoid activation function."""
-        sigmoid = Sigmoid()
-        x = Tensor([[0.0]])  # sigmoid(0) = 0.5
-        y = sigmoid(x)
-        
-        np.testing.assert_array_almost_equal(safe_numpy(y), [[0.5]])
-    
-    def test_sigmoid_range(self):
-        """Test Sigmoid output range."""
-        sigmoid = Sigmoid()
-        x = Tensor([[-10.0, 0.0, 10.0]])
-        y = sigmoid(x)
-        
-        # Should be in range [0, 1] - use reasonable bounds
-        assert np.all(safe_numpy(y) >= 0)
-        assert np.all(safe_numpy(y) <= 1)
-        # Check that extreme values are close to bounds
-        assert safe_numpy(y)[0][0] < 0.01  # Very small for -10
-        assert safe_numpy(y)[0][2] > 0.99  # Very large for 10
-    
-    def test_tanh_basic(self):
-        """Test Tanh activation function."""
-        tanh = Tanh()
-        x = Tensor([[0.0]])  # tanh(0) = 0
-        y = tanh(x)
-        
-        np.testing.assert_array_almost_equal(safe_numpy(y), [[0.0]])
-    
-    def test_tanh_range(self):
-        """Test Tanh output range."""
-        tanh = Tanh()
-        x = Tensor([[-10.0, 0.0, 10.0]])
-        y = tanh(x)
-        
-        # Should be in range [-1, 1] - use reasonable bounds
-        assert np.all(safe_numpy(y) >= -1)
-        assert np.all(safe_numpy(y) <= 1)
-        # Check that extreme values are close to bounds
-        assert safe_numpy(y)[0][0] < -0.99  # Very negative for -10
-        assert safe_numpy(y)[0][2] > 0.99   # Very positive for 10
-
-class TestLayerComposition:
-    """Test composing layers into neural networks."""
-    
-    def test_simple_network(self):
-        """Test a simple 2-layer network."""
-        # 3 → 4 → 2 network
-        layer1 = Dense(input_size=3, output_size=4)
-        relu = ReLU()
-        layer2 = Dense(input_size=4, output_size=2)
-        sigmoid = Sigmoid()
-        
-        # Forward pass
-        x = Tensor([[1.0, 2.0, 3.0]])
-        h1 = layer1(x)
-        h1_activated = relu(h1)
-        h2 = layer2(h1_activated)
-        output = sigmoid(h2)
-        
-        assert h1.shape == (1, 4)
-        assert h1_activated.shape == (1, 4)
-        assert h2.shape == (1, 2)
-        assert output.shape == (1, 2)
-        
-        # Output should be in sigmoid range
-        assert np.all(safe_numpy(output) >= 0)
-        assert np.all(safe_numpy(output) <= 1)
-    
-    def test_batch_network(self):
-        """Test network with batch processing."""
-        layer1 = Dense(input_size=2, output_size=3)
-        relu = ReLU()
-        layer2 = Dense(input_size=3, output_size=1)
-        
-        # Batch of 4 examples
-        x = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]])
-        
-        h1 = layer1(x)
-        h1_activated = relu(h1)
-        output = layer2(h1_activated)
-        
-        assert output.shape == (4, 1)
-    
-    def test_deep_network(self):
-        """Test deeper network composition."""
-        # 5-layer network
-        layers = [
-            Dense(input_size=10, output_size=8),
-            ReLU(),
-            Dense(input_size=8, output_size=6),
-            ReLU(),
-            Dense(input_size=6, output_size=4),
-            ReLU(),
-            Dense(input_size=4, output_size=2),
-            Sigmoid()
-        ]
-        
-        x = Tensor([[1.0] * 10])  # 10 features
-        
-        # Forward pass through all layers
-        current = x
-        for layer in layers:
-            current = layer(current)
-        
-        assert current.shape == (1, 2)
-        # Final output should be in sigmoid range
-        assert np.all(safe_numpy(current) >= 0)
-        assert np.all(safe_numpy(current) <= 1)
-
-class TestEdgeCases:
-    """Test edge cases and error conditions."""
-    
-    def test_zero_input(self):
-        """Test layers with zero input."""
-        layer = Dense(input_size=3, output_size=2)
-        relu = ReLU()
-        
-        x = Tensor([[0.0, 0.0, 0.0]])
-        y = layer(x)
-        y_relu = relu(y)
-        
-        assert y.shape == (1, 2)
-        assert y_relu.shape == (1, 2)
-    
-    def test_large_input(self):
-        """Test layers with large input values."""
-        layer = Dense(input_size=2, output_size=1)
-        sigmoid = Sigmoid()
-        
-        x = Tensor([[1000.0, -1000.0]])
-        y = layer(x)
-        y_sigmoid = sigmoid(y)
-        
-        # Should not overflow
-        assert not np.any(np.isnan(safe_numpy(y_sigmoid)))
-        assert not np.any(np.isinf(safe_numpy(y_sigmoid)))
-    
-    def test_single_neuron(self):
-        """Test single neuron layers."""
-        layer = Dense(input_size=1, output_size=1)
-        x = Tensor([[5.0]])
-        y = layer(x)
-        
-        assert y.shape == (1, 1)
-
-# Stretch goal tests (these will be skipped if methods don't exist)
-class TestStretchGoals:
-    """Stretch goal tests for advanced features."""
-    
-    @pytest.mark.skip(reason="Stretch goal: Weight initialization methods")
-    def test_weight_initialization_methods(self):
-        """Test different weight initialization strategies."""
-        # Xavier initialization
-        layer_xavier = Dense(input_size=100, output_size=50, init_method='xavier')
-        weights_xavier = safe_numpy(layer_xavier.weights)
-        
-        # He initialization  
-        layer_he = Dense(input_size=100, output_size=50, init_method='he')
-        weights_he = safe_numpy(layer_he.weights)
-        
-        # Check initialization ranges
-        xavier_limit = np.sqrt(6.0 / (100 + 50))
-        assert np.all(np.abs(weights_xavier) <= xavier_limit)
-        
-        he_limit = np.sqrt(2.0 / 100)
-        assert np.std(weights_he) <= he_limit * 1.5  # Some tolerance
-    
-    @pytest.mark.skip(reason="Stretch goal: Layer parameter access")
-    def test_layer_parameters(self):
-        """Test accessing and modifying layer parameters."""
-        layer = Dense(input_size=3, output_size=2)
-        
-        # Should be able to access parameters
-        assert hasattr(layer, 'parameters')
-        params = layer.parameters()
-        assert len(params) == 2  # weights and bias
-        
-        # Should be able to set parameters
-        new_weights = Tensor(np.ones((3, 2)))
-        layer.set_weights(new_weights)
-        np.testing.assert_array_equal(safe_numpy(layer.weights), safe_numpy(new_weights))
-    
-    @pytest.mark.skip(reason="Stretch goal: Additional activation functions")
-    def test_additional_activations(self):
-        """Test additional activation functions."""
-        # Leaky ReLU
-        leaky_relu = LeakyReLU(alpha=0.1)
-        x = Tensor([[-1.0, 0.0, 1.0]])
-        y = leaky_relu(x)
-        expected = [[-0.1, 0.0, 1.0]]
-        np.testing.assert_array_almost_equal(safe_numpy(y), expected)
-        
-        # Softmax
-        softmax = Softmax()
-        x = Tensor([[1.0, 2.0, 3.0]])
-        y = softmax(x)
-        # Should sum to 1
-        assert np.allclose(np.sum(safe_numpy(y)), 1.0)
-    
-    @pytest.mark.skip(reason="Stretch goal: Dropout layer")
-    def test_dropout_layer(self):
-        """Test dropout layer implementation."""
-        dropout = Dropout(p=0.5)
-        x = Tensor([[1.0, 2.0, 3.0, 4.0]])
-        
-        # Training mode
-        dropout.train()
-        y_train = dropout(x)
-        
-        # Inference mode
-        dropout.eval()
-        y_eval = dropout(x)
-        
-        # In eval mode, should be same as input
-        np.testing.assert_array_equal(safe_numpy(y_eval), safe_numpy(x))
-    
-    @pytest.mark.skip(reason="Stretch goal: Batch normalization")
-    def test_batch_normalization(self):
-        """Test batch normalization layer."""
-        bn = BatchNorm1d(num_features=3)
-        x = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-        y = bn(x)
-        
-        # Should normalize across batch dimension
-        assert y.shape == x.shape
-        # Mean should be close to 0, std close to 1
-        assert np.allclose(np.mean(safe_numpy(y), axis=0), 0.0, atol=1e-6)
-        assert np.allclose(np.std(safe_numpy(y), axis=0), 1.0, atol=1e-6) 
\ No newline at end of file
diff --git a/modules/source/04_networks/networks_dev.py b/modules/source/04_networks/networks_dev.py
index a256bc9a..ff081724 100644
--- a/modules/source/04_networks/networks_dev.py
+++ b/modules/source/04_networks/networks_dev.py
@@ -524,19 +524,19 @@ wide = create_mlp(10, [50], 1)
 - **Efficiency:** Balance between performance and computation
 
 ### Different Activation Functions
-```python
+   ```python
 # ReLU networks (most common)
 relu_net = create_mlp(10, [20], 1, activation=ReLU)
-
+   
 # Tanh networks (centered around 0)
 tanh_net = create_mlp(10, [20], 1, activation=Tanh)
-
+   
 # Multi-class classification
 classifier = create_mlp(10, [20], 3, output_activation=Softmax)
-```
+   ```
 
 Let's test different architectures!
-"""
+""" 
 
 # %% [markdown]
 """
@@ -560,7 +560,7 @@ try:
     classifier = create_mlp(input_size=3, hidden_sizes=[4], output_size=3, output_activation=Softmax)
     
     # Test with sample data
-    x = Tensor([[1.0, 2.0, 3.0]])
+        x = Tensor([[1.0, 2.0, 3.0]])
     
     # Test ReLU network
     y_relu = relu_net(x)
@@ -575,9 +575,9 @@ try:
     # Test multi-class classifier
     y_multi = classifier(x)
     assert y_multi.shape == (1, 3), "Multi-class classifier should work"
-    
-    # Check softmax properties
-    assert abs(np.sum(y_multi.data) - 1.0) < 1e-6, "Softmax outputs should sum to 1"
+        
+        # Check softmax properties
+        assert abs(np.sum(y_multi.data) - 1.0) < 1e-6, "Softmax outputs should sum to 1"
     print("✅ Multi-class classifier with Softmax works correctly")
     
     # Test different architectures
@@ -595,7 +595,7 @@ try:
     
     print("✅ All network architectures work correctly")
     
-except Exception as e:
+    except Exception as e:
     print(f"❌ Architecture test failed: {e}")
     raise
 
@@ -643,18 +643,18 @@ try:
     iris_classifier = create_mlp(input_size=4, hidden_sizes=[8, 6], output_size=3, output_activation=Softmax)
     
     # Simulate iris features: [sepal_length, sepal_width, petal_length, petal_width]
-    iris_samples = Tensor([
+        iris_samples = Tensor([
         [5.1, 3.5, 1.4, 0.2],  # Setosa
         [7.0, 3.2, 4.7, 1.4],  # Versicolor
         [6.3, 3.3, 6.0, 2.5]   # Virginica
-    ])
-    
-    iris_predictions = iris_classifier(iris_samples)
+        ])
+        
+        iris_predictions = iris_classifier(iris_samples)
     assert iris_predictions.shape == (3, 3), "Iris classifier should output 3 classes for 3 samples"
-    
+        
     # Check softmax properties
-    row_sums = np.sum(iris_predictions.data, axis=1)
-    assert np.allclose(row_sums, 1.0), "Each prediction should sum to 1"
+        row_sums = np.sum(iris_predictions.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Each prediction should sum to 1"
     print("✅ Multi-class classification works correctly")
     
     # Test 2: Regression Task (Housing prices)
@@ -691,38 +691,38 @@ try:
     # Test 4: Network Composition
     print("\n4. Network Composition Test:")
     # Create a feature extractor and classifier separately
-    feature_extractor = Sequential([
+        feature_extractor = Sequential([
         Dense(input_size=10, output_size=5),
-        ReLU(),
+            ReLU(),
         Dense(input_size=5, output_size=3),
-        ReLU()
-    ])
-    
-    classifier_head = Sequential([
+            ReLU()
+        ])
+        
+        classifier_head = Sequential([
         Dense(input_size=3, output_size=2),
-        Softmax()
-    ])
-    
+            Softmax()
+        ])
+        
     # Test composition
     raw_data = Tensor(np.random.randn(5, 10))
-    features = feature_extractor(raw_data)
-    final_predictions = classifier_head(features)
+        features = feature_extractor(raw_data)
+        final_predictions = classifier_head(features)
     
     assert features.shape == (5, 3), "Feature extractor should output 3 features"
     assert final_predictions.shape == (5, 2), "Classifier should output 2 classes"
-    
-    row_sums = np.sum(final_predictions.data, axis=1)
+        
+        row_sums = np.sum(final_predictions.data, axis=1)
     assert np.allclose(row_sums, 1.0), "Composed network predictions should be valid"
     print("✅ Network composition works correctly")
-    
+        
     print("\n🎉 Integration test passed! Your networks work correctly for:")
-    print("  • Multi-class classification (Iris flowers)")
-    print("  • Regression tasks (housing prices)")
+        print("  • Multi-class classification (Iris flowers)")
+        print("  • Regression tasks (housing prices)")
     print("  • Deep learning architectures")
     print("  • Network composition and feature extraction")
-    
-except Exception as e:
-    print(f"❌ Integration test failed: {e}")
+        
+    except Exception as e:
+        print(f"❌ Integration test failed: {e}")
     raise
 
 print("📈 Final Progress: Complete network architectures ready for real ML applications!")
diff --git a/modules/source/04_networks/networks_dev_backup.py b/modules/source/04_networks/networks_dev_backup.py
new file mode 100644
index 00000000..79f8abbb
--- /dev/null
+++ b/modules/source/04_networks/networks_dev_backup.py
@@ -0,0 +1,1418 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 4: Networks - Neural Network Architectures
+
+Welcome to the Networks module! This is where we compose layers into complete neural network architectures.
+
+## Learning Goals
+- Understand networks as function composition: `f(x) = layer_n(...layer_2(layer_1(x)))`
+- Build the Sequential network architecture for composing layers
+- Create common network patterns like MLPs (Multi-Layer Perceptrons)
+- Visualize network architectures and understand their capabilities
+- Master forward pass inference through complete networks
+
+## Build → Use → Understand
+1. **Build**: Sequential networks that compose layers into complete architectures
+2. **Use**: Create different network patterns and run inference
+3. **Understand**: How architecture design affects network behavior and capability
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "networks-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.networks
+
+#| export
+import numpy as np
+import sys
+import os
+from typing import List, Union, Optional, Callable
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+from matplotlib.patches import FancyBboxPatch, ConnectionPatch
+import seaborn as sns
+
+# Import all the building blocks we need - try package first, then local modules
+try:
+    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.layers import Dense
+    from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
+except ImportError:
+    # For development, import from local modules
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+    from tensor_dev import Tensor
+    from activations_dev import ReLU, Sigmoid, Tanh, Softmax
+    from layers_dev import Dense
+
+# %% nbgrader={"grade": false, "grade_id": "networks-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| hide
+#| export
+def _should_show_plots():
+    """Check if we should show plots (disable during testing)"""
+    # Check multiple conditions that indicate we're in test mode
+    is_pytest = (
+        'pytest' in sys.modules or
+        'test' in sys.argv or
+        os.environ.get('PYTEST_CURRENT_TEST') is not None or
+        any('test' in arg for arg in sys.argv) or
+        any('pytest' in arg for arg in sys.argv)
+    )
+    
+    # Show plots in development mode (when not in test mode)
+    return not is_pytest
+
+# %% nbgrader={"grade": false, "grade_id": "networks-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
+print("🔥 TinyTorch Networks Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
+print("Ready to build neural network architectures!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/04_networks/networks_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.networks`
+
+```python
+# Final package structure:
+from tinytorch.core.networks import Sequential, MLP  # Network architectures!
+from tinytorch.core.layers import Dense, Conv2D  # Building blocks
+from tinytorch.core.activations import ReLU, Sigmoid, Tanh  # Nonlinearity
+from tinytorch.core.tensor import Tensor  # Foundation
+```
+
+**Why this matters:**
+- **Learning:** Focused modules for deep understanding
+- **Production:** Proper organization like PyTorch's `torch.nn.Sequential`
+- **Consistency:** All network architectures live together in `core.networks`
+- **Integration:** Works seamlessly with layers, activations, and tensors
+"""
+
+# %% [markdown]
+"""
+## 🧠 The Mathematical Foundation of Neural Networks
+
+### Function Composition at Scale
+Neural networks are fundamentally about **function composition**:
+
+$$f(x) = f_n(f_{n-1}(\ldots f_2(f_1(x)) \ldots))$$
+
+Each layer is a function, and the network is the composition of all these functions.
+
+### Why Function Composition is Powerful
+- **Modularity**: Each layer has a specific purpose
+- **Composability**: Simple functions combine to create complex behaviors
+- **Universal approximation**: Deep compositions can approximate any function
+- **Hierarchical learning**: Early layers learn simple features, later layers learn complex patterns
+
+### The Architecture Design Space
+Different arrangements of layers create different capabilities:
+- **Depth**: More layers → more complex representations
+- **Width**: More neurons per layer → more capacity per layer
+- **Connections**: How layers connect affects information flow
+- **Activation functions**: Add nonlinearity for complex patterns
+
+### Connection to Real ML Systems
+Every framework uses sequential composition:
+- **PyTorch**: `torch.nn.Sequential([layer1, layer2, layer3])`
+- **TensorFlow**: `tf.keras.Sequential([layer1, layer2, layer3])`
+- **JAX**: `jax.nn.Sequential([layer1, layer2, layer3])`
+- **TinyTorch**: `tinytorch.core.networks.Sequential([layer1, layer2, layer3])` (what we're building!)
+
+### Performance and Design Considerations
+- **Forward pass efficiency**: Sequential computation through layers
+- **Memory management**: Intermediate activations storage
+- **Gradient flow**: How information flows backward (for training)
+- **Architecture search**: Finding optimal network structures
+"""
+
+# %% [markdown]
+"""
+## Step 1: What is a Network?
+
+### Definition
+A **network** is a composition of layers that transforms input data into output predictions. Think of it as a pipeline of transformations:
+
+```
+Input → Layer1 → Layer2 → Layer3 → Output
+```
+
+### The Mathematical Foundation: Function Composition Theory
+
+#### **Function Composition in Mathematics**
+In mathematics, function composition combines simple functions to create complex ones:
+
+$$(f \circ g)(x) = f(g(x))$$
+
+Neural network composition:
+$$h(x) = f_n(f_{n-1}(\ldots f_2(f_1(x)) \ldots))$$
+
+#### **Why Composition is Powerful**
+1. **Modularity**: Each layer has a specific, well-defined purpose
+2. **Composability**: Simple functions combine to create arbitrarily complex behaviors
+3. **Hierarchical learning**: Early layers learn simple features, later layers learn complex patterns
+4. **Universal approximation**: Deep compositions can approximate any continuous function
+
+#### **The Emergence of Intelligence**
+Complex behavior emerges from simple layer composition:
+
+```python
+# Example: Image classification
+raw_pixels → [Edge detectors] → [Shape detectors] → [Object detectors] → [Class predictor]
+     ↓              ↓                    ↓                    ↓                 ↓
+  [28x28]      [64 features]      [128 features]      [256 features]      [10 classes]
+```
+
+### Architectural Design Principles
+
+#### **1. Depth vs. Width Trade-offs**
+- **Deep networks**: More layers → more complex representations
+  - **Advantages**: Better feature hierarchies, parameter efficiency
+  - **Disadvantages**: Harder to train, gradient problems
+- **Wide networks**: More neurons per layer → more capacity per layer
+  - **Advantages**: Easier to train, parallel computation
+  - **Disadvantages**: More parameters, potential overfitting
+
+#### **2. Information Flow Patterns**
+```python
+# Sequential flow (what we're building):
+x → layer1 → layer2 → layer3 → output
+
+# Residual flow (advanced):
+x → layer1 → layer2 + x → layer3 → output
+
+# Attention flow (transformers):
+x → attention(x, x, x) → feedforward → output
+```
+
+#### **3. Activation Function Placement**
+```python
+# Standard pattern:
+linear_transformation → nonlinear_activation → next_layer
+
+# Why this works:
+# Linear + Linear = Linear (no increase in expressiveness)
+# Linear + Nonlinear + Linear = Nonlinear (exponential increase in expressiveness)
+```
+
+### Real-World Architecture Examples
+
+#### **Multi-Layer Perceptron (MLP)**
+```python
+# Classic feedforward network
+input → dense(512) → relu → dense(256) → relu → dense(10) → softmax
+```
+- **Use cases**: Tabular data, feature learning, classification
+- **Strengths**: Universal approximation, well-understood
+- **Weaknesses**: Doesn't exploit spatial/temporal structure
+
+#### **Convolutional Neural Network (CNN)**
+```python
+# Exploits spatial structure
+input → conv2d → relu → pool → conv2d → relu → pool → dense → softmax
+```
+- **Use cases**: Image processing, computer vision
+- **Strengths**: Translation invariance, parameter sharing
+- **Weaknesses**: Fixed receptive field, not great for sequences
+
+#### **Recurrent Neural Network (RNN)**
+```python
+# Processes sequences
+input_t → rnn_cell(hidden_{t-1}) → hidden_t → output_t
+```
+- **Use cases**: Natural language processing, time series
+- **Strengths**: Variable length sequences, memory
+- **Weaknesses**: Sequential computation, gradient problems
+
+#### **Transformer**
+```python
+# Attention-based processing
+input → attention → feedforward → attention → feedforward → output
+```
+- **Use cases**: Language models, machine translation
+- **Strengths**: Parallelizable, long-range dependencies
+- **Weaknesses**: Quadratic complexity, large memory requirements
+
+### The Network Design Process
+
+#### **1. Problem Analysis**
+- **Data type**: Images, text, tabular, time series?
+- **Task type**: Classification, regression, generation?
+- **Constraints**: Latency, memory, accuracy requirements?
+
+#### **2. Architecture Selection**
+- **Start simple**: Begin with basic MLP
+- **Add structure**: Incorporate domain-specific inductive biases
+- **Scale up**: Increase depth/width as needed
+
+#### **3. Component Design**
+- **Input layer**: Match data dimensions
+- **Hidden layers**: Gradual dimension reduction typical
+- **Output layer**: Match task requirements (classes, regression targets)
+- **Activation functions**: ReLU for hidden, task-specific for output
+
+#### **4. Optimization Considerations**
+- **Gradient flow**: Ensure gradients can flow through the network
+- **Computational efficiency**: Balance expressiveness with speed
+- **Memory usage**: Consider intermediate activation storage
+
+### Performance Characteristics
+
+#### **Forward Pass Complexity**
+For a network with L layers, each with n neurons:
+- **Time complexity**: O(L × n²) for dense layers
+- **Space complexity**: O(L × n) for activations
+- **Parallelization**: Each layer can be parallelized
+
+#### **Memory Management**
+```python
+# Memory usage during forward pass:
+input_memory = batch_size × input_size
+hidden_memory = batch_size × hidden_size × num_layers
+output_memory = batch_size × output_size
+total_memory = input_memory + hidden_memory + output_memory
+```
+
+#### **Computational Optimization**
+- **Batch processing**: Process multiple samples simultaneously
+- **Vectorization**: Use optimized matrix operations
+- **Hardware acceleration**: Leverage GPUs/TPUs for parallel computation
+
+### Connection to Previous Modules
+
+#### **From Module 1 (Tensor)**
+- **Data flow**: Tensors flow through the network
+- **Shape management**: Ensure compatible dimensions between layers
+
+#### **From Module 2 (Activations)**
+- **Nonlinearity**: Activation functions between layers enable complex learning
+- **Function choice**: Different activations for different purposes
+
+#### **From Module 3 (Layers)**
+- **Building blocks**: Layers are the fundamental components
+- **Composition**: Networks compose layers into complete architectures
+
+### Why Networks Matter: The Scaling Laws
+
+#### **Empirical Observations**
+- **More parameters**: Generally better performance (up to a point)
+- **More data**: Enables training of larger networks
+- **More compute**: Allows exploration of larger architectures
+
+#### **The Deep Learning Revolution**
+```python
+# Pre-2012: Shallow networks
+input → hidden(100) → output
+
+# Post-2012: Deep networks
+input → hidden(512) → hidden(512) → hidden(512) → ... → output
+```
+
+The key insight: **Depth enables hierarchical feature learning**
+
+Let's start building our Sequential network architecture!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "sequential-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Sequential:
+    """
+    Sequential Network: Composes layers in sequence
+    
+    The most fundamental network architecture.
+    Applies layers in order: f(x) = layer_n(...layer_2(layer_1(x)))
+    """
+    
+    def __init__(self, layers: List):
+        """
+        Initialize Sequential network with layers.
+        
+        Args:
+            layers: List of layers to compose in order
+            
+        TODO: Store the layers and implement forward pass
+        
+        APPROACH:
+        1. Store the layers list as an instance variable
+        2. This creates the network architecture ready for forward pass
+        
+        EXAMPLE:
+        Sequential([Dense(3,4), ReLU(), Dense(4,2)])
+        creates a 3-layer network: Dense → ReLU → Dense
+        
+        HINTS:
+        - Store layers in self.layers
+        - This is the foundation for all network architectures
+        """
+        ### BEGIN SOLUTION
+        self.layers = layers
+        ### END SOLUTION
+    
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        Forward pass through all layers in sequence.
+        
+        Args:
+            x: Input tensor
+            
+        Returns:
+            Output tensor after passing through all layers
+            
+        TODO: Implement sequential forward pass through all layers
+        
+        APPROACH:
+        1. Start with the input tensor
+        2. Apply each layer in sequence
+        3. Each layer's output becomes the next layer's input
+        4. Return the final output
+        
+        EXAMPLE:
+        Input: Tensor([[1, 2, 3]])
+        Layer1 (Dense): Tensor([[1.4, 2.8]])
+        Layer2 (ReLU): Tensor([[1.4, 2.8]])
+        Layer3 (Dense): Tensor([[0.7]])
+        Output: Tensor([[0.7]])
+        
+        HINTS:
+        - Use a for loop: for layer in self.layers:
+        - Apply each layer: x = layer(x)
+        - The output of one layer becomes input to the next
+        - Return the final result
+        """
+        ### BEGIN SOLUTION
+        # Apply each layer in sequence
+        for layer in self.layers:
+            x = layer(x)
+        return x
+        ### END SOLUTION
+    
+    def __call__(self, x: Tensor) -> Tensor:
+        """Make network callable: network(x) same as network.forward(x)"""
+        return self.forward(x)
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: Sequential Network
+
+Let's test your Sequential network implementation! This is the foundation of all neural network architectures.
+
+**This is a unit test** - it tests one specific class (Sequential network) in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-sequential-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test Sequential network immediately after implementation
+print("🔬 Unit Test: Sequential Network...")
+
+# Create a simple 2-layer network: 3 → 4 → 2
+try:
+    network = Sequential([
+        Dense(input_size=3, output_size=4),
+        ReLU(),
+        Dense(input_size=4, output_size=2),
+        Sigmoid()
+    ])
+    
+    print(f"Network created with {len(network.layers)} layers")
+    print("✅ Sequential network creation successful")
+    
+    # Test with sample data
+    x = Tensor([[1.0, 2.0, 3.0]])
+    print(f"Input: {x}")
+    
+    # Forward pass
+    y = network(x)
+    print(f"Output: {y}")
+    print(f"Output shape: {y.shape}")
+    
+    # Verify the network works
+    assert y.shape == (1, 2), f"Expected shape (1, 2), got {y.shape}"
+    print("✅ Sequential network produces correct output shape")
+    
+    # Test that sigmoid output is in valid range
+    assert np.all(y.data >= 0) and np.all(y.data <= 1), "Sigmoid output should be between 0 and 1"
+    print("✅ Sequential network output is in valid range")
+    
+    # Test that layers are stored correctly
+    assert len(network.layers) == 4, f"Expected 4 layers, got {len(network.layers)}"
+    print("✅ Sequential network stores layers correctly")
+    
+except Exception as e:
+    print(f"❌ Sequential network test failed: {e}")
+    raise
+
+# Show the network architecture
+print("🎯 Sequential network behavior:")
+print("   Applies layers in sequence: f(g(h(x)))")
+print("   Input flows through each layer in order")
+print("   Output of layer i becomes input of layer i+1")
+print("📈 Progress: Sequential network ✓")
+
+# %% [markdown]
+"""
+## Step 2: Building Multi-Layer Perceptrons (MLPs)
+
+### What is an MLP?
+A **Multi-Layer Perceptron** is the classic neural network architecture:
+
+```
+Input → Dense → Activation → Dense → Activation → ... → Dense → Output
+```
+
+### Why MLPs are Important
+- **Universal approximation**: Can approximate any continuous function
+- **Foundation**: Basis for understanding all neural networks
+- **Versatile**: Works for classification, regression, and more
+- **Simple**: Easy to understand and implement
+
+### MLP Architecture Pattern
+```
+create_mlp(3, [4, 2], 1) creates:
+Dense(3→4) → ReLU → Dense(4→2) → ReLU → Dense(2→1) → Sigmoid
+```
+
+### Real-World Applications
+- **Tabular data**: Customer analytics, financial modeling
+- **Feature learning**: Learning representations from raw data
+- **Classification**: Spam detection, medical diagnosis
+- **Regression**: Price prediction, time series forecasting
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "create-mlp", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def create_mlp(input_size: int, hidden_sizes: List[int], output_size: int, 
+               activation=ReLU, output_activation=Sigmoid) -> Sequential:
+    """
+    Create a Multi-Layer Perceptron (MLP) network.
+    
+    Args:
+        input_size: Number of input features
+        hidden_sizes: List of hidden layer sizes
+        output_size: Number of output features
+        activation: Activation function for hidden layers (default: ReLU)
+        output_activation: Activation function for output layer (default: Sigmoid)
+        
+    Returns:
+        Sequential network with MLP architecture
+        
+    TODO: Implement MLP creation with alternating Dense and activation layers.
+    
+    APPROACH:
+    1. Start with an empty list of layers
+    2. Add layers in this pattern:
+       - Dense(input_size → first_hidden_size)
+       - Activation()
+       - Dense(first_hidden_size → second_hidden_size)
+       - Activation()
+       - ...
+       - Dense(last_hidden_size → output_size)
+       - Output_activation()
+    3. Return Sequential(layers)
+    
+    EXAMPLE:
+    create_mlp(3, [4, 2], 1) creates:
+    Dense(3→4) → ReLU → Dense(4→2) → ReLU → Dense(2→1) → Sigmoid
+    
+    HINTS:
+    - Start with layers = []
+    - Track current_size starting with input_size
+    - For each hidden_size: add Dense(current_size, hidden_size), then activation
+    - Finally add Dense(last_hidden_size, output_size), then output_activation
+    - Return Sequential(layers)
+    """
+    ### BEGIN SOLUTION
+    layers = []
+    current_size = input_size
+    
+    # Add hidden layers with activations
+    for hidden_size in hidden_sizes:
+        layers.append(Dense(current_size, hidden_size))
+        layers.append(activation())
+        current_size = hidden_size
+    
+    # Add output layer with output activation
+    layers.append(Dense(current_size, output_size))
+    layers.append(output_activation())
+    
+    return Sequential(layers)
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Unit Test: MLP Creation
+
+Let's test your MLP creation function! This builds complete neural networks with a single function call.
+
+**This is a unit test** - it tests one specific function (create_mlp) in isolation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-mlp-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test MLP creation immediately after implementation
+print("🔬 Unit Test: MLP Creation...")
+
+# Create a simple MLP: 3 → 4 → 2 → 1
+try:
+    mlp = create_mlp(input_size=3, hidden_sizes=[4, 2], output_size=1)
+    
+    print(f"MLP created with {len(mlp.layers)} layers")
+    print("✅ MLP creation successful")
+    
+    # Test the structure - should have 6 layers: Dense, ReLU, Dense, ReLU, Dense, Sigmoid
+    expected_layers = 6  # 3 Dense + 2 ReLU + 1 Sigmoid
+    assert len(mlp.layers) == expected_layers, f"Expected {expected_layers} layers, got {len(mlp.layers)}"
+    print("✅ MLP has correct number of layers")
+    
+    # Test with sample data
+    x = Tensor([[1.0, 2.0, 3.0]])
+    y = mlp(x)
+    print(f"MLP input: {x}")
+    print(f"MLP output: {y}")
+    print(f"MLP output shape: {y.shape}")
+    
+    # Verify the output
+    assert y.shape == (1, 1), f"Expected shape (1, 1), got {y.shape}"
+    print("✅ MLP produces correct output shape")
+    
+    # Test that sigmoid output is in valid range
+    assert np.all(y.data >= 0) and np.all(y.data <= 1), "Sigmoid output should be between 0 and 1"
+    print("✅ MLP output is in valid range")
+    
+except Exception as e:
+    print(f"❌ MLP creation test failed: {e}")
+    raise
+
+# Test different architectures
+try:
+    # Test shallow network
+    shallow_net = create_mlp(input_size=3, hidden_sizes=[4], output_size=1)
+    assert len(shallow_net.layers) == 4, f"Shallow network should have 4 layers, got {len(shallow_net.layers)}"
+    
+    # Test deep network  
+    deep_net = create_mlp(input_size=3, hidden_sizes=[4, 4, 4], output_size=1)
+    assert len(deep_net.layers) == 8, f"Deep network should have 8 layers, got {len(deep_net.layers)}"
+    
+    # Test wide network
+    wide_net = create_mlp(input_size=3, hidden_sizes=[10], output_size=1)
+    assert len(wide_net.layers) == 4, f"Wide network should have 4 layers, got {len(wide_net.layers)}"
+    
+    print("✅ Different MLP architectures work correctly")
+    
+except Exception as e:
+    print(f"❌ MLP architecture test failed: {e}")
+    raise
+
+# Show the MLP pattern
+print("🎯 MLP creation pattern:")
+print("   Input → Dense → Activation → Dense → Activation → ... → Dense → Output_Activation")
+print("   Automatically creates the complete architecture")
+print("   Handles any number of hidden layers")
+print("📈 Progress: Sequential network ✓, MLP creation ✓")
+print("🚀 Complete neural networks ready!")
+
+# %% [markdown]
+"""
+### 🧪 Test Your Network Implementations
+
+Once you implement the functions above, run these cells to test them:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-sequential", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test the Sequential network
+print("Testing Sequential network...")
+
+# Create a simple 2-layer network: 3 → 4 → 2
+network = Sequential([
+    Dense(input_size=3, output_size=4),
+    ReLU(),
+    Dense(input_size=4, output_size=2),
+    Sigmoid()
+])
+
+print(f"Network created with {len(network.layers)} layers")
+
+# Test with sample data
+x = Tensor([[1.0, 2.0, 3.0]])
+print(f"Input: {x}")
+
+# Forward pass
+y = network(x)
+print(f"Output: {y}")
+print(f"Output shape: {y.shape}")
+
+# Verify the network works
+assert y.shape == (1, 2), f"Expected shape (1, 2), got {y.shape}"
+assert np.all(y.data >= 0) and np.all(y.data <= 1), "Sigmoid output should be between 0 and 1"
+
+print("✅ Sequential network tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-mlp", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test MLP creation
+print("Testing MLP creation...")
+
+# Create a simple MLP: 3 → 4 → 2 → 1
+mlp = create_mlp(input_size=3, hidden_sizes=[4, 2], output_size=1)
+
+print(f"MLP created with {len(mlp.layers)} layers")
+
+# Test the structure
+expected_layers = [
+    Dense,  # 3 → 4
+    ReLU,   # activation
+    Dense,  # 4 → 2
+    ReLU,   # activation
+    Dense,  # 2 → 1
+    Sigmoid # output activation
+]
+
+assert len(mlp.layers) == 6, f"Expected 6 layers, got {len(mlp.layers)}"
+
+# Test with sample data
+x = Tensor([[1.0, 2.0, 3.0]])
+y = mlp(x)
+print(f"MLP output: {y}")
+print(f"MLP output shape: {y.shape}")
+
+# Verify the output
+assert y.shape == (1, 1), f"Expected shape (1, 1), got {y.shape}"
+assert np.all(y.data >= 0) and np.all(y.data <= 1), "Sigmoid output should be between 0 and 1"
+
+print("✅ MLP creation tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-network-comparison", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test different network architectures
+print("Testing different network architectures...")
+
+# Create networks with different architectures
+shallow_net = create_mlp(input_size=3, hidden_sizes=[4], output_size=1)
+deep_net = create_mlp(input_size=3, hidden_sizes=[4, 4, 4], output_size=1)
+wide_net = create_mlp(input_size=3, hidden_sizes=[10], output_size=1)
+
+# Test input
+x = Tensor([[1.0, 2.0, 3.0]])
+
+# Test all networks
+shallow_out = shallow_net(x)
+deep_out = deep_net(x)
+wide_out = wide_net(x)
+
+print(f"Shallow network output: {shallow_out}")
+print(f"Deep network output: {deep_out}")
+print(f"Wide network output: {wide_out}")
+
+# Verify all outputs are valid
+for name, output in [("Shallow", shallow_out), ("Deep", deep_out), ("Wide", wide_out)]:
+    assert output.shape == (1, 1), f"{name} network output shape should be (1, 1), got {output.shape}"
+    assert np.all(output.data >= 0) and np.all(output.data <= 1), f"{name} network output should be between 0 and 1"
+
+print("✅ Network architecture comparison tests passed!")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented complete neural network architectures:
+
+### What You've Accomplished
+✅ **Sequential Networks**: The fundamental architecture for composing layers  
+✅ **Function Composition**: Understanding how layers combine to create complex behaviors  
+✅ **MLP Creation**: Building Multi-Layer Perceptrons with flexible architectures  
+✅ **Architecture Patterns**: Creating shallow, deep, and wide networks  
+✅ **Forward Pass**: Complete inference through multi-layer networks  
+
+### Key Concepts You've Learned
+- **Networks are function composition**: Complex behavior from simple building blocks
+- **Sequential architecture**: The foundation of most neural networks
+- **MLP patterns**: Dense → Activation → Dense → Activation → Output
+- **Architecture design**: How depth and width affect network capability
+- **Forward pass**: How data flows through complete networks
+
+### Mathematical Foundations
+- **Function composition**: f(x) = f_n(...f_2(f_1(x)))
+- **Universal approximation**: MLPs can approximate any continuous function
+- **Hierarchical learning**: Early layers learn simple features, later layers learn complex patterns
+- **Nonlinearity**: Activation functions enable complex decision boundaries
+
+### Real-World Applications
+- **Classification**: Image recognition, spam detection, medical diagnosis
+- **Regression**: Price prediction, time series forecasting
+- **Feature learning**: Extracting meaningful representations from raw data
+- **Transfer learning**: Using pre-trained networks for new tasks
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 04_networks`
+2. **Test your implementation**: `tito module test 04_networks`
+3. **Use your networks**: 
+   ```python
+   from tinytorch.core.networks import Sequential, create_mlp
+   from tinytorch.core.layers import Dense
+   from tinytorch.core.activations import ReLU
+   
+   # Create custom network
+   network = Sequential([Dense(10, 5), ReLU(), Dense(5, 1)])
+   
+   # Create MLP
+   mlp = create_mlp(10, [20, 10], 1)
+   ```
+4. **Move to Module 5**: Start building convolutional networks for images!
+
+**Ready for the next challenge?** Let's add convolutional layers for image processing and build CNNs!
+""" 
+
+# %% [markdown]
+"""
+## 🧪 Comprehensive Testing: Neural Network Architectures
+
+Let's thoroughly test your network implementations to ensure they work correctly in all scenarios.
+This comprehensive testing ensures your networks are robust and ready for real ML applications.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-networks-comprehensive", "locked": true, "points": 30, "schema_version": 3, "solution": false, "task": false}
+def test_networks_comprehensive():
+    """Comprehensive test of Sequential networks and MLP creation."""
+    print("🔬 Testing neural network architectures comprehensively...")
+    
+    tests_passed = 0
+    total_tests = 10
+    
+    # Test 1: Sequential Network Creation and Structure
+    try:
+        # Create a simple 2-layer network
+        network = Sequential([
+            Dense(input_size=3, output_size=4),
+            ReLU(),
+            Dense(input_size=4, output_size=2),
+            Sigmoid()
+        ])
+        
+        assert len(network.layers) == 4, f"Expected 4 layers, got {len(network.layers)}"
+        
+        # Test layer types
+        assert isinstance(network.layers[0], Dense), "First layer should be Dense"
+        assert isinstance(network.layers[1], ReLU), "Second layer should be ReLU"
+        assert isinstance(network.layers[2], Dense), "Third layer should be Dense"
+        assert isinstance(network.layers[3], Sigmoid), "Fourth layer should be Sigmoid"
+        
+        print("✅ Sequential network creation and structure")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Sequential network creation failed: {e}")
+    
+    # Test 2: Sequential Network Forward Pass
+    try:
+        network = Sequential([
+            Dense(input_size=3, output_size=4),
+            ReLU(),
+            Dense(input_size=4, output_size=2),
+            Sigmoid()
+        ])
+        
+        # Test single sample
+        x_single = Tensor([[1.0, 2.0, 3.0]])
+        y_single = network(x_single)
+        
+        assert y_single.shape == (1, 2), f"Single sample output should be (1, 2), got {y_single.shape}"
+        assert np.all((y_single.data >= 0) & (y_single.data <= 1)), "Sigmoid output should be in [0,1]"
+        
+        # Test batch processing
+        x_batch = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
+        y_batch = network(x_batch)
+        
+        assert y_batch.shape == (3, 2), f"Batch output should be (3, 2), got {y_batch.shape}"
+        assert np.all((y_batch.data >= 0) & (y_batch.data <= 1)), "All batch outputs should be in [0,1]"
+        
+        print("✅ Sequential network forward pass: single and batch")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Sequential network forward pass failed: {e}")
+    
+    # Test 3: MLP Creation Basic Functionality
+    try:
+        # Create simple MLP: 3 → 4 → 2 → 1
+        mlp = create_mlp(input_size=3, hidden_sizes=[4, 2], output_size=1)
+        
+        # Should have 6 layers: Dense, ReLU, Dense, ReLU, Dense, Sigmoid
+        expected_layers = 6
+        assert len(mlp.layers) == expected_layers, f"Expected {expected_layers} layers, got {len(mlp.layers)}"
+        
+        # Test layer pattern
+        layer_types = [type(layer).__name__ for layer in mlp.layers]
+        expected_pattern = ['Dense', 'ReLU', 'Dense', 'ReLU', 'Dense', 'Sigmoid']
+        assert layer_types == expected_pattern, f"Expected pattern {expected_pattern}, got {layer_types}"
+        
+        # Test forward pass
+        x = Tensor([[1.0, 2.0, 3.0]])
+        y = mlp(x)
+        
+        assert y.shape == (1, 1), f"MLP output should be (1, 1), got {y.shape}"
+        assert np.all((y.data >= 0) & (y.data <= 1)), "MLP output should be in [0,1]"
+        
+        print("✅ MLP creation basic functionality")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ MLP creation basic failed: {e}")
+    
+    # Test 4: Different MLP Architectures
+    try:
+        # Test shallow network (1 hidden layer)
+        shallow_net = create_mlp(input_size=3, hidden_sizes=[4], output_size=1)
+        assert len(shallow_net.layers) == 4, f"Shallow network should have 4 layers, got {len(shallow_net.layers)}"
+        
+        # Test deep network (3 hidden layers)
+        deep_net = create_mlp(input_size=3, hidden_sizes=[4, 4, 4], output_size=1)
+        assert len(deep_net.layers) == 8, f"Deep network should have 8 layers, got {len(deep_net.layers)}"
+        
+        # Test wide network (1 large hidden layer)
+        wide_net = create_mlp(input_size=3, hidden_sizes=[20], output_size=1)
+        assert len(wide_net.layers) == 4, f"Wide network should have 4 layers, got {len(wide_net.layers)}"
+        
+        # Test very deep network
+        very_deep_net = create_mlp(input_size=3, hidden_sizes=[5, 5, 5, 5, 5], output_size=1)
+        assert len(very_deep_net.layers) == 12, f"Very deep network should have 12 layers, got {len(very_deep_net.layers)}"
+        
+        # Test all networks work
+        x = Tensor([[1.0, 2.0, 3.0]])
+        for name, net in [("Shallow", shallow_net), ("Deep", deep_net), ("Wide", wide_net), ("Very Deep", very_deep_net)]:
+            y = net(x)
+            assert y.shape == (1, 1), f"{name} network output shape should be (1, 1), got {y.shape}"
+            assert np.all((y.data >= 0) & (y.data <= 1)), f"{name} network output should be in [0,1]"
+        
+        print("✅ Different MLP architectures: shallow, deep, wide, very deep")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Different MLP architectures failed: {e}")
+    
+    # Test 5: MLP with Different Activation Functions
+    try:
+        # Test with Tanh activation
+        mlp_tanh = create_mlp(input_size=3, hidden_sizes=[4], output_size=1, activation=Tanh, output_activation=Sigmoid)
+        
+        # Check layer types
+        layer_types = [type(layer).__name__ for layer in mlp_tanh.layers]
+        expected_pattern = ['Dense', 'Tanh', 'Dense', 'Sigmoid']
+        assert layer_types == expected_pattern, f"Tanh MLP pattern should be {expected_pattern}, got {layer_types}"
+        
+        # Test forward pass
+        x = Tensor([[1.0, 2.0, 3.0]])
+        y = mlp_tanh(x)
+        assert y.shape == (1, 1), "Tanh MLP should work correctly"
+        
+        # Test with different output activation
+        mlp_tanh_out = create_mlp(input_size=3, hidden_sizes=[4], output_size=3, activation=ReLU, output_activation=Softmax)
+        y_multi = mlp_tanh_out(x)
+        assert y_multi.shape == (1, 3), "Multi-output MLP should work"
+        
+        # Check softmax properties
+        assert abs(np.sum(y_multi.data) - 1.0) < 1e-6, "Softmax outputs should sum to 1"
+        
+        print("✅ MLP with different activation functions: Tanh, Softmax")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ MLP with different activations failed: {e}")
+    
+    # Test 6: Network Layer Composition
+    try:
+        # Test that network correctly chains layers
+        network = Sequential([
+            Dense(input_size=4, output_size=3),
+            ReLU(),
+            Dense(input_size=3, output_size=2),
+            Tanh(),
+            Dense(input_size=2, output_size=1),
+            Sigmoid()
+        ])
+        
+        x = Tensor([[1.0, -1.0, 2.0, -2.0]])
+        
+        # Manual forward pass to verify composition
+        h1 = network.layers[0](x)  # Dense
+        h2 = network.layers[1](h1)  # ReLU
+        h3 = network.layers[2](h2)  # Dense
+        h4 = network.layers[3](h3)  # Tanh
+        h5 = network.layers[4](h4)  # Dense
+        h6 = network.layers[5](h5)  # Sigmoid
+        
+        # Compare with network forward pass
+        y_network = network(x)
+        
+        assert np.allclose(h6.data, y_network.data), "Manual and network forward pass should match"
+        
+        # Check intermediate shapes
+        assert h1.shape == (1, 3), f"h1 shape should be (1, 3), got {h1.shape}"
+        assert h2.shape == (1, 3), f"h2 shape should be (1, 3), got {h2.shape}"
+        assert h3.shape == (1, 2), f"h3 shape should be (1, 2), got {h3.shape}"
+        assert h4.shape == (1, 2), f"h4 shape should be (1, 2), got {h4.shape}"
+        assert h5.shape == (1, 1), f"h5 shape should be (1, 1), got {h5.shape}"
+        assert h6.shape == (1, 1), f"h6 shape should be (1, 1), got {h6.shape}"
+        
+        # Check activation effects
+        assert np.all(h2.data >= 0), "ReLU should produce non-negative values"
+        assert np.all((h4.data >= -1) & (h4.data <= 1)), "Tanh should produce values in [-1,1]"
+        assert np.all((h6.data >= 0) & (h6.data <= 1)), "Sigmoid should produce values in [0,1]"
+        
+        print("✅ Network layer composition: correct chaining and shapes")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Network layer composition failed: {e}")
+    
+    # Test 7: Edge Cases and Robustness
+    try:
+        # Test with minimal network (1 layer)
+        minimal_net = Sequential([Dense(input_size=2, output_size=1)])
+        x_minimal = Tensor([[1.0, 2.0]])
+        y_minimal = minimal_net(x_minimal)
+        assert y_minimal.shape == (1, 1), "Minimal network should work"
+        
+        # Test with single neuron layers
+        single_neuron_net = create_mlp(input_size=1, hidden_sizes=[1], output_size=1)
+        x_single = Tensor([[5.0]])
+        y_single_neuron = single_neuron_net(x_single)
+        assert y_single_neuron.shape == (1, 1), "Single neuron network should work"
+        
+        # Test with large batch
+        large_net = create_mlp(input_size=10, hidden_sizes=[5], output_size=1)
+        x_large_batch = Tensor(np.random.randn(100, 10))
+        y_large_batch = large_net(x_large_batch)
+        assert y_large_batch.shape == (100, 1), "Large batch should work"
+        assert not np.any(np.isnan(y_large_batch.data)), "Should not produce NaN"
+        assert not np.any(np.isinf(y_large_batch.data)), "Should not produce Inf"
+        
+        print("✅ Edge cases: minimal networks, single neurons, large batches")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Edge cases failed: {e}")
+    
+    # Test 8: Multi-class Classification Networks
+    try:
+        # Create multi-class classifier
+        classifier = create_mlp(input_size=4, hidden_sizes=[8, 6], output_size=3, output_activation=Softmax)
+        
+        # Test with batch of samples
+        x_multi = Tensor(np.random.randn(5, 4))
+        y_multi = classifier(x_multi)
+        
+        assert y_multi.shape == (5, 3), f"Multi-class output should be (5, 3), got {y_multi.shape}"
+        
+        # Check softmax properties for each sample
+        row_sums = np.sum(y_multi.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Each sample should have probabilities summing to 1"
+        assert np.all(y_multi.data > 0), "All probabilities should be positive"
+        
+        # Test that argmax gives valid class predictions
+        predictions = np.argmax(y_multi.data, axis=1)
+        assert np.all((predictions >= 0) & (predictions < 3)), "Predictions should be valid class indices"
+        
+        print("✅ Multi-class classification: softmax probabilities, valid predictions")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Multi-class classification failed: {e}")
+    
+    # Test 9: Real ML Scenarios
+    try:
+        # Scenario 1: Binary classification (like spam detection)
+        spam_classifier = create_mlp(input_size=100, hidden_sizes=[50, 20], output_size=1, output_activation=Sigmoid)
+        
+        # Simulate email features
+        email_features = Tensor(np.random.randn(10, 100))
+        spam_probabilities = spam_classifier(email_features)
+        
+        assert spam_probabilities.shape == (10, 1), "Spam classifier should output probabilities for each email"
+        assert np.all((spam_probabilities.data >= 0) & (spam_probabilities.data <= 1)), "Should output valid probabilities"
+        
+        # Scenario 2: Image classification (like MNIST)
+        mnist_classifier = create_mlp(input_size=784, hidden_sizes=[256, 128], output_size=10, output_activation=Softmax)
+        
+        # Simulate flattened images
+        images = Tensor(np.random.randn(32, 784))  # Batch of 32 images
+        class_probabilities = mnist_classifier(images)
+        
+        assert class_probabilities.shape == (32, 10), "MNIST classifier should output 10 class probabilities"
+        
+        # Check softmax properties
+        batch_sums = np.sum(class_probabilities.data, axis=1)
+        assert np.allclose(batch_sums, 1.0), "Each image should have class probabilities summing to 1"
+        
+        # Scenario 3: Regression (like house price prediction)
+        price_predictor = Sequential([
+            Dense(input_size=8, output_size=16),
+            ReLU(),
+            Dense(input_size=16, output_size=8),
+            ReLU(),
+            Dense(input_size=8, output_size=1)  # No activation for regression
+        ])
+        
+        # Simulate house features
+        house_features = Tensor(np.random.randn(5, 8))
+        predicted_prices = price_predictor(house_features)
+        
+        assert predicted_prices.shape == (5, 1), "Price predictor should output one price per house"
+        
+        print("✅ Real ML scenarios: spam detection, image classification, price prediction")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Real ML scenarios failed: {e}")
+    
+    # Test 10: Network Comparison and Analysis
+    try:
+        # Create networks with same total parameters but different architectures
+        x_test = Tensor([[1.0, 2.0, 3.0, 4.0]])
+        
+        # Wide network: 4 → 20 → 1 (parameters: 4*20 + 20 + 20*1 + 1 = 121)
+        wide_network = create_mlp(input_size=4, hidden_sizes=[20], output_size=1)
+        
+        # Deep network: 4 → 10 → 10 → 1 (parameters: 4*10 + 10 + 10*10 + 10 + 10*1 + 1 = 171)
+        deep_network = create_mlp(input_size=4, hidden_sizes=[10, 10], output_size=1)
+        
+        # Test both networks
+        wide_output = wide_network(x_test)
+        deep_output = deep_network(x_test)
+        
+        assert wide_output.shape == (1, 1), "Wide network should produce correct output"
+        assert deep_output.shape == (1, 1), "Deep network should produce correct output"
+        
+        # Both should be valid but potentially different
+        assert np.all((wide_output.data >= 0) & (wide_output.data <= 1)), "Wide network output should be valid"
+        assert np.all((deep_output.data >= 0) & (deep_output.data <= 1)), "Deep network output should be valid"
+        
+        # Test network complexity
+        def count_parameters(network):
+            total = 0
+            for layer in network.layers:
+                if isinstance(layer, Dense):
+                    total += layer.weights.size
+                    if layer.bias is not None:
+                        total += layer.bias.size
+            return total
+        
+        wide_params = count_parameters(wide_network)
+        deep_params = count_parameters(deep_network)
+        
+        assert wide_params > 0, "Wide network should have parameters"
+        assert deep_params > 0, "Deep network should have parameters"
+        
+        print(f"✅ Network comparison: wide ({wide_params} params) vs deep ({deep_params} params)")
+        tests_passed += 1
+    except Exception as e:
+        print(f"❌ Network comparison failed: {e}")
+    
+    # Results summary
+    print(f"\n📊 Networks Module Results: {tests_passed}/{total_tests} tests passed")
+    
+    if tests_passed == total_tests:
+        print("🎉 All network tests passed! Your implementations support:")
+        print("  • Sequential networks: layer composition and chaining")
+        print("  • MLP creation: flexible multi-layer perceptron architectures")
+        print("  • Different architectures: shallow, deep, wide networks")
+        print("  • Multiple activation functions: ReLU, Tanh, Sigmoid, Softmax")
+        print("  • Multi-class classification: softmax probability distributions")
+        print("  • Real ML scenarios: spam detection, image classification, regression")
+        print("  • Network analysis: parameter counting and architecture comparison")
+        print("📈 Progress: All Network Functionality ✓")
+        return True
+    else:
+        print("⚠️  Some network tests failed. Common issues:")
+        print("  • Check Sequential class layer composition")
+        print("  • Verify create_mlp function layer creation pattern")
+        print("  • Ensure proper activation function integration")
+        print("  • Test forward pass through complete networks")
+        print("  • Verify shape handling across all layers")
+        return False
+
+# Run the comprehensive test
+success = test_networks_comprehensive()
+
+# %% [markdown]
+"""
+### 🧪 Integration Test: Complete Neural Network Applications
+
+Let's test your networks in realistic machine learning applications.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-networks-integration", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
+def test_networks_integration():
+    """Integration test with complete neural network applications."""
+    print("🔬 Testing networks in complete ML applications...")
+    
+    try:
+        print("🧠 Building complete ML applications with neural networks...")
+        
+        # Application 1: Iris Classification
+        print("\n🌸 Application 1: Iris Classification (Multi-class)")
+        iris_classifier = create_mlp(
+            input_size=4,      # 4 flower measurements
+            hidden_sizes=[8, 6], # Hidden layers
+            output_size=3,     # 3 iris species
+            output_activation=Softmax
+        )
+        
+        # Simulate iris data
+        iris_samples = Tensor([
+            [5.1, 3.5, 1.4, 0.2],  # Setosa-like
+            [7.0, 3.2, 4.7, 1.4],  # Versicolor-like
+            [6.3, 3.3, 6.0, 2.5]   # Virginica-like
+        ])
+        
+        iris_predictions = iris_classifier(iris_samples)
+        
+        assert iris_predictions.shape == (3, 3), "Should predict 3 classes for 3 samples"
+        
+        # Check that predictions are valid probabilities
+        row_sums = np.sum(iris_predictions.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Each prediction should sum to 1"
+        
+        # Get predicted classes
+        predicted_classes = np.argmax(iris_predictions.data, axis=1)
+        print(f"  Predicted classes: {predicted_classes}")
+        print(f"  Confidence scores: {np.max(iris_predictions.data, axis=1)}")
+        
+        print("✅ Iris classification: valid multi-class predictions")
+        
+        # Application 2: Housing Price Prediction
+        print("\n🏠 Application 2: Housing Price Prediction (Regression)")
+        price_predictor = Sequential([
+            Dense(input_size=8, output_size=16),  # 8 house features
+            ReLU(),
+            Dense(input_size=16, output_size=8),
+            ReLU(),
+            Dense(input_size=8, output_size=1)    # 1 price output (no activation for regression)
+        ])
+        
+        # Simulate house features: [size, bedrooms, bathrooms, age, location_score, etc.]
+        house_data = Tensor([
+            [2000, 3, 2, 5, 8.5, 1, 0, 1],    # Large, new house
+            [1200, 2, 1, 20, 6.0, 0, 1, 0],   # Small, older house
+            [1800, 3, 2, 10, 7.5, 1, 0, 0]    # Medium house
+        ])
+        
+        predicted_prices = price_predictor(house_data)
+        
+        assert predicted_prices.shape == (3, 1), "Should predict 1 price for each house"
+        assert not np.any(np.isnan(predicted_prices.data)), "Prices should not be NaN"
+        
+        print(f"  Predicted prices: {predicted_prices.data.flatten()}")
+        print("✅ Housing price prediction: valid regression outputs")
+        
+        # Application 3: Sentiment Analysis
+        print("\n💭 Application 3: Sentiment Analysis (Binary Classification)")
+        sentiment_analyzer = create_mlp(
+            input_size=100,    # 100 text features (like TF-IDF)
+            hidden_sizes=[50, 25], # Deep network for text
+            output_size=1,     # Binary sentiment (positive/negative)
+            output_activation=Sigmoid
+        )
+        
+        # Simulate text features for different reviews
+        review_features = Tensor(np.random.randn(5, 100))  # 5 reviews
+        sentiment_scores = sentiment_analyzer(review_features)
+        
+        assert sentiment_scores.shape == (5, 1), "Should predict sentiment for each review"
+        assert np.all((sentiment_scores.data >= 0) & (sentiment_scores.data <= 1)), "Sentiment scores should be probabilities"
+        
+        # Convert to sentiment labels
+        sentiment_labels = (sentiment_scores.data > 0.5).astype(int)
+        print(f"  Sentiment predictions: {sentiment_labels.flatten()}")
+        print(f"  Confidence scores: {sentiment_scores.data.flatten()}")
+        
+        print("✅ Sentiment analysis: valid binary classification")
+        
+        # Application 4: MNIST-like Digit Recognition
+        print("\n🔢 Application 4: Digit Recognition (Image Classification)")
+        digit_classifier = create_mlp(
+            input_size=784,     # 28x28 flattened images
+            hidden_sizes=[256, 128, 64], # Deep network for images
+            output_size=10,     # 10 digits (0-9)
+            output_activation=Softmax
+        )
+        
+        # Simulate flattened digit images
+        digit_images = Tensor(np.random.randn(8, 784))  # 8 digit images
+        digit_predictions = digit_classifier(digit_images)
+        
+        assert digit_predictions.shape == (8, 10), "Should predict 10 classes for each image"
+        
+        # Check softmax properties
+        row_sums = np.sum(digit_predictions.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Each prediction should sum to 1"
+        
+        # Get predicted digits
+        predicted_digits = np.argmax(digit_predictions.data, axis=1)
+        confidence_scores = np.max(digit_predictions.data, axis=1)
+        
+        print(f"  Predicted digits: {predicted_digits}")
+        print(f"  Confidence scores: {confidence_scores}")
+        
+        print("✅ Digit recognition: valid multi-class image classification")
+        
+        # Application 5: Network Architecture Comparison
+        print("\n📊 Application 5: Architecture Comparison Study")
+        
+        # Create different architectures for same task
+        architectures = {
+            "Shallow": create_mlp(4, [16], 3, output_activation=Softmax),
+            "Medium": create_mlp(4, [12, 8], 3, output_activation=Softmax),
+            "Deep": create_mlp(4, [8, 8, 8], 3, output_activation=Softmax),
+            "Wide": create_mlp(4, [24], 3, output_activation=Softmax)
+        }
+        
+        # Test all architectures on same data
+        test_data = Tensor([[1.0, 2.0, 3.0, 4.0]])
+        
+        for name, network in architectures.items():
+            prediction = network(test_data)
+            assert prediction.shape == (1, 3), f"{name} network should output 3 classes"
+            assert abs(np.sum(prediction.data) - 1.0) < 1e-6, f"{name} network should output valid probabilities"
+            
+            # Count parameters
+            param_count = sum(layer.weights.size + (layer.bias.size if hasattr(layer, 'bias') and layer.bias is not None else 0) 
+                            for layer in network.layers if hasattr(layer, 'weights'))
+            
+            print(f"  {name} network: {param_count} parameters, prediction: {prediction.data.flatten()}")
+        
+        print("✅ Architecture comparison: all networks work with different complexities")
+        
+        # Application 6: Transfer Learning Simulation
+        print("\n🔄 Application 6: Transfer Learning Simulation")
+        
+        # Create "pre-trained" feature extractor
+        feature_extractor = Sequential([
+            Dense(input_size=100, output_size=50),
+            ReLU(),
+            Dense(input_size=50, output_size=25),
+            ReLU()
+        ])
+        
+        # Create task-specific classifier
+        classifier_head = Sequential([
+            Dense(input_size=25, output_size=10),
+            ReLU(),
+            Dense(input_size=10, output_size=2),
+            Softmax()
+        ])
+        
+        # Simulate transfer learning pipeline
+        raw_data = Tensor(np.random.randn(3, 100))
+        
+        # Extract features
+        features = feature_extractor(raw_data)
+        assert features.shape == (3, 25), "Feature extractor should output 25 features"
+        
+        # Classify using extracted features
+        final_predictions = classifier_head(features)
+        assert final_predictions.shape == (3, 2), "Classifier should output 2 classes"
+        
+        row_sums = np.sum(final_predictions.data, axis=1)
+        assert np.allclose(row_sums, 1.0), "Transfer learning predictions should be valid"
+        
+        print("✅ Transfer learning simulation: modular network composition")
+        
+        print("\n🎉 Integration test passed! Your networks work correctly in:")
+        print("  • Multi-class classification (Iris flowers)")
+        print("  • Regression tasks (housing prices)")
+        print("  • Binary classification (sentiment analysis)")
+        print("  • Image classification (digit recognition)")
+        print("  • Architecture comparison studies")
+        print("  • Transfer learning scenarios")
+        print("📈 Progress: Networks ready for real ML applications!")
+        
+        return True
+        
+    except Exception as e:
+        print(f"❌ Integration test failed: {e}")
+        print("\n💡 This suggests an issue with:")
+        print("  • Network architecture composition")
+        print("  • Forward pass through complete networks")
+        print("  • Shape compatibility between layers")
+        print("  • Activation function integration")
+        print("  • Check your Sequential and create_mlp implementations")
+        return False
+
+# Run the integration test
+success = test_networks_integration() and success
+
+# Print final summary
+print(f"\n{'='*60}")
+print("🎯 NETWORKS MODULE TESTING COMPLETE")
+print(f"{'='*60}")
+
+if success:
+    print("🎉 CONGRATULATIONS! All network tests passed!")
+    print("\n✅ Your networks module successfully implements:")
+    print("  • Sequential networks: flexible layer composition")
+    print("  • MLP creation: automated multi-layer perceptron building")
+    print("  • Architecture flexibility: shallow, deep, wide networks")
+    print("  • Multiple activations: ReLU, Tanh, Sigmoid, Softmax")
+    print("  • Real ML applications: classification, regression, image recognition")
+    print("  • Network analysis: parameter counting and architecture comparison")
+    print("  • Transfer learning: modular network composition")
+    print("\n🚀 You're ready to tackle any neural network architecture!")
+    print("📈 Final Progress: Networks Module ✓ COMPLETE")
+else:
+    print("⚠️  Some tests failed. Please review the error messages above.")
+    print("\n🔧 To fix issues:")
+    print("  1. Check your Sequential class implementation")
+    print("  2. Verify create_mlp function layer creation")
+    print("  3. Ensure proper forward pass through all layers")
+    print("  4. Test shape compatibility between layers")
+    print("  5. Verify activation function integration")
+    print("\n💪 Keep building! These networks are the foundation of modern AI.")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented complete neural network architectures:
+
+### What You've Accomplished
+✅ **Sequential Networks**: The fundamental architecture for composing layers  
+✅ **Function Composition**: Understanding how layers combine to create complex behaviors  
+✅ **MLP Creation**: Building Multi-Layer Perceptrons with flexible architectures  
+✅ **Architecture Patterns**: Creating shallow, deep, and wide networks  
+✅ **Forward Pass**: Complete inference through multi-layer networks  
+
+### Key Concepts You've Learned
+- **Networks are function composition**: Complex behavior from simple building blocks
+- **Sequential architecture**: The foundation of most neural networks
+- **MLP patterns**: Dense → Activation → Dense → Activation → Output
+- **Architecture design**: How depth and width affect network capability
+- **Forward pass**: How data flows through complete networks
+
+### Mathematical Foundations
+- **Function composition**: f(x) = f_n(...f_2(f_1(x)))
+- **Universal approximation**: MLPs can approximate any continuous function
+- **Hierarchical learning**: Early layers learn simple features, later layers learn complex patterns
+- **Nonlinearity**: Activation functions enable complex decision boundaries
+
+### Real-World Applications
+- **Classification**: Image recognition, spam detection, medical diagnosis
+- **Regression**: Price prediction, time series forecasting
+- **Feature learning**: Extracting meaningful representations from raw data
+- **Transfer learning**: Using pre-trained networks for new tasks
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 04_networks`
+2. **Test your implementation**: `tito module test 04_networks`
+3. **Use your networks**: 
+   ```python
+   from tinytorch.core.networks import Sequential, create_mlp
+   from tinytorch.core.layers import Dense
+   from tinytorch.core.activations import ReLU
+   
+   # Create custom network
+   network = Sequential([Dense(10, 5), ReLU(), Dense(5, 1)])
+   
+   # Create MLP
+   mlp = create_mlp(10, [20, 10], 1)
+   ```
+4. **Move to Module 5**: Start building convolutional networks for images!
+
+**Ready for the next challenge?** Let's add convolutional layers for image processing and build CNNs!
+""" 
\ No newline at end of file
diff --git a/modules/source/04_networks/tests/test_networks.py b/modules/source/04_networks/tests/test_networks.py
deleted file mode 100644
index 2ebcbfe0..00000000
--- a/modules/source/04_networks/tests/test_networks.py
+++ /dev/null
@@ -1,453 +0,0 @@
-"""
-Tests for the Networks module.
-
-Tests network composition, visualization, and practical applications.
-"""
-
-import pytest
-import numpy as np
-import sys
-from pathlib import Path
-
-# Add the project root to the path
-project_root = Path(__file__).parent.parent.parent.parent
-sys.path.insert(0, str(project_root))
-
-# Import the modules we're testing
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.layers import Dense
-from tinytorch.core.activations import ReLU, Sigmoid, Tanh
-
-# Import the networks module
-try:
-    # Import from the exported package
-    from tinytorch.core.networks import (
-        Sequential, 
-        create_mlp
-    )
-    # These functions may not be implemented yet - use fallback
-    try:
-        from tinytorch.core.networks import (
-            create_classification_network,
-            create_regression_network,
-            visualize_network_architecture,
-            visualize_data_flow,
-            compare_networks,
-            analyze_network_behavior
-        )
-    except ImportError:
-        # Create mock functions for missing functionality
-        def create_classification_network(*args, **kwargs):
-            """Mock implementation for testing"""
-            return create_mlp(*args, **kwargs)
-        
-        def create_regression_network(*args, **kwargs):
-            """Mock implementation for testing"""  
-            return create_mlp(*args, **kwargs)
-        
-        def visualize_network_architecture(*args, **kwargs):
-            """Mock implementation for testing"""
-            return "Network visualization placeholder"
-        
-        def visualize_data_flow(*args, **kwargs):
-            """Mock implementation for testing"""
-            return "Data flow visualization placeholder"
-        
-        def compare_networks(*args, **kwargs):
-            """Mock implementation for testing"""
-            return "Network comparison placeholder"
-        
-        def analyze_network_behavior(*args, **kwargs):
-            """Mock implementation for testing"""
-            return "Network behavior analysis placeholder"
-            
-except ImportError:
-    # Fallback for when module isn't exported yet
-    sys.path.append(str(project_root / "modules" / "source" / "04_networks"))
-    from networks_dev import (
-        Sequential, 
-        create_mlp, 
-        create_classification_network,
-        create_regression_network,
-        visualize_network_architecture,
-        visualize_data_flow,
-        compare_networks,
-        analyze_network_behavior
-    )
-
-
-class TestSequentialNetwork:
-    """Test the Sequential network class."""
-    
-    def test_sequential_initialization(self):
-        """Test Sequential network initialization."""
-        layers = [Dense(3, 4), ReLU(), Dense(4, 2), Sigmoid()]
-        network = Sequential(layers)
-        
-        assert len(network.layers) == 4
-        assert isinstance(network.layers[0], Dense)
-        assert isinstance(network.layers[1], ReLU)
-        assert isinstance(network.layers[2], Dense)
-        assert isinstance(network.layers[3], Sigmoid)
-    
-    def test_sequential_forward_pass(self):
-        """Test Sequential network forward pass."""
-        network = Sequential([
-            Dense(3, 4),
-            ReLU(),
-            Dense(4, 2),
-            Sigmoid()
-        ])
-        
-        x = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-        output = network(x)
-        
-        assert output.shape == (2, 2)
-        assert isinstance(output, Tensor)
-        # Sigmoid output should be between 0 and 1
-        assert np.all(output.data >= 0) and np.all(output.data <= 1)
-    
-    def test_sequential_callable(self):
-        """Test that Sequential network is callable."""
-        network = Sequential([Dense(2, 3), ReLU()])
-        x = Tensor([[1.0, 2.0]])
-        
-        # Test both forward() and __call__()
-        output1 = network.forward(x)
-        output2 = network(x)
-        
-        assert np.allclose(output1.data, output2.data)
-    
-    def test_empty_sequential(self):
-        """Test Sequential network with no layers."""
-        network = Sequential([])
-        x = Tensor([[1.0, 2.0, 3.0]])
-        
-        # Should return input unchanged
-        output = network(x)
-        assert np.allclose(output.data, x.data)
-
-
-class TestMLPCreation:
-    """Test MLP creation functions."""
-    
-    def test_create_mlp_basic(self):
-        """Test basic MLP creation."""
-        mlp = create_mlp(input_size=3, hidden_sizes=[4], output_size=2)
-        
-        assert len(mlp.layers) == 4  # Dense + ReLU + Dense + Sigmoid
-        assert isinstance(mlp.layers[0], Dense)
-        assert mlp.layers[0].input_size == 3
-        assert mlp.layers[0].output_size == 4
-        assert isinstance(mlp.layers[1], ReLU)
-        assert isinstance(mlp.layers[2], Dense)
-        assert mlp.layers[2].input_size == 4
-        assert mlp.layers[2].output_size == 2
-        assert isinstance(mlp.layers[3], Sigmoid)
-    
-    def test_create_mlp_multiple_hidden(self):
-        """Test MLP creation with multiple hidden layers."""
-        mlp = create_mlp(input_size=10, hidden_sizes=[16, 8, 4], output_size=3)
-        
-        assert len(mlp.layers) == 8  # 3 Dense + 3 ReLU + 1 Dense + 1 Sigmoid
-        
-        # Check Dense layers
-        dense_layers = [layer for layer in mlp.layers if isinstance(layer, Dense)]
-        assert len(dense_layers) == 4
-        
-        assert dense_layers[0].input_size == 10
-        assert dense_layers[0].output_size == 16
-        assert dense_layers[1].input_size == 16
-        assert dense_layers[1].output_size == 8
-        assert dense_layers[2].input_size == 8
-        assert dense_layers[2].output_size == 4
-        assert dense_layers[3].input_size == 4
-        assert dense_layers[3].output_size == 3
-    
-    def test_create_mlp_no_hidden(self):
-        """Test MLP creation with no hidden layers."""
-        mlp = create_mlp(input_size=5, hidden_sizes=[], output_size=2)
-        
-        assert len(mlp.layers) == 2  # Dense + Sigmoid
-        assert isinstance(mlp.layers[0], Dense)
-        assert mlp.layers[0].input_size == 5
-        assert mlp.layers[0].output_size == 2
-        assert isinstance(mlp.layers[1], Sigmoid)
-    
-    def test_create_mlp_custom_activation(self):
-        """Test MLP creation with custom activation functions."""
-        mlp = create_mlp(
-            input_size=3, 
-            hidden_sizes=[4], 
-            output_size=2,
-            activation=Tanh,
-            output_activation=Tanh
-        )
-        
-        assert len(mlp.layers) == 4
-        assert isinstance(mlp.layers[1], Tanh)  # Hidden activation
-        assert isinstance(mlp.layers[3], Tanh)  # Output activation
-
-
-class TestSpecializedNetworks:
-    """Test specialized network creation functions."""
-    
-    def test_create_classification_network(self):
-        """Test classification network creation."""
-        classifier = create_classification_network(
-            input_size=100, 
-            num_classes=5,
-            hidden_sizes=[32, 16]
-        )
-        
-        assert len(classifier.layers) == 6  # Dense(100→32) + ReLU + Dense(32→16) + ReLU + Dense(16→5) + Softmax
-        
-        # Check output layer
-        dense_layers = [layer for layer in classifier.layers if isinstance(layer, Dense)]
-        assert dense_layers[-1].output_size == 5
-        # Should use Softmax for multi-class classification
-        from tinytorch.core.activations import Softmax
-        assert isinstance(classifier.layers[-1], Softmax)
-    
-    def test_create_classification_network_default(self):
-        """Test classification network with default hidden sizes."""
-        classifier = create_classification_network(input_size=50, num_classes=3)
-        
-        # Should use default hidden size of input_size // 2
-        expected_hidden = 50 // 2
-        dense_layers = [layer for layer in classifier.layers if isinstance(layer, Dense)]
-        assert dense_layers[0].output_size == expected_hidden
-        assert dense_layers[1].output_size == 3
-    
-    def test_create_regression_network(self):
-        """Test regression network creation."""
-        regressor = create_regression_network(
-            input_size=13, 
-            output_size=1,
-            hidden_sizes=[8, 4]
-        )
-        
-        assert len(regressor.layers) == 6  # Dense(13→8) + ReLU + Dense(8→4) + ReLU + Dense(4→1) + Tanh
-        
-        # Check output layer
-        dense_layers = [layer for layer in regressor.layers if isinstance(layer, Dense)]
-        assert dense_layers[-1].output_size == 1
-        assert isinstance(regressor.layers[-1], Tanh)
-    
-    def test_create_regression_network_default(self):
-        """Test regression network with default parameters."""
-        regressor = create_regression_network(input_size=20)
-        
-        # Should use default output_size=1 and hidden_size=input_size//2
-        expected_hidden = 20 // 2
-        dense_layers = [layer for layer in regressor.layers if isinstance(layer, Dense)]
-        assert dense_layers[0].output_size == expected_hidden
-        assert dense_layers[1].output_size == 1
-
-
-class TestNetworkBehavior:
-    """Test network behavior and functionality."""
-    
-    def test_network_shape_transformations(self):
-        """Test that networks properly transform tensor shapes."""
-        network = Sequential([
-            Dense(3, 4),
-            ReLU(),
-            Dense(4, 2),
-            Sigmoid()
-        ])
-        
-        x = Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-        output = network(x)
-        
-        assert x.shape == (2, 3)
-        assert output.shape == (2, 2)
-    
-    def test_network_activations(self):
-        """Test that activation functions are properly applied."""
-        network = Sequential([
-            Dense(2, 3),
-            ReLU(),
-            Dense(3, 1),
-            Sigmoid()
-        ])
-        
-        x = Tensor([[-1.0, 1.0]])
-        output = network(x)
-        
-        # ReLU should zero out negative values
-        # Sigmoid should output values between 0 and 1
-        assert np.all(output.data >= 0) and np.all(output.data <= 1)
-    
-    def test_network_parameter_count(self):
-        """Test that networks have the expected number of parameters."""
-        network = Sequential([
-            Dense(3, 4),  # 3*4 + 4 = 16 parameters
-            ReLU(),
-            Dense(4, 2),  # 4*2 + 2 = 10 parameters
-            Sigmoid()
-        ])
-        
-        # Count parameters (weights + biases)
-        total_params = 0
-        for layer in network.layers:
-            if hasattr(layer, 'weights'):
-                total_params += layer.weights.data.size
-                if hasattr(layer, 'bias') and layer.bias is not None:
-                    total_params += layer.bias.data.size
-        
-        assert total_params == 26  # 16 + 10
-
-
-class TestVisualizationFunctions:
-    """Test visualization functions (basic functionality, not visual output)."""
-    
-    def test_visualize_network_architecture_exists(self):
-        """Test that visualization function exists and is callable."""
-        network = Sequential([Dense(3, 4), ReLU(), Dense(4, 2), Sigmoid()])
-        
-        # Should not raise an error
-        try:
-            visualize_network_architecture(network, "Test Network")
-        except Exception as e:
-            pytest.fail(f"visualize_network_architecture raised {e}")
-    
-    def test_visualize_data_flow_exists(self):
-        """Test that data flow visualization function exists and is callable."""
-        network = Sequential([Dense(3, 4), ReLU(), Dense(4, 2), Sigmoid()])
-        x = Tensor([[1.0, 2.0, 3.0]])
-        
-        # Should not raise an error
-        try:
-            visualize_data_flow(network, x, "Test Data Flow")
-        except Exception as e:
-            pytest.fail(f"visualize_data_flow raised {e}")
-    
-    def test_compare_networks_exists(self):
-        """Test that network comparison function exists and is callable."""
-        network1 = Sequential([Dense(3, 4), ReLU(), Dense(4, 2), Sigmoid()])
-        network2 = Sequential([Dense(3, 8), ReLU(), Dense(8, 2), Sigmoid()])
-        x = Tensor([[1.0, 2.0, 3.0]])
-        
-        # Should not raise an error
-        try:
-            compare_networks([network1, network2], ["Small", "Large"], x, "Test Comparison")
-        except Exception as e:
-            pytest.fail(f"compare_networks raised {e}")
-    
-    def test_analyze_network_behavior_exists(self):
-        """Test that behavior analysis function exists and is callable."""
-        network = Sequential([Dense(3, 4), ReLU(), Dense(4, 2), Sigmoid()])
-        x = Tensor([[1.0, 2.0, 3.0]])
-        
-        # Should not raise an error
-        try:
-            analyze_network_behavior(network, x, "Test Behavior")
-        except Exception as e:
-            pytest.fail(f"analyze_network_behavior raised {e}")
-
-
-class TestPracticalApplications:
-    """Test practical network applications."""
-    
-    def test_digit_classification_network(self):
-        """Test creating a network for digit classification."""
-        classifier = create_classification_network(
-            input_size=784,  # 28x28 image
-            num_classes=10,   # 10 digits
-            hidden_sizes=[128, 64]
-        )
-        
-        # Test with fake image data
-        fake_image = Tensor(np.random.randn(1, 784).astype(np.float32))
-        output = classifier(fake_image)
-        
-        assert output.shape == (1, 10)
-        assert np.all(output.data >= 0) and np.all(output.data <= 1)
-        # Should sum to approximately 1 (probability distribution)
-        assert np.abs(np.sum(output.data) - 1.0) < 0.1
-    
-    def test_sentiment_analysis_network(self):
-        """Test creating a network for sentiment analysis."""
-        classifier = create_classification_network(
-            input_size=100,  # 100-dimensional embeddings
-            num_classes=2,    # Positive/Negative
-            hidden_sizes=[32, 16]
-        )
-        
-        # Test with fake text embeddings
-        fake_embeddings = Tensor(np.random.randn(1, 100).astype(np.float32))
-        output = classifier(fake_embeddings)
-        
-        assert output.shape == (1, 2)
-        assert np.all(output.data >= 0) and np.all(output.data <= 1)
-    
-    def test_house_price_prediction_network(self):
-        """Test creating a network for house price prediction."""
-        regressor = create_regression_network(
-            input_size=13,   # 13 house features
-            output_size=1,   # 1 price prediction
-            hidden_sizes=[8, 4]
-        )
-        
-        # Test with fake house features
-        fake_features = Tensor(np.random.randn(1, 13).astype(np.float32))
-        output = regressor(fake_features)
-        
-        assert output.shape == (1, 1)
-        # Tanh output should be between -1 and 1
-        assert np.all(output.data >= -1) and np.all(output.data <= 1)
-
-
-class TestNetworkIntegration:
-    """Test integration with other modules."""
-    
-    def test_network_with_tensor_operations(self):
-        """Test that networks work with tensor operations."""
-        network = Sequential([Dense(3, 4), ReLU(), Dense(4, 2), Sigmoid()])
-        
-        # Create input using tensor operations
-        x1 = Tensor([[1.0, 2.0, 3.0]])
-        x2 = Tensor([[4.0, 5.0, 6.0]])
-        x_combined = Tensor(np.vstack([x1.data, x2.data]))
-        
-        output = network(x_combined)
-        assert output.shape == (2, 2)
-    
-    def test_network_with_activations_module(self):
-        """Test that networks properly use activations from the activations module."""
-        # This test ensures we're using the activations from the activations module
-        # rather than re-implementing them
-        network = Sequential([
-            Dense(2, 3),
-            ReLU(),  # From activations module
-            Dense(3, 1),
-            Sigmoid()  # From activations module
-        ])
-        
-        x = Tensor([[-1.0, 1.0]])
-        output = network(x)
-        
-        # Test that activations work correctly
-        assert np.all(output.data >= 0) and np.all(output.data <= 1)
-    
-    def test_network_with_layers_module(self):
-        """Test that networks properly use layers from the layers module."""
-        # This test ensures we're using the Dense layers from the layers module
-        network = Sequential([
-            Dense(3, 4),  # From layers module
-            ReLU(),
-            Dense(4, 2),  # From layers module
-            Sigmoid()
-        ])
-        
-        x = Tensor([[1.0, 2.0, 3.0]])
-        output = network(x)
-        
-        # Test that layers work correctly
-        assert output.shape == (1, 2)
-
-
-if __name__ == "__main__":
-    # Run the tests
-    pytest.main([__file__, "-v"]) 
\ No newline at end of file
diff --git a/modules/source/05_cnn/cnn_dev.py b/modules/source/05_cnn/cnn_dev.py
index 39004edf..0600f9ff 100644
--- a/modules/source/05_cnn/cnn_dev.py
+++ b/modules/source/05_cnn/cnn_dev.py
@@ -607,50 +607,50 @@ try:
     print("\n1. Simple CNN Pipeline Test:")
     
     # Create pipeline: Conv2D → ReLU → Flatten → Dense
-    conv = Conv2D(kernel_size=(2, 2))
-    relu = ReLU()
-    dense = Dense(input_size=4, output_size=3)
-    
-    # Input image
-    image = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-    
-    # Forward pass
+        conv = Conv2D(kernel_size=(2, 2))
+        relu = ReLU()
+        dense = Dense(input_size=4, output_size=3)
+        
+        # Input image
+        image = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+        
+        # Forward pass
     features = conv(image)          # (3,3) → (2,2)
     activated = relu(features)      # (2,2) → (2,2)
     flattened = flatten(activated)  # (2,2) → (1,4)
     output = dense(flattened)       # (1,4) → (1,3)
-    
-    assert features.shape == (2, 2), f"Conv output shape wrong: {features.shape}"
-    assert activated.shape == (2, 2), f"ReLU output shape wrong: {activated.shape}"
-    assert flattened.shape == (1, 4), f"Flatten output shape wrong: {flattened.shape}"
-    assert output.shape == (1, 3), f"Dense output shape wrong: {output.shape}"
-    
+        
+        assert features.shape == (2, 2), f"Conv output shape wrong: {features.shape}"
+        assert activated.shape == (2, 2), f"ReLU output shape wrong: {activated.shape}"
+        assert flattened.shape == (1, 4), f"Flatten output shape wrong: {flattened.shape}"
+        assert output.shape == (1, 3), f"Dense output shape wrong: {output.shape}"
+        
     print("✅ Simple CNN pipeline works correctly")
     
     # Test 2: Multi-layer CNN
     print("\n2. Multi-layer CNN Test:")
     
     # Create deeper pipeline: Conv2D → ReLU → Conv2D → ReLU → Flatten → Dense
-    conv1 = Conv2D(kernel_size=(2, 2))
-    relu1 = ReLU()
-    conv2 = Conv2D(kernel_size=(2, 2))
-    relu2 = ReLU()
+        conv1 = Conv2D(kernel_size=(2, 2))
+        relu1 = ReLU()
+        conv2 = Conv2D(kernel_size=(2, 2))
+        relu2 = ReLU()
     dense_multi = Dense(input_size=9, output_size=2)
-    
-    # Larger input for multi-layer processing
+        
+        # Larger input for multi-layer processing
     large_image = Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]])
-    
-    # Forward pass
+        
+        # Forward pass
     h1 = conv1(large_image)  # (5,5) → (4,4)
     h2 = relu1(h1)           # (4,4) → (4,4)
     h3 = conv2(h2)           # (4,4) → (3,3)
     h4 = relu2(h3)           # (3,3) → (3,3)
     h5 = flatten(h4)         # (3,3) → (1,9)
     output_multi = dense_multi(h5)  # (1,9) → (1,2)
-    
-    assert h1.shape == (4, 4), f"Conv1 output wrong: {h1.shape}"
-    assert h3.shape == (3, 3), f"Conv2 output wrong: {h3.shape}"
-    assert h5.shape == (1, 9), f"Flatten output wrong: {h5.shape}"
+        
+        assert h1.shape == (4, 4), f"Conv1 output wrong: {h1.shape}"
+        assert h3.shape == (3, 3), f"Conv2 output wrong: {h3.shape}"
+        assert h5.shape == (1, 9), f"Flatten output wrong: {h5.shape}"
     assert output_multi.shape == (1, 2), f"Final output wrong: {output_multi.shape}"
     
     print("✅ Multi-layer CNN works correctly")
@@ -667,22 +667,22 @@ try:
                          [0, 1, 1, 0, 0, 1, 1, 0],
                          [0, 0, 1, 1, 1, 1, 0, 0],
                          [1, 1, 0, 0, 0, 0, 1, 1]])
-    
-    # CNN for digit classification
+        
+        # CNN for digit classification
     feature_extractor = Conv2D(kernel_size=(3, 3))  # (8,8) → (6,6)
-    activation = ReLU()
-    classifier = Dense(input_size=36, output_size=10)  # 10 digit classes
-    
-    # Forward pass
-    features = feature_extractor(digit_image)
-    activated_features = activation(features)
+        activation = ReLU()
+        classifier = Dense(input_size=36, output_size=10)  # 10 digit classes
+        
+        # Forward pass
+        features = feature_extractor(digit_image)
+        activated_features = activation(features)
     feature_vector = flatten(activated_features)
-    digit_scores = classifier(feature_vector)
-    
-    assert features.shape == (6, 6), f"Feature extraction shape wrong: {features.shape}"
-    assert feature_vector.shape == (1, 36), f"Feature vector shape wrong: {feature_vector.shape}"
-    assert digit_scores.shape == (1, 10), f"Digit scores shape wrong: {digit_scores.shape}"
-    
+        digit_scores = classifier(feature_vector)
+        
+        assert features.shape == (6, 6), f"Feature extraction shape wrong: {features.shape}"
+        assert feature_vector.shape == (1, 36), f"Feature vector shape wrong: {feature_vector.shape}"
+        assert digit_scores.shape == (1, 10), f"Digit scores shape wrong: {digit_scores.shape}"
+        
     print("✅ Image classification scenario works correctly")
     
     # Test 4: Feature Extraction and Composition
diff --git a/modules/source/05_cnn/cnn_dev_backup.py b/modules/source/05_cnn/cnn_dev_backup.py
new file mode 100644
index 00000000..9ab7314e
--- /dev/null
+++ b/modules/source/05_cnn/cnn_dev_backup.py
@@ -0,0 +1,1173 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 5: CNN - Convolutional Neural Networks
+
+Welcome to the CNN module! Here you'll implement the core building block of modern computer vision: the convolutional layer.
+
+## Learning Goals
+- Understand the convolution operation and its importance in computer vision
+- Implement Conv2D with explicit for-loops to understand the sliding window mechanism
+- Build convolutional layers that can detect spatial patterns in images
+- Compose Conv2D with other layers to build complete convolutional networks
+- See how convolution enables parameter sharing and translation invariance
+
+## Build → Use → Understand
+1. **Build**: Conv2D layer using sliding window convolution from scratch
+2. **Use**: Transform images and see feature maps emerge
+3. **Understand**: How CNNs learn hierarchical spatial patterns
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "cnn-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.cnn
+
+#| export
+import numpy as np
+import os
+import sys
+from typing import List, Tuple, Optional
+import matplotlib.pyplot as plt
+
+# Import from the main package - try package first, then local modules
+try:
+    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.layers import Dense
+    from tinytorch.core.activations import ReLU
+except ImportError:
+    # For development, import from local modules
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+    from tensor_dev import Tensor
+    from activations_dev import ReLU
+    from layers_dev import Dense
+
+# %% nbgrader={"grade": false, "grade_id": "cnn-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| hide
+#| export
+def _should_show_plots():
+    """Check if we should show plots (disable during testing)"""
+    # Check multiple conditions that indicate we're in test mode
+    is_pytest = (
+        'pytest' in sys.modules or
+        'test' in sys.argv or
+        os.environ.get('PYTEST_CURRENT_TEST') is not None or
+        any('test' in arg for arg in sys.argv) or
+        any('pytest' in arg for arg in sys.argv)
+    )
+    
+    # Show plots in development mode (when not in test mode)
+    return not is_pytest
+
+# %% nbgrader={"grade": false, "grade_id": "cnn-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
+print("🔥 TinyTorch CNN Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
+print("Ready to build convolutional neural networks!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/05_cnn/cnn_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.cnn`
+
+```python
+# Final package structure:
+from tinytorch.core.cnn import Conv2D, conv2d_naive, flatten  # CNN operations!
+from tinytorch.core.layers import Dense  # Fully connected layers
+from tinytorch.core.activations import ReLU  # Nonlinearity
+from tinytorch.core.tensor import Tensor  # Foundation
+```
+
+**Why this matters:**
+- **Learning:** Focused modules for deep understanding of convolution
+- **Production:** Proper organization like PyTorch's `torch.nn.Conv2d`
+- **Consistency:** All CNN operations live together in `core.cnn`
+- **Integration:** Works seamlessly with other TinyTorch components
+"""
+
+# %% [markdown]
+"""
+## 🧠 The Mathematical Foundation of Convolution
+
+### The Convolution Operation
+Convolution is a mathematical operation that combines two functions to produce a third function:
+
+```
+(f * g)(t) = ∫ f(τ)g(t - τ)dτ
+```
+
+In discrete 2D computer vision, this becomes:
+```
+(I * K)[i,j] = ΣΣ I[i+m, j+n] × K[m,n]
+```
+
+### Why Convolution is Perfect for Images
+- **Local connectivity**: Each output depends only on a small region of input
+- **Weight sharing**: Same filter applied everywhere (translation invariance)
+- **Spatial hierarchy**: Multiple layers build increasingly complex features
+- **Parameter efficiency**: Much fewer parameters than fully connected layers
+
+### The Three Core Principles
+1. **Sparse connectivity**: Each neuron connects to only a small region
+2. **Parameter sharing**: Same weights used across all spatial locations
+3. **Equivariant representation**: If input shifts, output shifts correspondingly
+
+### Connection to Real ML Systems
+Every vision framework uses convolution:
+- **PyTorch**: `torch.nn.Conv2d` with optimized CUDA kernels
+- **TensorFlow**: `tf.keras.layers.Conv2D` with cuDNN acceleration
+- **JAX**: `jax.lax.conv_general_dilated` with XLA compilation
+- **TinyTorch**: `tinytorch.core.cnn.Conv2D` (what we're building!)
+
+### Performance Considerations
+- **Memory layout**: Efficient data access patterns
+- **Vectorization**: SIMD operations for parallel computation
+- **Cache efficiency**: Spatial locality in memory access
+- **Optimization**: im2col, FFT-based convolution, Winograd algorithm
+"""
+
+# %% [markdown]
+"""
+## Step 1: Understanding Convolution
+
+### What is Convolution?
+A **convolutional layer** applies a small filter (kernel) across the input, producing a feature map. This operation captures local patterns and is the foundation of modern vision models.
+
+### Why Convolution Matters in Computer Vision
+- **Local connectivity**: Each output value depends only on a small region of the input
+- **Weight sharing**: The same filter is applied everywhere (translation invariance)
+- **Spatial hierarchy**: Multiple layers build increasingly complex features
+- **Parameter efficiency**: Much fewer parameters than fully connected layers
+
+### The Fundamental Insight
+**Convolution is pattern matching!** The kernel learns to detect specific patterns:
+- **Edge detectors**: Find boundaries between objects
+- **Texture detectors**: Recognize surface patterns
+- **Shape detectors**: Identify geometric forms
+- **Feature detectors**: Combine simple patterns into complex features
+
+### Real-World Examples
+- **Image processing**: Detect edges, blur, sharpen
+- **Computer vision**: Recognize objects, faces, text
+- **Medical imaging**: Detect tumors, analyze scans
+- **Autonomous driving**: Identify traffic signs, pedestrians
+
+### Visual Intuition
+```
+Input Image:     Kernel:        Output Feature Map:
+[1, 2, 3]       [1,  0]       [1*1+2*0+4*0+5*(-1), 2*1+3*0+5*0+6*(-1)]
+[4, 5, 6]       [0, -1]       [4*1+5*0+7*0+8*(-1), 5*1+6*0+8*0+9*(-1)]
+[7, 8, 9]
+```
+
+The kernel slides across the input, computing dot products at each position.
+
+Let's implement this step by step!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "conv2d-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def conv2d_naive(input: np.ndarray, kernel: np.ndarray) -> np.ndarray:
+    """
+    Naive 2D convolution (single channel, no stride, no padding).
+    
+    Args:
+        input: 2D input array (H, W)
+        kernel: 2D filter (kH, kW)
+    Returns:
+        2D output array (H-kH+1, W-kW+1)
+        
+    TODO: Implement the sliding window convolution using for-loops.
+    
+    APPROACH:
+    1. Get input dimensions: H, W = input.shape
+    2. Get kernel dimensions: kH, kW = kernel.shape
+    3. Calculate output dimensions: out_H = H - kH + 1, out_W = W - kW + 1
+    4. Create output array: np.zeros((out_H, out_W))
+    5. Use nested loops to slide the kernel:
+       - i loop: output rows (0 to out_H-1)
+       - j loop: output columns (0 to out_W-1)
+       - di loop: kernel rows (0 to kH-1)
+       - dj loop: kernel columns (0 to kW-1)
+    6. For each (i,j), compute: output[i,j] += input[i+di, j+dj] * kernel[di, dj]
+    
+    EXAMPLE:
+    Input: [[1, 2, 3],     Kernel: [[1, 0],
+            [4, 5, 6],               [0, -1]]
+            [7, 8, 9]]
+    
+    Output[0,0] = 1*1 + 2*0 + 4*0 + 5*(-1) = 1 - 5 = -4
+    Output[0,1] = 2*1 + 3*0 + 5*0 + 6*(-1) = 2 - 6 = -4
+    Output[1,0] = 4*1 + 5*0 + 7*0 + 8*(-1) = 4 - 8 = -4
+    Output[1,1] = 5*1 + 6*0 + 8*0 + 9*(-1) = 5 - 9 = -4
+    
+    HINTS:
+    - Start with output = np.zeros((out_H, out_W))
+    - Use four nested loops: for i in range(out_H): for j in range(out_W): for di in range(kH): for dj in range(kW):
+    - Accumulate the sum: output[i,j] += input[i+di, j+dj] * kernel[di, dj]
+    """
+    ### BEGIN SOLUTION
+    # Get input and kernel dimensions
+    H, W = input.shape
+    kH, kW = kernel.shape
+    
+    # Calculate output dimensions
+    out_H, out_W = H - kH + 1, W - kW + 1
+    
+    # Initialize output array
+    output = np.zeros((out_H, out_W), dtype=input.dtype)
+    
+    # Sliding window convolution with four nested loops
+    for i in range(out_H):
+        for j in range(out_W):
+            for di in range(kH):
+                for dj in range(kW):
+                    output[i, j] += input[i + di, j + dj] * kernel[di, dj]
+    
+    return output
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Quick Test: Convolution Operation
+
+Let's test your convolution implementation right away! This is the core operation that powers computer vision.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-conv2d-naive-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test conv2d_naive function immediately after implementation
+print("🔬 Testing convolution operation...")
+
+# Test simple 3x3 input with 2x2 kernel
+try:
+    input_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
+    kernel_array = np.array([[1, 0], [0, 1]], dtype=np.float32)  # Identity-like kernel
+    
+    result = conv2d_naive(input_array, kernel_array)
+    expected = np.array([[6, 8], [12, 14]], dtype=np.float32)  # 1+5, 2+6, 4+8, 5+9
+    
+    print(f"Input:\n{input_array}")
+    print(f"Kernel:\n{kernel_array}")
+    print(f"Result:\n{result}")
+    print(f"Expected:\n{expected}")
+    
+    assert np.allclose(result, expected), f"Convolution failed: expected {expected}, got {result}"
+    print("✅ Simple convolution test passed")
+    
+except Exception as e:
+    print(f"❌ Simple convolution test failed: {e}")
+    raise
+
+# Test edge detection kernel
+try:
+    input_array = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1]], dtype=np.float32)
+    edge_kernel = np.array([[-1, -1], [-1, 3]], dtype=np.float32)  # Edge detection
+    
+    result = conv2d_naive(input_array, edge_kernel)
+    expected = np.array([[0, 0], [0, 0]], dtype=np.float32)  # Uniform region = no edges
+    
+    assert np.allclose(result, expected), f"Edge detection failed: expected {expected}, got {result}"
+    print("✅ Edge detection test passed")
+    
+except Exception as e:
+    print(f"❌ Edge detection test failed: {e}")
+    raise
+
+# Test output shape
+try:
+    input_5x5 = np.random.randn(5, 5).astype(np.float32)
+    kernel_3x3 = np.random.randn(3, 3).astype(np.float32)
+    
+    result = conv2d_naive(input_5x5, kernel_3x3)
+    expected_shape = (3, 3)  # 5-3+1 = 3
+    
+    assert result.shape == expected_shape, f"Output shape wrong: expected {expected_shape}, got {result.shape}"
+    print("✅ Output shape test passed")
+    
+except Exception as e:
+    print(f"❌ Output shape test failed: {e}")
+    raise
+
+# Show the convolution process
+print("🎯 Convolution behavior:")
+print("   Slides kernel across input")
+print("   Computes dot product at each position")
+print("   Output size = Input size - Kernel size + 1")
+print("📈 Progress: Convolution operation ✓")
+
+# %% [markdown]
+"""
+## Step 2: Building the Conv2D Layer
+
+### What is a Conv2D Layer?
+A **Conv2D layer** is a learnable convolutional layer that:
+- Has learnable kernel weights (initialized randomly)
+- Applies convolution to input tensors
+- Integrates with the rest of the neural network
+
+### Why Conv2D Layers Matter
+- **Feature learning**: Kernels learn to detect useful patterns
+- **Composability**: Can be stacked with other layers
+- **Efficiency**: Shared weights reduce parameters dramatically
+- **Translation invariance**: Same patterns detected anywhere in the image
+
+### Real-World Applications
+- **Image classification**: Recognize objects in photos
+- **Object detection**: Find and locate objects
+- **Medical imaging**: Detect anomalies in scans
+- **Autonomous driving**: Identify road features
+
+### Design Decisions
+- **Kernel size**: Typically 3×3 or 5×5 for balance of locality and capacity
+- **Initialization**: Small random values to break symmetry
+- **Integration**: Works with Tensor class and other layers
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "conv2d-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Conv2D:
+    """
+    2D Convolutional Layer (single channel, single filter, no stride/pad).
+    
+    A learnable convolutional layer that applies a kernel to detect spatial patterns.
+    Perfect for building the foundation of convolutional neural networks.
+    """
+    
+    def __init__(self, kernel_size: Tuple[int, int]):
+        """
+        Initialize Conv2D layer with random kernel.
+        
+        Args:
+            kernel_size: (kH, kW) - size of the convolution kernel
+            
+        TODO: Initialize a random kernel with small values.
+        
+        APPROACH:
+        1. Store kernel_size as instance variable
+        2. Initialize random kernel with small values
+        3. Use proper initialization for stable training
+        
+        EXAMPLE:
+        Conv2D((2, 2)) creates:
+        - kernel: shape (2, 2) with small random values
+        
+        HINTS:
+        - Store kernel_size as self.kernel_size
+        - Initialize kernel: np.random.randn(kH, kW) * 0.1 (small values)
+        - Convert to float32 for consistency
+        """
+        ### BEGIN SOLUTION
+        # Store kernel size
+        self.kernel_size = kernel_size
+        kH, kW = kernel_size
+        
+        # Initialize random kernel with small values
+        self.kernel = np.random.randn(kH, kW).astype(np.float32) * 0.1
+        ### END SOLUTION
+    
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        Forward pass: apply convolution to input tensor.
+        
+        Args:
+            x: Input tensor (2D for simplicity)
+            
+        Returns:
+            Output tensor after convolution
+            
+        TODO: Implement forward pass using conv2d_naive function.
+        
+        APPROACH:
+        1. Extract numpy array from input tensor
+        2. Apply conv2d_naive with stored kernel
+        3. Return result wrapped in Tensor
+        
+        EXAMPLE:
+        x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  # shape (3, 3)
+        layer = Conv2D((2, 2))
+        y = layer(x)  # shape (2, 2)
+        
+        HINTS:
+        - Use x.data to get numpy array
+        - Use conv2d_naive(x.data, self.kernel)
+        - Return Tensor(result) to wrap the result
+        """
+        ### BEGIN SOLUTION
+        # Apply convolution using naive implementation
+        result = conv2d_naive(x.data, self.kernel)
+        return Tensor(result)
+        ### END SOLUTION
+    
+    def __call__(self, x: Tensor) -> Tensor:
+        """Make layer callable: layer(x) same as layer.forward(x)"""
+        return self.forward(x)
+
+# %% [markdown]
+"""
+### 🧪 Quick Test: Conv2D Layer
+
+Let's test your Conv2D layer implementation! This is a learnable convolutional layer that can be trained.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-conv2d-layer-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test Conv2D layer immediately after implementation
+print("🔬 Testing Conv2D layer...")
+
+# Create a Conv2D layer
+try:
+    layer = Conv2D(kernel_size=(2, 2))
+    print(f"Conv2D layer created with kernel size: {layer.kernel_size}")
+    print(f"Kernel shape: {layer.kernel.shape}")
+    
+    # Test that kernel is initialized properly
+    assert layer.kernel.shape == (2, 2), f"Kernel shape should be (2, 2), got {layer.kernel.shape}"
+    assert not np.allclose(layer.kernel, 0), "Kernel should not be all zeros"
+    print("✅ Conv2D layer initialization successful")
+    
+    # Test with sample input
+    x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+    print(f"Input shape: {x.shape}")
+    
+    y = layer(x)
+    print(f"Output shape: {y.shape}")
+    print(f"Output: {y}")
+    
+    # Verify shapes
+    assert y.shape == (2, 2), f"Output shape should be (2, 2), got {y.shape}"
+    assert isinstance(y, Tensor), "Output should be a Tensor"
+    print("✅ Conv2D layer forward pass successful")
+    
+except Exception as e:
+    print(f"❌ Conv2D layer test failed: {e}")
+    raise
+
+# Test different kernel sizes
+try:
+    layer_3x3 = Conv2D(kernel_size=(3, 3))
+    x_5x5 = Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]])
+    y_3x3 = layer_3x3(x_5x5)
+    
+    assert y_3x3.shape == (3, 3), f"3x3 kernel output should be (3, 3), got {y_3x3.shape}"
+    print("✅ Different kernel sizes work correctly")
+    
+except Exception as e:
+    print(f"❌ Different kernel sizes test failed: {e}")
+    raise
+
+# Show the layer behavior
+print("🎯 Conv2D layer behavior:")
+print("   Learnable kernel weights")
+print("   Applies convolution to detect patterns")
+print("   Can be trained end-to-end")
+print("📈 Progress: Convolution operation ✓, Conv2D layer ✓")
+
+# %% [markdown]
+"""
+## Step 3: Flattening for Dense Layers
+
+### What is Flattening?
+**Flattening** converts multi-dimensional tensors to 1D vectors, enabling connection between convolutional and dense layers.
+
+### Why Flattening is Needed
+- **Interface compatibility**: Conv2D outputs 2D, Dense expects 1D
+- **Network composition**: Connect spatial features to classification
+- **Standard practice**: Almost all CNNs use this pattern
+- **Dimension management**: Preserve information while changing shape
+
+### The Pattern
+```
+Conv2D → ReLU → Conv2D → ReLU → Flatten → Dense → Output
+```
+
+### Real-World Usage
+- **Classification**: Final layers need 1D input for class probabilities
+- **Feature extraction**: Convert spatial features to vector representations
+- **Transfer learning**: Extract features from pre-trained CNNs
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "flatten-function", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def flatten(x: Tensor) -> Tensor:
+    """
+    Flatten a 2D tensor to 1D (for connecting to Dense layers).
+    
+    Args:
+        x: Input tensor to flatten
+        
+    Returns:
+        Flattened tensor with batch dimension preserved
+        
+    TODO: Implement flattening operation.
+    
+    APPROACH:
+    1. Get the numpy array from the tensor
+    2. Use .flatten() to convert to 1D
+    3. Add batch dimension with [None, :]
+    4. Return Tensor wrapped around the result
+    
+    EXAMPLE:
+    Input: Tensor([[1, 2], [3, 4]])  # shape (2, 2)
+    Output: Tensor([[1, 2, 3, 4]])  # shape (1, 4)
+    
+    HINTS:
+    - Use x.data.flatten() to get 1D array
+    - Add batch dimension: result[None, :]
+    - Return Tensor(result)
+    """
+    ### BEGIN SOLUTION
+    # Flatten the tensor and add batch dimension
+    flattened = x.data.flatten()
+    result = flattened[None, :]  # Add batch dimension
+    return Tensor(result)
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Quick Test: Flatten Function
+
+Let's test your flatten function! This connects convolutional layers to dense layers.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-flatten-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test flatten function immediately after implementation
+print("🔬 Testing flatten function...")
+
+# Test case 1: 2x2 tensor
+try:
+    x = Tensor([[1, 2], [3, 4]])
+    flattened = flatten(x)
+    
+    print(f"Input: {x}")
+    print(f"Flattened: {flattened}")
+    print(f"Flattened shape: {flattened.shape}")
+    
+    # Verify shape and content
+    assert flattened.shape == (1, 4), f"Flattened shape should be (1, 4), got {flattened.shape}"
+    expected_data = np.array([[1, 2, 3, 4]])
+    assert np.array_equal(flattened.data, expected_data), f"Flattened data should be {expected_data}, got {flattened.data}"
+    print("✅ 2x2 flatten test passed")
+    
+except Exception as e:
+    print(f"❌ 2x2 flatten test failed: {e}")
+    raise
+
+# Test case 2: 3x3 tensor
+try:
+    x2 = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+    flattened2 = flatten(x2)
+    
+    assert flattened2.shape == (1, 9), f"Flattened shape should be (1, 9), got {flattened2.shape}"
+    expected_data2 = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])
+    assert np.array_equal(flattened2.data, expected_data2), f"Flattened data should be {expected_data2}, got {flattened2.data}"
+    print("✅ 3x3 flatten test passed")
+    
+except Exception as e:
+    print(f"❌ 3x3 flatten test failed: {e}")
+    raise
+
+# Test case 3: Different shapes
+try:
+    x3 = Tensor([[1, 2, 3, 4], [5, 6, 7, 8]])  # 2x4
+    flattened3 = flatten(x3)
+    
+    assert flattened3.shape == (1, 8), f"Flattened shape should be (1, 8), got {flattened3.shape}"
+    expected_data3 = np.array([[1, 2, 3, 4, 5, 6, 7, 8]])
+    assert np.array_equal(flattened3.data, expected_data3), f"Flattened data should be {expected_data3}, got {flattened3.data}"
+    print("✅ Different shapes flatten test passed")
+    
+except Exception as e:
+    print(f"❌ Different shapes flatten test failed: {e}")
+    raise
+
+# Show the flattening behavior
+print("🎯 Flatten behavior:")
+print("   Converts 2D tensor to 1D")
+print("   Preserves batch dimension")
+print("   Enables connection to Dense layers")
+print("📈 Progress: Convolution operation ✓, Conv2D layer ✓, Flatten ✓")
+print("🚀 CNN pipeline ready!")
+
+# %% [markdown]
+"""
+## 🧪 Comprehensive CNN Testing Suite
+
+Let's test all CNN components thoroughly with realistic computer vision scenarios!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "test-cnn-comprehensive", "locked": false, "schema_version": 3, "solution": false, "task": false}
+def test_convolution_operations():
+    """Test 1: Comprehensive convolution operations testing"""
+    print("🔬 Testing Convolution Operations...")
+    
+    # Test 1.1: Basic convolution
+    try:
+        input_img = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
+        identity_kernel = np.array([[1, 0], [0, 1]], dtype=np.float32)
+        
+        result = conv2d_naive(input_img, identity_kernel)
+        expected = np.array([[6, 8], [12, 14]], dtype=np.float32)
+        
+        assert np.allclose(result, expected), f"Identity convolution failed: {result} vs {expected}"
+        print("✅ Basic convolution test passed")
+    except Exception as e:
+        print(f"❌ Basic convolution failed: {e}")
+        return False
+    
+    # Test 1.2: Edge detection kernel
+    try:
+        # Vertical edge detection
+        edge_input = np.array([[0, 0, 1, 1], [0, 0, 1, 1], [0, 0, 1, 1]], dtype=np.float32)
+        vertical_edge = np.array([[-1, 1], [-1, 1]], dtype=np.float32)
+        
+        result = conv2d_naive(edge_input, vertical_edge)
+        # Should detect the vertical edge at position (0,1) and (1,1)
+        assert result[0, 1] > 0 and result[1, 1] > 0, "Vertical edge not detected"
+        print("✅ Edge detection test passed")
+    except Exception as e:
+        print(f"❌ Edge detection failed: {e}")
+        return False
+    
+    # Test 1.3: Blur kernel
+    try:
+        noise_input = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1]], dtype=np.float32)
+        blur_kernel = np.array([[0.25, 0.25], [0.25, 0.25]], dtype=np.float32)
+        
+        result = conv2d_naive(noise_input, blur_kernel)
+        # Blur should smooth out the noise
+        assert np.all(result >= 0) and np.all(result <= 1), "Blur kernel failed"
+        print("✅ Blur kernel test passed")
+    except Exception as e:
+        print(f"❌ Blur kernel failed: {e}")
+        return False
+    
+    # Test 1.4: Different kernel sizes
+    try:
+        large_input = np.random.randn(10, 10).astype(np.float32)
+        
+        # Test 3x3 kernel
+        kernel_3x3 = np.random.randn(3, 3).astype(np.float32)
+        result_3x3 = conv2d_naive(large_input, kernel_3x3)
+        assert result_3x3.shape == (8, 8), f"3x3 kernel output shape wrong: {result_3x3.shape}"
+        
+        # Test 5x5 kernel
+        kernel_5x5 = np.random.randn(5, 5).astype(np.float32)
+        result_5x5 = conv2d_naive(large_input, kernel_5x5)
+        assert result_5x5.shape == (6, 6), f"5x5 kernel output shape wrong: {result_5x5.shape}"
+        
+        print("✅ Different kernel sizes test passed")
+    except Exception as e:
+        print(f"❌ Different kernel sizes failed: {e}")
+        return False
+    
+    print("🎯 Convolution operations: All tests passed!")
+    return True
+
+def test_conv2d_layer():
+    """Test 2: Conv2D layer comprehensive testing"""
+    print("🔬 Testing Conv2D Layer...")
+    
+    # Test 2.1: Layer initialization
+    try:
+        layer_2x2 = Conv2D(kernel_size=(2, 2))
+        assert layer_2x2.kernel.shape == (2, 2), f"2x2 kernel shape wrong: {layer_2x2.kernel.shape}"
+        assert not np.allclose(layer_2x2.kernel, 0), "Kernel should not be all zeros"
+        
+        layer_3x3 = Conv2D(kernel_size=(3, 3))
+        assert layer_3x3.kernel.shape == (3, 3), f"3x3 kernel shape wrong: {layer_3x3.kernel.shape}"
+        
+        print("✅ Layer initialization test passed")
+    except Exception as e:
+        print(f"❌ Layer initialization failed: {e}")
+        return False
+    
+    # Test 2.2: Forward pass with different inputs
+    try:
+        layer = Conv2D(kernel_size=(2, 2))
+        
+        # Small image
+        small_img = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+        output_small = layer(small_img)
+        assert output_small.shape == (2, 2), f"Small image output shape wrong: {output_small.shape}"
+        assert isinstance(output_small, Tensor), "Output should be Tensor"
+        
+        # Larger image
+        large_img = Tensor(np.random.randn(8, 8))
+        output_large = layer(large_img)
+        assert output_large.shape == (7, 7), f"Large image output shape wrong: {output_large.shape}"
+        
+        print("✅ Forward pass test passed")
+    except Exception as e:
+        print(f"❌ Forward pass failed: {e}")
+        return False
+    
+    # Test 2.3: Learnable parameters
+    try:
+        layer1 = Conv2D(kernel_size=(2, 2))
+        layer2 = Conv2D(kernel_size=(2, 2))
+        
+        # Different layers should have different random kernels
+        assert not np.allclose(layer1.kernel, layer2.kernel), "Different layers should have different kernels"
+        
+        # Test that kernels are reasonable size (not too large)
+        assert np.max(np.abs(layer1.kernel)) < 1.0, "Kernel values should be small for stable training"
+        
+        print("✅ Learnable parameters test passed")
+    except Exception as e:
+        print(f"❌ Learnable parameters failed: {e}")
+        return False
+    
+    # Test 2.4: Real computer vision scenario - digit recognition
+    try:
+        # Simulate a simple 5x5 digit
+        digit_5x5 = Tensor([
+            [0, 1, 1, 1, 0],
+            [1, 0, 0, 0, 1],
+            [1, 0, 1, 0, 1],
+            [1, 0, 0, 0, 1],
+            [0, 1, 1, 1, 0]
+        ])
+        
+        # Edge detection layer
+        edge_layer = Conv2D(kernel_size=(3, 3))
+        edge_layer.kernel = np.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]], dtype=np.float32)
+        
+        edges = edge_layer(digit_5x5)
+        assert edges.shape == (3, 3), f"Edge detection output shape wrong: {edges.shape}"
+        
+        print("✅ Computer vision scenario test passed")
+    except Exception as e:
+        print(f"❌ Computer vision scenario failed: {e}")
+        return False
+    
+    print("🎯 Conv2D layer: All tests passed!")
+    return True
+
+def test_flatten_operations():
+    """Test 3: Flatten operations comprehensive testing"""
+    print("🔬 Testing Flatten Operations...")
+    
+    # Test 3.1: Basic flattening
+    try:
+        # 2x2 tensor
+        x_2x2 = Tensor([[1, 2], [3, 4]])
+        flat_2x2 = flatten(x_2x2)
+        
+        assert flat_2x2.shape == (1, 4), f"2x2 flatten shape wrong: {flat_2x2.shape}"
+        expected = np.array([[1, 2, 3, 4]])
+        assert np.array_equal(flat_2x2.data, expected), f"2x2 flatten data wrong: {flat_2x2.data}"
+        
+        # 3x3 tensor
+        x_3x3 = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+        flat_3x3 = flatten(x_3x3)
+        
+        assert flat_3x3.shape == (1, 9), f"3x3 flatten shape wrong: {flat_3x3.shape}"
+        expected = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])
+        assert np.array_equal(flat_3x3.data, expected), f"3x3 flatten data wrong: {flat_3x3.data}"
+        
+        print("✅ Basic flattening test passed")
+    except Exception as e:
+        print(f"❌ Basic flattening failed: {e}")
+        return False
+    
+    # Test 3.2: Different aspect ratios
+    try:
+        # Wide tensor
+        x_wide = Tensor([[1, 2, 3, 4, 5, 6]])  # 1x6
+        flat_wide = flatten(x_wide)
+        assert flat_wide.shape == (1, 6), f"Wide flatten shape wrong: {flat_wide.shape}"
+        
+        # Tall tensor
+        x_tall = Tensor([[1], [2], [3], [4], [5], [6]])  # 6x1
+        flat_tall = flatten(x_tall)
+        assert flat_tall.shape == (1, 6), f"Tall flatten shape wrong: {flat_tall.shape}"
+        
+        print("✅ Different aspect ratios test passed")
+    except Exception as e:
+        print(f"❌ Different aspect ratios failed: {e}")
+        return False
+    
+    # Test 3.3: Preserve data order
+    try:
+        # Test that flattening preserves row-major order
+        x_ordered = Tensor([[1, 2, 3], [4, 5, 6]])  # 2x3
+        flat_ordered = flatten(x_ordered)
+        
+        expected_order = np.array([[1, 2, 3, 4, 5, 6]])
+        assert np.array_equal(flat_ordered.data, expected_order), "Flatten should preserve row-major order"
+        
+        print("✅ Data order preservation test passed")
+    except Exception as e:
+        print(f"❌ Data order preservation failed: {e}")
+        return False
+    
+    # Test 3.4: CNN to Dense connection scenario
+    try:
+        # Simulate CNN feature map -> Dense layer
+        feature_map = Tensor([[0.1, 0.2], [0.3, 0.4]])  # 2x2 feature map
+        flattened_features = flatten(feature_map)
+        
+        # Should be ready for Dense layer input
+        assert flattened_features.shape == (1, 4), "Feature map should flatten to (1, 4)"
+        assert isinstance(flattened_features, Tensor), "Should remain a Tensor"
+        
+        # Test with Dense layer
+        dense = Dense(input_size=4, output_size=2)
+        output = dense(flattened_features)
+        assert output.shape == (1, 2), f"Dense output shape wrong: {output.shape}"
+        
+        print("✅ CNN to Dense connection test passed")
+    except Exception as e:
+        print(f"❌ CNN to Dense connection failed: {e}")
+        return False
+    
+    print("🎯 Flatten operations: All tests passed!")
+    return True
+
+def test_cnn_pipelines():
+    """Test 4: Complete CNN pipeline testing"""
+    print("🔬 Testing CNN Pipelines...")
+    
+    # Test 4.1: Simple CNN pipeline
+    try:
+        # Create pipeline: Conv2D -> ReLU -> Flatten -> Dense
+        conv = Conv2D(kernel_size=(2, 2))
+        relu = ReLU()
+        dense = Dense(input_size=4, output_size=3)
+        
+        # Input image
+        image = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+        
+        # Forward pass
+        features = conv(image)          # (3,3) -> (2,2)
+        activated = relu(features)      # (2,2) -> (2,2)
+        flattened = flatten(activated)  # (2,2) -> (1,4)
+        output = dense(flattened)       # (1,4) -> (1,3)
+        
+        assert features.shape == (2, 2), f"Conv output shape wrong: {features.shape}"
+        assert activated.shape == (2, 2), f"ReLU output shape wrong: {activated.shape}"
+        assert flattened.shape == (1, 4), f"Flatten output shape wrong: {flattened.shape}"
+        assert output.shape == (1, 3), f"Dense output shape wrong: {output.shape}"
+        
+        print("✅ Simple CNN pipeline test passed")
+    except Exception as e:
+        print(f"❌ Simple CNN pipeline failed: {e}")
+        return False
+    
+    # Test 4.2: Multi-layer CNN
+    try:
+        # Create deeper pipeline: Conv2D -> ReLU -> Conv2D -> ReLU -> Flatten -> Dense
+        conv1 = Conv2D(kernel_size=(2, 2))
+        relu1 = ReLU()
+        conv2 = Conv2D(kernel_size=(2, 2))
+        relu2 = ReLU()
+        dense = Dense(input_size=1, output_size=2)
+        
+        # Larger input for multi-layer processing
+        large_image = Tensor(np.random.randn(5, 5))
+        
+        # Forward pass
+        h1 = conv1(large_image)  # (5,5) -> (4,4)
+        h2 = relu1(h1)           # (4,4) -> (4,4)
+        h3 = conv2(h2)           # (4,4) -> (3,3)
+        h4 = relu2(h3)           # (3,3) -> (3,3)
+        h5 = flatten(h4)         # (3,3) -> (1,9)
+        
+        # Adjust dense layer for correct input size
+        dense_adjusted = Dense(input_size=9, output_size=2)
+        output = dense_adjusted(h5)  # (1,9) -> (1,2)
+        
+        assert h1.shape == (4, 4), f"Conv1 output wrong: {h1.shape}"
+        assert h3.shape == (3, 3), f"Conv2 output wrong: {h3.shape}"
+        assert h5.shape == (1, 9), f"Flatten output wrong: {h5.shape}"
+        assert output.shape == (1, 2), f"Final output wrong: {output.shape}"
+        
+        print("✅ Multi-layer CNN test passed")
+    except Exception as e:
+        print(f"❌ Multi-layer CNN failed: {e}")
+        return False
+    
+    # Test 4.3: Image classification scenario
+    try:
+        # Simulate MNIST-like 8x8 digit classification
+        digit_image = Tensor(np.random.randn(8, 8))
+        
+        # CNN for digit classification
+        feature_extractor = Conv2D(kernel_size=(3, 3))  # (8,8) -> (6,6)
+        activation = ReLU()
+        classifier_prep = flatten  # (6,6) -> (1,36)
+        classifier = Dense(input_size=36, output_size=10)  # 10 digit classes
+        
+        # Forward pass
+        features = feature_extractor(digit_image)
+        activated_features = activation(features)
+        feature_vector = classifier_prep(activated_features)
+        digit_scores = classifier(feature_vector)
+        
+        assert features.shape == (6, 6), f"Feature extraction shape wrong: {features.shape}"
+        assert feature_vector.shape == (1, 36), f"Feature vector shape wrong: {feature_vector.shape}"
+        assert digit_scores.shape == (1, 10), f"Digit scores shape wrong: {digit_scores.shape}"
+        
+        print("✅ Image classification scenario test passed")
+    except Exception as e:
+        print(f"❌ Image classification scenario failed: {e}")
+        return False
+    
+    # Test 4.4: Real-world CNN architecture pattern
+    try:
+        # Simulate LeNet-like architecture pattern
+        input_img = Tensor(np.random.randn(32, 32))  # 32x32 input image
+        
+        # First conv block
+        conv1 = Conv2D(kernel_size=(5, 5))  # (32,32) -> (28,28)
+        relu1 = ReLU()
+        
+        # Second conv block
+        conv2 = Conv2D(kernel_size=(5, 5))  # (28,28) -> (24,24)
+        relu2 = ReLU()
+        
+        # Classifier
+        classifier = Dense(input_size=24*24, output_size=3)  # 3 classes
+        
+        # Forward pass
+        h1 = relu1(conv1(input_img))
+        h2 = relu2(conv2(h1))
+        h3 = flatten(h2)
+        output = classifier(h3)
+        
+        assert h1.shape == (28, 28), f"First conv block output wrong: {h1.shape}"
+        assert h2.shape == (24, 24), f"Second conv block output wrong: {h2.shape}"
+        assert h3.shape == (1, 576), f"Flattened features wrong: {h3.shape}"  # 24*24 = 576
+        assert output.shape == (1, 3), f"Classification output wrong: {output.shape}"
+        
+        print("✅ Real-world CNN architecture test passed")
+    except Exception as e:
+        print(f"❌ Real-world CNN architecture failed: {e}")
+        return False
+    
+    print("🎯 CNN pipelines: All tests passed!")
+    return True
+
+# Run all comprehensive tests
+def run_comprehensive_cnn_tests():
+    """Run all comprehensive CNN tests"""
+    print("🧪 Running Comprehensive CNN Test Suite...")
+    print("=" * 50)
+    
+    test_results = []
+    
+    # Run all test functions
+    test_results.append(test_convolution_operations())
+    test_results.append(test_conv2d_layer())
+    test_results.append(test_flatten_operations())
+    test_results.append(test_cnn_pipelines())
+    
+    # Summary
+    print("=" * 50)
+    print("📊 Test Results Summary:")
+    print(f"✅ Convolution Operations: {'PASSED' if test_results[0] else 'FAILED'}")
+    print(f"✅ Conv2D Layer: {'PASSED' if test_results[1] else 'FAILED'}")
+    print(f"✅ Flatten Operations: {'PASSED' if test_results[2] else 'FAILED'}")
+    print(f"✅ CNN Pipelines: {'PASSED' if test_results[3] else 'FAILED'}")
+    
+    all_passed = all(test_results)
+    print(f"\n🎯 Overall Result: {'ALL TESTS PASSED! 🎉' if all_passed else 'SOME TESTS FAILED ❌'}")
+    
+    if all_passed:
+        print("\n🚀 CNN Module Implementation Complete!")
+        print("   ✓ Convolution operations working correctly")
+        print("   ✓ Conv2D layers ready for training")
+        print("   ✓ Flatten operations connecting conv to dense layers")
+        print("   ✓ Complete CNN pipelines functional")
+        print("\n🎓 Ready for real computer vision applications!")
+    
+    return all_passed
+
+# Run the comprehensive test suite
+if __name__ == "__main__":
+    run_comprehensive_cnn_tests()
+
+# %% [markdown]
+"""
+### 🧪 Test Your CNN Implementations
+
+Once you implement the functions above, run these cells to test them:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-conv2d-naive", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test conv2d_naive function
+print("Testing conv2d_naive function...")
+
+# Test case 1: Simple 3x3 input with 2x2 kernel
+input_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
+kernel_array = np.array([[1, 0], [0, -1]], dtype=np.float32)
+
+result = conv2d_naive(input_array, kernel_array)
+expected = np.array([[-4, -4], [-4, -4]], dtype=np.float32)
+
+print(f"Input:\n{input_array}")
+print(f"Kernel:\n{kernel_array}")
+print(f"Result:\n{result}")
+print(f"Expected:\n{expected}")
+
+assert np.allclose(result, expected), f"conv2d_naive failed: expected {expected}, got {result}"
+
+# Test case 2: Different kernel
+kernel2 = np.array([[1, 1], [1, 1]], dtype=np.float32)
+result2 = conv2d_naive(input_array, kernel2)
+expected2 = np.array([[12, 16], [24, 28]], dtype=np.float32)
+
+assert np.allclose(result2, expected2), f"conv2d_naive failed: expected {expected2}, got {result2}"
+
+print("✅ conv2d_naive tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-conv2d-layer", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test Conv2D layer
+print("Testing Conv2D layer...")
+
+# Create a Conv2D layer
+layer = Conv2D(kernel_size=(2, 2))
+print(f"Kernel size: {layer.kernel_size}")
+print(f"Kernel shape: {layer.kernel.shape}")
+
+# Test with sample input
+x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+print(f"Input shape: {x.shape}")
+
+y = layer(x)
+print(f"Output shape: {y.shape}")
+print(f"Output: {y}")
+
+# Verify shapes
+assert y.shape == (2, 2), f"Output shape should be (2, 2), got {y.shape}"
+assert isinstance(y, Tensor), "Output should be a Tensor"
+
+print("✅ Conv2D layer tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-flatten", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test flatten function
+print("Testing flatten function...")
+
+# Test case 1: 2x2 tensor
+x = Tensor([[1, 2], [3, 4]])
+flattened = flatten(x)
+
+print(f"Input: {x}")
+print(f"Flattened: {flattened}")
+print(f"Flattened shape: {flattened.shape}")
+
+# Verify shape and content
+assert flattened.shape == (1, 4), f"Flattened shape should be (1, 4), got {flattened.shape}"
+expected_data = np.array([[1, 2, 3, 4]])
+assert np.array_equal(flattened.data, expected_data), f"Flattened data should be {expected_data}, got {flattened.data}"
+
+# Test case 2: 3x3 tensor
+x2 = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+flattened2 = flatten(x2)
+
+assert flattened2.shape == (1, 9), f"Flattened shape should be (1, 9), got {flattened2.shape}"
+expected_data2 = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])
+assert np.array_equal(flattened2.data, expected_data2), f"Flattened data should be {expected_data2}, got {flattened2.data}"
+
+print("✅ Flatten tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-cnn-pipeline", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test complete CNN pipeline
+print("Testing complete CNN pipeline...")
+
+# Create a simple CNN pipeline: Conv2D → ReLU → Flatten → Dense
+conv_layer = Conv2D(kernel_size=(2, 2))
+relu = ReLU()
+dense_layer = Dense(input_size=4, output_size=2)
+
+# Test input (3x3 image)
+x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+print(f"Input shape: {x.shape}")
+
+# Forward pass through pipeline
+h1 = conv_layer(x)
+print(f"After Conv2D: {h1.shape}")
+
+h2 = relu(h1)
+print(f"After ReLU: {h2.shape}")
+
+h3 = flatten(h2)
+print(f"After Flatten: {h3.shape}")
+
+h4 = dense_layer(h3)
+print(f"After Dense: {h4.shape}")
+
+# Verify pipeline works
+assert h1.shape == (2, 2), f"Conv2D output should be (2, 2), got {h1.shape}"
+assert h2.shape == (2, 2), f"ReLU output should be (2, 2), got {h2.shape}"
+assert h3.shape == (1, 4), f"Flatten output should be (1, 4), got {h3.shape}"
+assert h4.shape == (1, 2), f"Dense output should be (1, 2), got {h4.shape}"
+
+print("✅ CNN pipeline tests passed!")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented the core components of convolutional neural networks:
+
+### What You've Accomplished
+✅ **Convolution Operation**: Implemented conv2d_naive with sliding window from scratch  
+✅ **Conv2D Layer**: Built a learnable convolutional layer with random kernel initialization  
+✅ **Flattening**: Created the bridge between convolutional and dense layers  
+✅ **CNN Pipeline**: Composed Conv2D → ReLU → Flatten → Dense for complete networks  
+✅ **Spatial Pattern Detection**: Understanding how convolution detects local features  
+
+### Key Concepts You've Learned
+- **Convolution is pattern matching**: Kernels detect specific spatial patterns
+- **Parameter sharing**: Same kernel applied everywhere for translation invariance
+- **Local connectivity**: Each output depends only on a small input region
+- **Spatial hierarchy**: Multiple layers build increasingly complex features
+- **Dimension management**: Flattening connects spatial and vector representations
+
+### Mathematical Foundations
+- **Convolution operation**: (I * K)[i,j] = ΣΣ I[i+m, j+n] × K[m,n]
+- **Sliding window**: Kernel moves across input computing dot products
+- **Feature maps**: Convolution outputs that highlight detected patterns
+- **Translation invariance**: Same pattern detected regardless of position
+
+### Real-World Applications
+- **Computer vision**: Object recognition, face detection, medical imaging
+- **Image processing**: Edge detection, noise reduction, enhancement
+- **Autonomous systems**: Traffic sign recognition, obstacle detection
+- **Scientific imaging**: Satellite imagery, microscopy, astronomy
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 05_cnn`
+2. **Test your implementation**: `tito module test 05_cnn`
+3. **Use your CNN components**: 
+   ```python
+   from tinytorch.core.cnn import Conv2D, conv2d_naive, flatten
+   from tinytorch.core.layers import Dense
+   from tinytorch.core.activations import ReLU
+   
+   # Create CNN pipeline
+   conv = Conv2D((3, 3))
+   relu = ReLU()
+   dense = Dense(16, 10)
+   
+   # Process image
+   features = conv(image)
+   activated = relu(features)
+   flattened = flatten(activated)
+   output = dense(flattened)
+   ```
+4. **Move to Module 6**: Start building data loading and preprocessing pipelines!
+
+**Ready for the next challenge?** Let's build efficient data loading systems to feed our networks!
+""" 
\ No newline at end of file
diff --git a/modules/source/05_cnn/tests/test_cnn.py b/modules/source/05_cnn/tests/test_cnn.py
deleted file mode 100644
index 93a7c4f1..00000000
--- a/modules/source/05_cnn/tests/test_cnn.py
+++ /dev/null
@@ -1,368 +0,0 @@
-"""
-Test suite for the CNN module.
-This tests the CNN implementations to ensure they work correctly.
-"""
-
-import pytest
-import numpy as np
-import sys
-from pathlib import Path
-
-# Add the CNN module to the path
-sys.path.append(str(Path(__file__).parent.parent))
-
-try:
-    # Import from the exported package
-    from tinytorch.core.cnn import conv2d_naive, Conv2D, flatten
-except ImportError:
-    # Fallback for when module isn't exported yet
-    from cnn_dev import conv2d_naive, Conv2D, flatten
-
-from tinytorch.core.tensor import Tensor
-
-def safe_numpy(tensor):
-    """Get numpy array from tensor, using .data attribute"""
-    return tensor.data
-
-
-class TestConv2DNaive:
-    """Test the naive convolution implementation."""
-    
-    def test_conv2d_naive_small(self):
-        """Test basic convolution with small matrices."""
-        input = np.array([
-            [1, 2, 3],
-            [4, 5, 6],
-            [7, 8, 9]
-        ], dtype=np.float32)
-        kernel = np.array([
-            [1, 0],
-            [0, -1]
-        ], dtype=np.float32)
-        expected = np.array([
-            [1*1+2*0+4*0+5*(-1), 2*1+3*0+5*0+6*(-1)],
-            [4*1+5*0+7*0+8*(-1), 5*1+6*0+8*0+9*(-1)]
-        ], dtype=np.float32)
-        output = conv2d_naive(input, kernel)
-        assert np.allclose(output, expected), f"conv2d_naive output incorrect!\nExpected:\n{expected}\nGot:\n{output}"
-    
-    def test_conv2d_naive_edge_detection(self):
-        """Test convolution with edge detection kernel."""
-        input = np.array([
-            [0, 0, 0, 0, 0],
-            [0, 1, 1, 1, 0],
-            [0, 1, 1, 1, 0],
-            [0, 1, 1, 1, 0],
-            [0, 0, 0, 0, 0]
-        ], dtype=np.float32)
-        
-        # Vertical edge detection kernel
-        kernel = np.array([
-            [-1, 0, 1],
-            [-2, 0, 2],
-            [-1, 0, 1]
-        ], dtype=np.float32)
-        
-        output = conv2d_naive(input, kernel)
-        assert output.shape == (3, 3), f"Expected shape (3, 3), got {output.shape}"
-        
-        # Should detect vertical edges
-        assert np.abs(output[1, 0]) > 0, "Should detect left edge"
-        assert np.abs(output[1, 2]) > 0, "Should detect right edge"
-        assert np.abs(output[1, 1]) < 1, "Should be small in center"
-    
-    def test_conv2d_naive_identity_kernel(self):
-        """Test convolution with identity kernel."""
-        input = np.array([
-            [1, 2, 3],
-            [4, 5, 6],
-            [7, 8, 9]
-        ], dtype=np.float32)
-        
-        # Identity kernel
-        kernel = np.array([
-            [0, 0, 0],
-            [0, 1, 0],
-            [0, 0, 0]
-        ], dtype=np.float32)
-        
-        output = conv2d_naive(input, kernel)
-        expected = np.array([[5]], dtype=np.float32)  # Only center value
-        assert np.allclose(output, expected), f"Identity kernel failed: got {output}, expected {expected}"
-    
-    def test_conv2d_naive_different_sizes(self):
-        """Test convolution with different input and kernel sizes."""
-        # 4x4 input, 2x2 kernel
-        input = np.array([
-            [1, 2, 3, 4],
-            [5, 6, 7, 8],
-            [9, 10, 11, 12],
-            [13, 14, 15, 16]
-        ], dtype=np.float32)
-        
-        kernel = np.array([
-            [1, 1],
-            [1, 1]
-        ], dtype=np.float32)
-        
-        output = conv2d_naive(input, kernel)
-        assert output.shape == (3, 3), f"Expected shape (3, 3), got {output.shape}"
-        
-        # Check first element: 1+2+5+6 = 14
-        assert np.isclose(output[0, 0], 14), f"First element should be 14, got {output[0, 0]}"
-    
-    def test_conv2d_naive_single_pixel(self):
-        """Test convolution with single pixel input."""
-        input = np.array([[5]], dtype=np.float32)
-        kernel = np.array([[2]], dtype=np.float32)
-        
-        output = conv2d_naive(input, kernel)
-        expected = np.array([[10]], dtype=np.float32)
-        assert np.allclose(output, expected), f"Single pixel convolution failed: got {output}, expected {expected}"
-
-
-class TestConv2DLayer:
-    """Test the Conv2D layer implementation."""
-    
-    def test_conv2d_layer_creation(self):
-        """Test Conv2D layer creation."""
-        conv = Conv2D((3, 3))
-        assert conv.kernel_size == (3, 3), f"Kernel size should be (3, 3), got {conv.kernel_size}"
-        assert conv.kernel.shape == (3, 3), f"Kernel shape should be (3, 3), got {conv.kernel.shape}"
-    
-    def test_conv2d_layer_forward_pass(self):
-        """Test Conv2D layer forward pass."""
-        conv = Conv2D((2, 2))
-        x = Tensor(np.ones((4, 4), dtype=np.float32))
-        
-        output = conv(x)
-        assert output.shape == (3, 3), f"Expected output shape (3, 3), got {output.shape}"
-        assert hasattr(output, 'data'), "Output should be a Tensor with data attribute"
-    
-    def test_conv2d_layer_different_sizes(self):
-        """Test Conv2D layer with different input sizes."""
-        conv = Conv2D((2, 2))
-        
-        # Test with 3x3 input
-        x1 = Tensor(np.ones((3, 3), dtype=np.float32))
-        out1 = conv(x1)
-        assert out1.shape == (2, 2), f"3x3 input should give (2, 2) output, got {out1.shape}"
-        
-        # Test with 5x5 input
-        x2 = Tensor(np.ones((5, 5), dtype=np.float32))
-        out2 = conv(x2)
-        assert out2.shape == (4, 4), f"5x5 input should give (4, 4) output, got {out2.shape}"
-    
-    def test_conv2d_layer_kernel_initialization(self):
-        """Test that Conv2D layer initializes kernel properly."""
-        conv = Conv2D((3, 3))
-        
-        # Kernel should not be all zeros
-        assert not np.allclose(conv.kernel, 0), "Kernel should not be all zeros"
-        
-        # Kernel should be reasonable size (not too large)
-        assert np.abs(conv.kernel).max() < 10, "Kernel values should be reasonable"
-    
-    def test_conv2d_layer_reproducibility(self):
-        """Test that Conv2D layer gives consistent results."""
-        conv = Conv2D((2, 2))
-        x = Tensor(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
-        
-        # Multiple forward passes should give same result
-        out1 = conv(x)
-        out2 = conv(x)
-        
-        assert np.allclose(safe_numpy(out1), safe_numpy(out2)), "Conv2D should be deterministic"
-
-
-class TestFlattenFunction:
-    """Test the flatten function implementation."""
-    
-    def test_flatten_2d_matrix(self):
-        """Test flattening a 2D matrix."""
-        x = Tensor(np.array([[1, 2], [3, 4]], dtype=np.float32))
-        flattened = flatten(x)
-        
-        expected = np.array([[1, 2, 3, 4]], dtype=np.float32)
-        assert np.array_equal(safe_numpy(flattened), expected), f"Flatten failed: got {safe_numpy(flattened)}, expected {expected}"
-        assert flattened.shape == (1, 4), f"Expected shape (1, 4), got {flattened.shape}"
-    
-    def test_flatten_3d_tensor(self):
-        """Test flattening a 3D tensor."""
-        x = Tensor(np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]], dtype=np.float32))
-        flattened = flatten(x)
-        
-        expected = np.array([[1, 2, 3, 4, 5, 6, 7, 8]], dtype=np.float32)
-        assert np.array_equal(safe_numpy(flattened), expected), f"3D flatten failed: got {safe_numpy(flattened)}, expected {expected}"
-        assert flattened.shape == (1, 8), f"Expected shape (1, 8), got {flattened.shape}"
-    
-    def test_flatten_1d_tensor(self):
-        """Test flattening a 1D tensor."""
-        x = Tensor(np.array([1, 2, 3, 4], dtype=np.float32))
-        flattened = flatten(x)
-        
-        expected = np.array([[1, 2, 3, 4]], dtype=np.float32)
-        assert np.array_equal(safe_numpy(flattened), expected), f"1D flatten failed: got {safe_numpy(flattened)}, expected {expected}"
-        assert flattened.shape == (1, 4), f"Expected shape (1, 4), got {flattened.shape}"
-    
-    def test_flatten_single_element(self):
-        """Test flattening a single element tensor."""
-        x = Tensor(np.array([[[[5]]]], dtype=np.float32))
-        flattened = flatten(x)
-        
-        expected = np.array([[5]], dtype=np.float32)
-        assert np.array_equal(safe_numpy(flattened), expected), f"Single element flatten failed: got {safe_numpy(flattened)}, expected {expected}"
-        assert flattened.shape == (1, 1), f"Expected shape (1, 1), got {flattened.shape}"
-    
-    def test_flatten_preserves_data_type(self):
-        """Test that flatten preserves data type."""
-        x = Tensor(np.array([[1, 2], [3, 4]], dtype=np.float32))
-        flattened = flatten(x)
-        
-        assert safe_numpy(flattened).dtype == np.float32, f"Data type should be preserved: got {safe_numpy(flattened).dtype}"
-
-
-class TestCNNIntegration:
-    """Test integration between CNN components."""
-    
-    def test_conv_then_flatten(self):
-        """Test convolution followed by flatten (typical CNN pattern)."""
-        # Create a simple input
-        x = Tensor(np.array([
-            [1, 2, 3, 4],
-            [5, 6, 7, 8],
-            [9, 10, 11, 12],
-            [13, 14, 15, 16]
-        ], dtype=np.float32))
-        
-        # Apply convolution
-        conv = Conv2D((2, 2))
-        conv_out = conv(x)
-        assert conv_out.shape == (3, 3), f"Conv output should be (3, 3), got {conv_out.shape}"
-        
-        # Apply flatten
-        flat_out = flatten(conv_out)
-        assert flat_out.shape == (1, 9), f"Flatten output should be (1, 9), got {flat_out.shape}"
-        
-        # Check that data is preserved
-        assert safe_numpy(flat_out).size == 9, "Should have 9 elements after flatten"
-    
-    def test_multiple_conv_layers(self):
-        """Test multiple convolution layers (deeper CNN)."""
-        x = Tensor(np.ones((5, 5), dtype=np.float32))
-        
-        # First conv layer
-        conv1 = Conv2D((2, 2))
-        out1 = conv1(x)
-        assert out1.shape == (4, 4), f"First conv should give (4, 4), got {out1.shape}"
-        
-        # Second conv layer
-        conv2 = Conv2D((2, 2))
-        out2 = conv2(out1)
-        assert out2.shape == (3, 3), f"Second conv should give (3, 3), got {out2.shape}"
-        
-        # Final flatten
-        final = flatten(out2)
-        assert final.shape == (1, 9), f"Final flatten should give (1, 9), got {final.shape}"
-    
-    def test_conv_output_range(self):
-        """Test that convolution outputs are in reasonable range."""
-        # Create input with known range
-        x = Tensor(np.random.rand(4, 4).astype(np.float32))  # Values 0-1
-        
-        conv = Conv2D((2, 2))
-        output = conv(x)
-        
-        # Output should be finite
-        assert np.all(np.isfinite(safe_numpy(output))), "Conv output should be finite"
-        
-        # Output should not be extremely large
-        assert np.abs(safe_numpy(output)).max() < 100, "Conv output should not be extremely large"
-
-
-class TestCNNEdgeCases:
-    """Test edge cases and error conditions."""
-    
-    def test_conv2d_naive_minimum_size(self):
-        """Test convolution with minimum possible sizes."""
-        # 1x1 input, 1x1 kernel
-        input = np.array([[1]], dtype=np.float32)
-        kernel = np.array([[2]], dtype=np.float32)
-        
-        output = conv2d_naive(input, kernel)
-        expected = np.array([[2]], dtype=np.float32)
-        assert np.allclose(output, expected), f"Minimum size convolution failed: got {output}, expected {expected}"
-    
-    def test_conv2d_layer_minimum_size(self):
-        """Test Conv2D layer with minimum input size."""
-        conv = Conv2D((1, 1))
-        x = Tensor(np.array([[5]], dtype=np.float32))
-        
-        output = conv(x)
-        assert output.shape == (1, 1), f"Minimum size layer should give (1, 1), got {output.shape}"
-    
-    def test_flatten_empty_handling(self):
-        """Test flatten with various edge cases."""
-        # Very small tensor
-        x = Tensor(np.array([1], dtype=np.float32))
-        flattened = flatten(x)
-        assert flattened.shape == (1, 1), f"Single element should give (1, 1), got {flattened.shape}"
-    
-    def test_conv_with_zeros(self):
-        """Test convolution with zero inputs."""
-        # All zeros input
-        x = Tensor(np.zeros((3, 3), dtype=np.float32))
-        conv = Conv2D((2, 2))
-        output = conv(x)
-        
-        # Should not crash and should produce valid output
-        assert output.shape == (2, 2), f"Zero input should give (2, 2), got {output.shape}"
-        assert np.all(np.isfinite(safe_numpy(output))), "Zero input should produce finite output"
-    
-    def test_conv_with_negative_values(self):
-        """Test convolution with negative inputs."""
-        x = Tensor(np.array([[-1, -2], [-3, -4]], dtype=np.float32))
-        conv = Conv2D((2, 2))
-        output = conv(x)
-        
-        # Should handle negative values properly
-        assert output.shape == (1, 1), f"Negative input should give (1, 1), got {output.shape}"
-        assert np.all(np.isfinite(safe_numpy(output))), "Negative input should produce finite output"
-
-
-class TestCNNPerformance:
-    """Test performance characteristics of CNN operations."""
-    
-    def test_conv_reasonable_speed(self):
-        """Test that convolution completes in reasonable time."""
-        import time
-        
-        # Medium-sized input
-        x = Tensor(np.random.rand(10, 10).astype(np.float32))
-        conv = Conv2D((3, 3))
-        
-        start_time = time.time()
-        output = conv(x)
-        end_time = time.time()
-        
-        # Should complete quickly (less than 1 second)
-        assert end_time - start_time < 1.0, "Convolution should complete quickly"
-        assert output.shape == (8, 8), f"Expected (8, 8), got {output.shape}"
-    
-    def test_flatten_preserves_size(self):
-        """Test that flatten preserves total number of elements."""
-        shapes = [(2, 3), (4, 4), (1, 10), (5, 2, 3)]
-        
-        for shape in shapes:
-            x = Tensor(np.random.rand(*shape).astype(np.float32))
-            flattened = flatten(x)
-            
-            original_size = np.prod(shape)
-            flattened_size = flattened.shape[1]  # Second dimension since flatten returns (1, N)
-            
-            assert original_size == flattened_size, f"Size mismatch for shape {shape}: {original_size} != {flattened_size}"
-
-
-if __name__ == "__main__":
-    # Run the tests
-    pytest.main([__file__, "-v"]) 
\ No newline at end of file
diff --git a/modules/source/06_dataloader/dataloader_dev.py b/modules/source/06_dataloader/dataloader_dev.py
index b73c51fe..e213d8cc 100644
--- a/modules/source/06_dataloader/dataloader_dev.py
+++ b/modules/source/06_dataloader/dataloader_dev.py
@@ -753,22 +753,22 @@ try:
     dataset = SimpleDataset(size=20, num_features=5, num_classes=4)
     
     print(f"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}")
-    
-    # Test basic properties
+        
+        # Test basic properties
     assert len(dataset) == 20, f"Dataset length should be 20, got {len(dataset)}"
     assert dataset.get_num_classes() == 4, f"Should have 4 classes, got {dataset.get_num_classes()}"
     print("✅ SimpleDataset basic properties work correctly")
-    
+        
     # Test sample access
-    data, label = dataset[0]
-    assert isinstance(data, Tensor), "Data should be a Tensor"
-    assert isinstance(label, Tensor), "Label should be a Tensor"
+        data, label = dataset[0]
+        assert isinstance(data, Tensor), "Data should be a Tensor"
+        assert isinstance(label, Tensor), "Label should be a Tensor"
     assert data.shape == (5,), f"Data shape should be (5,), got {data.shape}"
     assert label.shape == (), f"Label shape should be (), got {label.shape}"
     print("✅ SimpleDataset sample access works correctly")
-    
+        
     # Test sample shape
-    sample_shape = dataset.get_sample_shape()
+        sample_shape = dataset.get_sample_shape()
     assert sample_shape == (5,), f"Sample shape should be (5,), got {sample_shape}"
     print("✅ SimpleDataset get_sample_shape works correctly")
     
@@ -787,7 +787,7 @@ try:
     assert np.array_equal(label1.data, label2.data), "Labels should be deterministic"
     print("✅ SimpleDataset data is deterministic")
     
-except Exception as e:
+    except Exception as e:
     print(f"❌ SimpleDataset test failed: {e}")
     raise
 
@@ -861,9 +861,9 @@ try:
         # Verify batch properties
         assert batch_data.shape[1] == 8, f"Features should be 8, got {batch_data.shape[1]}"
         assert len(batch_labels.shape) == 1, f"Labels should be 1D, got shape {batch_labels.shape}"
-        assert isinstance(batch_data, Tensor), "Batch data should be Tensor"
-        assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor"
-    
+            assert isinstance(batch_data, Tensor), "Batch data should be Tensor"
+            assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor"
+        
     assert epoch_samples == 100, f"Should process 100 samples, got {epoch_samples}"
     expected_batches = (100 + 16 - 1) // 16
     assert epoch_batches == expected_batches, f"Should have {expected_batches} batches, got {epoch_batches}"
@@ -943,11 +943,11 @@ try:
     dataset = SimpleDataset(size=60, num_features=6, num_classes=3)
     loader = DataLoader(dataset, batch_size=20, shuffle=True)
     
-    for epoch in range(3):
-        epoch_samples = 0
+        for epoch in range(3):
+            epoch_samples = 0
         for batch_data, batch_labels in loader:
-            epoch_samples += batch_data.shape[0]
-            
+                epoch_samples += batch_data.shape[0]
+                
             # Verify shapes remain consistent across epochs
             assert batch_data.shape[1] == 6, f"Features should be 6 in epoch {epoch}"
             assert len(batch_labels.shape) == 1, f"Labels should be 1D in epoch {epoch}"
@@ -963,7 +963,7 @@ try:
     print("  • Memory-efficient processing")
     print("  • Multi-epoch training scenarios")
     
-except Exception as e:
+    except Exception as e:
     print(f"❌ Integration test failed: {e}")
     raise
 
@@ -1038,7 +1038,7 @@ Congratulations! You've successfully implemented the core components of data loa
    for epoch in range(num_epochs):
        for batch_data, batch_labels in loader:
            # Train model
-           pass
+       pass
    ```
 4. **Explore advanced topics**: Data augmentation, distributed loading, streaming datasets!
 
diff --git a/modules/source/06_dataloader/dataloader_dev_backup.py b/modules/source/06_dataloader/dataloader_dev_backup.py
new file mode 100644
index 00000000..bfc1f080
--- /dev/null
+++ b/modules/source/06_dataloader/dataloader_dev_backup.py
@@ -0,0 +1,1368 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 6: DataLoader - Data Loading and Preprocessing
+
+Welcome to the DataLoader module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.
+
+## Learning Goals
+- Understand data pipelines as the foundation of ML systems
+- Implement efficient data loading with memory management and batching
+- Build reusable dataset abstractions for different data types
+- Master the Dataset and DataLoader pattern used in all ML frameworks
+- Learn systems thinking for data engineering and I/O optimization
+
+## Build → Use → Understand
+1. **Build**: Create dataset classes and data loaders from scratch
+2. **Use**: Load real datasets and feed them to neural networks
+3. **Understand**: How data engineering affects system performance and scalability
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "dataloader-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.dataloader
+
+#| export
+import numpy as np
+import sys
+import os
+import pickle
+import struct
+from typing import List, Tuple, Optional, Union, Iterator
+import matplotlib.pyplot as plt
+import urllib.request
+import tarfile
+
+# Import our building blocks - try package first, then local modules
+try:
+    from tinytorch.core.tensor import Tensor
+except ImportError:
+    # For development, import from local modules
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+    from tensor_dev import Tensor
+
+# %% nbgrader={"grade": false, "grade_id": "dataloader-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| hide
+#| export
+def _should_show_plots():
+    """Check if we should show plots (disable during testing)"""
+    # Check multiple conditions that indicate we're in test mode
+    is_pytest = (
+        'pytest' in sys.modules or
+        'test' in sys.argv or
+        os.environ.get('PYTEST_CURRENT_TEST') is not None or
+        any('test' in arg for arg in sys.argv) or
+        any('pytest' in arg for arg in sys.argv)
+    )
+    
+    # Show plots in development mode (when not in test mode)
+    return not is_pytest
+
+# %% nbgrader={"grade": false, "grade_id": "dataloader-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
+print("🔥 TinyTorch DataLoader Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
+print("Ready to build data pipelines!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.dataloader`
+
+```python
+# Final package structure:
+from tinytorch.core.dataloader import Dataset, DataLoader  # Data loading utilities!
+from tinytorch.core.tensor import Tensor  # Foundation
+from tinytorch.core.networks import Sequential  # Models to train
+```
+
+**Why this matters:**
+- **Learning:** Focused modules for deep understanding of data pipelines
+- **Production:** Proper organization like PyTorch's `torch.utils.data`
+- **Consistency:** All data loading utilities live together in `core.dataloader`
+- **Integration:** Works seamlessly with tensors and networks
+"""
+
+# %% [markdown]
+"""
+## 🧠 The Mathematical Foundation of Data Engineering
+
+### The Data Pipeline Equation
+Every machine learning system follows this fundamental equation:
+
+```
+Model Performance = f(Data Quality × Data Quantity × Data Efficiency)
+```
+
+### Why Data Engineering is Critical
+- **Data is the fuel**: Without proper data pipelines, nothing else works
+- **I/O bottlenecks**: Data loading is often the biggest performance bottleneck
+- **Memory management**: How you handle data affects everything else
+- **Production reality**: Data pipelines are critical in real ML systems
+
+### The Three Pillars of Data Engineering
+1. **Abstraction**: Clean interfaces that hide complexity
+2. **Efficiency**: Minimize I/O and memory overhead
+3. **Scalability**: Handle datasets larger than memory
+
+### Connection to Real ML Systems
+Every framework uses the Dataset/DataLoader pattern:
+- **PyTorch**: `torch.utils.data.Dataset` and `torch.utils.data.DataLoader`
+- **TensorFlow**: `tf.data.Dataset` with efficient data pipelines
+- **JAX**: Custom data loading with `jax.numpy` integration
+- **TinyTorch**: `tinytorch.core.dataloader.Dataset` and `DataLoader` (what we're building!)
+
+### Performance Considerations
+- **Memory efficiency**: Handle datasets larger than RAM
+- **I/O optimization**: Read from disk efficiently with batching
+- **Caching strategies**: When to cache vs recompute
+- **Parallel processing**: Multi-threaded data loading
+"""
+
+# %% [markdown]
+"""
+## Step 1: Understanding Data Engineering
+
+### What is Data Engineering?
+**Data engineering** is the foundation of all machine learning systems. It involves loading, processing, and managing data efficiently so that models can learn from it.
+
+### The Fundamental Insight
+**Data engineering is about managing the flow of information through your system:**
+```
+Raw Data → Load → Preprocess → Batch → Feed to Model
+```
+
+### Real-World Examples
+- **Image datasets**: CIFAR-10, ImageNet, MNIST
+- **Text datasets**: Wikipedia, books, social media
+- **Tabular data**: CSV files, databases, spreadsheets
+- **Audio data**: Speech recordings, music files
+
+### Systems Thinking
+- **Memory efficiency**: Handle datasets larger than RAM
+- **I/O optimization**: Read from disk efficiently
+- **Batching strategies**: Trade-offs between memory and speed
+- **Caching**: When to cache vs recompute
+
+### Visual Intuition
+```
+Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]
+Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]
+Batch: [Tensor(32, 32, 32, 3)]  # 32 images at once
+Model: Process batch efficiently
+```
+
+Let's start by building the most fundamental component: **Dataset**.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "dataset-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Dataset:
+    """
+    Base Dataset class: Abstract interface for all datasets.
+    
+    The fundamental abstraction for data loading in TinyTorch.
+    Students implement concrete datasets by inheriting from this class.
+    """
+    
+    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:
+        """
+        Get a single sample and label by index.
+        
+        Args:
+            index: Index of the sample to retrieve
+            
+        Returns:
+            Tuple of (data, label) tensors
+            
+        TODO: Implement abstract method for getting samples.
+        
+        APPROACH:
+        1. This is an abstract method - subclasses will implement it
+        2. Return a tuple of (data, label) tensors
+        3. Data should be the input features, label should be the target
+        
+        EXAMPLE:
+        dataset[0] should return (Tensor(image_data), Tensor(label))
+        
+        HINTS:
+        - This is an abstract method that subclasses must override
+        - Always return a tuple of (data, label) tensors
+        - Data contains the input features, label contains the target
+        """
+        ### BEGIN SOLUTION
+        # This is an abstract method - subclasses must implement it
+        raise NotImplementedError("Subclasses must implement __getitem__")
+        ### END SOLUTION
+    
+    def __len__(self) -> int:
+        """
+        Get the total number of samples in the dataset.
+        
+        TODO: Implement abstract method for getting dataset size.
+        
+        APPROACH:
+        1. This is an abstract method - subclasses will implement it
+        2. Return the total number of samples in the dataset
+        
+        EXAMPLE:
+        len(dataset) should return 50000 for CIFAR-10 training set
+        
+        HINTS:
+        - This is an abstract method that subclasses must override
+        - Return an integer representing the total number of samples
+        """
+        ### BEGIN SOLUTION
+        # This is an abstract method - subclasses must implement it
+        raise NotImplementedError("Subclasses must implement __len__")
+        ### END SOLUTION
+    
+    def get_sample_shape(self) -> Tuple[int, ...]:
+        """
+        Get the shape of a single data sample.
+        
+        TODO: Implement method to get sample shape.
+        
+        APPROACH:
+        1. Get the first sample using self[0]
+        2. Extract the data part (first element of tuple)
+        3. Return the shape of the data tensor
+        
+        EXAMPLE:
+        For CIFAR-10: returns (3, 32, 32) for RGB images
+        
+        HINTS:
+        - Use self[0] to get the first sample
+        - Extract data from the (data, label) tuple
+        - Return data.shape
+        """
+        ### BEGIN SOLUTION
+        # Get the first sample to determine shape
+        data, _ = self[0]
+        return data.shape
+        ### END SOLUTION
+    
+    def get_num_classes(self) -> int:
+        """
+        Get the number of classes in the dataset.
+        
+        TODO: Implement abstract method for getting number of classes.
+        
+        APPROACH:
+        1. This is an abstract method - subclasses will implement it
+        2. Return the number of unique classes in the dataset
+        
+        EXAMPLE:
+        For CIFAR-10: returns 10 (classes 0-9)
+        
+        HINTS:
+        - This is an abstract method that subclasses must override
+        - Return the number of unique classes/categories
+        """
+        ### BEGIN SOLUTION
+        # This is an abstract method - subclasses must implement it
+        raise NotImplementedError("Subclasses must implement get_num_classes")
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Quick Test: Dataset Base Class
+
+Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-dataset-interface-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false}
+# Test Dataset interface with a simple implementation
+print("🔬 Testing Dataset interface...")
+
+# Create a minimal test dataset
+class TestDataset(Dataset):
+    def __init__(self, size=5):
+        self.size = size
+    
+    def __getitem__(self, index):
+        # Simple test data: features are [index, index*2], label is index % 2
+        data = Tensor([index, index * 2])
+        label = Tensor([index % 2])
+        return data, label
+    
+    def __len__(self):
+        return self.size
+    
+    def get_num_classes(self):
+        return 2
+
+# Test the interface
+try:
+    test_dataset = TestDataset(size=5)
+    print(f"Dataset created with size: {len(test_dataset)}")
+    
+    # Test __getitem__
+    data, label = test_dataset[0]
+    print(f"Sample 0: data={data}, label={label}")
+    assert isinstance(data, Tensor), "Data should be a Tensor"
+    assert isinstance(label, Tensor), "Label should be a Tensor"
+    print("✅ Dataset __getitem__ works correctly")
+    
+    # Test __len__
+    assert len(test_dataset) == 5, f"Dataset length should be 5, got {len(test_dataset)}"
+    print("✅ Dataset __len__ works correctly")
+    
+    # Test get_num_classes
+    assert test_dataset.get_num_classes() == 2, f"Should have 2 classes, got {test_dataset.get_num_classes()}"
+    print("✅ Dataset get_num_classes works correctly")
+    
+    # Test multiple samples
+    for i in range(3):
+        data, label = test_dataset[i]
+        expected_data = [i, i * 2]
+        expected_label = [i % 2]
+        assert np.array_equal(data.data, expected_data), f"Data mismatch at index {i}"
+        assert np.array_equal(label.data, expected_label), f"Label mismatch at index {i}"
+    print("✅ Dataset produces correct data for multiple samples")
+    
+except Exception as e:
+    print(f"❌ Dataset interface test failed: {e}")
+    raise
+
+# Show the dataset pattern
+print("🎯 Dataset interface pattern:")
+print("   __getitem__: Returns (data, label) tuple")
+print("   __len__: Returns dataset size")
+print("   get_num_classes: Returns number of classes")
+print("📈 Progress: Dataset interface ✓")
+
+# %% [markdown]
+"""
+## Step 2: Building the DataLoader
+
+### What is a DataLoader?
+A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect.
+
+### Why DataLoaders Matter
+- **Batching**: Groups samples for efficient GPU computation
+- **Shuffling**: Randomizes data order to prevent overfitting
+- **Memory efficiency**: Loads data on-demand rather than all at once
+- **Iteration**: Provides clean interface for training loops
+
+### The DataLoader Pattern
+```
+DataLoader(dataset, batch_size=32, shuffle=True)
+for batch_data, batch_labels in dataloader:
+    # batch_data.shape: (32, ...)
+    # batch_labels.shape: (32,)
+    # Train on batch
+```
+
+### Real-World Applications
+- **Training loops**: Feed batches to neural networks
+- **Validation**: Evaluate models on held-out data
+- **Inference**: Process large datasets efficiently
+- **Data analysis**: Explore datasets systematically
+
+### Systems Thinking
+- **Batch size**: Trade-off between memory and speed
+- **Shuffling**: Prevents overfitting to data order
+- **Iteration**: Efficient looping through data
+- **Memory**: Manage large datasets that don't fit in RAM
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "dataloader-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class DataLoader:
+    """
+    DataLoader: Efficiently batch and iterate through datasets.
+    
+    Provides batching, shuffling, and efficient iteration over datasets.
+    Essential for training neural networks efficiently.
+    """
+    
+    def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):
+        """
+        Initialize DataLoader.
+        
+        Args:
+            dataset: Dataset to load from
+            batch_size: Number of samples per batch
+            shuffle: Whether to shuffle data each epoch
+            
+        TODO: Store configuration and dataset.
+        
+        APPROACH:
+        1. Store dataset as self.dataset
+        2. Store batch_size as self.batch_size
+        3. Store shuffle as self.shuffle
+        
+        EXAMPLE:
+        DataLoader(dataset, batch_size=32, shuffle=True)
+        
+        HINTS:
+        - Store all parameters as instance variables
+        - These will be used in __iter__ for batching
+        """
+        ### BEGIN SOLUTION
+        self.dataset = dataset
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        ### END SOLUTION
+    
+    def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:
+        """
+        Iterate through dataset in batches.
+        
+        Returns:
+            Iterator yielding (batch_data, batch_labels) tuples
+            
+        TODO: Implement batching and shuffling logic.
+        
+        APPROACH:
+        1. Create indices list: list(range(len(dataset)))
+        2. Shuffle indices if self.shuffle is True
+        3. Loop through indices in batch_size chunks
+        4. For each batch: collect samples, stack them, yield batch
+        
+        EXAMPLE:
+        for batch_data, batch_labels in dataloader:
+            # batch_data.shape: (batch_size, ...)
+            # batch_labels.shape: (batch_size,)
+        
+        HINTS:
+        - Use list(range(len(self.dataset))) for indices
+        - Use np.random.shuffle() if self.shuffle is True
+        - Loop in chunks of self.batch_size
+        - Collect samples and stack with np.stack()
+        """
+        ### BEGIN SOLUTION
+        # Create indices for all samples
+        indices = list(range(len(self.dataset)))
+        
+        # Shuffle if requested
+        if self.shuffle:
+            np.random.shuffle(indices)
+        
+        # Iterate through indices in batches
+        for i in range(0, len(indices), self.batch_size):
+            batch_indices = indices[i:i + self.batch_size]
+            
+            # Collect samples for this batch
+            batch_data = []
+            batch_labels = []
+            
+            for idx in batch_indices:
+                data, label = self.dataset[idx]
+                batch_data.append(data.data)
+                batch_labels.append(label.data)
+            
+            # Stack into batch tensors
+            batch_data_array = np.stack(batch_data, axis=0)
+            batch_labels_array = np.stack(batch_labels, axis=0)
+            
+            yield Tensor(batch_data_array), Tensor(batch_labels_array)
+        ### END SOLUTION
+    
+    def __len__(self) -> int:
+        """
+        Get the number of batches per epoch.
+        
+        TODO: Calculate number of batches.
+        
+        APPROACH:
+        1. Get dataset size: len(self.dataset)
+        2. Divide by batch_size and round up
+        3. Use ceiling division: (n + batch_size - 1) // batch_size
+        
+        EXAMPLE:
+        Dataset size 100, batch size 32 → 4 batches
+        
+        HINTS:
+        - Use len(self.dataset) for dataset size
+        - Use ceiling division for exact batch count
+        - Formula: (dataset_size + batch_size - 1) // batch_size
+        """
+        ### BEGIN SOLUTION
+        # Calculate number of batches using ceiling division
+        dataset_size = len(self.dataset)
+        return (dataset_size + self.batch_size - 1) // self.batch_size
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+### 🧪 Quick Test: DataLoader
+
+Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-dataloader-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
+# Test DataLoader immediately after implementation
+print("🔬 Testing DataLoader...")
+
+# Use the test dataset from before
+class TestDataset(Dataset):
+    def __init__(self, size=10):
+        self.size = size
+    
+    def __getitem__(self, index):
+        data = Tensor([index, index * 2])
+        label = Tensor([index % 3])  # 3 classes
+        return data, label
+    
+    def __len__(self):
+        return self.size
+    
+    def get_num_classes(self):
+        return 3
+
+# Test basic DataLoader functionality
+try:
+    dataset = TestDataset(size=10)
+    dataloader = DataLoader(dataset, batch_size=3, shuffle=False)
+    
+    print(f"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}")
+    print(f"Number of batches: {len(dataloader)}")
+    
+    # Test __len__
+    expected_batches = (10 + 3 - 1) // 3  # Ceiling division: 4 batches
+    assert len(dataloader) == expected_batches, f"Should have {expected_batches} batches, got {len(dataloader)}"
+    print("✅ DataLoader __len__ works correctly")
+    
+    # Test iteration
+    batch_count = 0
+    total_samples = 0
+    
+    for batch_data, batch_labels in dataloader:
+        batch_count += 1
+        batch_size = batch_data.shape[0]
+        total_samples += batch_size
+        
+        print(f"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}")
+        
+        # Verify batch dimensions
+        assert len(batch_data.shape) == 2, f"Batch data should be 2D, got {batch_data.shape}"
+        assert len(batch_labels.shape) == 2, f"Batch labels should be 2D, got {batch_labels.shape}"
+        assert batch_data.shape[1] == 2, f"Each sample should have 2 features, got {batch_data.shape[1]}"
+        assert batch_labels.shape[1] == 1, f"Each label should have 1 element, got {batch_labels.shape[1]}"
+        
+    assert batch_count == expected_batches, f"Should iterate {expected_batches} times, got {batch_count}"
+    assert total_samples == 10, f"Should process 10 total samples, got {total_samples}"
+    print("✅ DataLoader iteration works correctly")
+    
+except Exception as e:
+    print(f"❌ DataLoader test failed: {e}")
+    raise
+
+# Test shuffling
+try:
+    dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)
+    dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)
+    
+    # Get first batch from each
+    batch1_shuffle = next(iter(dataloader_shuffle))
+    batch1_no_shuffle = next(iter(dataloader_no_shuffle))
+    
+    print("✅ DataLoader shuffling parameter works")
+    
+except Exception as e:
+    print(f"❌ DataLoader shuffling test failed: {e}")
+    raise
+
+# Test different batch sizes
+try:
+    small_loader = DataLoader(dataset, batch_size=2, shuffle=False)
+    large_loader = DataLoader(dataset, batch_size=8, shuffle=False)
+    
+    assert len(small_loader) == 5, f"Small loader should have 5 batches, got {len(small_loader)}"
+    assert len(large_loader) == 2, f"Large loader should have 2 batches, got {len(large_loader)}"
+    print("✅ DataLoader handles different batch sizes correctly")
+    
+except Exception as e:
+    print(f"❌ DataLoader batch size test failed: {e}")
+    raise
+
+# Show the DataLoader behavior
+print("🎯 DataLoader behavior:")
+print("   Batches data for efficient processing")
+print("   Handles shuffling and iteration")
+print("   Provides clean interface for training loops")
+print("📈 Progress: Dataset interface ✓, DataLoader ✓")
+
+# %% [markdown]
+"""
+## Step 3: Creating a Simple Dataset Example
+
+### Why We Need Concrete Examples
+Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.
+
+### Design Principles
+- **Simple**: Easy to understand and debug
+- **Configurable**: Adjustable size and properties
+- **Predictable**: Deterministic data for testing
+- **Educational**: Shows the Dataset pattern clearly
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "simple-dataset", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class SimpleDataset(Dataset):
+    """
+    Simple dataset for testing and demonstration.
+    
+    Generates synthetic data with configurable size and properties.
+    Perfect for understanding the Dataset pattern.
+    """
+    
+    def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):
+        """
+        Initialize SimpleDataset.
+        
+        Args:
+            size: Number of samples in the dataset
+            num_features: Number of features per sample
+            num_classes: Number of classes
+            
+        TODO: Initialize the dataset with synthetic data.
+        
+        APPROACH:
+        1. Store the configuration parameters
+        2. Generate synthetic data and labels
+        3. Make data deterministic for testing
+        
+        EXAMPLE:
+        SimpleDataset(size=100, num_features=4, num_classes=3)
+        creates 100 samples with 4 features each, 3 classes
+        
+        HINTS:
+        - Store size, num_features, num_classes as instance variables
+        - Use np.random.seed() for reproducible data
+        - Generate random data with np.random.randn()
+        - Generate random labels with np.random.randint()
+        """
+        ### BEGIN SOLUTION
+        self.size = size
+        self.num_features = num_features
+        self.num_classes = num_classes
+        
+        # Set seed for reproducible data
+        np.random.seed(42)
+        
+        # Generate synthetic data
+        self.data = np.random.randn(size, num_features).astype(np.float32)
+        self.labels = np.random.randint(0, num_classes, size=size)
+        ### END SOLUTION
+    
+    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:
+        """
+        Get a single sample and label by index.
+        
+        Args:
+            index: Index of the sample to retrieve
+            
+        Returns:
+            Tuple of (data, label) tensors
+            
+        TODO: Return the sample and label at the given index.
+        
+        APPROACH:
+        1. Get data at index from self.data
+        2. Get label at index from self.labels
+        3. Convert to tensors and return as tuple
+        
+        EXAMPLE:
+        dataset[0] returns (Tensor([1.2, -0.5, 0.8, 0.1]), Tensor(2))
+        
+        HINTS:
+        - Use self.data[index] and self.labels[index]
+        - Convert to Tensor objects
+        - Return as tuple (data, label)
+        """
+        ### BEGIN SOLUTION
+        data = Tensor(self.data[index])
+        label = Tensor(self.labels[index])
+        return data, label
+        ### END SOLUTION
+    
+    def __len__(self) -> int:
+        """
+        Get the total number of samples in the dataset.
+        
+        TODO: Return the dataset size.
+        
+        HINTS:
+        - Return self.size
+        """
+        ### BEGIN SOLUTION
+        return self.size
+        ### END SOLUTION
+    
+    def get_num_classes(self) -> int:
+        """
+        Get the number of classes in the dataset.
+        
+        TODO: Return the number of classes.
+        
+        HINTS:
+        - Return self.num_classes
+        """
+        ### BEGIN SOLUTION
+        return self.num_classes
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+## 🧪 Comprehensive DataLoader Testing Suite
+
+Let's test all data loading components thoroughly with realistic ML data scenarios!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "test-dataloader-comprehensive", "locked": false, "schema_version": 3, "solution": false, "task": false}
+def test_dataset_interface():
+    """Test 1: Dataset interface comprehensive testing"""
+    print("🔬 Testing Dataset Interface...")
+    
+    # Test 1.1: Abstract base class behavior
+    try:
+        # Test that we can't instantiate abstract Dataset
+        try:
+            base_dataset = Dataset()
+            base_dataset[0]  # Should raise NotImplementedError
+            assert False, "Should not be able to call abstract methods"
+        except NotImplementedError:
+            print("✅ Abstract Dataset correctly raises NotImplementedError")
+    except Exception as e:
+        print(f"❌ Abstract Dataset test failed: {e}")
+        return False
+    
+    # Test 1.2: SimpleDataset implementation
+    try:
+        dataset = SimpleDataset(size=50, num_features=4, num_classes=3)
+        
+        # Test basic properties
+        assert len(dataset) == 50, f"Dataset length should be 50, got {len(dataset)}"
+        assert dataset.get_num_classes() == 3, f"Should have 3 classes, got {dataset.get_num_classes()}"
+        
+        # Test sample retrieval
+        data, label = dataset[0]
+        assert isinstance(data, Tensor), "Data should be a Tensor"
+        assert isinstance(label, Tensor), "Label should be a Tensor"
+        assert data.shape == (4,), f"Data shape should be (4,), got {data.shape}"
+        
+        # Test sample shape method
+        sample_shape = dataset.get_sample_shape()
+        assert sample_shape == (4,), f"Sample shape should be (4,), got {sample_shape}"
+        
+        print("✅ SimpleDataset implementation test passed")
+    except Exception as e:
+        print(f"❌ SimpleDataset implementation failed: {e}")
+        return False
+    
+    # Test 1.3: Different dataset configurations
+    try:
+        # Small dataset
+        small_dataset = SimpleDataset(size=5, num_features=2, num_classes=2)
+        assert len(small_dataset) == 5, "Small dataset length wrong"
+        assert small_dataset.get_num_classes() == 2, "Small dataset classes wrong"
+        
+        # Large dataset
+        large_dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)
+        assert len(large_dataset) == 1000, "Large dataset length wrong"
+        assert large_dataset.get_num_classes() == 5, "Large dataset classes wrong"
+        
+        # Test data consistency (seeded random)
+        data1, _ = small_dataset[0]
+        data2, _ = small_dataset[0]
+        assert np.allclose(data1.data, data2.data), "Dataset should be deterministic"
+        
+        print("✅ Different dataset configurations test passed")
+    except Exception as e:
+        print(f"❌ Different dataset configurations failed: {e}")
+        return False
+    
+    # Test 1.4: Edge cases and robustness
+    try:
+        # Test edge case: single sample
+        single_dataset = SimpleDataset(size=1, num_features=1, num_classes=1)
+        data, label = single_dataset[0]
+        assert data.shape == (1,), "Single sample data shape wrong"
+        assert isinstance(label.data, (int, np.integer)) or label.data.shape == (), "Single sample label wrong"
+        
+        # Test boundary indices
+        dataset = SimpleDataset(size=10, num_features=3, num_classes=2)
+        first_data, first_label = dataset[0]
+        last_data, last_label = dataset[9]
+        assert first_data.shape == (3,), "First sample shape wrong"
+        assert last_data.shape == (3,), "Last sample shape wrong"
+        
+        print("✅ Edge cases and robustness test passed")
+    except Exception as e:
+        print(f"❌ Edge cases and robustness failed: {e}")
+        return False
+    
+    print("🎯 Dataset interface: All tests passed!")
+    return True
+
+def test_dataloader_functionality():
+    """Test 2: DataLoader functionality comprehensive testing"""
+    print("🔬 Testing DataLoader Functionality...")
+    
+    # Test 2.1: Basic DataLoader operations
+    try:
+        dataset = SimpleDataset(size=32, num_features=4, num_classes=2)
+        dataloader = DataLoader(dataset, batch_size=8, shuffle=False)
+        
+        # Test initialization
+        assert dataloader.batch_size == 8, f"Batch size should be 8, got {dataloader.batch_size}"
+        assert dataloader.shuffle == False, f"Shuffle should be False, got {dataloader.shuffle}"
+        
+        # Test length calculation
+        expected_batches = (32 + 8 - 1) // 8  # Ceiling division: 4 batches
+        assert len(dataloader) == expected_batches, f"Should have {expected_batches} batches, got {len(dataloader)}"
+        
+        print("✅ Basic DataLoader operations test passed")
+    except Exception as e:
+        print(f"❌ Basic DataLoader operations failed: {e}")
+        return False
+    
+    # Test 2.2: Batch iteration and shapes
+    try:
+        dataset = SimpleDataset(size=25, num_features=3, num_classes=2)
+        dataloader = DataLoader(dataset, batch_size=10, shuffle=False)
+        
+        batch_count = 0
+        total_samples = 0
+        
+        for batch_data, batch_labels in dataloader:
+            batch_count += 1
+            batch_size = batch_data.shape[0]
+            total_samples += batch_size
+            
+            # Check batch shapes
+            assert len(batch_data.shape) == 2, f"Batch data should be 2D, got {batch_data.shape}"
+            assert batch_data.shape[1] == 3, f"Should have 3 features, got {batch_data.shape[1]}"
+            assert batch_labels.shape[0] == batch_size, f"Labels should match batch size"
+            
+            # Check data types
+            assert isinstance(batch_data, Tensor), "Batch data should be Tensor"
+            assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor"
+        
+        # Verify complete iteration
+        assert total_samples == 25, f"Should process 25 samples, got {total_samples}"
+        assert batch_count == 3, f"Should have 3 batches, got {batch_count}"  # 25/10 = 3 batches
+        
+        print("✅ Batch iteration and shapes test passed")
+    except Exception as e:
+        print(f"❌ Batch iteration and shapes failed: {e}")
+        return False
+    
+    # Test 2.3: Different batch sizes
+    try:
+        dataset = SimpleDataset(size=100, num_features=5, num_classes=3)
+        
+        # Small batches
+        small_loader = DataLoader(dataset, batch_size=7, shuffle=False)
+        assert len(small_loader) == 15, f"Small loader should have 15 batches, got {len(small_loader)}"  # 100/7 = 15
+        
+        # Large batches
+        large_loader = DataLoader(dataset, batch_size=30, shuffle=False)
+        assert len(large_loader) == 4, f"Large loader should have 4 batches, got {len(large_loader)}"  # 100/30 = 4
+        
+        # Single sample batches
+        single_loader = DataLoader(dataset, batch_size=1, shuffle=False)
+        assert len(single_loader) == 100, f"Single loader should have 100 batches, got {len(single_loader)}"
+        
+        print("✅ Different batch sizes test passed")
+    except Exception as e:
+        print(f"❌ Different batch sizes failed: {e}")
+        return False
+    
+    # Test 2.4: Shuffling behavior
+    try:
+        dataset = SimpleDataset(size=20, num_features=2, num_classes=2)
+        
+        # Test with shuffling
+        loader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)
+        loader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)
+        
+        # Get multiple batches to test shuffling
+        shuffle_batches = list(loader_shuffle)
+        no_shuffle_batches = list(loader_no_shuffle)
+        
+        assert len(shuffle_batches) == len(no_shuffle_batches), "Should have same number of batches"
+        
+        # Test that all original samples are present (just reordered)
+        shuffle_all_data = np.concatenate([batch[0].data for batch in shuffle_batches])
+        no_shuffle_all_data = np.concatenate([batch[0].data for batch in no_shuffle_batches])
+        
+        assert shuffle_all_data.shape == no_shuffle_all_data.shape, "Should have same total data shape"
+        
+        print("✅ Shuffling behavior test passed")
+    except Exception as e:
+        print(f"❌ Shuffling behavior failed: {e}")
+        return False
+    
+    print("🎯 DataLoader functionality: All tests passed!")
+    return True
+
+def test_data_pipeline_scenarios():
+    """Test 3: Real-world data pipeline scenarios"""
+    print("🔬 Testing Data Pipeline Scenarios...")
+    
+    # Test 3.1: Image classification scenario
+    try:
+        # Simulate CIFAR-10 like dataset: 32x32 RGB images, 10 classes
+        image_dataset = SimpleDataset(size=1000, num_features=32*32*3, num_classes=10)
+        image_loader = DataLoader(image_dataset, batch_size=64, shuffle=True)
+        
+        # Test one epoch of training
+        epoch_samples = 0
+        for batch_data, batch_labels in image_loader:
+            epoch_samples += batch_data.shape[0]
+            
+            # Verify image batch properties
+            assert batch_data.shape[1] == 32*32*3, f"Should have 3072 features (32x32x3), got {batch_data.shape[1]}"
+            assert batch_data.shape[0] <= 64, f"Batch size should be <= 64, got {batch_data.shape[0]}"
+            
+            # Simulate forward pass
+            batch_size = batch_data.shape[0]
+            assert batch_labels.shape[0] == batch_size, "Labels should match batch size"
+        
+        assert epoch_samples == 1000, f"Should process 1000 samples, got {epoch_samples}"
+        print("✅ Image classification scenario test passed")
+    except Exception as e:
+        print(f"❌ Image classification scenario failed: {e}")
+        return False
+    
+    # Test 3.2: Text classification scenario
+    try:
+        # Simulate text classification: 512 token embeddings, 5 sentiment classes
+        text_dataset = SimpleDataset(size=500, num_features=512, num_classes=5)
+        text_loader = DataLoader(text_dataset, batch_size=32, shuffle=True)
+        
+        # Test batch processing
+        for batch_data, batch_labels in text_loader:
+            # Verify text batch properties
+            assert batch_data.shape[1] == 512, f"Should have 512 features, got {batch_data.shape[1]}"
+            
+            # Simulate text processing
+            batch_size = batch_data.shape[0]
+            assert batch_size <= 32, f"Batch size should be <= 32, got {batch_size}"
+            break  # Just test first batch
+        
+        print("✅ Text classification scenario test passed")
+    except Exception as e:
+        print(f"❌ Text classification scenario failed: {e}")
+        return False
+    
+    # Test 3.3: Tabular data scenario
+    try:
+        # Simulate tabular data: house prices with 20 features, 3 price ranges
+        tabular_dataset = SimpleDataset(size=200, num_features=20, num_classes=3)
+        tabular_loader = DataLoader(tabular_dataset, batch_size=16, shuffle=False)
+        
+        # Test systematic processing (no shuffling for tabular data)
+        batch_count = 0
+        for batch_data, batch_labels in tabular_loader:
+            batch_count += 1
+            
+            # Verify tabular batch properties
+            assert batch_data.shape[1] == 20, f"Should have 20 features, got {batch_data.shape[1]}"
+            
+            # Simulate tabular processing
+            batch_size = batch_data.shape[0]
+            assert batch_size <= 16, f"Batch size should be <= 16, got {batch_size}"
+        
+        expected_batches = (200 + 16 - 1) // 16  # 13 batches
+        assert batch_count == expected_batches, f"Should have {expected_batches} batches, got {batch_count}"
+        
+        print("✅ Tabular data scenario test passed")
+    except Exception as e:
+        print(f"❌ Tabular data scenario failed: {e}")
+        return False
+    
+    # Test 3.4: Small dataset scenario
+    try:
+        # Simulate small research dataset
+        small_dataset = SimpleDataset(size=50, num_features=10, num_classes=2)
+        small_loader = DataLoader(small_dataset, batch_size=8, shuffle=True)
+        
+        # Test multiple epochs
+        for epoch in range(3):
+            epoch_samples = 0
+            for batch_data, batch_labels in small_loader:
+                epoch_samples += batch_data.shape[0]
+                
+                # Verify small dataset properties
+                assert batch_data.shape[1] == 10, f"Should have 10 features, got {batch_data.shape[1]}"
+                
+            assert epoch_samples == 50, f"Epoch {epoch}: should process 50 samples, got {epoch_samples}"
+        
+        print("✅ Small dataset scenario test passed")
+    except Exception as e:
+        print(f"❌ Small dataset scenario failed: {e}")
+        return False
+    
+    print("🎯 Data pipeline scenarios: All tests passed!")
+    return True
+
+def test_integration_with_ml_workflow():
+    """Test 4: Integration with ML workflow"""
+    print("🔬 Testing Integration with ML Workflow...")
+    
+    # Test 4.1: Training loop integration
+    try:
+        # Create dataset for training
+        train_dataset = SimpleDataset(size=100, num_features=8, num_classes=3)
+        train_loader = DataLoader(train_dataset, batch_size=20, shuffle=True)
+        
+        # Simulate training loop
+        for epoch in range(2):
+            epoch_loss = 0
+            batch_count = 0
+            
+            for batch_data, batch_labels in train_loader:
+                batch_count += 1
+                
+                # Simulate forward pass
+                batch_size = batch_data.shape[0]
+                assert batch_data.shape == (batch_size, 8), f"Batch data shape wrong: {batch_data.shape}"
+                assert batch_labels.shape[0] == batch_size, f"Batch labels shape wrong: {batch_labels.shape}"
+                
+                # Simulate loss computation
+                mock_loss = np.random.random()
+                epoch_loss += mock_loss
+                
+                # Verify we can iterate through all batches
+                assert batch_count <= 5, f"Too many batches: {batch_count}"  # 100/20 = 5
+            
+            assert batch_count == 5, f"Should have 5 batches per epoch, got {batch_count}"
+        
+        print("✅ Training loop integration test passed")
+    except Exception as e:
+        print(f"❌ Training loop integration failed: {e}")
+        return False
+    
+    # Test 4.2: Validation loop integration
+    try:
+        # Create dataset for validation
+        val_dataset = SimpleDataset(size=50, num_features=8, num_classes=3)
+        val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)  # No shuffle for validation
+        
+        # Simulate validation loop
+        total_correct = 0
+        total_samples = 0
+        
+        for batch_data, batch_labels in val_loader:
+            batch_size = batch_data.shape[0]
+            total_samples += batch_size
+            
+            # Simulate prediction
+            mock_predictions = np.random.randint(0, 3, size=batch_size)
+            mock_correct = np.random.randint(0, batch_size + 1)
+            total_correct += mock_correct
+            
+            # Verify batch properties
+            assert batch_data.shape[1] == 8, f"Features should be 8, got {batch_data.shape[1]}"
+            assert batch_labels.shape[0] == batch_size, f"Labels should match batch size"
+        
+        assert total_samples == 50, f"Should validate 50 samples, got {total_samples}"
+        
+        print("✅ Validation loop integration test passed")
+    except Exception as e:
+        print(f"❌ Validation loop integration failed: {e}")
+        return False
+    
+    # Test 4.3: Model inference integration
+    try:
+        # Create dataset for inference
+        test_dataset = SimpleDataset(size=30, num_features=5, num_classes=2)
+        test_loader = DataLoader(test_dataset, batch_size=5, shuffle=False)
+        
+        # Simulate inference
+        all_predictions = []
+        
+        for batch_data, batch_labels in test_loader:
+            batch_size = batch_data.shape[0]
+            
+            # Simulate model inference
+            mock_predictions = np.random.random((batch_size, 2))  # 2 classes
+            all_predictions.append(mock_predictions)
+            
+            # Verify inference batch properties
+            assert batch_data.shape[1] == 5, f"Features should be 5, got {batch_data.shape[1]}"
+            assert batch_size <= 5, f"Batch size should be <= 5, got {batch_size}"
+        
+        # Verify all predictions collected
+        total_predictions = np.concatenate(all_predictions, axis=0)
+        assert total_predictions.shape == (30, 2), f"Predictions shape should be (30, 2), got {total_predictions.shape}"
+        
+        print("✅ Model inference integration test passed")
+    except Exception as e:
+        print(f"❌ Model inference integration failed: {e}")
+        return False
+    
+    # Test 4.4: Cross-validation scenario
+    try:
+        # Create dataset for cross-validation
+        full_dataset = SimpleDataset(size=100, num_features=6, num_classes=4)
+        
+        # Simulate 5-fold cross-validation
+        fold_size = 20
+        
+        for fold in range(5):
+            # Create train/val split simulation
+            train_size = 80  # 4 folds for training
+            val_size = 20    # 1 fold for validation
+            
+            train_dataset = SimpleDataset(size=train_size, num_features=6, num_classes=4)
+            val_dataset = SimpleDataset(size=val_size, num_features=6, num_classes=4)
+            
+            train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
+            val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
+            
+            # Verify fold setup
+            assert len(train_dataset) == train_size, f"Train size wrong for fold {fold}"
+            assert len(val_dataset) == val_size, f"Val size wrong for fold {fold}"
+            
+            # Test one iteration of each
+            train_batch = next(iter(train_loader))
+            val_batch = next(iter(val_loader))
+            
+            assert train_batch[0].shape[1] == 6, f"Train features wrong for fold {fold}"
+            assert val_batch[0].shape[1] == 6, f"Val features wrong for fold {fold}"
+        
+        print("✅ Cross-validation scenario test passed")
+    except Exception as e:
+        print(f"❌ Cross-validation scenario failed: {e}")
+        return False
+    
+    print("🎯 ML workflow integration: All tests passed!")
+    return True
+
+# Run all comprehensive tests
+def run_comprehensive_dataloader_tests():
+    """Run all comprehensive DataLoader tests"""
+    print("🧪 Running Comprehensive DataLoader Test Suite...")
+    print("=" * 60)
+    
+    test_results = []
+    
+    # Run all test functions
+    test_results.append(test_dataset_interface())
+    test_results.append(test_dataloader_functionality())
+    test_results.append(test_data_pipeline_scenarios())
+    test_results.append(test_integration_with_ml_workflow())
+    
+    # Summary
+    print("=" * 60)
+    print("📊 Test Results Summary:")
+    print(f"✅ Dataset Interface: {'PASSED' if test_results[0] else 'FAILED'}")
+    print(f"✅ DataLoader Functionality: {'PASSED' if test_results[1] else 'FAILED'}")
+    print(f"✅ Data Pipeline Scenarios: {'PASSED' if test_results[2] else 'FAILED'}")
+    print(f"✅ ML Workflow Integration: {'PASSED' if test_results[3] else 'FAILED'}")
+    
+    all_passed = all(test_results)
+    print(f"\n🎯 Overall Result: {'ALL TESTS PASSED! 🎉' if all_passed else 'SOME TESTS FAILED ❌'}")
+    
+    if all_passed:
+        print("\n🚀 DataLoader Module Implementation Complete!")
+        print("   ✓ Dataset interface working correctly")
+        print("   ✓ DataLoader batching and iteration functional")
+        print("   ✓ Real-world data pipeline scenarios tested")
+        print("   ✓ ML workflow integration verified")
+        print("\n🎓 Ready for production ML data pipelines!")
+    
+    return all_passed
+
+# Run the comprehensive test suite
+if __name__ == "__main__":
+    run_comprehensive_dataloader_tests()
+
+# %% [markdown]
+"""
+### 🧪 Test Your Data Loading Implementations
+
+Once you implement the classes above, run these cells to test them:
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-dataset", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test Dataset abstract class
+print("Testing Dataset abstract class...")
+
+# Create a simple dataset
+dataset = SimpleDataset(size=10, num_features=3, num_classes=2)
+
+# Test basic functionality
+assert len(dataset) == 10, f"Dataset length should be 10, got {len(dataset)}"
+assert dataset.get_num_classes() == 2, f"Number of classes should be 2, got {dataset.get_num_classes()}"
+
+# Test sample retrieval
+data, label = dataset[0]
+assert isinstance(data, Tensor), "Data should be a Tensor"
+assert isinstance(label, Tensor), "Label should be a Tensor"
+assert data.shape == (3,), f"Data shape should be (3,), got {data.shape}"
+assert label.shape == (), f"Label shape should be (), got {label.shape}"
+
+# Test sample shape
+sample_shape = dataset.get_sample_shape()
+assert sample_shape == (3,), f"Sample shape should be (3,), got {sample_shape}"
+
+print("✅ Dataset tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-dataloader", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test DataLoader
+print("Testing DataLoader...")
+
+# Create dataset and dataloader
+dataset = SimpleDataset(size=50, num_features=4, num_classes=3)
+dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
+
+# Test dataloader length
+expected_batches = (50 + 8 - 1) // 8  # Ceiling division
+assert len(dataloader) == expected_batches, f"DataLoader length should be {expected_batches}, got {len(dataloader)}"
+
+# Test batch iteration
+batch_count = 0
+total_samples = 0
+
+for batch_data, batch_labels in dataloader:
+    batch_count += 1
+    batch_size = batch_data.shape[0]
+    total_samples += batch_size
+    
+    # Check batch shapes
+    assert batch_data.shape[1] == 4, f"Batch data should have 4 features, got {batch_data.shape[1]}"
+    assert batch_labels.shape[0] == batch_size, f"Batch labels should match batch size, got {batch_labels.shape[0]}"
+    
+    # Check that we don't exceed expected batches
+    assert batch_count <= expected_batches, f"Too many batches: {batch_count} > {expected_batches}"
+
+# Verify we processed all samples
+assert total_samples == 50, f"Should process 50 samples total, got {total_samples}"
+assert batch_count == expected_batches, f"Should have {expected_batches} batches, got {batch_count}"
+
+print("✅ DataLoader tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-dataloader-shuffle", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test DataLoader shuffling
+print("Testing DataLoader shuffling...")
+
+# Create dataset
+dataset = SimpleDataset(size=20, num_features=2, num_classes=2)
+
+# Test with shuffling
+dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)
+dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)
+
+# Get first batch from each
+batch_shuffle = next(iter(dataloader_shuffle))
+batch_no_shuffle = next(iter(dataloader_no_shuffle))
+
+# With different random seeds, shuffled batches should be different
+# (This is probabilistic, but very likely to be true)
+shuffle_data = batch_shuffle[0].data
+no_shuffle_data = batch_no_shuffle[0].data
+
+# Check that shapes are correct
+assert shuffle_data.shape == (5, 2), f"Shuffled batch shape should be (5, 2), got {shuffle_data.shape}"
+assert no_shuffle_data.shape == (5, 2), f"No-shuffle batch shape should be (5, 2), got {no_shuffle_data.shape}"
+
+print("✅ DataLoader shuffling tests passed!")
+
+# %% nbgrader={"grade": true, "grade_id": "test-integration", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+# Test complete data pipeline integration
+print("Testing complete data pipeline integration...")
+
+# Create a larger dataset
+dataset = SimpleDataset(size=100, num_features=8, num_classes=5)
+dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
+
+# Simulate training loop
+epoch_samples = 0
+epoch_batches = 0
+
+for batch_data, batch_labels in dataloader:
+    epoch_batches += 1
+    epoch_samples += batch_data.shape[0]
+    
+    # Verify batch properties
+    assert batch_data.shape[1] == 8, f"Features should be 8, got {batch_data.shape[1]}"
+    assert len(batch_labels.shape) == 1, f"Labels should be 1D, got shape {batch_labels.shape}"
+    
+    # Verify data types
+    assert isinstance(batch_data, Tensor), "Batch data should be Tensor"
+    assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor"
+
+# Verify we processed all data
+assert epoch_samples == 100, f"Should process 100 samples, got {epoch_samples}"
+expected_batches = (100 + 16 - 1) // 16
+assert epoch_batches == expected_batches, f"Should have {expected_batches} batches, got {epoch_batches}"
+
+print("✅ Complete data pipeline integration tests passed!")
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented the core components of data loading systems:
+
+### What You've Accomplished
+✅ **Dataset Abstract Class**: The foundation interface for all data loading  
+✅ **DataLoader Implementation**: Efficient batching and iteration over datasets  
+✅ **SimpleDataset Example**: Concrete implementation showing the Dataset pattern  
+✅ **Complete Data Pipeline**: End-to-end data loading for neural network training  
+✅ **Systems Thinking**: Understanding memory efficiency, batching, and I/O optimization  
+
+### Key Concepts You've Learned
+- **Dataset pattern**: Abstract interface for consistent data access
+- **DataLoader pattern**: Efficient batching and iteration for training
+- **Memory efficiency**: Loading data on-demand rather than all at once
+- **Batching strategies**: Grouping samples for efficient GPU computation
+- **Shuffling**: Randomizing data order to prevent overfitting
+
+### Mathematical Foundations
+- **Batch processing**: Vectorized operations on multiple samples
+- **Memory management**: Handling datasets larger than available RAM
+- **I/O optimization**: Minimizing disk reads and memory allocation
+- **Stochastic sampling**: Random shuffling for better generalization
+
+### Real-World Applications
+- **Computer vision**: Loading image datasets like CIFAR-10, ImageNet
+- **Natural language processing**: Loading text datasets with tokenization
+- **Tabular data**: Loading CSV files and database records
+- **Audio processing**: Loading and preprocessing audio files
+- **Time series**: Loading sequential data with proper windowing
+
+### Connection to Production Systems
+- **PyTorch**: Your Dataset and DataLoader mirror `torch.utils.data`
+- **TensorFlow**: Similar concepts in `tf.data.Dataset`
+- **JAX**: Custom data loading with efficient batching
+- **MLOps**: Data pipelines are critical for production ML systems
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 06_dataloader`
+2. **Test your implementation**: `tito module test 06_dataloader`
+3. **Use your data loading**: 
+   ```python
+   from tinytorch.core.dataloader import Dataset, DataLoader, SimpleDataset
+   
+   # Create dataset and dataloader
+   dataset = SimpleDataset(size=1000, num_features=10, num_classes=3)
+   dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
+   
+   # Training loop
+   for batch_data, batch_labels in dataloader:
+       # Train your network on batch_data, batch_labels
+       pass
+   ```
+4. **Build real datasets**: Extend Dataset for your specific data types
+5. **Optimize performance**: Add caching, parallel loading, and preprocessing
+
+**Ready for the next challenge?** You now have all the core components to build complete machine learning systems: tensors, activations, layers, networks, and data loading. The next modules will focus on training (autograd, optimizers) and advanced topics!
+""" 
\ No newline at end of file
diff --git a/modules/source/06_dataloader/tests/generate_test_dataloader.py b/modules/source/06_dataloader/tests/generate_test_dataloader.py
deleted file mode 100644
index b9ced235..00000000
--- a/modules/source/06_dataloader/tests/generate_test_dataloader.py
+++ /dev/null
@@ -1,80 +0,0 @@
-#!/usr/bin/env python3
-"""
-Generate small test data for data module testing.
-
-This creates a small mock dataset that mimics CIFAR-10 structure but is tiny
-and doesn't require downloading anything.
-"""
-
-import numpy as np
-import pickle
-import os
-from pathlib import Path
-
-def generate_test_cifar10_data():
-    """Generate small test data that mimics CIFAR-10 structure."""
-    
-    # CIFAR-10 class names
-    class_names = [
-        'airplane', 'automobile', 'bird', 'cat', 'deer', 
-        'dog', 'frog', 'horse', 'ship', 'truck'
-    ]
-    
-    # Create small test dataset
-    train_size = 50  # Small training set
-    test_size = 20   # Small test set
-    
-    # Generate random image data (3x32x32, values 0-255)
-    train_data = np.random.randint(0, 256, size=(train_size, 3, 32, 32), dtype=np.uint8)
-    train_labels = np.random.randint(0, 10, size=(train_size,), dtype=np.uint8)
-    
-    test_data = np.random.randint(0, 256, size=(test_size, 3, 32, 32), dtype=np.uint8)
-    test_labels = np.random.randint(0, 10, size=(test_size,), dtype=np.uint8)
-    
-    # Create the data directory
-    data_dir = Path(__file__).parent / "test_data"
-    data_dir.mkdir(exist_ok=True)
-    
-    # Save training data (mimics CIFAR-10 format)
-    train_dict = {
-        b'data': train_data.reshape(train_size, -1),  # Flatten to (N, 3072)
-        b'labels': train_labels.tolist(),
-        b'batch_label': b'training batch 1 of 1',
-        b'filenames': [f'train_image_{i}.png'.encode() for i in range(train_size)]
-    }
-    
-    with open(data_dir / "data_batch_1", "wb") as f:
-        pickle.dump(train_dict, f)
-    
-    # Save test data
-    test_dict = {
-        b'data': test_data.reshape(test_size, -1),  # Flatten to (N, 3072)
-        b'labels': test_labels.tolist(),
-        b'batch_label': b'testing batch 1 of 1',
-        b'filenames': [f'test_image_{i}.png'.encode() for i in range(test_size)]
-    }
-    
-    with open(data_dir / "test_batch", "wb") as f:
-        pickle.dump(test_dict, f)
-    
-    # Save metadata
-    meta_dict = {
-        b'label_names': [name.encode() for name in class_names],
-        b'num_cases_per_batch': [train_size],
-        b'num_vis': 3072  # 32*32*3
-    }
-    
-    with open(data_dir / "batches.meta", "wb") as f:
-        pickle.dump(meta_dict, f)
-    
-    print(f"✅ Generated test data:")
-    print(f"   - Training samples: {train_size}")
-    print(f"   - Test samples: {test_size}")
-    print(f"   - Image shape: (3, 32, 32)")
-    print(f"   - Classes: {len(class_names)}")
-    print(f"   - Saved to: {data_dir}")
-    
-    return data_dir
-
-if __name__ == "__main__":
-    generate_test_cifar10_data() 
\ No newline at end of file
diff --git a/modules/source/06_dataloader/tests/test_data/batches.meta b/modules/source/06_dataloader/tests/test_data/batches.meta
deleted file mode 100644
index 9b3ddceb..00000000
Binary files a/modules/source/06_dataloader/tests/test_data/batches.meta and /dev/null differ
diff --git a/modules/source/06_dataloader/tests/test_data/data_batch_1 b/modules/source/06_dataloader/tests/test_data/data_batch_1
deleted file mode 100644
index 3e2c33ed..00000000
Binary files a/modules/source/06_dataloader/tests/test_data/data_batch_1 and /dev/null differ
diff --git a/modules/source/06_dataloader/tests/test_data/test_batch b/modules/source/06_dataloader/tests/test_data/test_batch
deleted file mode 100644
index 7d3e1546..00000000
Binary files a/modules/source/06_dataloader/tests/test_data/test_batch and /dev/null differ
diff --git a/modules/source/06_dataloader/tests/test_dataloader.py b/modules/source/06_dataloader/tests/test_dataloader.py
deleted file mode 100644
index f6064744..00000000
--- a/modules/source/06_dataloader/tests/test_dataloader.py
+++ /dev/null
@@ -1,460 +0,0 @@
-"""
-Test suite for the dataloader module.
-This tests the student implementations to ensure they work correctly.
-"""
-
-import pytest
-import numpy as np
-import sys
-import os
-import tempfile
-import shutil
-import pickle
-from pathlib import Path
-from unittest.mock import patch, MagicMock
-
-# Import from the main package (rock solid foundation)
-try:
-    from tinytorch.core.dataloader import Dataset, DataLoader, SimpleDataset
-    # These may not be implemented yet - use fallback
-    try:
-        from tinytorch.core.dataloader import CIFAR10Dataset, Normalizer, create_data_pipeline
-    except ImportError:
-        # Create mock classes for missing functionality
-        class CIFAR10Dataset:
-            """Mock implementation for testing"""
-            def __init__(self, *args, **kwargs):
-                pass
-            def __len__(self):
-                return 100
-            def __getitem__(self, idx):
-                return ([0.5] * 32 * 32 * 3, 1)
-        
-        class Normalizer:
-            """Mock implementation for testing"""
-            def __init__(self, *args, **kwargs):
-                pass
-            def __call__(self, x):
-                return x
-        
-        def create_data_pipeline(*args, **kwargs):
-            """Mock implementation for testing"""
-            return SimpleDataset([([0.5] * 10, 1)] * 100)
-            
-except ImportError:
-    # Fallback for when module isn't exported yet
-    project_root = Path(__file__).parent.parent.parent
-    sys.path.append(str(project_root / "modules" / "source" / "06_dataloader"))
-    from dataloader_dev import Dataset, DataLoader, CIFAR10Dataset, Normalizer, create_data_pipeline
-
-from tinytorch.core.tensor import Tensor
-
-def safe_numpy(tensor):
-    """Get numpy array from tensor, using .data attribute"""
-    return tensor.data
-
-def safe_item(tensor):
-    """Get scalar value from tensor"""
-    return float(tensor.data)
-
-class TestCIFAR10Dataset(Dataset):
-    """Test dataset that uses local test data instead of downloading CIFAR-10."""
-    
-    def __init__(self, root_dir: str, train: bool = True, download: bool = True):
-        """Initialize with local test data."""
-        self.root_dir = root_dir
-        self.train = train
-        self.download = download
-        
-        # Use local test data
-        test_data_dir = Path(__file__).parent / "test_data"
-        if not test_data_dir.exists():
-            raise FileNotFoundError(f"Test data not found at {test_data_dir}")
-        
-        self._load_test_data(test_data_dir)
-    
-    def _load_test_data(self, data_dir):
-        """Load the small test dataset."""
-        # Load metadata
-        with open(data_dir / "batches.meta", "rb") as f:
-            meta_dict = pickle.load(f)
-        
-        self.class_names = [name.decode() for name in meta_dict[b'label_names']]
-        
-        # Load training or test data
-        if self.train:
-            with open(data_dir / "data_batch_1", "rb") as f:
-                data_dict = pickle.load(f)
-        else:
-            with open(data_dir / "test_batch", "rb") as f:
-                data_dict = pickle.load(f)
-        
-        # Reshape data from (N, 3072) to (N, 3, 32, 32)
-        self.data = data_dict[b'data'].reshape(-1, 3, 32, 32)
-        self.labels = data_dict[b'labels']
-    
-    def __getitem__(self, index: int):
-        """Get a single sample and label."""
-        image = self.data[index]
-        label = self.labels[index]
-        
-        return Tensor(image.astype(np.float32)), Tensor(np.array(label))
-    
-    def __len__(self) -> int:
-        """Get the total number of samples."""
-        return len(self.data)
-    
-    def get_num_classes(self) -> int:
-        """Get the number of classes."""
-        return len(self.class_names)
-
-class TestDatasetInterface:
-    """Test the base Dataset class interface (abstract class behavior)."""
-    
-    def test_dataset_is_abstract(self):
-        """Test that Dataset base class is abstract."""
-        dataset = Dataset()
-        
-        # Should raise NotImplementedError for abstract methods
-        with pytest.raises(NotImplementedError):
-            dataset[0]
-        
-        with pytest.raises(NotImplementedError):
-            len(dataset)
-        
-        with pytest.raises(NotImplementedError):
-            dataset.get_num_classes()
-    
-    def test_concrete_dataset_implementation(self):
-        """Test that concrete datasets work properly."""
-        class TestDataset(Dataset):
-            def __init__(self, size=10):
-                self.size = size
-                self.data = [np.random.randn(3, 32, 32) for _ in range(size)]
-                self.labels = [i % 3 for i in range(size)]
-            
-            def __getitem__(self, index):
-                return Tensor(self.data[index]), Tensor(np.array(self.labels[index]))
-            
-            def __len__(self):
-                return self.size
-            
-            def get_num_classes(self):
-                return 3
-        
-        dataset = TestDataset(5)
-        
-        # Test basic functionality
-        assert len(dataset) == 5
-        assert dataset.get_num_classes() == 3
-        
-        # Test indexing
-        sample, label = dataset[0]
-        assert sample.shape == (3, 32, 32)
-        assert label.shape == ()
-        
-        # Test get_sample_shape
-        assert dataset.get_sample_shape() == (3, 32, 32)
-
-class TestLocalCIFAR10Dataset:
-    """Test CIFAR-10 dataset with local test data."""
-    
-    def test_cifar10_train_set_load(self):
-        """Test loading training set from local test data."""
-        with tempfile.TemporaryDirectory() as temp_dir:
-            # Use local test data
-            dataset = TestCIFAR10Dataset(temp_dir, train=True, download=True)
-            
-            # Verify basic properties
-            assert len(dataset) == 50  # Our test training set size
-            assert dataset.get_num_classes() == 10
-            
-            # Test sample access
-            image, label = dataset[0]
-            assert image.shape == (3, 32, 32)  # CIFAR-10 image shape
-            assert 0 <= safe_item(label) < 10  # Valid class label
-            
-            # Test class names
-            assert len(dataset.class_names) == 10
-            assert 'airplane' in dataset.class_names
-            assert 'truck' in dataset.class_names
-    
-    def test_cifar10_test_set_load(self):
-        """Test loading test set from local test data."""
-        with tempfile.TemporaryDirectory() as temp_dir:
-            # Use local test data
-            dataset = TestCIFAR10Dataset(temp_dir, train=False, download=True)
-            
-            # Verify test set properties
-            assert len(dataset) == 20  # Our test test set size
-            assert dataset.get_num_classes() == 10
-            
-            # Test sample access
-            image, label = dataset[0]
-            assert image.shape == (3, 32, 32)
-            assert 0 <= safe_item(label) < 10
-    
-    def test_cifar10_data_types(self):
-        """Test that test data has correct types and ranges."""
-        with tempfile.TemporaryDirectory() as temp_dir:
-            dataset = TestCIFAR10Dataset(temp_dir, train=True, download=True)
-            
-            # Test first few samples
-            for i in range(5):
-                image, label = dataset[i]
-                
-                # Check data types
-                assert isinstance(image, Tensor)
-                assert isinstance(label, Tensor)
-                
-                # Check value ranges (our test data uses 0-255 range)
-                assert 0 <= safe_numpy(image).min() <= 255
-                assert 0 <= safe_numpy(image).max() <= 255
-                
-                # Check label is valid class
-                assert 0 <= safe_item(label) < 10
-
-class TestDataLoader:
-    """Test DataLoader with local test data."""
-    
-    def setup_method(self):
-        """Set up local test dataset for DataLoader tests."""
-        self.temp_dir = tempfile.mkdtemp()
-        # Use local test data
-        self.dataset = TestCIFAR10Dataset(self.temp_dir, train=True, download=True)
-    
-    def teardown_method(self):
-        """Clean up temporary directory."""
-        shutil.rmtree(self.temp_dir, ignore_errors=True)
-    
-    def test_dataloader_creation(self):
-        """Test DataLoader creation with local test data."""
-        # Test with default parameters
-        loader = DataLoader(self.dataset, batch_size=16)
-        assert len(loader) == 4  # 50 samples / 16 batch_size = 4 batches (rounded up)
-        
-        # Test with custom batch size
-        loader = DataLoader(self.dataset, batch_size=10)
-        assert len(loader) == 5  # 50 samples / 10 batch_size = 5 batches
-    
-    def test_dataloader_iteration_test_data(self):
-        """Test DataLoader iteration with local test data."""
-        loader = DataLoader(self.dataset, batch_size=8, shuffle=True)
-        
-        batch_count = 0
-        total_samples = 0
-        
-        for batch_data, batch_labels in loader:
-            batch_count += 1
-            batch_size = batch_data.shape[0]
-            total_samples += batch_size
-            
-            # Check batch shapes
-            assert batch_data.shape[1:] == (3, 32, 32)  # CIFAR-10 image shape
-            assert batch_labels.shape == (batch_size,)
-            
-            # Check data types
-            assert isinstance(batch_data, Tensor)
-            assert isinstance(batch_labels, Tensor)
-            
-            # Check test data properties
-            assert 0 <= safe_numpy(batch_data).min() <= 255
-            assert 0 <= safe_numpy(batch_data).max() <= 255
-            assert 0 <= safe_numpy(batch_labels).min() < 10
-            assert 0 <= safe_numpy(batch_labels).max() < 10
-            
-            # Check batch size
-            assert batch_size <= 8
-            
-            if batch_count >= 3:  # Test first few batches
-                break
-        
-        assert batch_count > 0
-        assert total_samples <= len(self.dataset)
-    
-    def test_dataloader_shuffling_test_data(self):
-        """Test that shuffling works with test data."""
-        loader1 = DataLoader(self.dataset, batch_size=10, shuffle=True)
-        loader2 = DataLoader(self.dataset, batch_size=10, shuffle=True)
-        
-        # Get first batch from each loader
-        batch1_data, batch1_labels = next(iter(loader1))
-        batch2_data, batch2_labels = next(iter(loader2))
-        
-        # With shuffling, batches should likely be different
-        # (This test might occasionally fail due to randomness, but very unlikely)
-        different = not np.array_equal(safe_numpy(batch1_labels), safe_numpy(batch2_labels))
-        # Note: We don't assert this because random shuffling might occasionally produce same order
-    
-    def test_dataloader_no_shuffle_test_data(self):
-        """Test DataLoader without shuffling uses test data in order."""
-        loader = DataLoader(self.dataset, batch_size=10, shuffle=False)
-        
-        # Get first batch
-        batch_data, batch_labels = next(iter(loader))
-        
-        # Without shuffling, should get first 10 samples in order
-        expected_samples = [self.dataset[i] for i in range(10)]
-        expected_labels = [safe_item(sample[1]) for sample in expected_samples]
-        
-        np.testing.assert_array_equal(safe_numpy(batch_labels), expected_labels)
-
-class TestNormalizer:
-    """Test Normalizer with local test data."""
-    
-    def setup_method(self):
-        """Set up local test data for normalization tests."""
-        self.temp_dir = tempfile.mkdtemp()
-        dataset = TestCIFAR10Dataset(self.temp_dir, train=True, download=True)
-        
-        # Get first 20 samples for testing
-        self.test_data = []
-        for i in range(20):
-            image, _ = dataset[i]
-            self.test_data.append(image)
-    
-    def teardown_method(self):
-        """Clean up temporary directory."""
-        shutil.rmtree(self.temp_dir, ignore_errors=True)
-    
-    def test_normalizer_fit_test_data(self):
-        """Test Normalizer fit with local test data."""
-        normalizer = Normalizer()
-        normalizer.fit(self.test_data)
-        
-        # Check computed statistics
-        assert normalizer.mean is not None
-        assert normalizer.std is not None
-        
-        # Our test data has pixel values 0-255, so mean should be reasonable
-        assert 0 <= normalizer.mean <= 255
-        assert normalizer.std > 0  # Should have some variation
-    
-    def test_normalizer_transform_test_data(self):
-        """Test Normalizer transform with local test data."""
-        normalizer = Normalizer()
-        normalizer.fit(self.test_data)
-        
-        # Transform single sample
-        sample = self.test_data[0]
-        normalized = normalizer.transform(sample)
-        
-        # Check that normalization changes the values
-        assert not np.allclose(safe_numpy(sample), safe_numpy(normalized))
-        
-        # Check that normalized data has different statistics
-        original_mean = np.mean(safe_numpy(sample))
-        normalized_mean = np.mean(safe_numpy(normalized))
-        assert abs(normalized_mean) < abs(original_mean)  # Should be closer to 0
-    
-    def test_normalizer_transform_batch_test_data(self):
-        """Test Normalizer with batch of test data."""
-        normalizer = Normalizer()
-        normalizer.fit(self.test_data)
-        
-        # Transform batch
-        batch = self.test_data[:5]
-        normalized_batch = normalizer.transform(batch)
-        
-        # Check that we get same number of samples
-        assert len(normalized_batch) == len(batch)
-        
-        # Check that each sample is normalized
-        for original, normalized in zip(batch, normalized_batch):
-            assert not np.allclose(safe_numpy(original), safe_numpy(normalized))
-
-class TestDataPipeline:
-    """Test complete data pipeline with local test data."""
-    
-    def test_create_data_pipeline_test_data(self):
-        """Test creating data pipeline with local test data."""
-        with tempfile.TemporaryDirectory() as temp_dir:
-            # Copy test data to temp directory
-            test_data_dir = Path(__file__).parent / "test_data"
-            import shutil
-            shutil.copytree(test_data_dir, temp_dir + "/test_data")
-            
-            # Create pipeline (this would normally download CIFAR-10)
-            # For testing, we'll create a simple pipeline manually
-            dataset = TestCIFAR10Dataset(temp_dir, train=True, download=True)
-            dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
-            
-            # Test pipeline components
-            assert len(dataset) == 50  # Our test training set
-            assert len(dataloader) == 7  # 50 samples / 8 batch_size = 7 batches
-            
-            # Test that we can iterate through the pipeline
-            batch_count = 0
-            for batch_data, batch_labels in dataloader:
-                batch_count += 1
-                assert batch_data.shape[1:] == (3, 32, 32)
-                assert batch_labels.shape[0] <= 8
-                
-                if batch_count >= 3:  # Test first few batches
-                    break
-            
-            assert batch_count > 0
-    
-    def test_pipeline_normalization_test_data(self):
-        """Test pipeline with normalization using local test data."""
-        with tempfile.TemporaryDirectory() as temp_dir:
-            dataset = TestCIFAR10Dataset(temp_dir, train=True, download=True)
-            
-            # Get some samples for normalization
-            samples = [dataset[i][0] for i in range(10)]
-            
-            # Create and fit normalizer
-            normalizer = Normalizer()
-            normalizer.fit(samples)
-            
-            # Test that normalization works
-            normalized = normalizer.transform(samples[0])
-            assert not np.allclose(safe_numpy(samples[0]), safe_numpy(normalized))
-            
-            # Test with dataloader
-            dataloader = DataLoader(dataset, batch_size=5, shuffle=False)
-            batch_data, batch_labels = next(iter(dataloader))
-            
-            # Normalize batch
-            normalized_batch = []
-            for i in range(batch_data.shape[0]):
-                sample = Tensor(batch_data.data[i])
-                normalized_sample = normalizer.transform(sample)
-                normalized_batch.append(normalized_sample.data)
-            
-            normalized_batch = Tensor(np.stack(normalized_batch))
-            
-            # Check that batch normalization works
-            assert normalized_batch.shape == batch_data.shape
-            assert not np.allclose(safe_numpy(batch_data), safe_numpy(normalized_batch))
-
-class TestEdgeCases:
-    """Test edge cases with local test data."""
-    
-    def test_small_batch_size_test_data(self):
-        """Test with very small batch size using local test data."""
-        with tempfile.TemporaryDirectory() as temp_dir:
-            # Create small dataset
-            dataset = TestCIFAR10Dataset(temp_dir, train=True, download=True)
-            
-            # Use batch size of 1
-            loader = DataLoader(dataset, batch_size=1, shuffle=False)
-            
-            # Test first few batches
-            batch_count = 0
-            for batch_data, batch_labels in loader:
-                assert batch_data.shape == (1, 3, 32, 32)
-                assert batch_labels.shape == (1,)
-                
-                batch_count += 1
-                if batch_count >= 5:
-                    break
-            
-            assert batch_count == 5
-
-def run_data_tests():
-    """Run all data tests."""
-    pytest.main([__file__, "-v"])
-
-if __name__ == "__main__":
-    run_data_tests() 
\ No newline at end of file
diff --git a/modules/source/07_autograd/autograd_dev.py b/modules/source/07_autograd/autograd_dev.py
index d8ae01c3..7e1dae31 100644
--- a/modules/source/07_autograd/autograd_dev.py
+++ b/modules/source/07_autograd/autograd_dev.py
@@ -38,7 +38,7 @@ from collections import defaultdict
 
 # Import our existing components
 try:
-    from tinytorch.core.tensor import Tensor
+from tinytorch.core.tensor import Tensor
 except ImportError:
     # For development, import from local modules
     import os
@@ -123,7 +123,7 @@ Let's build the engine that powers modern AI!
 ### What is a Variable?
 A **Variable** wraps a Tensor and tracks:
 - **Data**: The actual values (forward pass)
-- **Gradient**: The computed gradients (backward pass)  
+- **Gradient**: The computed gradients (backward pass)
 - **Computation history**: How this Variable was created
 - **Backward function**: How to compute gradients
 
@@ -167,7 +167,7 @@ class Variable:
                  requires_grad: bool = True, grad_fn: Optional[Callable] = None):
         """
         Create a Variable with gradient tracking.
-        
+            
         TODO: Implement Variable initialization with gradient tracking.
         
         STEP-BY-STEP IMPLEMENTATION:
@@ -275,33 +275,33 @@ class Variable:
         if self.requires_grad:
             if self.grad is None:
                 self.grad = gradient
-            else:
+        else:
                 # Accumulate gradients
                 self.grad = Variable(self.grad.data.data + gradient.data.data)
         
-        if self.grad_fn is not None:
-            self.grad_fn(gradient)
+            if self.grad_fn is not None:
+                self.grad_fn(gradient)
         ### END SOLUTION
-
+    
     def zero_grad(self) -> None:
         """Reset gradients to zero."""
         self.grad = None
-
+    
     def __add__(self, other: Union['Variable', float, int]) -> 'Variable':
         """Addition operator: self + other"""
         return add(self, other)
-
+    
     def __mul__(self, other: Union['Variable', float, int]) -> 'Variable':
         """Multiplication operator: self * other"""
         return multiply(self, other)
-
+    
     def __sub__(self, other: Union['Variable', float, int]) -> 'Variable':
         """Subtraction operator: self - other"""
         return subtract(self, other)
-
+    
     def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable':
         """Division operator: self / other"""
-        return divide(self, other)
+        return divide(self, other) 
 
 # %% [markdown]
 """
@@ -817,12 +817,12 @@ Let's see how autograd enables neural network training:
 4. **Parameter update**: Update weights using gradients
 
 ### Example: Simple Linear Regression
-```python
+   ```python
 # Model: y = wx + b
 w = Variable(0.5, requires_grad=True)
 b = Variable(0.1, requires_grad=True)
 
-# Forward pass
+    # Forward pass
 prediction = w * x + b
 
 # Loss: mean squared error
@@ -870,7 +870,7 @@ def test_neural_network_training():
             x = Variable(x_val, requires_grad=False)
             target = Variable(y_val, requires_grad=False)
             
-            # Forward pass
+    # Forward pass
             prediction = add(multiply(w, x), b)  # wx + b
             
             # Loss: squared error
diff --git a/modules/source/07_autograd/autograd_dev_backup.py b/modules/source/07_autograd/autograd_dev_backup.py
new file mode 100644
index 00000000..471939b6
--- /dev/null
+++ b/modules/source/07_autograd/autograd_dev_backup.py
@@ -0,0 +1,1672 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.1
+# ---
+
+# %% [markdown]
+"""
+# Module 7: Autograd - Automatic Differentiation Engine
+
+Welcome to the Autograd module! This is where TinyTorch becomes truly powerful. You'll implement the automatic differentiation engine that makes neural network training possible.
+
+## Learning Goals
+- Understand how automatic differentiation works through computational graphs
+- Implement the Variable class that tracks gradients and operations
+- Build backward propagation for gradient computation
+- Create the foundation for neural network training
+- Master the mathematical concepts behind backpropagation
+
+## Build → Use → Analyze
+1. **Build**: Create the Variable class and gradient computation system
+2. **Use**: Perform automatic differentiation on complex expressions
+3. **Analyze**: Understand how gradients flow through computational graphs
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "autograd-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
+#| default_exp core.autograd
+
+#| export
+import numpy as np
+import sys
+from typing import Union, List, Tuple, Optional, Any, Callable
+from collections import defaultdict
+
+# Import our existing components
+from tinytorch.core.tensor import Tensor
+
+# %% nbgrader={"grade": false, "grade_id": "autograd-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
+print("🔥 TinyTorch Autograd Module")
+print(f"NumPy version: {np.__version__}")
+print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
+print("Ready to build automatic differentiation!")
+
+# %% [markdown]
+"""
+## 📦 Where This Code Lives in the Final Package
+
+**Learning Side:** You work in `modules/source/07_autograd/autograd_dev.py`  
+**Building Side:** Code exports to `tinytorch.core.autograd`
+
+```python
+# Final package structure:
+from tinytorch.core.autograd import Variable, backward  # The gradient engine!
+from tinytorch.core.tensor import Tensor
+from tinytorch.core.activations import ReLU, Sigmoid, Tanh
+```
+
+**Why this matters:**
+- **Learning:** Focused module for understanding gradients
+- **Production:** Proper organization like PyTorch's `torch.autograd`
+- **Consistency:** All gradient operations live together in `core.autograd`
+- **Foundation:** Enables training for all neural networks
+"""
+
+# %% [markdown]
+"""
+## Step 1: What is Automatic Differentiation?
+
+### Definition
+**Automatic differentiation (autograd)** is a technique that automatically computes derivatives of functions represented as computational graphs. It's the magic that makes neural network training possible.
+
+### The Fundamental Challenge: Computing Gradients at Scale
+
+#### **The Problem**
+Neural networks have millions or billions of parameters. To train them, we need to compute the gradient of the loss function with respect to every single parameter:
+
+```python
+# For a neural network with parameters θ = [w1, w2, ..., wn, b1, b2, ..., bm]
+# We need to compute: ∇θ L = [∂L/∂w1, ∂L/∂w2, ..., ∂L/∂wn, ∂L/∂b1, ∂L/∂b2, ..., ∂L/∂bm]
+```
+
+#### **Why Manual Differentiation Fails**
+- **Complexity**: Neural networks are compositions of thousands of operations
+- **Error-prone**: Manual computation is extremely difficult and error-prone
+- **Inflexible**: Every architecture change requires re-deriving gradients
+- **Inefficient**: Manual computation doesn't exploit computational structure
+
+#### **Why Numerical Differentiation is Inadequate**
+```python
+# Numerical differentiation: f'(x) ≈ (f(x + h) - f(x)) / h
+def numerical_gradient(f, x, h=1e-5):
+    return (f(x + h) - f(x)) / h
+```
+
+Problems:
+- **Slow**: Requires 2 function evaluations per parameter
+- **Imprecise**: Numerical errors accumulate
+- **Unstable**: Sensitive to choice of h
+- **Expensive**: O(n) cost for n parameters
+
+### The Solution: Computational Graphs
+
+#### **Key Insight: Every Computation is a Graph**
+Any mathematical expression can be represented as a directed acyclic graph (DAG):
+
+```python
+# Expression: f(x, y) = (x + y) * (x - y)
+# Graph representation:
+#     x ──┐     ┌── add ──┐
+#         │     │         │
+#         ├─────┤         ├── multiply ── output
+#         │     │         │
+#     y ──┘     └── sub ──┘
+```
+
+#### **Forward Pass: Computing Values**
+Traverse the graph from inputs to outputs, computing values at each node:
+
+```python
+# Forward pass for f(x, y) = (x + y) * (x - y)
+x = 3, y = 2
+add_result = x + y = 5
+sub_result = x - y = 1
+output = add_result * sub_result = 5
+```
+
+#### **Backward Pass: Computing Gradients**
+Traverse the graph from outputs to inputs, computing gradients using the chain rule:
+
+For $f(x, y) = (x + y) \cdot (x - y)$ with $x = 3, y = 2$:
+
+$$\frac{\partial \text{output}}{\partial \text{multiply}} = 1$$
+
+$$\frac{\partial \text{output}}{\partial \text{add}} = \frac{\partial \text{output}}{\partial \text{multiply}} \cdot \frac{\partial \text{multiply}}{\partial \text{add}} = 1 \cdot \text{sub\_result} = 1$$
+
+$$\frac{\partial \text{output}}{\partial \text{sub}} = \frac{\partial \text{output}}{\partial \text{multiply}} \cdot \frac{\partial \text{multiply}}{\partial \text{sub}} = 1 \cdot \text{add\_result} = 5$$
+
+$$\frac{\partial \text{output}}{\partial x} = \frac{\partial \text{output}}{\partial \text{add}} \cdot \frac{\partial \text{add}}{\partial x} + \frac{\partial \text{output}}{\partial \text{sub}} \cdot \frac{\partial \text{sub}}{\partial x} = 1 \cdot 1 + 5 \cdot 1 = 6$$
+
+$$\frac{\partial \text{output}}{\partial y} = \frac{\partial \text{output}}{\partial \text{add}} \cdot \frac{\partial \text{add}}{\partial y} + \frac{\partial \text{output}}{\partial \text{sub}} \cdot \frac{\partial \text{sub}}{\partial y} = 1 \cdot 1 + 5 \cdot (-1) = -4$$
+```
+
+### Mathematical Foundation: The Chain Rule
+
+#### **Single Variable Chain Rule**
+For composite functions: If $z = f(g(x))$, then:
+
+$$\frac{dz}{dx} = \frac{dz}{df} \cdot \frac{df}{dx}$$
+
+#### **Multivariable Chain Rule**
+For functions of multiple variables: If $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$, then:
+
+$$\frac{dz}{dt} = \frac{\partial z}{\partial x} \cdot \frac{dx}{dt} + \frac{\partial z}{\partial y} \cdot \frac{dy}{dt}$$
+
+#### **Chain Rule in Computational Graphs**
+For any path from input to output through intermediate nodes:
+
+$$\frac{\partial \text{output}}{\partial \text{input}} = \prod_{i} \frac{\partial \text{node}_{i+1}}{\partial \text{node}_i}$$
+
+### Automatic Differentiation Modes
+
+#### **Forward Mode (Forward Accumulation)**
+- **Process**: Compute derivatives alongside forward pass
+- **Efficiency**: Efficient when #inputs << #outputs
+- **Use case**: Jacobian-vector products, sensitivity analysis
+
+#### **Reverse Mode (Backpropagation)**
+- **Process**: Compute derivatives in reverse pass after forward pass
+- **Efficiency**: Efficient when #outputs << #inputs
+- **Use case**: Neural network training (many parameters, few outputs)
+
+#### **Why Reverse Mode Dominates ML**
+Neural networks typically have:
+- **Many inputs**: Millions of parameters
+- **Few outputs**: Single loss value or small output vector
+- **Reverse mode**: O(1) cost per parameter vs O(n) for forward mode
+
+### The Computational Graph Abstraction
+
+#### **Nodes: Operations and Variables**
+- **Variable nodes**: Store values and gradients
+- **Operation nodes**: Define how to compute forward and backward passes
+
+#### **Edges: Data Dependencies**
+- **Forward edges**: Data flow from inputs to outputs
+- **Backward edges**: Gradient flow from outputs to inputs
+
+#### **Dynamic vs Static Graphs**
+- **Static graphs**: Define once, execute many times (TensorFlow 1.x)
+- **Dynamic graphs**: Build graph during execution (PyTorch, TensorFlow 2.x)
+
+### Real-World Impact: What Autograd Enables
+
+#### **Deep Learning Revolution**
+```python
+# Before autograd: Manual gradient computation
+def manual_gradient(x, y, w1, w2, b1, b2):
+    # Forward pass
+    z1 = w1 * x + b1
+    a1 = sigmoid(z1)
+    z2 = w2 * a1 + b2
+    a2 = sigmoid(z2)
+    loss = (a2 - y) ** 2
+    
+    # Backward pass (manual)
+    dloss_da2 = 2 * (a2 - y)
+    da2_dz2 = sigmoid_derivative(z2)
+    dz2_dw2 = a1
+    dz2_db2 = 1
+    dz2_da1 = w2
+    da1_dz1 = sigmoid_derivative(z1)
+    dz1_dw1 = x
+    dz1_db1 = 1
+    
+    # Chain rule application
+    dloss_dw2 = dloss_da2 * da2_dz2 * dz2_dw2
+    dloss_db2 = dloss_da2 * da2_dz2 * dz2_db2
+    dloss_dw1 = dloss_da2 * da2_dz2 * dz2_da1 * da1_dz1 * dz1_dw1
+    dloss_db1 = dloss_da2 * da2_dz2 * dz2_da1 * da1_dz1 * dz1_db1
+    
+    return dloss_dw1, dloss_db1, dloss_dw2, dloss_db2
+
+# With autograd: Automatic gradient computation
+def autograd_gradient(x, y, w1, w2, b1, b2):
+    # Forward pass with gradient tracking
+    z1 = w1 * x + b1
+    a1 = sigmoid(z1)
+    z2 = w2 * a1 + b2
+    a2 = sigmoid(z2)
+    loss = (a2 - y) ** 2
+    
+    # Backward pass (automatic)
+    loss.backward()
+    
+    return w1.grad, b1.grad, w2.grad, b2.grad
+```
+
+#### **Scientific Computing**
+- **Optimization**: Gradient-based optimization algorithms
+- **Inverse problems**: Parameter estimation from observations
+- **Sensitivity analysis**: How outputs change with input perturbations
+
+#### **Modern AI Applications**
+- **Neural architecture search**: Differentiable architecture optimization
+- **Meta-learning**: Learning to learn with gradient-based meta-algorithms
+- **Differentiable programming**: Entire programs as differentiable functions
+
+### Performance Considerations
+
+#### **Memory Management**
+- **Intermediate storage**: Must store forward pass results for backward pass
+- **Memory optimization**: Checkpointing, gradient accumulation
+- **Trade-offs**: Memory vs computation time
+
+#### **Computational Efficiency**
+- **Graph optimization**: Fuse operations, eliminate redundancy
+- **Parallelization**: Compute independent gradients simultaneously
+- **Hardware acceleration**: Specialized gradient computation on GPUs/TPUs
+
+#### **Numerical Stability**
+- **Gradient clipping**: Prevent exploding gradients
+- **Numerical precision**: Balance between float16 and float32
+- **Accumulation order**: Minimize numerical errors
+
+### Connection to Neural Network Training
+
+#### **The Training Loop**
+```python
+for epoch in range(num_epochs):
+    for batch in dataloader:
+        # Forward pass
+        predictions = model(batch.inputs)
+        loss = criterion(predictions, batch.targets)
+        
+        # Backward pass (autograd)
+        loss.backward()
+        
+        # Parameter update
+        optimizer.step()
+        optimizer.zero_grad()
+```
+
+#### **Gradient-Based Optimization**
+- **Stochastic Gradient Descent**: Use gradients to update parameters
+- **Adaptive methods**: Adam, RMSprop use gradient statistics
+- **Second-order methods**: Use gradient and Hessian information
+
+### Why Autograd is Revolutionary
+
+#### **Democratization of Deep Learning**
+- **Research acceleration**: Focus on architecture, not gradient computation
+- **Experimentation**: Easy to try new ideas and architectures
+- **Accessibility**: Researchers don't need to be differentiation experts
+
+#### **Scalability**
+- **Large models**: Handle millions/billions of parameters automatically
+- **Complex architectures**: Support arbitrary computational graphs
+- **Distributed training**: Coordinate gradients across multiple devices
+
+Let's implement the Variable class that makes this magic possible!
+"""
+
+# %% [markdown]
+"""
+## Step 2: The Variable Class
+
+### Core Concept
+A **Variable** wraps a Tensor and tracks:
+- **Data**: The actual values (forward pass)
+- **Gradient**: The computed gradients (backward pass)
+- **Computation history**: How this Variable was created
+- **Backward function**: How to compute gradients
+
+### Design Principles
+- **Transparency**: Works seamlessly with existing Tensor operations
+- **Efficiency**: Minimal overhead for forward pass
+- **Flexibility**: Supports any differentiable operation
+- **Correctness**: Implements the chain rule precisely
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "variable-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+class Variable:
+    """
+    Variable: Tensor wrapper with automatic differentiation capabilities.
+    
+    The fundamental class for gradient computation in TinyTorch.
+    Wraps Tensor objects and tracks computational history for backpropagation.
+    """
+    
+    def __init__(self, data: Union[Tensor, np.ndarray, list, float, int], 
+                 requires_grad: bool = True, grad_fn: Optional[Callable] = None):
+        """
+        Create a Variable with gradient tracking.
+        
+        Args:
+            data: The data to wrap (will be converted to Tensor)
+            requires_grad: Whether to compute gradients for this Variable
+            grad_fn: Function to compute gradients (None for leaf nodes)
+            
+        TODO: Implement Variable initialization with gradient tracking.
+        
+        APPROACH:
+        1. Convert data to Tensor if it's not already
+        2. Store the tensor data
+        3. Set gradient tracking flag
+        4. Initialize gradient to None (will be computed later)
+        5. Store the gradient function for backward pass
+        6. Track if this is a leaf node (no grad_fn)
+        
+        EXAMPLE:
+        Variable(5.0) → Variable wrapping Tensor(5.0)
+        Variable([1, 2, 3]) → Variable wrapping Tensor([1, 2, 3])
+        
+        HINTS:
+        - Use isinstance() to check if data is already a Tensor
+        - Store requires_grad, grad_fn, and is_leaf flags
+        - Initialize self.grad to None
+        - A leaf node has grad_fn=None
+        """
+        ### BEGIN SOLUTION
+        # Convert data to Tensor if needed
+        if isinstance(data, Tensor):
+            self.data = data
+        else:
+            self.data = Tensor(data)
+        
+        # Set gradient tracking
+        self.requires_grad = requires_grad
+        self.grad = None  # Will be initialized when needed
+        self.grad_fn = grad_fn
+        self.is_leaf = grad_fn is None
+        
+        # For computational graph
+        self._backward_hooks = []
+        ### END SOLUTION
+    
+    @property
+    def shape(self) -> Tuple[int, ...]:
+        """Get the shape of the underlying tensor."""
+        return self.data.shape
+    
+    @property
+    def size(self) -> int:
+        """Get the total number of elements."""
+        return self.data.size
+    
+    def __repr__(self) -> str:
+        """String representation of the Variable."""
+        grad_str = f", grad_fn={self.grad_fn.__name__}" if self.grad_fn else ""
+        return f"Variable({self.data.data.tolist()}, requires_grad={self.requires_grad}{grad_str})"
+    
+    def backward(self, gradient: Optional['Variable'] = None) -> None:
+        """
+        Compute gradients using backpropagation.
+        
+        Args:
+            gradient: The gradient to backpropagate (defaults to ones)
+            
+        TODO: Implement backward propagation.
+        
+        APPROACH:
+        1. If gradient is None, create a gradient of ones with same shape
+        2. If this Variable doesn't require gradients, return early
+        3. If this is a leaf node, accumulate the gradient
+        4. If this has a grad_fn, call it to propagate gradients
+        
+        EXAMPLE:
+        x = Variable(5.0)
+        y = x * 2
+        y.backward()  # Computes x.grad = 2.0
+        
+        HINTS:
+        - Use np.ones_like() to create default gradient
+        - Accumulate gradients with += for leaf nodes
+        - Call self.grad_fn(gradient) for non-leaf nodes
+        """
+        ### BEGIN SOLUTION
+        # Default gradient is ones
+        if gradient is None:
+            gradient = Variable(np.ones_like(self.data.data))
+        
+        # Skip if gradients not required
+        if not self.requires_grad:
+            return
+        
+        # Accumulate gradient for leaf nodes
+        if self.is_leaf:
+            if self.grad is None:
+                self.grad = Variable(np.zeros_like(self.data.data))
+            self.grad.data._data += gradient.data.data
+        else:
+            # Propagate gradients through grad_fn
+            if self.grad_fn is not None:
+                self.grad_fn(gradient)
+        ### END SOLUTION
+    
+    def zero_grad(self) -> None:
+        """Zero out the gradient."""
+        if self.grad is not None:
+            self.grad.data._data.fill(0)
+    
+    # Arithmetic operations with gradient tracking
+    def __add__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Addition with gradient tracking."""
+        return add(self, other)
+    
+    def __mul__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Multiplication with gradient tracking."""
+        return multiply(self, other)
+    
+    def __sub__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Subtraction with gradient tracking."""
+        return subtract(self, other)
+    
+    def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Division with gradient tracking."""
+        return divide(self, other) 
+
+# %% [markdown]
+"""
+## Step 3: Basic Operations with Gradients
+
+### The Pattern
+Every differentiable operation follows the same pattern:
+1. **Forward pass**: Compute the result
+2. **Create grad_fn**: Function that knows how to compute gradients
+3. **Return Variable**: With the result and grad_fn
+
+### Mathematical Rules
+- **Addition**: $\frac{d(x + y)}{dx} = 1$, $\frac{d(x + y)}{dy} = 1$
+- **Multiplication**: $\frac{d(x \cdot y)}{dx} = y$, $\frac{d(x \cdot y)}{dy} = x$
+- **Subtraction**: $\frac{d(x - y)}{dx} = 1$, $\frac{d(x - y)}{dy} = -1$
+- **Division**: $\frac{d(x / y)}{dx} = \frac{1}{y}$, $\frac{d(x / y)}{dy} = -\frac{x}{y^2}$
+
+### Implementation Strategy
+Each operation creates a closure that captures the input variables and implements the gradient computation rule.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "add-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Addition operation with gradient tracking.
+    
+    Args:
+        a: First operand
+        b: Second operand
+        
+    Returns:
+        Variable with sum and gradient function
+        
+    TODO: Implement addition with gradient computation.
+    
+    APPROACH:
+    1. Convert inputs to Variables if needed
+    2. Compute forward pass: result = a + b
+    3. Create gradient function that distributes gradients
+    4. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = x + y, then dz/dx = 1, dz/dy = 1
+    
+    EXAMPLE:
+    x = Variable(2.0), y = Variable(3.0)
+    z = add(x, y)  # z.data = 5.0
+    z.backward()   # x.grad = 1.0, y.grad = 1.0
+    
+    HINTS:
+    - Use isinstance() to check if inputs are Variables
+    - Create a closure that captures a and b
+    - In grad_fn, call a.backward() and b.backward() with appropriate gradients
+    """
+    ### BEGIN SOLUTION
+    # Convert to Variables if needed
+    if not isinstance(a, Variable):
+        a = Variable(a, requires_grad=False)
+    if not isinstance(b, Variable):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data + b.data
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        # Addition distributes gradients equally
+        if a.requires_grad:
+            a.backward(grad_output)
+        if b.requires_grad:
+            b.backward(grad_output)
+    
+    # Determine if result requires gradients
+    requires_grad = a.requires_grad or b.requires_grad
+    
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Multiplication operation with gradient tracking.
+    
+    Args:
+        a: First operand
+        b: Second operand
+        
+    Returns:
+        Variable with product and gradient function
+        
+    TODO: Implement multiplication with gradient computation.
+    
+    APPROACH:
+    1. Convert inputs to Variables if needed
+    2. Compute forward pass: result = a * b
+    3. Create gradient function using product rule
+    4. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = x * y, then dz/dx = y, dz/dy = x
+    
+    EXAMPLE:
+    x = Variable(2.0), y = Variable(3.0)
+    z = multiply(x, y)  # z.data = 6.0
+    z.backward()        # x.grad = 3.0, y.grad = 2.0
+    
+    HINTS:
+    - Store a.data and b.data for gradient computation
+    - In grad_fn, multiply incoming gradient by the other operand
+    - Handle broadcasting if shapes are different
+    """
+    ### BEGIN SOLUTION
+    # Convert to Variables if needed
+    if not isinstance(a, Variable):
+        a = Variable(a, requires_grad=False)
+    if not isinstance(b, Variable):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data * b.data
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        # Product rule: d(xy)/dx = y, d(xy)/dy = x
+        if a.requires_grad:
+            a_grad = Variable(grad_output.data * b.data)
+            a.backward(a_grad)
+        if b.requires_grad:
+            b_grad = Variable(grad_output.data * a.data)
+            b.backward(b_grad)
+    
+    # Determine if result requires gradients
+    requires_grad = a.requires_grad or b.requires_grad
+    
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "subtract-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Subtraction operation with gradient tracking.
+    
+    Args:
+        a: First operand (minuend)
+        b: Second operand (subtrahend)
+        
+    Returns:
+        Variable with difference and gradient function
+        
+    TODO: Implement subtraction with gradient computation.
+    
+    APPROACH:
+    1. Convert inputs to Variables if needed
+    2. Compute forward pass: result = a - b
+    3. Create gradient function with correct signs
+    4. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = x - y, then dz/dx = 1, dz/dy = -1
+    
+    EXAMPLE:
+    x = Variable(5.0), y = Variable(3.0)
+    z = subtract(x, y)  # z.data = 2.0
+    z.backward()        # x.grad = 1.0, y.grad = -1.0
+    
+    HINTS:
+    - Forward pass is straightforward: a - b
+    - Gradient for a is positive, for b is negative
+    - Remember to negate the gradient for b
+    """
+    ### BEGIN SOLUTION
+    # Convert to Variables if needed
+    if not isinstance(a, Variable):
+        a = Variable(a, requires_grad=False)
+    if not isinstance(b, Variable):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data - b.data
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        # Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1
+        if a.requires_grad:
+            a.backward(grad_output)
+        if b.requires_grad:
+            b_grad = Variable(-grad_output.data.data)
+            b.backward(b_grad)
+    
+    # Determine if result requires gradients
+    requires_grad = a.requires_grad or b.requires_grad
+    
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "divide-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Division operation with gradient tracking.
+    
+    Args:
+        a: Numerator
+        b: Denominator
+        
+    Returns:
+        Variable with quotient and gradient function
+        
+    TODO: Implement division with gradient computation.
+    
+    APPROACH:
+    1. Convert inputs to Variables if needed
+    2. Compute forward pass: result = a / b
+    3. Create gradient function using quotient rule
+    4. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = x / y, then dz/dx = \frac{1}{y}, dz/dy = -\frac{x}{y^2}
+    
+    EXAMPLE:
+    x = Variable(6.0), y = Variable(2.0)
+    z = divide(x, y)  # z.data = 3.0
+    z.backward()      # x.grad = 0.5, y.grad = -1.5
+    
+    HINTS:
+    - Forward pass: a.data / b.data
+    - Gradient for a: grad_output / b.data
+    - Gradient for b: -grad_output * a.data / (b.data ** 2)
+    - Be careful with numerical stability
+    """
+    ### BEGIN SOLUTION
+    # Convert to Variables if needed
+    if not isinstance(a, Variable):
+        a = Variable(a, requires_grad=False)
+    if not isinstance(b, Variable):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data / b.data
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        # Quotient rule: d(x/y)/dx = 1/y, d(x/y)/dy = -x/y²
+        if a.requires_grad:
+            a_grad = Variable(grad_output.data.data / b.data.data)
+            a.backward(a_grad)
+        if b.requires_grad:
+            b_grad = Variable(-grad_output.data.data * a.data.data / (b.data.data ** 2))
+            b.backward(b_grad)
+    
+    # Determine if result requires gradients
+    requires_grad = a.requires_grad or b.requires_grad
+    
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+## Step 4: Testing Basic Operations
+
+Let's test our basic operations to ensure they compute gradients correctly.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-basic-operations", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+def test_basic_operations():
+    """Test basic operations with gradient computation."""
+    print("🔬 Testing basic operations...")
+    
+    # Test addition
+    print("📊 Testing addition...")
+    x = Variable(2.0, requires_grad=True)
+    y = Variable(3.0, requires_grad=True)
+    z = add(x, y)
+    
+    assert abs(z.data.data.item() - 5.0) < 1e-6, f"Addition failed: expected 5.0, got {z.data.data.item()}"
+    
+    z.backward()
+    assert abs(x.grad.data.data.item() - 1.0) < 1e-6, f"Addition gradient for x failed: expected 1.0, got {x.grad.data.data.item()}"
+    assert abs(y.grad.data.data.item() - 1.0) < 1e-6, f"Addition gradient for y failed: expected 1.0, got {y.grad.data.data.item()}"
+    print("✅ Addition test passed!")
+    
+    # Test multiplication
+    print("📊 Testing multiplication...")
+    x = Variable(2.0, requires_grad=True)
+    y = Variable(3.0, requires_grad=True)
+    z = multiply(x, y)
+    
+    assert abs(z.data.data.item() - 6.0) < 1e-6, f"Multiplication failed: expected 6.0, got {z.data.data.item()}"
+    
+    z.backward()
+    assert abs(x.grad.data.data.item() - 3.0) < 1e-6, f"Multiplication gradient for x failed: expected 3.0, got {x.grad.data.data.item()}"
+    assert abs(y.grad.data.data.item() - 2.0) < 1e-6, f"Multiplication gradient for y failed: expected 2.0, got {y.grad.data.data.item()}"
+    print("✅ Multiplication test passed!")
+    
+    # Test subtraction
+    print("📊 Testing subtraction...")
+    x = Variable(5.0, requires_grad=True)
+    y = Variable(3.0, requires_grad=True)
+    z = subtract(x, y)
+    
+    assert abs(z.data.data.item() - 2.0) < 1e-6, f"Subtraction failed: expected 2.0, got {z.data.data.item()}"
+    
+    z.backward()
+    assert abs(x.grad.data.data.item() - 1.0) < 1e-6, f"Subtraction gradient for x failed: expected 1.0, got {x.grad.data.data.item()}"
+    assert abs(y.grad.data.data.item() - (-1.0)) < 1e-6, f"Subtraction gradient for y failed: expected -1.0, got {y.grad.data.data.item()}"
+    print("✅ Subtraction test passed!")
+    
+    # Test division
+    print("📊 Testing division...")
+    x = Variable(6.0, requires_grad=True)
+    y = Variable(2.0, requires_grad=True)
+    z = divide(x, y)
+    
+    assert abs(z.data.data.item() - 3.0) < 1e-6, f"Division failed: expected 3.0, got {z.data.data.item()}"
+    
+    z.backward()
+    assert abs(x.grad.data.data.item() - 0.5) < 1e-6, f"Division gradient for x failed: expected 0.5, got {x.grad.data.data.item()}"
+    assert abs(y.grad.data.data.item() - (-1.5)) < 1e-6, f"Division gradient for y failed: expected -1.5, got {y.grad.data.data.item()}"
+    print("✅ Division test passed!")
+    
+    print("🎉 All basic operation tests passed!")
+    return True
+
+# Run the test
+success = test_basic_operations()
+
+# %% [markdown]
+"""
+## Step 5: Chain Rule Testing
+
+Let's test more complex expressions to ensure the chain rule works correctly.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-chain-rule", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+def test_chain_rule():
+    """Test chain rule with complex expressions."""
+    print("🔬 Testing chain rule...")
+    
+    # Test: f(x, y) = (x + y) * (x - y) = x² - y²
+    print("📊 Testing f(x, y) = (x + y) * (x - y)...")
+    x = Variable(3.0, requires_grad=True)
+    y = Variable(2.0, requires_grad=True)
+    
+    # Forward pass
+    sum_xy = add(x, y)      # x + y = 5
+    diff_xy = subtract(x, y) # x - y = 1
+    result = multiply(sum_xy, diff_xy)  # (x + y) * (x - y) = 5
+    
+    assert abs(result.data.data.item() - 5.0) < 1e-6, f"Chain rule forward failed: expected 5.0, got {result.data.data.item()}"
+    
+    # Backward pass
+    result.backward()
+    
+    # Analytical gradients: df/dx = 2x = 6, df/dy = -2y = -4
+    expected_x_grad = 2 * 3.0  # 6.0
+    expected_y_grad = -2 * 2.0  # -4.0
+    
+    assert abs(x.grad.data.data.item() - expected_x_grad) < 1e-6, f"Chain rule x gradient failed: expected {expected_x_grad}, got {x.grad.data.data.item()}"
+    assert abs(y.grad.data.data.item() - expected_y_grad) < 1e-6, f"Chain rule y gradient failed: expected {expected_y_grad}, got {y.grad.data.data.item()}"
+    print("✅ Chain rule test passed!")
+    
+    # Test: f(x) = x * x * x (x³)
+    print("📊 Testing f(x) = x³...")
+    x = Variable(2.0, requires_grad=True)
+    
+    # Forward pass
+    x_squared = multiply(x, x)      # x²
+    x_cubed = multiply(x_squared, x)  # x³
+    
+    assert abs(x_cubed.data.data.item() - 8.0) < 1e-6, f"x³ forward failed: expected 8.0, got {x_cubed.data.data.item()}"
+    
+    # Backward pass
+    x_cubed.backward()
+    
+    # Analytical gradient: df/dx = 3x² = 12
+    expected_grad = 3 * (2.0 ** 2)  # 12.0
+    
+    assert abs(x.grad.data.data.item() - expected_grad) < 1e-6, f"x³ gradient failed: expected {expected_grad}, got {x.grad.data.data.item()}"
+    print("✅ x³ test passed!")
+    
+    print("🎉 All chain rule tests passed!")
+    return True
+
+# Run the test
+success = test_chain_rule()
+
+# %% [markdown]
+"""
+## Step 6: Activation Function Gradients
+
+Now let's implement gradients for activation functions to integrate with our existing modules.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "relu-gradient", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def relu_with_grad(x: Variable) -> Variable:
+    """
+    ReLU activation with gradient tracking.
+    
+    Args:
+        x: Input Variable
+        
+    Returns:
+        Variable with ReLU applied and gradient function
+        
+    TODO: Implement ReLU with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: max(0, x)
+    2. Create gradient function using ReLU derivative
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    f(x) = max(0, x)
+    f'(x) = 1 if x > 0, else 0
+    
+    EXAMPLE:
+    x = Variable([-1.0, 0.0, 1.0])
+    y = relu_with_grad(x)  # y.data = [0.0, 0.0, 1.0]
+    y.backward()           # x.grad = [0.0, 0.0, 1.0]
+    
+    HINTS:
+    - Use np.maximum(0, x.data.data) for forward pass
+    - Use (x.data.data > 0) for gradient mask
+    - Only propagate gradients where input was positive
+    """
+    ### BEGIN SOLUTION
+    # Forward pass
+    result_data = Tensor(np.maximum(0, x.data.data))
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if x.requires_grad:
+            # ReLU derivative: 1 if x > 0, else 0
+            mask = (x.data.data > 0).astype(np.float32)
+            x_grad = Variable(grad_output.data.data * mask)
+            x.backward(x_grad)
+    
+    return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "sigmoid-gradient", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def sigmoid_with_grad(x: Variable) -> Variable:
+    """
+    Sigmoid activation with gradient tracking.
+    
+    Args:
+        x: Input Variable
+        
+    Returns:
+        Variable with sigmoid applied and gradient function
+        
+    TODO: Implement sigmoid with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: 1 / (1 + exp(-x))
+    2. Create gradient function using sigmoid derivative
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    f(x) = 1 / (1 + exp(-x))
+    f'(x) = f(x) * (1 - f(x))
+    
+    EXAMPLE:
+    x = Variable(0.0)
+    y = sigmoid_with_grad(x)  # y.data = 0.5
+    y.backward()              # x.grad = 0.25
+    
+    HINTS:
+    - Use np.clip for numerical stability
+    - Store sigmoid output for gradient computation
+    - Gradient is sigmoid * (1 - sigmoid)
+    """
+    ### BEGIN SOLUTION
+    # Forward pass with numerical stability
+    clipped = np.clip(x.data.data, -500, 500)
+    sigmoid_output = 1.0 / (1.0 + np.exp(-clipped))
+    result_data = Tensor(sigmoid_output)
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if x.requires_grad:
+            # Sigmoid derivative: sigmoid * (1 - sigmoid)
+            sigmoid_grad = sigmoid_output * (1.0 - sigmoid_output)
+            x_grad = Variable(grad_output.data.data * sigmoid_grad)
+            x.backward(x_grad)
+    
+    return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+## Step 7: Integration Testing
+
+Let's test our autograd system with a simple neural network scenario.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-integration", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
+def test_integration():
+    """Test autograd integration with neural network scenario."""
+    print("🔬 Testing autograd integration...")
+    
+    # Simple neural network: input -> linear -> ReLU -> output
+    print("📊 Testing simple neural network...")
+    
+    # Input
+    x = Variable(2.0, requires_grad=True)
+    
+    # Weights and bias
+    w1 = Variable(0.5, requires_grad=True)
+    b1 = Variable(0.1, requires_grad=True)
+    w2 = Variable(1.5, requires_grad=True)
+    
+    # Forward pass
+    linear1 = add(multiply(x, w1), b1)  # x * w1 + b1 = 2*0.5 + 0.1 = 1.1
+    activation1 = relu_with_grad(linear1)  # ReLU(1.1) = 1.1
+    output = multiply(activation1, w2)     # 1.1 * 1.5 = 1.65
+    
+    # Check forward pass
+    expected_output = 1.65
+    assert abs(output.data.data.item() - expected_output) < 1e-6, f"Integration forward failed: expected {expected_output}, got {output.data.data.item()}"
+    
+    # Backward pass
+    output.backward()
+    
+    # Check gradients
+    # dL/dx = dL/doutput * doutput/dactivation1 * dactivation1/dlinear1 * dlinear1/dx
+    #       = 1 * w2 * 1 * w1 = 1.5 * 0.5 = 0.75
+    expected_x_grad = 0.75
+    assert abs(x.grad.data.data.item() - expected_x_grad) < 1e-6, f"Integration x gradient failed: expected {expected_x_grad}, got {x.grad.data.data.item()}"
+    
+    # dL/dw1 = dL/doutput * doutput/dactivation1 * dactivation1/dlinear1 * dlinear1/dw1
+    #        = 1 * w2 * 1 * x = 1.5 * 2.0 = 3.0
+    expected_w1_grad = 3.0
+    assert abs(w1.grad.data.data.item() - expected_w1_grad) < 1e-6, f"Integration w1 gradient failed: expected {expected_w1_grad}, got {w1.grad.data.data.item()}"
+    
+    # dL/db1 = dL/doutput * doutput/dactivation1 * dactivation1/dlinear1 * dlinear1/db1
+    #        = 1 * w2 * 1 * 1 = 1.5
+    expected_b1_grad = 1.5
+    assert abs(b1.grad.data.data.item() - expected_b1_grad) < 1e-6, f"Integration b1 gradient failed: expected {expected_b1_grad}, got {b1.grad.data.data.item()}"
+    
+    # dL/dw2 = dL/doutput * doutput/dw2 = 1 * activation1 = 1.1
+    expected_w2_grad = 1.1
+    assert abs(w2.grad.data.data.item() - expected_w2_grad) < 1e-6, f"Integration w2 gradient failed: expected {expected_w2_grad}, got {w2.grad.data.data.item()}"
+    
+    print("✅ Integration test passed!")
+    print("🎉 All autograd tests passed!")
+    return True
+
+# Run the test
+success = test_integration()
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+
+Congratulations! You've successfully implemented automatic differentiation for TinyTorch:
+
+### What You've Accomplished
+✅ **Variable Class**: Tensor wrapper with gradient tracking and computational graph  
+✅ **Basic Operations**: Addition, multiplication, subtraction, division with gradients  
+✅ **Chain Rule**: Automatic gradient computation through complex expressions  
+✅ **Activation Functions**: ReLU and Sigmoid with proper gradient computation  
+✅ **Integration**: Works seamlessly with neural network scenarios  
+
+### Key Concepts You've Learned
+- **Computational graphs** represent mathematical expressions as directed graphs
+- **Forward pass** computes function values following the graph
+- **Backward pass** computes gradients using the chain rule in reverse
+- **Gradient functions** capture how to compute gradients for each operation
+- **Variable tracking** enables automatic differentiation of any expression
+
+### Mathematical Foundations
+- **Chain rule**: The fundamental principle behind backpropagation
+- **Partial derivatives**: How gradients flow through operations
+- **Computational efficiency**: Reusing forward pass results in backward pass
+- **Numerical stability**: Handling edge cases in gradient computation
+
+### Real-World Applications
+- **Neural network training**: Backpropagation through layers
+- **Optimization**: Gradient descent and advanced optimizers
+- **Scientific computing**: Sensitivity analysis and inverse problems
+- **Machine learning**: Any gradient-based learning algorithm
+
+### Next Steps
+1. **Export your code**: `tito package nbdev --export 07_autograd`
+2. **Test your implementation**: `tito module test 07_autograd`
+3. **Use your autograd**: 
+   ```python
+   from tinytorch.core.autograd import Variable
+   
+   x = Variable(2.0, requires_grad=True)
+   y = x**2 + 3*x + 1
+   y.backward()
+   print(x.grad)  # Your gradients in action!
+   ```
+4. **Move to Module 8**: Start building training loops and optimizers!
+
+**Ready for the next challenge?** Let's use your autograd system to build complete training pipelines!
+""" 
+
+# %% [markdown]
+"""
+## Step 8: Performance Optimizations and Advanced Features
+
+### Memory Management
+- **Gradient Accumulation**: Efficient in-place gradient updates
+- **Computational Graph Cleanup**: Release intermediate values when possible
+- **Lazy Evaluation**: Compute gradients only when needed
+
+### Numerical Stability
+- **Gradient Clipping**: Prevent exploding gradients
+- **Numerical Precision**: Handle edge cases gracefully
+- **Overflow Protection**: Clip extreme values
+
+### Advanced Features
+- **Higher-Order Gradients**: Gradients of gradients
+- **Gradient Checkpointing**: Memory-efficient backpropagation
+- **Custom Operations**: Framework for user-defined differentiable functions
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "advanced-features", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def power(base: Variable, exponent: Union[float, int]) -> Variable:
+    """
+    Power operation with gradient tracking: base^exponent.
+    
+    Args:
+        base: Base Variable
+        exponent: Exponent (scalar)
+        
+    Returns:
+        Variable with power applied and gradient function
+        
+    TODO: Implement power operation with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: base^exponent
+    2. Create gradient function using power rule
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = x^n, then dz/dx = n * x^(n-1)
+    
+    EXAMPLE:
+    x = Variable(2.0)
+    y = power(x, 3)  # y.data = 8.0
+    y.backward()     # x.grad = 3 * 2^2 = 12.0
+    
+    HINTS:
+    - Use np.power() for forward pass
+    - Power rule: gradient = exponent * base^(exponent-1)
+    - Handle edge cases like exponent=0 or base=0
+    """
+    ### BEGIN SOLUTION
+    # Forward pass
+    result_data = Tensor(np.power(base.data.data, exponent))
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if base.requires_grad:
+            # Power rule: d(x^n)/dx = n * x^(n-1)
+            if exponent == 0:
+                # Special case: derivative of constant is 0
+                base_grad = Variable(np.zeros_like(base.data.data))
+            else:
+                base_grad_data = exponent * np.power(base.data.data, exponent - 1)
+                base_grad = Variable(grad_output.data.data * base_grad_data)
+            base.backward(base_grad)
+    
+    return Variable(result_data, requires_grad=base.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "exp-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def exp(x: Variable) -> Variable:
+    """
+    Exponential operation with gradient tracking: e^x.
+    
+    Args:
+        x: Input Variable
+        
+    Returns:
+        Variable with exponential applied and gradient function
+        
+    TODO: Implement exponential operation with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: e^x
+    2. Create gradient function using exponential derivative
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = e^x, then dz/dx = e^x
+    
+    EXAMPLE:
+    x = Variable(1.0)
+    y = exp(x)  # y.data = e^1 ≈ 2.718
+    y.backward()  # x.grad = e^1 ≈ 2.718
+    
+    HINTS:
+    - Use np.exp() for forward pass
+    - Exponential derivative is itself: d(e^x)/dx = e^x
+    - Store result for gradient computation
+    """
+    ### BEGIN SOLUTION
+    # Forward pass
+    exp_result = np.exp(x.data.data)
+    result_data = Tensor(exp_result)
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if x.requires_grad:
+            # Exponential derivative: d(e^x)/dx = e^x
+            x_grad = Variable(grad_output.data.data * exp_result)
+            x.backward(x_grad)
+    
+    return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "log-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def log(x: Variable) -> Variable:
+    """
+    Natural logarithm operation with gradient tracking: ln(x).
+    
+    Args:
+        x: Input Variable
+        
+    Returns:
+        Variable with logarithm applied and gradient function
+        
+    TODO: Implement logarithm operation with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: ln(x)
+    2. Create gradient function using logarithm derivative
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = ln(x), then dz/dx = 1/x
+    
+    EXAMPLE:
+    x = Variable(2.0)
+    y = log(x)  # y.data = ln(2) ≈ 0.693
+    y.backward()  # x.grad = 1/2 = 0.5
+    
+    HINTS:
+    - Use np.log() for forward pass
+    - Logarithm derivative: d(ln(x))/dx = 1/x
+    - Handle numerical stability for small x
+    """
+    ### BEGIN SOLUTION
+    # Forward pass with numerical stability
+    clipped_x = np.clip(x.data.data, 1e-8, np.inf)  # Avoid log(0)
+    result_data = Tensor(np.log(clipped_x))
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if x.requires_grad:
+            # Logarithm derivative: d(ln(x))/dx = 1/x
+            x_grad = Variable(grad_output.data.data / clipped_x)
+            x.backward(x_grad)
+    
+    return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "sum-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def sum_all(x: Variable) -> Variable:
+    """
+    Sum all elements operation with gradient tracking.
+    
+    Args:
+        x: Input Variable
+        
+    Returns:
+        Variable with sum and gradient function
+        
+    TODO: Implement sum operation with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: sum of all elements
+    2. Create gradient function that broadcasts gradient back
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = sum(x), then dz/dx_i = 1 for all i
+    
+    EXAMPLE:
+    x = Variable([[1, 2], [3, 4]])
+    y = sum_all(x)  # y.data = 10
+    y.backward()    # x.grad = [[1, 1], [1, 1]]
+    
+    HINTS:
+    - Use np.sum() for forward pass
+    - Gradient is ones with same shape as input
+    - This is used for loss computation
+    """
+    ### BEGIN SOLUTION
+    # Forward pass
+    result_data = Tensor(np.sum(x.data.data))
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if x.requires_grad:
+            # Sum gradient: broadcasts to all elements
+            x_grad = Variable(grad_output.data.data * np.ones_like(x.data.data))
+            x.backward(x_grad)
+    
+    return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "mean-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def mean(x: Variable) -> Variable:
+    """
+    Mean operation with gradient tracking.
+    
+    Args:
+        x: Input Variable
+        
+    Returns:
+        Variable with mean and gradient function
+        
+    TODO: Implement mean operation with gradient computation.
+    
+    APPROACH:
+    1. Compute forward pass: mean of all elements
+    2. Create gradient function that distributes gradient evenly
+    3. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = mean(x), then dz/dx_i = 1/n for all i (where n is number of elements)
+    
+    EXAMPLE:
+    x = Variable([[1, 2], [3, 4]])
+    y = mean(x)  # y.data = 2.5
+    y.backward()  # x.grad = [[0.25, 0.25], [0.25, 0.25]]
+    
+    HINTS:
+    - Use np.mean() for forward pass
+    - Gradient is 1/n for each element
+    - This is commonly used for loss computation
+    """
+    ### BEGIN SOLUTION
+    # Forward pass
+    result_data = Tensor(np.mean(x.data.data))
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        if x.requires_grad:
+            # Mean gradient: 1/n for each element
+            n = x.data.size
+            x_grad = Variable(grad_output.data.data * np.ones_like(x.data.data) / n)
+            x.backward(x_grad)
+    
+    return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+## Step 9: Gradient Utilities and Helper Functions
+
+### Gradient Management
+- **Gradient Clipping**: Prevent exploding gradients
+- **Gradient Checking**: Verify gradient correctness
+- **Parameter Collection**: Gather all parameters for optimization
+
+### Debugging Tools
+- **Gradient Visualization**: Inspect gradient flow
+- **Computational Graph**: Visualize the computation graph
+- **Gradient Statistics**: Monitor gradient magnitudes
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "gradient-utilities", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def clip_gradients(variables: List[Variable], max_norm: float = 1.0) -> None:
+    """
+    Clip gradients to prevent exploding gradients.
+    
+    Args:
+        variables: List of Variables to clip gradients for
+        max_norm: Maximum gradient norm allowed
+        
+    TODO: Implement gradient clipping.
+    
+    APPROACH:
+    1. Compute total gradient norm across all variables
+    2. If norm exceeds max_norm, scale all gradients down
+    3. Modify gradients in-place
+    
+    MATHEMATICAL RULE:
+    If ||g|| > max_norm, then g := g * (max_norm / ||g||)
+    
+    EXAMPLE:
+    variables = [w1, w2, b1, b2]
+    clip_gradients(variables, max_norm=1.0)
+    
+    HINTS:
+    - Compute L2 norm of all gradients combined
+    - Scale factor = max_norm / total_norm
+    - Only clip if total_norm > max_norm
+    """
+    ### BEGIN SOLUTION
+    # Compute total gradient norm
+    total_norm = 0.0
+    for var in variables:
+        if var.grad is not None:
+            total_norm += np.sum(var.grad.data.data ** 2)
+    total_norm = np.sqrt(total_norm)
+    
+    # Clip if necessary
+    if total_norm > max_norm:
+        scale_factor = max_norm / total_norm
+        for var in variables:
+            if var.grad is not None:
+                var.grad.data._data *= scale_factor
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "collect-parameters", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def collect_parameters(*modules) -> List[Variable]:
+    """
+    Collect all parameters from modules for optimization.
+    
+    Args:
+        *modules: Variable number of modules/objects with parameters
+        
+    Returns:
+        List of all Variables that require gradients
+        
+    TODO: Implement parameter collection.
+    
+    APPROACH:
+    1. Iterate through all provided modules
+    2. Find all Variable attributes that require gradients
+    3. Return list of all such Variables
+    
+    EXAMPLE:
+    layer1 = SomeLayer()
+    layer2 = SomeLayer()
+    params = collect_parameters(layer1, layer2)
+    
+    HINTS:
+    - Use hasattr() and getattr() to find Variable attributes
+    - Check if attribute is Variable and requires_grad
+    - Handle different module types gracefully
+    """
+    ### BEGIN SOLUTION
+    parameters = []
+    for module in modules:
+        if hasattr(module, '__dict__'):
+            for attr_name, attr_value in module.__dict__.items():
+                if isinstance(attr_value, Variable) and attr_value.requires_grad:
+                    parameters.append(attr_value)
+    return parameters
+    ### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "zero-gradients", "locked": false, "schema_version": 3, "solution": true, "task": false}
+#| export
+def zero_gradients(variables: List[Variable]) -> None:
+    """
+    Zero out gradients for all variables.
+    
+    Args:
+        variables: List of Variables to zero gradients for
+        
+    TODO: Implement gradient zeroing.
+    
+    APPROACH:
+    1. Iterate through all variables
+    2. Call zero_grad() on each variable
+    3. Handle None gradients gracefully
+    
+    EXAMPLE:
+    parameters = [w1, w2, b1, b2]
+    zero_gradients(parameters)
+    
+    HINTS:
+    - Use the zero_grad() method on each Variable
+    - Check if variable has gradients before zeroing
+    - This is typically called before each training step
+    """
+    ### BEGIN SOLUTION
+    for var in variables:
+        if var.grad is not None:
+            var.zero_grad()
+    ### END SOLUTION
+
+# %% [markdown]
+"""
+## Step 10: Advanced Testing
+
+Let's test our advanced features and optimizations.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-advanced-operations", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
+def test_advanced_operations():
+    """Test advanced mathematical operations."""
+    print("🔬 Testing advanced operations...")
+    
+    # Test power operation
+    print("📊 Testing power operation...")
+    x = Variable(2.0, requires_grad=True)
+    y = power(x, 3)  # x^3
+    
+    assert abs(y.data.data.item() - 8.0) < 1e-6, f"Power forward failed: expected 8.0, got {y.data.data.item()}"
+    
+    y.backward()
+    # Gradient: d(x^3)/dx = 3x^2 = 3 * 4 = 12
+    assert abs(x.grad.data.data.item() - 12.0) < 1e-6, f"Power gradient failed: expected 12.0, got {x.grad.data.data.item()}"
+    print("✅ Power operation test passed!")
+    
+    # Test exponential operation
+    print("📊 Testing exponential operation...")
+    x = Variable(1.0, requires_grad=True)
+    y = exp(x)  # e^x
+    
+    expected_exp = np.exp(1.0)
+    assert abs(y.data.data.item() - expected_exp) < 1e-6, f"Exp forward failed: expected {expected_exp}, got {y.data.data.item()}"
+    
+    y.backward()
+    # Gradient: d(e^x)/dx = e^x
+    assert abs(x.grad.data.data.item() - expected_exp) < 1e-6, f"Exp gradient failed: expected {expected_exp}, got {x.grad.data.data.item()}"
+    print("✅ Exponential operation test passed!")
+    
+    # Test logarithm operation
+    print("📊 Testing logarithm operation...")
+    x = Variable(2.0, requires_grad=True)
+    y = log(x)  # ln(x)
+    
+    expected_log = np.log(2.0)
+    assert abs(y.data.data.item() - expected_log) < 1e-6, f"Log forward failed: expected {expected_log}, got {y.data.data.item()}"
+    
+    y.backward()
+    # Gradient: d(ln(x))/dx = 1/x = 1/2 = 0.5
+    assert abs(x.grad.data.data.item() - 0.5) < 1e-6, f"Log gradient failed: expected 0.5, got {x.grad.data.data.item()}"
+    print("✅ Logarithm operation test passed!")
+    
+    # Test sum operation
+    print("📊 Testing sum operation...")
+    x = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
+    y = sum_all(x)  # sum of all elements
+    
+    assert abs(y.data.data.item() - 10.0) < 1e-6, f"Sum forward failed: expected 10.0, got {y.data.data.item()}"
+    
+    y.backward()
+    # Gradient: all elements should be 1
+    expected_grad = np.ones((2, 2))
+    np.testing.assert_array_almost_equal(x.grad.data.data, expected_grad)
+    print("✅ Sum operation test passed!")
+    
+    # Test mean operation
+    print("📊 Testing mean operation...")
+    x = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
+    y = mean(x)  # mean of all elements
+    
+    assert abs(y.data.data.item() - 2.5) < 1e-6, f"Mean forward failed: expected 2.5, got {y.data.data.item()}"
+    
+    y.backward()
+    # Gradient: all elements should be 1/4 = 0.25
+    expected_grad = np.ones((2, 2)) * 0.25
+    np.testing.assert_array_almost_equal(x.grad.data.data, expected_grad)
+    print("✅ Mean operation test passed!")
+    
+    print("🎉 All advanced operation tests passed!")
+    return True
+
+# Run the test
+success = test_advanced_operations()
+
+# %% nbgrader={"grade": true, "grade_id": "test-gradient-utilities", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
+def test_gradient_utilities():
+    """Test gradient utility functions."""
+    print("🔬 Testing gradient utilities...")
+    
+    # Test gradient clipping
+    print("📊 Testing gradient clipping...")
+    x = Variable(1.0, requires_grad=True)
+    y = Variable(1.0, requires_grad=True)
+    
+    # Create large gradients
+    z = multiply(x, 10.0)  # Large gradient for x
+    w = multiply(y, 10.0)  # Large gradient for y
+    loss = add(z, w)
+    loss.backward()
+    
+    # Check gradients are large before clipping
+    assert abs(x.grad.data.data.item() - 10.0) < 1e-6
+    assert abs(y.grad.data.data.item() - 10.0) < 1e-6
+    
+    # Clip gradients
+    clip_gradients([x, y], max_norm=1.0)
+    
+    # Check gradients are clipped
+    total_norm = np.sqrt(x.grad.data.data.item()**2 + y.grad.data.data.item()**2)
+    assert abs(total_norm - 1.0) < 1e-6, f"Gradient clipping failed: total norm {total_norm}, expected 1.0"
+    print("✅ Gradient clipping test passed!")
+    
+    # Test zero gradients
+    print("📊 Testing zero gradients...")
+    # Gradients should be non-zero before zeroing
+    assert abs(x.grad.data.data.item()) > 1e-6
+    assert abs(y.grad.data.data.item()) > 1e-6
+    
+    # Zero gradients
+    zero_gradients([x, y])
+    
+    # Check gradients are zero
+    assert abs(x.grad.data.data.item()) < 1e-6
+    assert abs(y.grad.data.data.item()) < 1e-6
+    print("✅ Zero gradients test passed!")
+    
+    print("🎉 All gradient utility tests passed!")
+    return True
+
+# Run the test
+success = test_gradient_utilities()
+
+# %% [markdown]
+"""
+## Step 11: Complete ML Pipeline Example
+
+Let's demonstrate a complete machine learning pipeline using our autograd system.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-complete-pipeline", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
+def test_complete_ml_pipeline():
+    """Test complete ML pipeline with autograd."""
+    print("🔬 Testing complete ML pipeline...")
+    
+    # Create a simple regression problem: y = 2x + 1 + noise
+    print("📊 Setting up regression problem...")
+    
+    # Training data
+    x_data = [1.0, 2.0, 3.0, 4.0, 5.0]
+    y_data = [3.1, 4.9, 7.2, 9.1, 10.8]  # Approximately 2x + 1 with noise
+    
+    # Model parameters
+    w = Variable(0.1, requires_grad=True)  # Weight
+    b = Variable(0.0, requires_grad=True)  # Bias
+    
+    # Training loop
+    learning_rate = 0.01
+    num_epochs = 100
+    
+    print("📊 Training model...")
+    for epoch in range(num_epochs):
+        total_loss = Variable(0.0, requires_grad=False)
+        
+        # Forward pass for all data points
+        for x_val, y_val in zip(x_data, y_data):
+            x = Variable(x_val, requires_grad=False)
+            y_target = Variable(y_val, requires_grad=False)
+            
+            # Prediction: y_pred = w * x + b
+            y_pred = add(multiply(w, x), b)
+            
+            # Loss: MSE = (y_pred - y_target)^2
+            diff = subtract(y_pred, y_target)
+            loss = multiply(diff, diff)
+            
+            # Accumulate loss
+            total_loss = add(total_loss, loss)
+        
+        # Backward pass
+        total_loss.backward()
+        
+        # Update parameters
+        w.data._data -= learning_rate * w.grad.data.data
+        b.data._data -= learning_rate * b.grad.data.data
+        
+        # Zero gradients for next iteration
+        zero_gradients([w, b])
+        
+        # Print progress
+        if epoch % 20 == 0:
+            print(f"   Epoch {epoch}: Loss = {total_loss.data.data.item():.4f}, w = {w.data.data.item():.4f}, b = {b.data.data.item():.4f}")
+    
+    # Check final parameters
+    print("📊 Checking final parameters...")
+    final_w = w.data.data.item()
+    final_b = b.data.data.item()
+    
+    # Should be close to true values: w=2, b=1
+    assert abs(final_w - 2.0) < 0.5, f"Weight not learned correctly: expected ~2.0, got {final_w}"
+    assert abs(final_b - 1.0) < 0.5, f"Bias not learned correctly: expected ~1.0, got {final_b}"
+    
+    print(f"✅ Model learned: w = {final_w:.3f}, b = {final_b:.3f}")
+    print("✅ Complete ML pipeline test passed!")
+    
+    # Test prediction on new data
+    print("📊 Testing prediction on new data...")
+    x_test = Variable(6.0, requires_grad=False)
+    y_pred = add(multiply(w, x_test), b)
+    expected_pred = 2.0 * 6.0 + 1.0  # True function value
+    
+    print(f"   Prediction for x=6: {y_pred.data.data.item():.3f} (expected ~{expected_pred})")
+    assert abs(y_pred.data.data.item() - expected_pred) < 1.0, "Prediction accuracy insufficient"
+    
+    print("🎉 Complete ML pipeline test passed!")
+    return True
+
+# Run the test
+success = test_complete_ml_pipeline() 
\ No newline at end of file
diff --git a/modules/source/07_autograd/tests/test_autograd.py b/modules/source/07_autograd/tests/test_autograd.py
deleted file mode 100644
index b1f49810..00000000
--- a/modules/source/07_autograd/tests/test_autograd.py
+++ /dev/null
@@ -1,698 +0,0 @@
-"""
-Test suite for the autograd module.
-This tests the autograd implementations using mock classes to avoid cross-module dependencies.
-"""
-
-import pytest
-import numpy as np
-import sys
-import os
-
-# Add the module path for testing
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
-
-# Import the autograd module directly
-from autograd_dev import Variable, add, multiply, subtract, divide, relu_with_grad, sigmoid_with_grad
-
-
-class MockTensor:
-    """Mock Tensor class for testing autograd without dependencies."""
-    
-    def __init__(self, data):
-        if isinstance(data, (int, float)):
-            self._data = np.array(data, dtype=np.float32)
-        elif isinstance(data, list):
-            self._data = np.array(data, dtype=np.float32)
-        elif isinstance(data, np.ndarray):
-            self._data = data.astype(np.float32)
-        else:
-            self._data = np.array(data, dtype=np.float32)
-    
-    @property
-    def data(self):
-        return self._data
-    
-    @property
-    def shape(self):
-        return self._data.shape
-    
-    @property
-    def size(self):
-        return self._data.size
-    
-    def __add__(self, other):
-        if isinstance(other, MockTensor):
-            return MockTensor(self._data + other._data)
-        else:
-            return MockTensor(self._data + other)
-    
-    def __mul__(self, other):
-        if isinstance(other, MockTensor):
-            return MockTensor(self._data * other._data)
-        else:
-            return MockTensor(self._data * other)
-    
-    def __sub__(self, other):
-        if isinstance(other, MockTensor):
-            return MockTensor(self._data - other._data)
-        else:
-            return MockTensor(self._data - other)
-    
-    def __truediv__(self, other):
-        if isinstance(other, MockTensor):
-            return MockTensor(self._data / other._data)
-        else:
-            return MockTensor(self._data / other)
-    
-    def item(self):
-        return self._data.item()
-
-
-class TestVariableCreation:
-    """Test Variable creation and basic properties."""
-    
-    def test_variable_from_scalar(self):
-        """Test creating Variable from scalar values."""
-        # Float scalar
-        v1 = Variable(5.0)
-        assert v1.shape == ()
-        assert v1.size == 1
-        assert v1.requires_grad == True
-        assert v1.is_leaf == True
-        assert v1.grad is None
-        
-        # Integer scalar
-        v2 = Variable(42)
-        assert v2.shape == ()
-        assert v2.size == 1
-        assert abs(v2.data.data.item() - 42.0) < 1e-6
-    
-    def test_variable_from_list(self):
-        """Test creating Variable from list."""
-        v = Variable([1.0, 2.0, 3.0])
-        assert v.shape == (3,)
-        assert v.size == 3
-        assert v.requires_grad == True
-        assert v.is_leaf == True
-        np.testing.assert_array_almost_equal(v.data.data, [1.0, 2.0, 3.0])
-    
-    def test_variable_from_numpy(self):
-        """Test creating Variable from numpy array."""
-        arr = np.array([[1.0, 2.0], [3.0, 4.0]])
-        v = Variable(arr)
-        assert v.shape == (2, 2)
-        assert v.size == 4
-        np.testing.assert_array_almost_equal(v.data.data, arr)
-    
-    def test_variable_requires_grad_flag(self):
-        """Test requires_grad flag functionality."""
-        v1 = Variable(5.0, requires_grad=True)
-        assert v1.requires_grad == True
-        
-        v2 = Variable(5.0, requires_grad=False)
-        assert v2.requires_grad == False
-    
-    def test_variable_with_grad_fn(self):
-        """Test Variable with gradient function (non-leaf)."""
-        def dummy_grad_fn(grad):
-            pass
-        
-        v = Variable(5.0, requires_grad=True, grad_fn=dummy_grad_fn)
-        assert v.requires_grad == True
-        assert v.is_leaf == False
-        assert v.grad_fn == dummy_grad_fn
-    
-    def test_variable_repr(self):
-        """Test string representation of Variable."""
-        v = Variable(5.0)
-        repr_str = repr(v)
-        assert 'Variable' in repr_str
-        assert 'requires_grad' in repr_str
-
-
-class TestBasicOperations:
-    """Test basic arithmetic operations with gradient tracking."""
-    
-    def test_addition_operation(self):
-        """Test addition operation and gradients."""
-        x = Variable(2.0, requires_grad=True)
-        y = Variable(3.0, requires_grad=True)
-        z = add(x, y)
-        
-        # Test forward pass
-        assert abs(z.data.data.item() - 5.0) < 1e-6
-        assert z.requires_grad == True
-        assert z.is_leaf == False
-        
-        # Test backward pass
-        z.backward()
-        assert abs(x.grad.data.data.item() - 1.0) < 1e-6
-        assert abs(y.grad.data.data.item() - 1.0) < 1e-6
-    
-    def test_multiplication_operation(self):
-        """Test multiplication operation and gradients."""
-        x = Variable(2.0, requires_grad=True)
-        y = Variable(3.0, requires_grad=True)
-        z = multiply(x, y)
-        
-        # Test forward pass
-        assert abs(z.data.data.item() - 6.0) < 1e-6
-        assert z.requires_grad == True
-        assert z.is_leaf == False
-        
-        # Test backward pass
-        z.backward()
-        assert abs(x.grad.data.data.item() - 3.0) < 1e-6  # dy/dx = y = 3
-        assert abs(y.grad.data.data.item() - 2.0) < 1e-6  # dy/dy = x = 2
-    
-    def test_subtraction_operation(self):
-        """Test subtraction operation and gradients."""
-        x = Variable(5.0, requires_grad=True)
-        y = Variable(3.0, requires_grad=True)
-        z = subtract(x, y)
-        
-        # Test forward pass
-        assert abs(z.data.data.item() - 2.0) < 1e-6
-        assert z.requires_grad == True
-        assert z.is_leaf == False
-        
-        # Test backward pass
-        z.backward()
-        assert abs(x.grad.data.data.item() - 1.0) < 1e-6   # dz/dx = 1
-        assert abs(y.grad.data.data.item() - (-1.0)) < 1e-6  # dz/dy = -1
-    
-    def test_division_operation(self):
-        """Test division operation and gradients."""
-        x = Variable(6.0, requires_grad=True)
-        y = Variable(2.0, requires_grad=True)
-        z = divide(x, y)
-        
-        # Test forward pass
-        assert abs(z.data.data.item() - 3.0) < 1e-6
-        assert z.requires_grad == True
-        assert z.is_leaf == False
-        
-        # Test backward pass
-        z.backward()
-        assert abs(x.grad.data.data.item() - 0.5) < 1e-6    # dz/dx = 1/y = 1/2
-        assert abs(y.grad.data.data.item() - (-1.5)) < 1e-6  # dz/dy = -x/y² = -6/4
-    
-    def test_operations_with_constants(self):
-        """Test operations with constant values."""
-        x = Variable(2.0, requires_grad=True)
-        
-        # Addition with constant
-        z1 = add(x, 3.0)
-        assert abs(z1.data.data.item() - 5.0) < 1e-6
-        z1.backward()
-        assert abs(x.grad.data.data.item() - 1.0) < 1e-6
-        
-        # Reset gradient
-        x.zero_grad()
-        
-        # Multiplication with constant
-        z2 = multiply(x, 4.0)
-        assert abs(z2.data.data.item() - 8.0) < 1e-6
-        z2.backward()
-        assert abs(x.grad.data.data.item() - 4.0) < 1e-6
-    
-    def test_no_grad_propagation(self):
-        """Test that gradients don't propagate when requires_grad=False."""
-        x = Variable(2.0, requires_grad=False)
-        y = Variable(3.0, requires_grad=True)
-        z = add(x, y)
-        
-        z.backward()
-        assert x.grad is None  # No gradient for x
-        assert abs(y.grad.data.data.item() - 1.0) < 1e-6
-
-
-class TestChainRule:
-    """Test chain rule implementation with complex expressions."""
-    
-    def test_simple_chain_rule(self):
-        """Test f(x, y) = (x + y) * (x - y) = x² - y²."""
-        x = Variable(3.0, requires_grad=True)
-        y = Variable(2.0, requires_grad=True)
-        
-        # Forward pass
-        sum_xy = add(x, y)
-        diff_xy = subtract(x, y)
-        result = multiply(sum_xy, diff_xy)
-        
-        # Check forward pass
-        assert abs(result.data.data.item() - 5.0) < 1e-6  # (3+2)*(3-2) = 5
-        
-        # Backward pass
-        result.backward()
-        
-        # Check gradients: df/dx = 2x = 6, df/dy = -2y = -4
-        assert abs(x.grad.data.data.item() - 6.0) < 1e-6
-        assert abs(y.grad.data.data.item() - (-4.0)) < 1e-6
-    
-    def test_cubic_function(self):
-        """Test f(x) = x³ using x * x * x."""
-        x = Variable(2.0, requires_grad=True)
-        
-        # Forward pass
-        x_squared = multiply(x, x)
-        x_cubed = multiply(x_squared, x)
-        
-        # Check forward pass
-        assert abs(x_cubed.data.data.item() - 8.0) < 1e-6  # 2³ = 8
-        
-        # Backward pass
-        x_cubed.backward()
-        
-        # Check gradient: df/dx = 3x² = 12
-        assert abs(x.grad.data.data.item() - 12.0) < 1e-6
-    
-    def test_complex_expression(self):
-        """Test f(x, y) = (x * y) + (x / y)."""
-        x = Variable(4.0, requires_grad=True)
-        y = Variable(2.0, requires_grad=True)
-        
-        # Forward pass
-        product = multiply(x, y)
-        quotient = divide(x, y)
-        result = add(product, quotient)
-        
-        # Check forward pass: (4*2) + (4/2) = 8 + 2 = 10
-        assert abs(result.data.data.item() - 10.0) < 1e-6
-        
-        # Backward pass
-        result.backward()
-        
-        # Check gradients: df/dx = y + 1/y = 2 + 0.5 = 2.5
-        #                  df/dy = x - x/y² = 4 - 4/4 = 3
-        assert abs(x.grad.data.data.item() - 2.5) < 1e-6
-        assert abs(y.grad.data.data.item() - 3.0) < 1e-6
-    
-    def test_gradient_accumulation(self):
-        """Test that gradients accumulate correctly."""
-        x = Variable(2.0, requires_grad=True)
-        
-        # First computation
-        y1 = multiply(x, 3.0)
-        y1.backward()
-        first_grad = x.grad.data.data.item()
-        
-        # Second computation (should accumulate)
-        y2 = multiply(x, 4.0)
-        y2.backward()
-        second_grad = x.grad.data.data.item()
-        
-        # Gradient should accumulate: 3 + 4 = 7
-        assert abs(second_grad - 7.0) < 1e-6
-    
-    def test_zero_grad_functionality(self):
-        """Test zero_grad functionality."""
-        x = Variable(2.0, requires_grad=True)
-        y = multiply(x, 3.0)
-        y.backward()
-        
-        # Check gradient exists
-        assert x.grad is not None
-        assert abs(x.grad.data.data.item() - 3.0) < 1e-6
-        
-        # Zero the gradient
-        x.zero_grad()
-        assert abs(x.grad.data.data.item() - 0.0) < 1e-6
-
-
-class TestActivationGradients:
-    """Test activation functions with gradient computation."""
-    
-    def test_relu_activation(self):
-        """Test ReLU activation and its gradient."""
-        # Test positive input
-        x1 = Variable(2.0, requires_grad=True)
-        y1 = relu_with_grad(x1)
-        
-        assert abs(y1.data.data.item() - 2.0) < 1e-6  # ReLU(2) = 2
-        y1.backward()
-        assert abs(x1.grad.data.data.item() - 1.0) < 1e-6  # gradient = 1 for x > 0
-        
-        # Test negative input
-        x2 = Variable(-1.0, requires_grad=True)
-        y2 = relu_with_grad(x2)
-        
-        assert abs(y2.data.data.item() - 0.0) < 1e-6  # ReLU(-1) = 0
-        y2.backward()
-        assert abs(x2.grad.data.data.item() - 0.0) < 1e-6  # gradient = 0 for x < 0
-        
-        # Test zero input
-        x3 = Variable(0.0, requires_grad=True)
-        y3 = relu_with_grad(x3)
-        
-        assert abs(y3.data.data.item() - 0.0) < 1e-6  # ReLU(0) = 0
-        y3.backward()
-        assert abs(x3.grad.data.data.item() - 0.0) < 1e-6  # gradient = 0 for x = 0
-    
-    def test_sigmoid_activation(self):
-        """Test Sigmoid activation and its gradient."""
-        # Test zero input
-        x1 = Variable(0.0, requires_grad=True)
-        y1 = sigmoid_with_grad(x1)
-        
-        assert abs(y1.data.data.item() - 0.5) < 1e-6  # sigmoid(0) = 0.5
-        y1.backward()
-        assert abs(x1.grad.data.data.item() - 0.25) < 1e-6  # gradient = 0.5 * 0.5 = 0.25
-        
-        # Test positive input
-        x2 = Variable(2.0, requires_grad=True)
-        y2 = sigmoid_with_grad(x2)
-        
-        expected_sigmoid = 1.0 / (1.0 + np.exp(-2.0))
-        assert abs(y2.data.data.item() - expected_sigmoid) < 1e-6
-        
-        y2.backward()
-        expected_grad = expected_sigmoid * (1.0 - expected_sigmoid)
-        assert abs(x2.grad.data.data.item() - expected_grad) < 1e-6
-        
-        # Test negative input
-        x3 = Variable(-1.0, requires_grad=True)
-        y3 = sigmoid_with_grad(x3)
-        
-        expected_sigmoid = 1.0 / (1.0 + np.exp(1.0))
-        assert abs(y3.data.data.item() - expected_sigmoid) < 1e-6
-        
-        y3.backward()
-        expected_grad = expected_sigmoid * (1.0 - expected_sigmoid)
-        assert abs(x3.grad.data.data.item() - expected_grad) < 1e-6
-    
-    def test_activation_chaining(self):
-        """Test chaining activation functions."""
-        x = Variable(1.0, requires_grad=True)
-        
-        # Chain: x -> ReLU -> Sigmoid
-        relu_out = relu_with_grad(x)
-        sigmoid_out = sigmoid_with_grad(relu_out)
-        
-        # Forward pass
-        expected_relu = 1.0  # ReLU(1) = 1
-        expected_sigmoid = 1.0 / (1.0 + np.exp(-1.0))  # sigmoid(1)
-        
-        assert abs(relu_out.data.data.item() - expected_relu) < 1e-6
-        assert abs(sigmoid_out.data.data.item() - expected_sigmoid) < 1e-6
-        
-        # Backward pass
-        sigmoid_out.backward()
-        
-        # Check that gradient flows through both activations
-        assert x.grad is not None
-        assert abs(x.grad.data.data.item()) > 1e-6  # Should have non-zero gradient
-
-
-class TestNeuralNetworkScenarios:
-    """Test autograd in realistic neural network scenarios."""
-    
-    def test_simple_linear_layer(self):
-        """Test simple linear transformation: y = Wx + b."""
-        # Input
-        x = Variable(2.0, requires_grad=True)
-        
-        # Parameters
-        w = Variable(0.5, requires_grad=True)
-        b = Variable(0.1, requires_grad=True)
-        
-        # Forward pass
-        linear_out = add(multiply(x, w), b)  # y = x*w + b = 2*0.5 + 0.1 = 1.1
-        
-        assert abs(linear_out.data.data.item() - 1.1) < 1e-6
-        
-        # Backward pass
-        linear_out.backward()
-        
-        # Check gradients
-        assert abs(x.grad.data.data.item() - 0.5) < 1e-6  # dy/dx = w = 0.5
-        assert abs(w.grad.data.data.item() - 2.0) < 1e-6  # dy/dw = x = 2.0
-        assert abs(b.grad.data.data.item() - 1.0) < 1e-6  # dy/db = 1 = 1.0
-    
-    def test_two_layer_network(self):
-        """Test two-layer neural network."""
-        # Input
-        x = Variable(1.0, requires_grad=True)
-        
-        # Layer 1 parameters
-        w1 = Variable(2.0, requires_grad=True)
-        b1 = Variable(0.5, requires_grad=True)
-        
-        # Layer 2 parameters
-        w2 = Variable(1.5, requires_grad=True)
-        b2 = Variable(0.2, requires_grad=True)
-        
-        # Forward pass
-        # Layer 1: h = x*w1 + b1 = 1*2 + 0.5 = 2.5
-        h = add(multiply(x, w1), b1)
-        # ReLU activation
-        h_relu = relu_with_grad(h)  # ReLU(2.5) = 2.5
-        # Layer 2: y = h*w2 + b2 = 2.5*1.5 + 0.2 = 3.95
-        y = add(multiply(h_relu, w2), b2)
-        
-        assert abs(y.data.data.item() - 3.95) < 1e-6
-        
-        # Backward pass
-        y.backward()
-        
-        # Check that all parameters have gradients
-        assert x.grad is not None
-        assert w1.grad is not None
-        assert b1.grad is not None
-        assert w2.grad is not None
-        assert b2.grad is not None
-        
-        # Check specific gradient values
-        assert abs(b2.grad.data.data.item() - 1.0) < 1e-6  # dy/db2 = 1
-        assert abs(w2.grad.data.data.item() - 2.5) < 1e-6  # dy/dw2 = h_relu = 2.5
-        assert abs(b1.grad.data.data.item() - 1.5) < 1e-6  # dy/db1 = w2 = 1.5
-        assert abs(w1.grad.data.data.item() - 1.5) < 1e-6  # dy/dw1 = x * w2 = 1 * 1.5
-        assert abs(x.grad.data.data.item() - 3.0) < 1e-6   # dy/dx = w1 * w2 = 2 * 1.5
-    
-    def test_loss_computation(self):
-        """Test loss computation with gradients."""
-        # Prediction and target
-        pred = Variable(3.0, requires_grad=True)
-        target = Variable(2.0, requires_grad=False)
-        
-        # Mean squared error: loss = (pred - target)²
-        diff = subtract(pred, target)  # 3 - 2 = 1
-        loss = multiply(diff, diff)    # 1² = 1
-        
-        assert abs(loss.data.data.item() - 1.0) < 1e-6
-        
-        # Backward pass
-        loss.backward()
-        
-        # Check gradient: d_loss/d_pred = 2 * (pred - target) = 2 * 1 = 2
-        assert abs(pred.grad.data.data.item() - 2.0) < 1e-6
-        assert target.grad is None  # No gradient for target
-    
-    def test_batch_processing_simulation(self):
-        """Test simulation of batch processing."""
-        # Simulate batch of 3 samples
-        x1 = Variable(1.0, requires_grad=True)
-        x2 = Variable(2.0, requires_grad=True)
-        x3 = Variable(3.0, requires_grad=True)
-        
-        # Shared parameters
-        w = Variable(0.5, requires_grad=True)
-        b = Variable(0.1, requires_grad=True)
-        
-        # Forward pass for each sample
-        y1 = add(multiply(x1, w), b)  # 1*0.5 + 0.1 = 0.6
-        y2 = add(multiply(x2, w), b)  # 2*0.5 + 0.1 = 1.1
-        y3 = add(multiply(x3, w), b)  # 3*0.5 + 0.1 = 1.6
-        
-        # Compute batch loss (sum of individual losses)
-        loss1 = multiply(y1, y1)  # 0.6² = 0.36
-        loss2 = multiply(y2, y2)  # 1.1² = 1.21
-        loss3 = multiply(y3, y3)  # 1.6² = 2.56
-        
-        batch_loss = add(add(loss1, loss2), loss3)  # 0.36 + 1.21 + 2.56 = 4.13
-        
-        assert abs(batch_loss.data.data.item() - 4.13) < 1e-6
-        
-        # Backward pass
-        batch_loss.backward()
-        
-        # Check that gradients accumulated for shared parameters
-        assert w.grad is not None
-        assert b.grad is not None
-        
-        # w gradient should be sum of individual contributions
-        # dL/dw = 2*y1*x1 + 2*y2*x2 + 2*y3*x3 = 2*(0.6*1 + 1.1*2 + 1.6*3) = 2*7.6 = 15.2
-        expected_w_grad = 2 * (0.6*1 + 1.1*2 + 1.6*3)
-        assert abs(w.grad.data.data.item() - expected_w_grad) < 1e-6
-        
-        # b gradient should be sum of individual contributions
-        # dL/db = 2*y1 + 2*y2 + 2*y3 = 2*(0.6 + 1.1 + 1.6) = 2*3.3 = 6.6
-        expected_b_grad = 2 * (0.6 + 1.1 + 1.6)
-        assert abs(b.grad.data.data.item() - expected_b_grad) < 1e-6
-
-
-class TestEdgeCases:
-    """Test edge cases and error conditions."""
-    
-    def test_zero_division_handling(self):
-        """Test division by zero handling."""
-        x = Variable(1.0, requires_grad=True)
-        y = Variable(0.0, requires_grad=True)
-        
-        # This should not crash but may produce inf/nan
-        z = divide(x, y)
-        
-        # Check that the operation completes
-        assert z.data.data.item() == np.inf or np.isnan(z.data.data.item())
-    
-    def test_large_gradient_values(self):
-        """Test handling of large gradient values."""
-        x = Variable(100.0, requires_grad=True)
-        y = Variable(100.0, requires_grad=True)
-        
-        # Large multiplication
-        z = multiply(x, y)  # 100 * 100 = 10000
-        z.backward()
-        
-        # Gradients should be large but finite
-        assert np.isfinite(x.grad.data.data.item())
-        assert np.isfinite(y.grad.data.data.item())
-        assert abs(x.grad.data.data.item() - 100.0) < 1e-6
-        assert abs(y.grad.data.data.item() - 100.0) < 1e-6
-    
-    def test_very_small_values(self):
-        """Test handling of very small values."""
-        x = Variable(1e-10, requires_grad=True)
-        y = Variable(2e-10, requires_grad=True)
-        
-        z = add(x, y)
-        z.backward()
-        
-        # Gradients should still be computed correctly
-        assert abs(x.grad.data.data.item() - 1.0) < 1e-6
-        assert abs(y.grad.data.data.item() - 1.0) < 1e-6
-    
-    def test_mixed_requires_grad(self):
-        """Test operations with mixed requires_grad settings."""
-        x = Variable(2.0, requires_grad=True)
-        y = Variable(3.0, requires_grad=False)
-        
-        z = multiply(x, y)
-        
-        # Result should require gradients
-        assert z.requires_grad == True
-        
-        z.backward()
-        
-        # Only x should have gradients
-        assert x.grad is not None
-        assert y.grad is None
-        assert abs(x.grad.data.data.item() - 3.0) < 1e-6
-
-
-# Integration tests that combine multiple concepts
-class TestIntegration:
-    """Integration tests combining multiple autograd concepts."""
-    
-    def test_complete_training_step(self):
-        """Test a complete training step simulation."""
-        # Model parameters
-        w1 = Variable(0.1, requires_grad=True)
-        b1 = Variable(0.0, requires_grad=True)
-        w2 = Variable(0.2, requires_grad=True)
-        b2 = Variable(0.0, requires_grad=True)
-        
-        # Training data
-        x = Variable(1.5, requires_grad=False)
-        target = Variable(2.0, requires_grad=False)
-        
-        # Forward pass
-        h1 = add(multiply(x, w1), b1)  # Linear layer 1
-        h1_relu = relu_with_grad(h1)   # ReLU activation
-        output = add(multiply(h1_relu, w2), b2)  # Linear layer 2
-        
-        # Loss computation (MSE)
-        diff = subtract(output, target)
-        loss = multiply(diff, diff)
-        
-        # Backward pass
-        loss.backward()
-        
-        # Check that all parameters have gradients
-        assert w1.grad is not None
-        assert b1.grad is not None
-        assert w2.grad is not None
-        assert b2.grad is not None
-        
-        # Simulate parameter update (gradient descent)
-        learning_rate = 0.01
-        
-        # Save old parameter values
-        old_w1 = w1.data.data.item()
-        old_b1 = b1.data.data.item()
-        old_w2 = w2.data.data.item()
-        old_b2 = b2.data.data.item()
-        
-        # Update parameters: param = param - lr * grad
-        w1.data._data -= learning_rate * w1.grad.data.data
-        b1.data._data -= learning_rate * b1.grad.data.data
-        w2.data._data -= learning_rate * w2.grad.data.data
-        b2.data._data -= learning_rate * b2.grad.data.data
-        
-        # Check that parameters actually changed
-        assert abs(w1.data.data.item() - old_w1) > 1e-6
-        assert abs(b1.data.data.item() - old_b1) > 1e-6
-        assert abs(w2.data.data.item() - old_w2) > 1e-6
-        assert abs(b2.data.data.item() - old_b2) > 1e-6
-    
-    def test_multi_output_gradients(self):
-        """Test gradients when multiple outputs depend on same input."""
-        x = Variable(2.0, requires_grad=True)
-        
-        # Create multiple outputs from same input
-        y1 = multiply(x, 3.0)  # y1 = 3x
-        y2 = multiply(x, x)    # y2 = x²
-        
-        # Combine outputs
-        combined = add(y1, y2)  # combined = 3x + x²
-        
-        combined.backward()
-        
-        # Gradient should be sum of individual contributions
-        # d(combined)/dx = d(3x)/dx + d(x²)/dx = 3 + 2x = 3 + 2*2 = 7
-        assert abs(x.grad.data.data.item() - 7.0) < 1e-6
-    
-    def test_gradient_flow_through_complex_network(self):
-        """Test gradient flow through a more complex network."""
-        # Input
-        x = Variable(1.0, requires_grad=True)
-        
-        # Create a diamond-shaped computation graph
-        #     x
-        #   /   \
-        #  a     b
-        #   \   /
-        #     c
-        
-        a = multiply(x, 2.0)  # a = 2x
-        b = add(x, 1.0)       # b = x + 1
-        c = multiply(a, b)    # c = a * b = 2x * (x + 1) = 2x² + 2x
-        
-        # Expected: c = 2x² + 2x, so dc/dx = 4x + 2 = 4*1 + 2 = 6
-        c.backward()
-        
-        assert abs(x.grad.data.data.item() - 6.0) < 1e-6
-    
-    def test_nested_function_composition(self):
-        """Test deeply nested function composition."""
-        x = Variable(2.0, requires_grad=True)
-        
-        # Create nested composition: f(g(h(x)))
-        h = multiply(x, 2.0)        # h(x) = 2x
-        g = add(h, 1.0)             # g(h(x)) = 2x + 1
-        f = multiply(g, g)          # f(g(h(x))) = (2x + 1)²
-        
-        # Expected: f = (2x + 1)², so df/dx = 2(2x + 1) * 2 = 4(2x + 1) = 4(2*2 + 1) = 20
-        f.backward()
-        
-        assert abs(x.grad.data.data.item() - 20.0) < 1e-6 
\ No newline at end of file
diff --git a/modules/source/08_optimizers/optimizers_dev.py b/modules/source/08_optimizers/optimizers_dev.py
new file mode 100644
index 00000000..0519ecba
--- /dev/null
+++ b/modules/source/08_optimizers/optimizers_dev.py
@@ -0,0 +1 @@
+ 
\ No newline at end of file
diff --git a/pyproject.toml b/pyproject.toml
index db23b06d..3f9f5bf3 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -84,7 +84,6 @@ addopts = [
 ]
 testpaths = [
     "tests",
-    "modules/*/tests",
 ]
 python_files = ["test_*.py"]
 python_classes = ["Test*"]
diff --git a/tito/commands/status.py b/tito/commands/status.py
index 2051a708..de7d3056 100644
--- a/tito/commands/status.py
+++ b/tito/commands/status.py
@@ -52,9 +52,9 @@ class StatusCommand(BaseCommand):
         console = self.console
         
         # Scan modules directory
-        modules_dir = Path("modules")
+        modules_dir = Path("modules/source")
         if not modules_dir.exists():
-            console.print(Panel("[red]❌ modules/ directory not found[/red]", 
+            console.print(Panel("[red]❌ modules/source/ directory not found[/red]", 
                               title="Error", border_style="red"))
             return 1
         
@@ -150,14 +150,21 @@ class StatusCommand(BaseCommand):
         
         # Check for required files
         dev_file = module_dir / f"{module_name}_dev.py"
-        tests_dir = module_dir / "tests"
-        test_file = tests_dir / f"test_{module_name}.py"
         readme_file = module_dir / "README.md"
         metadata_file = module_dir / "module.yaml"
         
+        # Check for tests in main tests directory
+        # Extract short name from module directory name (e.g., "01_tensor" -> "tensor")
+        if module_name.startswith(tuple(f"{i:02d}_" for i in range(100))):
+            short_name = module_name[3:]  # Remove "00_" prefix
+        else:
+            short_name = module_name
+        
+        main_test_file = Path("tests") / f"test_{short_name}.py"
+        
         status = {
             'dev_file': dev_file.exists(),
-            'tests': test_file.exists(),
+            'tests': main_test_file.exists(),
             'readme': readme_file.exists(),
             'metadata_file': metadata_file.exists(),
         }
@@ -187,7 +194,13 @@ class StatusCommand(BaseCommand):
             return 'in_progress'
         
         # If tests exist, run them to determine status
-        test_file = f"modules/{module_name}/tests/test_{module_name}.py"
+        # Extract short name from module directory name (e.g., "01_tensor" -> "tensor")
+        if module_name.startswith(tuple(f"{i:02d}_" for i in range(100))):
+            short_name = module_name[3:]  # Remove "00_" prefix
+        else:
+            short_name = module_name
+        
+        test_file = f"tests/test_{short_name}.py"
         try:
             # Run pytest quietly to check if tests pass
             result = subprocess.run(