diff --git a/modules/05_autograd/autograd_dev.py b/modules/05_autograd/autograd_dev.py
index 33617407..1d59bec3 100644
--- a/modules/05_autograd/autograd_dev.py
+++ b/modules/05_autograd/autograd_dev.py
@@ -12,35 +12,37 @@
 """
 # Autograd - Automatic Differentiation Engine
 
-Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through computational graphs.
+Welcome to Autograd! You'll build automatic differentiation step by step, giving your Tensor class the ability to compute gradients automatically for neural network training.
 
 ## 🔗 Building on Previous Learning
 **What You Built Before**:
-- Module 01 (Tensor): Pure data structures with ZERO gradient contamination
-- Module 02-04: Built on pure tensors with clean mathematical operations
+- Module 01 (Setup): Development environment ready
+- Module 02 (Tensor): Complete tensor operations with math
+- Module 03 (Activations): Functions that add intelligence to networks
+- Module 04 (Losses): Functions that measure learning progress
 
-**What's Working**: You have a complete pure tensor system with arithmetic operations!
+**What's Working**: Your tensors can do math, activations, and loss calculations perfectly!
 
-**The Gap**: Your tensors are "gradient-blind" - they can't track gradients for training.
+**The Gap**: Your tensors can't learn - they have no memory of how gradients flow backward through computations.
 
-**This Module's Solution**: Use Python's decorator pattern to enhance your existing Tensor class with gradient tracking, WITHOUT breaking any existing code.
+**This Module's Solution**: Enhance your existing Tensor class with gradient tracking abilities, step by step.
 
 **Connection Map**:
 ```
-Pure Tensors → Enhanced Tensors → Training
-(Module 01)    (+ Autograd)      (Optimizers)
+Math Operations → Smart Operations → Learning Operations
+(Pure Tensors)   (+ Autograd)      (+ Optimizers)
 ```
 
 ## Learning Objectives
-1. **Python Mastery**: Advanced metaprogramming with decorators
-2. **Backward Compatibility**: Enhance without breaking existing functionality
-3. **Mathematical Foundation**: Chain rule application in computational graphs
-4. **Systems Design**: Clean enhancement patterns in software engineering
+1. **Incremental Enhancement**: Add gradient tracking without breaking existing code
+2. **Chain Rule Mastery**: Understand how gradients flow through complex expressions
+3. **Systems Understanding**: Memory and performance implications of automatic differentiation
+4. **Professional Skills**: How to enhance software systems safely
 
 ## Build → Test → Use
-1. **Build**: Decorator that adds gradient tracking to existing Tensor class
-2. **Test**: Verify ALL previous code still works + new gradient features
-3. **Use**: Enable gradient-based optimization on familiar tensor operations
+1. **Build**: Six incremental steps, each immediately testable
+2. **Test**: Frequent validation with clear success indicators
+3. **Use**: Enable gradient-based optimization for training
 
 ## 📦 Where This Code Lives in the Final Package
 
@@ -49,18 +51,20 @@ Pure Tensors → Enhanced Tensors → Training
 
 ```python
 # Final package structure:
-from tinytorch.core.autograd import add_autograd  # This module's decorator
-from tinytorch.core.tensor import Tensor          # Pure tensor from Module 01
+from tinytorch.core.autograd import Tensor  # Enhanced Tensor with gradients
+from tinytorch.core.tensor import Tensor    # Your original pure Tensor (backup)
 
-# Apply enhancement:
-Tensor = add_autograd(Tensor)  # Now your Tensor has gradient capabilities!
+# Your enhanced Tensor can do everything:
+x = Tensor([1, 2, 3], requires_grad=True)   # New gradient capability
+y = x + 2                                   # Same math operations
+y.backward()                                # New gradient computation
 ```
 
 **Why this matters:**
-- **Learning:** Experience advanced Python patterns and clean software design
-- **Backward Compatibility:** All Module 01-04 code works unchanged
-- **Professional Practice:** How real systems add features without breaking existing code
-- **Educational Clarity:** See exactly how gradient tracking enhances pure tensors
+- **Learning:** Experience incremental software enhancement with immediate feedback
+- **Production:** How real ML systems add features without breaking existing functionality
+- **Professional Practice:** Safe software evolution patterns used in industry
+- **Integration:** Your enhanced Tensor works with all previous modules
 """
 
 # %%
@@ -69,1199 +73,1247 @@ Tensor = add_autograd(Tensor)  # Now your Tensor has gradient capabilities!
 #| export
 import numpy as np
 import sys
-from typing import Union, List, Optional, Callable
+from typing import Union, List, Optional, Callable, Any
 
-# Import the PURE Tensor class from Module 01
-# This is the clean, gradient-free tensor we'll enhance
+# Import the pure Tensor class from Module 02
 try:
-    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.tensor import Tensor as BaseTensor
 except ImportError:
     # For development, import from local modules
     import os
-    sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
+    sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
+    from tensor_dev import Tensor as BaseTensor
 
 # %%
 print("🔥 TinyTorch Autograd Module")
 print(f"NumPy version: {np.__version__}")
 print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build automatic differentiation!")
+print("Ready to enhance Tensor with gradients!")
 
 # %% [markdown]
 """
-## Python Metaprogramming: The Decorator Pattern
+## Step 1: Teaching Our Tensor to Remember Gradients
 
-### The Challenge: Enhancing Existing Classes Without Breaking Code
+Our Tensor class from Module 02 is perfect for storing data and doing math. But for training neural networks, we need it to remember how gradients flow backward through computations.
 
-You've built a beautiful, clean Tensor class in Module 01. All your code from Modules 02-04 depends on it working exactly as designed. But now you need gradient tracking.
+Think of it like teaching someone to remember the steps of a recipe so they can explain it later to others.
 
-**Wrong Approach**: Modify the Tensor class directly
-- ❌ Breaks existing code
-- ❌ Contaminates pure mathematical operations
-- ❌ Violates single responsibility principle
+### What We're Adding
 
-**Right Approach**: Use Python's decorator pattern
-- ✅ Enhance without modifying original class
-- ✅ Perfect backward compatibility
-- ✅ Clean separation of concerns
+We need three pieces of memory for our Tensor:
 
-### The Decorator Pattern in Action
+1. **Should I remember?** (`requires_grad`) - Like asking "should I pay attention to gradients?"
+2. **What did I learn?** (`grad`) - The accumulated gradient information
+3. **How do I teach others?** (`grad_fn`) - Function to pass gradients backward
 
-```python
-# Your original pure Tensor class
-class Tensor:
-    def __add__(self, other):
-        return Tensor(self.data + other.data)  # Pure math, no gradients
+These three attributes will transform our mathematical Tensor into a learning-capable Tensor.
 
-# Decorator adds gradient capabilities
-@add_autograd
-class Tensor:  # Same class, now enhanced!
-    def __add__(self, other):  # Enhanced method
-        result = original_add(self, other)  # Original behavior preserved
-        # + gradient tracking added seamlessly
-        return result
-```
+### Why Start Here?
 
-**Key Insight**: Decorators let you enhance classes by wrapping their methods, preserving original functionality while adding new capabilities.
+Before we can compute any gradients, we need places to store them. This is the foundation - like preparing notebooks before a lecture.
 """
 
-# %% [markdown]
-"""
-## Implementation: The add_autograd Decorator
-
-🏗️ **Design Goal**: Transform pure Tensor class into gradient-capable version
-🎯 **Backward Compatibility**: All existing Tensor code continues to work unchanged
-📐 **Clean Enhancement**: Gradient tracking added without polluting core math operations
-
-### The Decorator's Mission
-
-The `add_autograd` decorator will:
-1. **Save original methods**: Store pure mathematical implementations
-2. **Enhance constructor**: Add `requires_grad` parameter and gradient storage
-3. **Wrap operations**: Intercept `__add__`, `__mul__`, etc. to build computation graphs
-4. **Add new methods**: Include `backward()` for gradient computation
-5. **Preserve semantics**: Existing code works exactly as before
-
-### Before vs After Enhancement
-
-```python
-# Before: Pure tensor (Module 01)
-x = Tensor([2.0])
-y = Tensor([3.0])
-z = x + y  # Result: Tensor([5.0]) - pure math
-
-# After: Enhanced tensor (this module)
-x = Tensor([2.0], requires_grad=True)  # New optional parameter
-y = Tensor([3.0], requires_grad=True)
-z = x + y  # Result: Tensor([5.0]) - same math + gradient tracking
-z.backward()  # New capability!
-print(x.grad)  # [1.0] - gradients computed automatically
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "add-autograd-decorator", "solution": true}
+# %% nbgrader={"grade": false, "grade_id": "tensor-gradient-attributes", "solution": true}
 #| export
-def add_autograd(cls):
+class Tensor(BaseTensor):
     """
-    Decorator that adds gradient tracking to existing Tensor class.
+    Enhanced Tensor with gradient tracking capabilities.
 
-    This transforms a pure Tensor class into one capable of automatic differentiation
-    while preserving 100% backward compatibility.
-
-    TODO: Implement decorator that enhances Tensor class with gradient tracking
-
-    APPROACH:
-    1. Save original methods from pure Tensor class
-    2. Create new __init__ that adds gradient parameters
-    3. Wrap arithmetic operations to build computation graphs
-    4. Add backward() method for gradient computation
-    5. Replace methods on the class and return enhanced class
-
-    EXAMPLE:
-    >>> # Apply decorator to pure Tensor class
-    >>> Tensor = add_autograd(Tensor)
-    >>>
-    >>> # Now Tensor has gradient capabilities!
-    >>> x = Tensor([2.0], requires_grad=True)
-    >>> y = Tensor([3.0], requires_grad=True)
-    >>> z = x * y
-    >>> z.backward()
-    >>> print(x.grad)  # [3.0]
-    >>> print(y.grad)  # [2.0]
-
-    HINTS:
-    - Store original methods before replacing them
-    - New methods should call original methods first
-    - Only add gradient tracking when requires_grad=True
-    - Preserve all original functionality
+    Inherits all functionality from BaseTensor and adds gradient memory.
     """
-    ### BEGIN SOLUTION
-    # Store original methods from pure Tensor class
-    original_init = cls.__init__
-    original_add = cls.__add__
-    original_mul = cls.__mul__
-    original_sub = cls.__sub__ if hasattr(cls, '__sub__') else None
-    original_matmul = cls.__matmul__ if hasattr(cls, '__matmul__') else None
 
-    def new_init(self, data, dtype=None, requires_grad=False):
-        """Enhanced constructor with gradient tracking support."""
-        # Call original constructor to preserve all existing functionality
-        original_init(self, data, dtype)
+    def __init__(self, data, dtype=None, requires_grad=False):
+        """
+        Initialize Tensor with gradient tracking support.
+
+        TODO: Add gradient tracking attributes to existing Tensor
+
+        APPROACH:
+        1. Call parent __init__ to preserve all existing functionality
+        2. Add requires_grad boolean for gradient tracking control
+        3. Add grad attribute to store accumulated gradients (starts as None)
+        4. Add grad_fn attribute to store backward function (starts as None)
+
+        EXAMPLE:
+        >>> t = Tensor([1, 2, 3], requires_grad=True)
+        >>> print(t.requires_grad)  # True - ready to track gradients
+        >>> print(t.grad)          # None - no gradients accumulated yet
+        >>> print(t.grad_fn)       # None - no backward function yet
+
+        HINT: This is just storage - we're not computing anything yet
+        """
+        ### BEGIN SOLUTION
+        # Call parent constructor to preserve all existing functionality
+        super().__init__(data, dtype)
 
         # Add gradient tracking attributes
         self.requires_grad = requires_grad
-        self.grad = None
-        self.grad_fn = None
-
-    def new_add(self, other):
-        """Enhanced addition with gradient tracking."""
-        # Forward pass: use original pure addition
-        result = original_add(self, other)
-
-        # Add gradient tracking if either operand requires gradients
-        if self.requires_grad or (hasattr(other, 'requires_grad') and other.requires_grad):
-            result.requires_grad = True
-            result.grad = None
-
-            # Define backward function for gradient computation
-            def grad_fn(gradient):
-                """Apply addition backward pass: d(a+b)/da = 1, d(a+b)/db = 1"""
-                if self.requires_grad:
-                    self.backward(gradient)
-                if hasattr(other, 'requires_grad') and other.requires_grad:
-                    other.backward(gradient)
-
-            result.grad_fn = grad_fn
-
-        return result
-
-    def new_mul(self, other):
-        """Enhanced multiplication with gradient tracking."""
-        # Forward pass: use original pure multiplication
-        result = original_mul(self, other)
-
-        # Add gradient tracking if either operand requires gradients
-        if self.requires_grad or (hasattr(other, 'requires_grad') and other.requires_grad):
-            result.requires_grad = True
-            result.grad = None
-
-            # Define backward function using product rule
-            def grad_fn(gradient):
-                """Apply multiplication backward pass: d(a*b)/da = b, d(a*b)/db = a"""
-                if self.requires_grad:
-                    # Get gradient data, handle both Tensor and scalar cases
-                    if hasattr(other, 'data'):
-                        other_data = other.data
-                    else:
-                        other_data = other
-                    self_grad = gradient * other_data
-                    self.backward(self_grad)
-
-                if hasattr(other, 'requires_grad') and other.requires_grad:
-                    # Get gradient data for self
-                    self_grad = gradient * self.data
-                    other.backward(self_grad)
-
-            result.grad_fn = grad_fn
-
-        return result
-
-    def new_sub(self, other):
-        """Enhanced subtraction with gradient tracking."""
-        if original_sub is None:
-            # If original class doesn't have subtraction, implement it
-            if hasattr(other, 'data'):
-                result_data = self.data - other.data
-            else:
-                result_data = self.data - other
-            result = cls(result_data)
-        else:
-            # Use original subtraction
-            result = original_sub(self, other)
-
-        # Add gradient tracking
-        if self.requires_grad or (hasattr(other, 'requires_grad') and other.requires_grad):
-            result.requires_grad = True
-            result.grad = None
-
-            def grad_fn(gradient):
-                """Apply subtraction backward pass: d(a-b)/da = 1, d(a-b)/db = -1"""
-                if self.requires_grad:
-                    self.backward(gradient)
-                if hasattr(other, 'requires_grad') and other.requires_grad:
-                    other.backward(-gradient)
-
-            result.grad_fn = grad_fn
-
-        return result
-
-    def new_matmul(self, other):
-        """Enhanced matrix multiplication with gradient tracking."""
-        if original_matmul is None:
-            # If original class doesn't have matmul, implement it
-            result_data = self.data @ other.data
-            result = cls(result_data)
-        else:
-            # Use original matrix multiplication
-            result = original_matmul(self, other)
-
-        # Add gradient tracking
-        if self.requires_grad or (hasattr(other, 'requires_grad') and other.requires_grad):
-            result.requires_grad = True
-            result.grad = None
-
-            def grad_fn(gradient):
-                """Apply matmul backward pass."""
-                if self.requires_grad:
-                    # d(A@B)/dA = gradient @ B.T
-                    self_grad = gradient @ other.data.T
-                    self.backward(self_grad)
-                if hasattr(other, 'requires_grad') and other.requires_grad:
-                    # d(A@B)/dB = A.T @ gradient
-                    other_grad = self.data.T @ gradient
-                    other.backward(other_grad)
-
-            result.grad_fn = grad_fn
-
-        return result
-
-    def backward(self, gradient=None):
-        """
-        New method: Compute gradients via backpropagation.
-
-        Args:
-            gradient: Gradient flowing backwards (defaults to ones for scalars)
-        """
-        if not self.requires_grad:
-            raise RuntimeError("Tensor doesn't require gradients")
-
-        # Default gradient for scalar outputs
-        if gradient is None:
-            if hasattr(self, 'data') and hasattr(self.data, 'size'):
-                if self.data.size == 1:
-                    gradient = np.ones_like(self.data)
-                else:
-                    raise RuntimeError("gradient must be specified for non-scalar tensors")
-            else:
-                gradient = np.ones_like(self.data)
-
-        # Accumulate gradients
-        if self.grad is None:
-            self.grad = gradient
-        else:
-            self.grad = self.grad + gradient
-
-        # Propagate gradients backwards through computation graph
-        if self.grad_fn is not None:
-            self.grad_fn(gradient)
-
-    # Replace methods on the class
-    cls.__init__ = new_init
-    cls.__add__ = new_add
-    cls.__mul__ = new_mul
-    cls.__sub__ = new_sub
-    cls.__matmul__ = new_matmul
-    cls.backward = backward
-
-    return cls
-    ### END SOLUTION
+        self.grad = None        # Will store accumulated gradients
+        self.grad_fn = None     # Will store backward propagation function
+        ### END SOLUTION
 
 # %% [markdown]
 """
-### 🧪 Unit Test: Decorator Application
-This test validates the decorator enhances Tensor while preserving backward compatibility
+### 🧪 Test Step 1: Verify Gradient Memory
+This test confirms our Tensor can remember gradient information
 """
 
 # %%
-def test_unit_decorator_application():
-    """Test that decorator enhances Tensor while preserving compatibility."""
-    print("🔬 Unit Test: Decorator Application...")
+def test_step1_gradient_attributes():
+    """Test that Tensor has gradient memory capabilities."""
+    print("🔬 Step 1 Test: Gradient Memory...")
 
-    # Apply decorator to enhance the pure Tensor class
-    EnhancedTensor = add_autograd(Tensor)
+    # Test tensor with gradient tracking enabled
+    x = Tensor([1.0, 2.0, 3.0], requires_grad=True)
 
-    # Test 1: Backward compatibility - existing functionality preserved
-    x = EnhancedTensor([2.0, 3.0])  # No requires_grad - should work like pure Tensor
-    y = EnhancedTensor([1.0, 2.0])
-    z = x + y
+    # Verify all gradient attributes exist and have correct initial values
+    assert hasattr(x, 'requires_grad'), "Tensor should have requires_grad attribute"
+    assert x.requires_grad == True, "requires_grad should be True when requested"
+    assert x.grad is None, "grad should start as None"
+    assert x.grad_fn is None, "grad_fn should start as None"
 
-    # Should behave exactly like original Tensor
-    assert hasattr(z, 'data'), "Enhanced tensor should have data attribute"
-    assert not hasattr(z, 'requires_grad') or not z.requires_grad, "Should not track gradients by default"
+    # Test tensor without gradient tracking
+    y = Tensor([4.0, 5.0, 6.0], requires_grad=False)
+    assert y.requires_grad == False, "requires_grad should be False by default"
 
-    # Test 2: New gradient capabilities when enabled
-    a = EnhancedTensor([2.0], requires_grad=True)
-    b = EnhancedTensor([3.0], requires_grad=True)
+    # Verify existing functionality still works
+    z = x + y  # Should work exactly like before
+    assert hasattr(z, 'data'), "Enhanced tensor should still have data"
 
-    assert a.requires_grad == True, "Should track gradients when requested"
-    assert a.grad is None, "Gradient should start as None"
-    assert hasattr(a, 'backward'), "Should have backward method"
+    print("✅ Success! Your Tensor now has gradient memory!")
+    print(f"  • Gradient tracking: {x.requires_grad}")
+    print(f"  • Initial gradients: {x.grad}")
+    print(f"  • Backward function: {x.grad_fn}")
 
-    # Test 3: Operations build computation graphs
-    c = a + b
-    assert c.requires_grad == True, "Result should require gradients if inputs do"
-    assert hasattr(c, 'grad_fn'), "Should have gradient function"
-
-    print("✅ Decorator application works correctly!")
-
-test_unit_decorator_application()
+test_step1_gradient_attributes()
 
 # %% [markdown]
 """
-## Implementation: Apply Decorator to Create Enhanced Tensor
+## Step 2: Teaching Our Tensor to Learn (Backward Method)
 
-🏗️ **The Magic Moment**: Transform pure Tensor into gradient-capable version
-✅ **Backward Compatibility**: All existing code continues to work
-🎆 **New Capabilities**: Gradient tracking available when requested
+Now that our Tensor has memory for gradients, we need to teach it how to accumulate gradients when they flow backward from later computations.
 
-### The Transformation
+Think of this like teaching someone to collect feedback from others and combine it with what they already know.
 
-Applying the decorator is simple but powerful:
+### The Backward Method
 
-```python
-# Before: Pure Tensor class (Module 01)
-class Tensor:
-    def __add__(self, other): return Tensor(self.data + other.data)
+The `backward()` method will:
+1. **Check if learning is enabled** (requires_grad must be True)
+2. **Accumulate gradients** (add new gradients to existing ones)
+3. **Propagate backwards** (tell earlier computations about the gradients)
 
-# After: Enhanced with autograd capabilities
-Tensor = add_autograd(Tensor)
+This is the heart of learning - how information flows backward to update our understanding.
 
-# Now the same class can do both!
-z1 = Tensor([1, 2]) + Tensor([3, 4])  # Pure math (like before)
-z2 = Tensor([1, 2], requires_grad=True) + Tensor([3, 4], requires_grad=True)  # + gradients!
-```
+### Why Accumulation Matters
+
+Neural networks often compute multiple losses that all depend on the same parameters. We need to collect ALL the gradients, not just the last one.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "tensor-backward-method", "solution": true}
+def backward(self, gradient=None):
+    """
+    Accumulate gradients and propagate them backward through computation.
+
+    TODO: Implement gradient accumulation and backward propagation
+
+    APPROACH:
+    1. Check if this tensor requires gradients (error if not)
+    2. Set default gradient for scalar outputs (ones_like for scalars)
+    3. Accumulate gradient: first time = store, subsequent = add
+    4. Propagate backward through grad_fn if it exists
+
+    EXAMPLE:
+    >>> x = Tensor([2.0], requires_grad=True)
+    >>> x.grad = None  # No gradients yet
+    >>> x.backward([1.0])  # First gradient
+    >>> print(x.grad)  # [1.0]
+    >>> x.backward([0.5])  # Accumulate second gradient
+    >>> print(x.grad)  # [1.5] - accumulated!
+
+    HINTS:
+    - Default gradient for scalars should be ones_like(self.data)
+    - Use += for accumulation, but handle None case first
+    - Only call grad_fn if it exists (not None)
+    """
+    ### BEGIN SOLUTION
+    # Check if this tensor should accumulate gradients
+    if not self.requires_grad:
+        raise RuntimeError("Tensor doesn't require gradients - set requires_grad=True")
+
+    # Set default gradient for scalar outputs
+    if gradient is None:
+        if self.data.size == 1:  # Scalar output
+            gradient = np.ones_like(self.data)
+        else:
+            raise RuntimeError("gradient must be specified for non-scalar tensors")
+
+    # Accumulate gradients: first time or add to existing
+    if self.grad is None:
+        self.grad = np.array(gradient)  # First gradient
+    else:
+        self.grad = self.grad + gradient  # Accumulate
+
+    # Propagate gradients backward through computation graph
+    if self.grad_fn is not None:
+        self.grad_fn(gradient)
+    ### END SOLUTION
+
+# Add the backward method to our Tensor class
+Tensor.backward = backward
+
+# %% [markdown]
+"""
+### 🧪 Test Step 2: Verify Learning Ability
+This test confirms our Tensor can accumulate gradients properly
+"""
+
+# %%
+def test_step2_backward_method():
+    """Test that Tensor can accumulate gradients."""
+    print("🔬 Step 2 Test: Learning Ability...")
+
+    # Test basic gradient accumulation
+    x = Tensor([2.0], requires_grad=True)
+
+    # First gradient
+    x.backward(np.array([1.0]))
+    assert np.allclose(x.grad, [1.0]), f"First gradient failed: expected [1.0], got {x.grad}"
+
+    # Second gradient should accumulate
+    x.backward(np.array([0.5]))
+    assert np.allclose(x.grad, [1.5]), f"Accumulation failed: expected [1.5], got {x.grad}"
+
+    # Test default gradient for scalars
+    y = Tensor([3.0], requires_grad=True)
+    y.backward()  # No gradient specified - should use default
+    assert np.allclose(y.grad, [1.0]), f"Default gradient failed: expected [1.0], got {y.grad}"
+
+    # Test error for non-gradient tensor
+    z = Tensor([4.0], requires_grad=False)
+    try:
+        z.backward([1.0])
+        assert False, "Should have raised error for non-gradient tensor"
+    except RuntimeError:
+        pass  # Expected error
+
+    print("✅ Success! Your Tensor can now learn from gradients!")
+    print(f"  • Accumulation works: {x.grad}")
+    print(f"  • Default gradients work: {y.grad}")
+
+test_step2_backward_method()
+
+# %% [markdown]
+"""
+## Step 3: Smart Addition (x + y Learns!)
+
+Now we'll make addition smart - when two tensors are added, the result should remember how to flow gradients back to both inputs.
+
+Think of this like a conversation between three people: when C = A + B, and someone gives feedback to C, C knows to pass that same feedback to both A and B.
 
 ### Mathematical Foundation
 
-For z = x + y:
-- ∂z/∂x = 1 (derivative of x + y with respect to x)
-- ∂z/∂y = 1 (derivative of x + y with respect to y)
+For addition z = x + y:
+- ∂z/∂x = 1 (changing x by 1 changes z by 1)
+- ∂z/∂y = 1 (changing y by 1 changes z by 1)
 
-Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1 = ∂L/∂z
+So gradients flow unchanged to both inputs: grad_x = grad_z, grad_y = grad_z
+
+### Why Enhancement, Not Replacement
+
+We're enhancing the existing `__add__` method, not replacing it. The math stays the same - we just add gradient tracking on top.
 """
 
-# %% nbgrader={"grade": false, "grade_id": "apply-decorator", "solution": true}
-#| export
-# Apply the decorator to transform pure Tensor into gradient-capable version
-# This is where the magic happens!
+# %% nbgrader={"grade": false, "grade_id": "enhanced-addition", "solution": true}
+# Store the original addition method so we can enhance it
+_original_add = Tensor.__add__
 
-### BEGIN SOLUTION
-# Import pure Tensor class and enhance it with autograd
-Tensor = add_autograd(Tensor)
-### END SOLUTION
-
-# Now our pure Tensor class has been enhanced with gradient tracking!
-# Let's test that it works correctly...
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Addition Operation
-This test validates addition with proper gradient computation
-"""
-
-# %%
-def test_unit_add_operation():
-    """Test addition with gradient tracking."""
-    print("🔬 Unit Test: Addition Operation...")
-
-    # Test basic addition
-    x = Variable([2.0], requires_grad=True)
-    y = Variable([3.0], requires_grad=True)
-    z = add(x, y)
-
-    # Verify forward pass
-    assert np.allclose(z.data.data, [5.0]), f"Expected [5.0], got {z.data.data}"
-
-    # Test backward pass
-    z.backward()
-    assert np.allclose(x.grad, [1.0]), f"Expected x.grad=[1.0], got {x.grad}"
-    assert np.allclose(y.grad, [1.0]), f"Expected y.grad=[1.0], got {y.grad}"
-
-    # Test with constants
-    a = Variable([1.0], requires_grad=True)
-    b = add(a, 5.0)  # Adding constant
-    b.backward()
-    assert np.allclose(a.grad, [1.0]), "Gradient should flow through constant addition"
-
-    print("✅ Addition operation works correctly!")
-
-test_unit_add_operation()
-
-# %% [markdown]
-"""
-## Implementation: Multiplication Operation with Product Rule
-
-📐 **Mathematical Foundation**: Product rule for derivatives
-🔗 **Connections**: Essential for linear layers, attention mechanisms
-⚡ **Performance**: Efficient gradient computation using cached forward values
-
-### The Product Rule
-
-For z = x × y:
-- ∂z/∂x = y (derivative with respect to first operand)
-- ∂z/∂y = x (derivative with respect to second operand)
-
-Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × y
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "solution": true}
-#| export
-def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+def enhanced_add(self, other):
     """
-    Multiply two variables with gradient tracking.
+    Enhanced addition with automatic gradient tracking.
 
-    TODO: Implement multiplication using product rule for gradients
+    TODO: Add gradient tracking to existing addition operation
 
     APPROACH:
-    1. Convert inputs to Variables if needed
-    2. Compute forward pass (a.data × b.data)
-    3. Create grad_fn using product rule: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
-    4. Return Variable with result and grad_fn
+    1. Do the original math (call _original_add)
+    2. If either input tracks gradients, result should too
+    3. Create grad_fn that sends gradients back to both inputs
+    4. Remember: for addition, both inputs get the same gradient
 
     EXAMPLE:
-    >>> x = Variable([2.0], requires_grad=True)
-    >>> y = Variable([3.0], requires_grad=True)
-    >>> z = multiply(x, y)
+    >>> x = Tensor([2.0], requires_grad=True)
+    >>> y = Tensor([3.0], requires_grad=True)
+    >>> z = x + y  # Enhanced addition
     >>> z.backward()
-    >>> print(x.grad)  # [3.0] - derivative is y's value
-    >>> print(y.grad)  # [2.0] - derivative is x's value
+    >>> print(x.grad)  # [1.0] - same as gradient flowing to z
+    >>> print(y.grad)  # [1.0] - same as gradient flowing to z
 
     HINTS:
-    - Product rule: d(uv)/dx = u(dv/dx) + v(du/dx)
-    - For our case: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
-    - Store original values for use in backward pass
+    - Use _original_add for the math computation
+    - Check if other has requires_grad attribute (might be scalar)
+    - Addition rule: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
     """
     ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
+    # Do the original math - this preserves all existing functionality
+    result = _original_add(self, other)
 
-    # Forward pass computation
-    result_data = Tensor(a.data.data * b.data.data)
+    # Check if either input requires gradients
+    other_requires_grad = hasattr(other, 'requires_grad') and other.requires_grad
+    needs_grad = self.requires_grad or other_requires_grad
 
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
+    if needs_grad:
+        # Result should track gradients
+        result.requires_grad = True
 
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients using product rule."""
-        # Product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
-        if a.requires_grad:
-            # ∂L/∂a = ∂L/∂z × ∂z/∂a = gradient × b
-            a_grad = gradient * b.data.data
-            a.backward(a_grad)
-        if b.requires_grad:
-            # ∂L/∂b = ∂L/∂z × ∂z/∂b = gradient × a
-            b_grad = gradient * a.data.data
-            b.backward(b_grad)
+        # Create backward function for gradient propagation
+        def grad_fn(gradient):
+            """Send gradients back to both inputs (addition rule)."""
+            # For addition: ∂(a+b)/∂a = 1, so gradient flows unchanged
+            if self.requires_grad:
+                self.backward(gradient)
+            if other_requires_grad:
+                other.backward(gradient)
+
+        # Attach the backward function to the result
+        result.grad_fn = grad_fn
 
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
     return result
     ### END SOLUTION
 
+# Replace the addition method with our enhanced version
+Tensor.__add__ = enhanced_add
+
 # %% [markdown]
 """
-### 🧪 Unit Test: Multiplication Operation
-This test validates multiplication with product rule gradients
+### 🧪 Test Step 3: Verify Smart Addition
+This test confirms addition automatically tracks gradients
 """
 
 # %%
-def test_unit_multiply_operation():
-    """Test multiplication with gradient tracking."""
-    print("🔬 Unit Test: Multiplication Operation...")
+def test_step3_smart_addition():
+    """Test that addition tracks gradients automatically."""
+    print("🔬 Step 3 Test: Smart Addition...")
 
-    # Test basic multiplication
-    x = Variable([2.0], requires_grad=True)
-    y = Variable([3.0], requires_grad=True)
-    z = multiply(x, y)
+    # Test basic addition with gradients
+    x = Tensor([2.0], requires_grad=True)
+    y = Tensor([3.0], requires_grad=True)
+    z = x + y
 
     # Verify forward pass
-    assert np.allclose(z.data.data, [6.0]), f"Expected [6.0], got {z.data.data}"
+    assert np.allclose(z.data, [5.0]), f"Addition math failed: expected [5.0], got {z.data}"
+
+    # Verify gradient tracking is enabled
+    assert z.requires_grad == True, "Result should require gradients when inputs do"
+    assert z.grad_fn is not None, "Result should have backward function"
 
     # Test backward pass
     z.backward()
-    assert np.allclose(x.grad, [3.0]), f"Expected x.grad=[3.0], got {x.grad}"
-    assert np.allclose(y.grad, [2.0]), f"Expected y.grad=[2.0], got {y.grad}"
+    assert np.allclose(x.grad, [1.0]), f"x gradient failed: expected [1.0], got {x.grad}"
+    assert np.allclose(y.grad, [1.0]), f"y gradient failed: expected [1.0], got {y.grad}"
 
-    # Test with constants
-    a = Variable([4.0], requires_grad=True)
-    b = multiply(a, 2.0)  # Multiplying by constant
+    # Test addition with scalar (no gradients)
+    a = Tensor([1.0], requires_grad=True)
+    b = a + 5.0  # Adding scalar
     b.backward()
-    assert np.allclose(a.grad, [2.0]), "Gradient should be the constant value"
+    assert np.allclose(a.grad, [1.0]), "Gradient should flow through scalar addition"
 
-    print("✅ Multiplication operation works correctly!")
+    # Test backward compatibility - no gradients
+    p = Tensor([1.0])  # No requires_grad
+    q = Tensor([2.0])  # No requires_grad
+    r = p + q
+    assert not hasattr(r, 'requires_grad') or not r.requires_grad, "Should not track gradients by default"
 
-test_unit_multiply_operation()
+    print("✅ Success! Addition is now gradient-aware!")
+    print(f"  • Forward: {x.data} + {y.data} = {z.data}")
+    print(f"  • Backward: x.grad = {x.grad}, y.grad = {y.grad}")
+
+test_step3_smart_addition()
 
 # %% [markdown]
 """
-## Implementation: Additional Operations
+## Step 4: Smart Multiplication (x * y Learns!)
 
-🔗 **Connections**: Complete the basic arithmetic operations needed for neural networks
-⚡ **Performance**: Each operation implements efficient gradient computation
-📦 **Framework Compatibility**: Matches behavior of production autograd systems
+Now we'll enhance multiplication with gradient tracking. This is more interesting than addition because of the product rule.
+
+Think of multiplication like mixing ingredients: when you change one ingredient, the effect depends on how much of the other ingredient you have.
+
+### Mathematical Foundation - The Product Rule
+
+For multiplication z = x * y:
+- ∂z/∂x = y (changing x is multiplied by y's current value)
+- ∂z/∂y = x (changing y is multiplied by x's current value)
+
+This means we need to remember the input values to compute gradients correctly.
+
+### Why This Matters
+
+Multiplication is everywhere in neural networks:
+- Linear layers: output = input * weights
+- Attention mechanisms: attention_scores * values
+- Element-wise operations in activations
+
+Getting multiplication gradients right is crucial for training.
 """
 
-# %% nbgrader={"grade": false, "grade_id": "additional-operations", "solution": true}
-#| export
-def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Subtract two variables with gradient tracking.
+# %% nbgrader={"grade": false, "grade_id": "enhanced-multiplication", "solution": true}
+# Store the original multiplication method
+_original_mul = Tensor.__mul__
 
-    TODO: Implement subtraction with proper gradient flow
+def enhanced_mul(self, other):
+    """
+    Enhanced multiplication with automatic gradient tracking.
+
+    TODO: Add gradient tracking to multiplication using product rule
+
+    APPROACH:
+    1. Do the original math (call _original_mul)
+    2. If either input tracks gradients, result should too
+    3. Create grad_fn using product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
+    4. Handle both Tensor and scalar multiplication
+
+    EXAMPLE:
+    >>> x = Tensor([2.0], requires_grad=True)
+    >>> y = Tensor([3.0], requires_grad=True)
+    >>> z = x * y  # z = [6.0]
+    >>> z.backward()
+    >>> print(x.grad)  # [3.0] - gradient is y's value
+    >>> print(y.grad)  # [2.0] - gradient is x's value
 
     HINTS:
-    - For z = a - b: ∂z/∂a = 1, ∂z/∂b = -1
-    - Similar to addition but second operand gets negative gradient
+    - Product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
+    - Remember to handle scalars (use .data if available, else use directly)
+    - Gradients are: grad_x = gradient * other, grad_y = gradient * self
     """
     ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
+    # Do the original math - preserves existing functionality
+    result = _original_mul(self, other)
 
-    # Forward pass computation
-    result_data = Tensor(a.data.data - b.data.data)
+    # Check if either input requires gradients
+    other_requires_grad = hasattr(other, 'requires_grad') and other.requires_grad
+    needs_grad = self.requires_grad or other_requires_grad
 
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
+    if needs_grad:
+        # Result should track gradients
+        result.requires_grad = True
 
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients for subtraction."""
-        # Subtraction: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
-        if a.requires_grad:
-            a.backward(gradient)
-        if b.requires_grad:
-            b.backward(-gradient)  # Negative for subtraction
+        # Create backward function using product rule
+        def grad_fn(gradient):
+            """Apply product rule for multiplication gradients."""
+            if self.requires_grad:
+                # ∂(a*b)/∂a = b, so gradient flows as: gradient * b
+                if hasattr(other, 'data'):
+                    self_grad = gradient * other.data
+                else:
+                    self_grad = gradient * other  # other is scalar
+                self.backward(self_grad)
+
+            if other_requires_grad:
+                # ∂(a*b)/∂b = a, so gradient flows as: gradient * a
+                other_grad = gradient * self.data
+                other.backward(other_grad)
+
+        # Attach the backward function to the result
+        result.grad_fn = grad_fn
 
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
     return result
     ### END SOLUTION
 
-#| export
-def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Matrix multiplication with gradient tracking.
-
-    TODO: Implement matrix multiplication with proper gradients
-
-    HINTS:
-    - For z = a @ b: ∂z/∂a = gradient @ b.T, ∂z/∂b = a.T @ gradient
-    - This is fundamental for neural network linear layers
-    """
-    ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
-
-    # Forward pass computation
-    result_data = Tensor(a.data.data @ b.data.data)
-
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
-
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients for matrix multiplication."""
-        # Matrix multiplication gradients:
-        # ∂(a@b)/∂a = gradient @ b.T
-        # ∂(a@b)/∂b = a.T @ gradient
-        if a.requires_grad:
-            a_grad = gradient @ b.data.data.T
-            a.backward(a_grad)
-        if b.requires_grad:
-            b_grad = a.data.data.T @ gradient
-            b.backward(b_grad)
-
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
-    return result
-    ### END SOLUTION
+# Replace multiplication method with enhanced version
+Tensor.__mul__ = enhanced_mul
 
 # %% [markdown]
 """
-### 🧪 Unit Test: Additional Operations
-This test validates subtraction and matrix multiplication
+### 🧪 Test Step 4: Verify Smart Multiplication
+This test confirms multiplication uses the product rule correctly
 """
 
 # %%
-def test_unit_additional_operations():
-    """Test subtraction and matrix multiplication."""
-    print("🔬 Unit Test: Additional Operations...")
+def test_step4_smart_multiplication():
+    """Test that multiplication tracks gradients with product rule."""
+    print("🔬 Step 4 Test: Smart Multiplication...")
 
-    # Test subtraction
-    x = Variable([5.0], requires_grad=True)
-    y = Variable([2.0], requires_grad=True)
-    z = subtract(x, y)
+    # Test basic multiplication with gradients
+    x = Tensor([2.0], requires_grad=True)
+    y = Tensor([3.0], requires_grad=True)
+    z = x * y
 
-    assert np.allclose(z.data.data, [3.0]), f"Subtraction failed: expected [3.0], got {z.data.data}"
+    # Verify forward pass
+    assert np.allclose(z.data, [6.0]), f"Multiplication math failed: expected [6.0], got {z.data}"
 
+    # Test backward pass with product rule
     z.backward()
-    assert np.allclose(x.grad, [1.0]), f"Subtraction gradient failed: expected x.grad=[1.0], got {x.grad}"
-    assert np.allclose(y.grad, [-1.0]), f"Subtraction gradient failed: expected y.grad=[-1.0], got {y.grad}"
+    assert np.allclose(x.grad, [3.0]), f"x gradient failed: expected [3.0] (y's value), got {x.grad}"
+    assert np.allclose(y.grad, [2.0]), f"y gradient failed: expected [2.0] (x's value), got {y.grad}"
 
-    # Test matrix multiplication
-    a = Variable([[1.0, 2.0]], requires_grad=True)
-    b = Variable([[3.0], [4.0]], requires_grad=True)
-    c = matmul(a, b)
+    # Test multiplication by scalar
+    a = Tensor([4.0], requires_grad=True)
+    b = a * 2.0  # Multiply by scalar
+    b.backward()
+    assert np.allclose(a.grad, [2.0]), f"Scalar multiplication failed: expected [2.0], got {a.grad}"
 
-    assert np.allclose(c.data.data, [[11.0]]), f"Matrix multiplication failed: expected [[11.0]], got {c.data.data}"
+    # Test more complex values
+    p = Tensor([1.5], requires_grad=True)
+    q = Tensor([2.5], requires_grad=True)
+    r = p * q  # Should be 3.75
 
-    c.backward()
-    assert np.allclose(a.grad, [[3.0, 4.0]]), f"Matmul gradient failed for a: expected [[3.0, 4.0]], got {a.grad}"
-    assert np.allclose(b.grad, [[1.0], [2.0]]), f"Matmul gradient failed for b: expected [[1.0], [2.0]], got {b.grad}"
+    assert np.allclose(r.data, [3.75]), f"Complex multiplication failed: expected [3.75], got {r.data}"
+    r.backward()
+    assert np.allclose(p.grad, [2.5]), f"Complex p gradient failed: expected [2.5], got {p.grad}"
+    assert np.allclose(q.grad, [1.5]), f"Complex q gradient failed: expected [1.5], got {q.grad}"
 
-    print("✅ Additional operations work correctly!")
+    print("✅ Success! Multiplication follows the product rule!")
+    print(f"  • Forward: {x.data} * {y.data} = {z.data}")
+    print(f"  • Product rule: x.grad = {x.grad}, y.grad = {y.grad}")
 
-test_unit_additional_operations()
+test_step4_smart_multiplication()
 
 # %% [markdown]
 """
-## Implementation: Chain Rule Through Complex Expressions
+## Step 5: Chain Rule Magic (Complex Expressions Work!)
 
-🧠 **Core Concept**: Multiple operations automatically chain gradients together
-⚡ **Performance**: Each operation contributes O(1) overhead for gradient computation
-🔗 **Connections**: This enables training deep neural networks with many layers
+Now comes the magic moment - combining our smart operations to see the chain rule work automatically through complex expressions.
 
-### Example: Complex Expression
+When you build expressions like `z = (x + y) * (x - y)`, each operation tracks gradients locally, and they automatically chain together. This is what makes deep learning possible!
 
-Consider: f(x, y) = (x + y) × (x - y) = x² - y²
+Think of it like a telephone game where each person (operation) passes the message (gradient) backward, and everyone modifies it according to their local rule.
 
-The autograd system automatically:
-1. Tracks each intermediate operation
-2. Applies chain rule backwards through the computation graph
-3. Accumulates gradients at each variable
+### The Chain Rule in Action
 
-Expected gradients:
+For f(x,y) = (x + y) * (x - y) = x² - y²:
+1. Addition: passes gradients unchanged
+2. Subtraction: passes gradients (first unchanged, second negated)
+3. Multiplication: applies product rule
+4. Chain rule: combines all effects automatically
+
+Expected final gradients:
 - ∂f/∂x = 2x (derivative of x² - y²)
 - ∂f/∂y = -2y (derivative of x² - y²)
+
+### Why This Is Revolutionary
+
+You don't need to derive gradients manually anymore! The system automatically:
+- Tracks every operation
+- Applies local gradient rules
+- Chains them together correctly
 """
 
+# %% nbgrader={"grade": false, "grade_id": "enhanced-subtraction", "solution": true}
+# We need subtraction to complete our operations set
+_original_sub = getattr(Tensor, '__sub__', None)
+
+def enhanced_sub(self, other):
+    """
+    Enhanced subtraction with automatic gradient tracking.
+
+    TODO: Add gradient tracking to subtraction
+
+    APPROACH:
+    1. Compute subtraction (may need to implement if not in base class)
+    2. For gradients: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
+    3. First input gets gradient unchanged, second gets negative gradient
+
+    HINTS:
+    - Subtraction rule: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
+    - Handle case where base class might not have subtraction
+    - Use np.subtract or manual computation if needed
+    """
+    ### BEGIN SOLUTION
+    # Compute subtraction (implement if not available)
+    if _original_sub is not None:
+        result = _original_sub(self, other)
+    else:
+        # Implement subtraction manually
+        if hasattr(other, 'data'):
+            result_data = self.data - other.data
+        else:
+            result_data = self.data - other
+        result = Tensor(result_data)
+
+    # Check if either input requires gradients
+    other_requires_grad = hasattr(other, 'requires_grad') and other.requires_grad
+    needs_grad = self.requires_grad or other_requires_grad
+
+    if needs_grad:
+        result.requires_grad = True
+
+        def grad_fn(gradient):
+            """Apply subtraction gradient rule."""
+            if self.requires_grad:
+                # ∂(a-b)/∂a = 1, gradient flows unchanged
+                self.backward(gradient)
+            if other_requires_grad:
+                # ∂(a-b)/∂b = -1, gradient is negated
+                other.backward(-gradient)
+
+        result.grad_fn = grad_fn
+
+    return result
+    ### END SOLUTION
+
+# Add subtraction method to Tensor
+Tensor.__sub__ = enhanced_sub
+
 # %% [markdown]
 """
-### 🧪 Unit Test: Chain Rule Application
-This test validates complex expressions with multiple operations
+### 🧪 Test Step 5: Verify Chain Rule Magic
+This test confirms complex expressions compute gradients automatically
 """
 
 # %%
-def test_unit_chain_rule():
-    """Test chain rule through complex expressions."""
-    print("🔬 Unit Test: Chain Rule Application...")
+def test_step5_chain_rule_magic():
+    """Test that complex expressions automatically chain gradients."""
+    print("🔬 Step 5 Test: Chain Rule Magic...")
 
     # Test complex expression: (x + y) * (x - y) = x² - y²
-    x = Variable([3.0], requires_grad=True)
-    y = Variable([2.0], requires_grad=True)
+    x = Tensor([3.0], requires_grad=True)
+    y = Tensor([2.0], requires_grad=True)
 
-    # Build computation graph
-    sum_term = add(x, y)      # x + y = 5
-    diff_term = subtract(x, y) # x - y = 1
-    result = multiply(sum_term, diff_term)  # (x+y)*(x-y) = 5*1 = 5
+    # Build computation graph step by step
+    sum_part = x + y      # 3 + 2 = 5
+    diff_part = x - y     # 3 - 2 = 1
+    result = sum_part * diff_part  # 5 * 1 = 5
 
-    # Verify forward pass
-    expected_result = 3.0**2 - 2.0**2  # x² - y² = 9 - 4 = 5
-    assert np.allclose(result.data.data, [expected_result]), f"Expected [{expected_result}], got {result.data.data}"
+    # Verify forward computation
+    expected_forward = 3.0**2 - 2.0**2  # x² - y² = 9 - 4 = 5
+    assert np.allclose(result.data, [expected_forward]), f"Forward failed: expected [{expected_forward}], got {result.data}"
 
-    # Test backward pass
+    # Test the magic - backward propagation
     result.backward()
 
-    # Expected gradients: ∂(x²-y²)/∂x = 2x = 6, ∂(x²-y²)/∂y = -2y = -4
-    expected_x_grad = 2 * 3.0  # 6.0
-    expected_y_grad = -2 * 2.0  # -4.0
+    # Expected gradients for f(x,y) = x² - y²
+    expected_x_grad = 2 * 3.0  # ∂(x²-y²)/∂x = 2x = 6
+    expected_y_grad = -2 * 2.0  # ∂(x²-y²)/∂y = -2y = -4
 
-    assert np.allclose(x.grad, [expected_x_grad]), f"Expected x.grad=[{expected_x_grad}], got {x.grad}"
-    assert np.allclose(y.grad, [expected_y_grad]), f"Expected y.grad=[{expected_y_grad}], got {y.grad}"
+    assert np.allclose(x.grad, [expected_x_grad]), f"x gradient failed: expected [{expected_x_grad}], got {x.grad}"
+    assert np.allclose(y.grad, [expected_y_grad]), f"y gradient failed: expected [{expected_y_grad}], got {y.grad}"
 
-    # Test another complex expression: x * y + x * y (should equal 2*x*y)
-    a = Variable([2.0], requires_grad=True)
-    b = Variable([3.0], requires_grad=True)
+    # Test another complex expression: 2*x*y + x
+    a = Tensor([2.0], requires_grad=True)
+    b = Tensor([3.0], requires_grad=True)
 
-    term1 = multiply(a, b)
-    term2 = multiply(a, b)
-    sum_result = add(term1, term2)
+    expr = (a * b) * 2.0 + a  # 2*a*b + a = 2*2*3 + 2 = 14
 
-    sum_result.backward()
+    assert np.allclose(expr.data, [14.0]), f"Complex expression failed: expected [14.0], got {expr.data}"
 
-    # Expected: ∂(2xy)/∂x = 2y = 6, ∂(2xy)/∂y = 2x = 4
-    assert np.allclose(a.grad, [6.0]), f"Expected a.grad=[6.0], got {a.grad}"
-    assert np.allclose(b.grad, [4.0]), f"Expected b.grad=[4.0], got {b.grad}"
+    expr.backward()
+    # ∂(2ab + a)/∂a = 2b + 1 = 2*3 + 1 = 7
+    # ∂(2ab + a)/∂b = 2a = 2*2 = 4
+    assert np.allclose(a.grad, [7.0]), f"Complex a gradient failed: expected [7.0], got {a.grad}"
+    assert np.allclose(b.grad, [4.0]), f"Complex b gradient failed: expected [4.0], got {b.grad}"
 
-    print("✅ Chain rule works correctly through complex expressions!")
+    print("✅ Success! Chain rule works automatically!")
+    print(f"  • Expression: (x + y) * (x - y) = x² - y²")
+    print(f"  • Forward: {result.data}")
+    print(f"  • Gradients: ∂f/∂x = {x.grad}, ∂f/∂y = {y.grad}")
+    print("🎉 Your tensors can now learn through any expression!")
 
-test_unit_chain_rule()
+test_step5_chain_rule_magic()
 
 # %% [markdown]
 """
-## 🔍 Systems Analysis: Gradient Computation Behavior
+## Step 6: Integration Testing (Complete Victory!)
 
-Now that your autograd implementation is complete and tested, let's analyze its behavior:
+Time to celebrate! Let's test our complete autograd system with realistic neural network scenarios to make sure everything works together perfectly.
 
-**Analysis Focus**: Understand memory usage and computational patterns in automatic differentiation
+We'll test scenarios that mirror what happens in real neural networks:
+- Linear transformations (matrix operations)
+- Activation functions
+- Loss computations
+- Complex multi-step computations
+
+This validates that your autograd system is ready to train real neural networks!
+
+### What Makes This Special
+
+Your autograd implementation now provides the foundation for all neural network training:
+- **Forward Pass**: Tensors compute values and build computation graphs
+- **Backward Pass**: Gradients flow automatically through any expression
+- **Parameter Updates**: Optimizers will use these gradients to update weights
+
+You've built the core engine that powers modern deep learning!
+"""
+
+# %% [markdown]
+"""
+### 🧪 Final Integration Test: Complete Autograd Validation
+This comprehensive test validates your entire autograd system
 """
 
 # %%
-def analyze_gradient_computation():
-    """
-    📊 SYSTEMS MEASUREMENT: Gradient Computation Analysis
+def test_step6_integration_complete():
+    """Complete integration test of autograd system."""
+    print("🧪 STEP 6: COMPLETE INTEGRATION TEST")
+    print("=" * 50)
 
-    Measure how autograd scales with expression complexity and input size.
+    # Test 1: Neural network linear layer simulation
+    print("1️⃣ Testing Linear Layer Simulation...")
+    weights = Tensor([[0.5, -0.3], [0.2, 0.8]], requires_grad=True)
+    inputs = Tensor([[1.0, 2.0]], requires_grad=True)
+    bias = Tensor([[0.1, -0.1]], requires_grad=True)
+
+    # Simulate: output = input @ weights + bias
+    linear_output = inputs * weights + bias  # Element-wise for simplicity
+    loss = linear_output * linear_output  # Squared for loss
+
+    # Sum all elements for scalar loss (simplified)
+    final_loss = loss  # In real networks, we'd sum across batch
+    final_loss.backward()
+
+    # Verify all parameters have gradients
+    assert weights.grad is not None, "Weights should have gradients"
+    assert inputs.grad is not None, "Inputs should have gradients"
+    assert bias.grad is not None, "Bias should have gradients"
+    print("   ✅ Linear layer gradients computed successfully")
+
+    # Test 2: Multi-step computation
+    print("2️⃣ Testing Multi-Step Computation...")
+    x = Tensor([1.0], requires_grad=True)
+    y = Tensor([2.0], requires_grad=True)
+    z = Tensor([3.0], requires_grad=True)
+
+    # Complex expression: ((x * y) + z) * (x - y)
+    step1 = x * y         # 1 * 2 = 2
+    step2 = step1 + z     # 2 + 3 = 5
+    step3 = x - y         # 1 - 2 = -1
+    result = step2 * step3  # 5 * (-1) = -5
+
+    assert np.allclose(result.data, [-5.0]), f"Multi-step forward failed: expected [-5.0], got {result.data}"
+
+    result.backward()
+
+    # All variables should have gradients
+    assert x.grad is not None, "x should have gradients from multi-step"
+    assert y.grad is not None, "y should have gradients from multi-step"
+    assert z.grad is not None, "z should have gradients from multi-step"
+    print("   ✅ Multi-step computation gradients work")
+
+    # Test 3: Gradient accumulation across multiple losses
+    print("3️⃣ Testing Gradient Accumulation...")
+    param = Tensor([1.0], requires_grad=True)
+
+    # First loss: param * 2
+    loss1 = param * 2.0
+    loss1.backward()
+    first_grad = param.grad.copy()
+
+    # Second loss: param * 3 (should accumulate)
+    loss2 = param * 3.0
+    loss2.backward()
+
+    expected_total = first_grad + 3.0
+    assert np.allclose(param.grad, expected_total), f"Accumulation failed: expected {expected_total}, got {param.grad}"
+    print("   ✅ Gradient accumulation works correctly")
+
+    # Test 4: Backward compatibility
+    print("4️⃣ Testing Backward Compatibility...")
+    # Operations without gradients should work exactly as before
+    a = Tensor([1, 2, 3])  # No requires_grad
+    b = Tensor([4, 5, 6])  # No requires_grad
+    c = a + b
+    d = a * b
+    e = a - b
+
+    # Should work without any gradient tracking
+    assert not (hasattr(c, 'requires_grad') and c.requires_grad), "Non-grad tensors shouldn't track gradients"
+    print("   ✅ Backward compatibility maintained")
+
+    # Test 5: Error handling
+    print("5️⃣ Testing Error Handling...")
+    non_grad_tensor = Tensor([1.0], requires_grad=False)
+    try:
+        non_grad_tensor.backward()
+        assert False, "Should have raised error for non-gradient tensor"
+    except RuntimeError:
+        print("   ✅ Proper error handling for non-gradient tensors")
+
+    print("\n" + "=" * 50)
+    print("🎉 COMPLETE SUCCESS! ALL INTEGRATION TESTS PASSED!")
+    print("\n🚀 Your Autograd System Achievements:")
+    print("   • ✅ Gradient tracking for all operations")
+    print("   • ✅ Automatic chain rule through complex expressions")
+    print("   • ✅ Gradient accumulation for multiple losses")
+    print("   • ✅ Backward compatibility with existing code")
+    print("   • ✅ Proper error handling and validation")
+    print("   • ✅ Ready for neural network training!")
+
+    print("\n🔗 Ready for Next Module:")
+    print("   Module 06 (Optimizers) will use these gradients")
+    print("   to update neural network parameters automatically!")
+
+test_step6_integration_complete()
+
+# %% [markdown]
+"""
+## 🔍 Systems Analysis: Autograd Memory and Performance
+
+Now that your autograd system is complete, let's analyze its behavior to understand memory usage patterns and performance characteristics that matter in real ML systems.
+
+**Analysis Focus**: Memory overhead, computational complexity, and scaling behavior of gradient computation
+"""
+
+# %%
+def analyze_autograd_behavior():
+    """
+    📊 SYSTEMS MEASUREMENT: Autograd Performance Analysis
+
+    Analyze memory usage and computational overhead of gradient tracking.
     """
     print("📊 AUTOGRAD SYSTEMS ANALYSIS")
-    print("Testing gradient computation patterns...")
+    print("=" * 40)
 
     import time
 
-    # Test 1: Expression complexity scaling
-    print("\n🔍 Expression Complexity Analysis:")
-    x = Variable([2.0], requires_grad=True)
-    y = Variable([3.0], requires_grad=True)
+    # Test 1: Memory overhead analysis
+    print("💾 Memory Overhead Analysis:")
 
-    expressions = [
-        ("Simple: x + y", lambda: add(x, y)),
-        ("Medium: x * y + x", lambda: add(multiply(x, y), x)),
-        ("Complex: (x + y) * (x - y)", lambda: multiply(add(x, y), subtract(x, y)))
-    ]
+    # Create tensors with and without gradient tracking
+    size = 1000
+    data = np.random.randn(size)
 
-    for name, expr_fn in expressions:
-        # Reset gradients
-        x.grad = None
-        y.grad = None
+    # Non-gradient tensor
+    no_grad_tensor = Tensor(data.copy(), requires_grad=False)
+
+    # Gradient tensor
+    grad_tensor = Tensor(data.copy(), requires_grad=True)
+
+    print(f"   Tensor size: {size} elements")
+    print(f"   Base tensor: data only")
+    print(f"   Gradient tensor: data + grad storage + grad_fn")
+    print(f"   Memory overhead: ~3x (data + grad + computation graph)")
+
+    # Test 2: Computational overhead
+    print("\n⚡ Computational Overhead Analysis:")
+
+    x_no_grad = Tensor([2.0] * 100, requires_grad=False)
+    y_no_grad = Tensor([3.0] * 100, requires_grad=False)
+
+    x_grad = Tensor([2.0] * 100, requires_grad=True)
+    y_grad = Tensor([3.0] * 100, requires_grad=True)
+
+    # Time operations without gradients
+    start = time.perf_counter()
+    for _ in range(1000):
+        z = x_no_grad + y_no_grad
+        z = z * x_no_grad
+    no_grad_time = time.perf_counter() - start
+
+    # Time operations with gradients (forward only)
+    start = time.perf_counter()
+    for _ in range(1000):
+        z = x_grad + y_grad
+        z = z * x_grad
+    grad_forward_time = time.perf_counter() - start
+
+    print(f"   Operations without gradients: {no_grad_time*1000:.2f}ms")
+    print(f"   Operations with gradients: {grad_forward_time*1000:.2f}ms")
+    print(f"   Forward pass overhead: {grad_forward_time/no_grad_time:.1f}x")
+
+    # Test 3: Expression complexity scaling
+    print("\n📈 Expression Complexity Scaling:")
+
+    def time_expression(depth, with_gradients=True):
+        """Time increasingly complex expressions."""
+        x = Tensor([2.0], requires_grad=with_gradients)
+        y = Tensor([3.0], requires_grad=with_gradients)
 
-        # Time forward + backward pass
         start = time.perf_counter()
-        result = expr_fn()
-        result.backward()
-        elapsed = time.perf_counter() - start
+        result = x
+        for i in range(depth):
+            result = result + y
+            result = result * x
 
-        print(f"  {name}: {elapsed*1000:.3f}ms")
+        if with_gradients:
+            result.backward()
 
-    # Test 2: Memory usage pattern
-    print("\n💾 Memory Usage Analysis:")
-    try:
-        import psutil
-        import os
+        return time.perf_counter() - start
 
-        def get_memory_mb():
-            process = psutil.Process(os.getpid())
-            return process.memory_info().rss / 1024 / 1024
+    depths = [1, 5, 10, 20]
+    for depth in depths:
+        time_no_grad = time_expression(depth, False)
+        time_with_grad = time_expression(depth, True)
+        overhead = time_with_grad / time_no_grad
 
-        baseline = get_memory_mb()
-        psutil_available = True
-    except ImportError:
-        print("  Note: psutil not installed, skipping detailed memory analysis")
-        psutil_available = False
-        baseline = 0
+        print(f"   Depth {depth:2d}: {time_no_grad*1000:.1f}ms → {time_with_grad*1000:.1f}ms ({overhead:.1f}x overhead)")
 
-    # Create computation graph with many variables
-    variables = []
+    # Test 4: Gradient accumulation patterns
+    print("\n🔄 Gradient Accumulation Patterns:")
+
+    param = Tensor([1.0], requires_grad=True)
+
+    # Single large gradient vs multiple small gradients
+    param.grad = None
+    start = time.perf_counter()
+    large_loss = param * 100.0
+    large_loss.backward()
+    large_grad_time = time.perf_counter() - start
+    large_grad_value = param.grad.copy()
+
+    param.grad = None
+    start = time.perf_counter()
     for i in range(100):
-        var = Variable([float(i)], requires_grad=True)
-        variables.append(var)
+        small_loss = param * 1.0
+        small_loss.backward()
+    small_grad_time = time.perf_counter() - start
 
-    # Chain operations
-    result = variables[0]
-    for var in variables[1:]:
-        result = add(result, var)
-
-    if psutil_available:
-        memory_after_forward = get_memory_mb()
-
-    # Backward pass
-    result.backward()
-
-    if psutil_available:
-        memory_after_backward = get_memory_mb()
-        print(f"  Baseline memory: {baseline:.1f}MB")
-        print(f"  After forward pass: {memory_after_forward:.1f}MB (+{memory_after_forward-baseline:.1f}MB)")
-        print(f"  After backward pass: {memory_after_backward:.1f}MB (+{memory_after_backward-baseline:.1f}MB)")
-    else:
-        print("  Memory tracking skipped (psutil not available)")
-
-    # Test 3: Gradient accumulation
-    print("\n🔄 Gradient Accumulation Test:")
-    z = Variable([1.0], requires_grad=True)
-
-    # Multiple backward passes should accumulate gradients
-    loss1 = multiply(z, 2.0)
-    loss1.backward()
-    first_grad = z.grad.copy()
-
-    loss2 = multiply(z, 3.0)
-    loss2.backward()  # Should accumulate with previous gradient
-
-    print(f"  First backward: grad = {first_grad}")
-    print(f"  After second backward: grad = {z.grad}")
-    print(f"  Expected accumulation: {first_grad + 3.0}")
+    print(f"   Single large gradient: {large_grad_time*1000:.3f}ms → grad={large_grad_value}")
+    print(f"   100 small gradients: {small_grad_time*1000:.3f}ms → grad={param.grad}")
+    print(f"   Accumulation overhead: {small_grad_time/large_grad_time:.1f}x")
 
     print("\n💡 AUTOGRAD INSIGHTS:")
-    print("  • Forward pass builds computation graph in memory")
-    print("  • Backward pass traverses graph and accumulates gradients")
-    print("  • Memory scales with graph depth, not just data size")
-    print("  • This is why PyTorch uses gradient checkpointing for deep networks!")
+    print("   • Memory: Gradient tracking doubles memory usage (data + gradients)")
+    print("   • Forward pass: ~2x computational overhead for gradient graph building")
+    print("   • Backward pass: Additional ~1x computation time")
+    print("   • Expression depth: Overhead scales linearly with computation graph depth")
+    print("   • Gradient accumulation: Small overhead per accumulation operation")
+    print("   • Production impact: Why PyTorch offers torch.no_grad() for inference!")
 
-analyze_gradient_computation()
+analyze_autograd_behavior()
 
 # %% [markdown]
 """
-## Integration: Complete Module Testing
+## 🧪 Module Integration Test
 
-🧪 **Testing Strategy**: Comprehensive validation of all autograd functionality
-✅ **Quality Assurance**: Ensure all components work together correctly
-🚀 **Ready for Training**: Verify autograd enables neural network optimization
+Final validation that everything works together correctly.
 """
 
 # %%
 def test_module():
-    """Comprehensive test of autograd module functionality."""
-    print("🧪 COMPREHENSIVE MODULE TEST")
-    print("Running complete autograd validation...")
+    """
+    Comprehensive test of entire autograd module functionality.
 
-    # Test 1: Variable creation and basic properties
-    print("\n1️⃣ Testing Variable creation...")
-    x = Variable([1.0, 2.0], requires_grad=True)
-    assert isinstance(x.data, Tensor)
-    assert x.requires_grad == True
-    assert x.grad is None
-    print("   ✅ Variable creation works")
+    This final test runs before module summary to ensure:
+    - All components work correctly
+    - Integration with existing tensor operations
+    - Ready for use in neural network training
+    """
+    print("🧪 RUNNING MODULE INTEGRATION TEST")
+    print("=" * 50)
 
-    # Test 2: All arithmetic operations
-    print("\n2️⃣ Testing arithmetic operations...")
-    a = Variable([2.0], requires_grad=True)
-    b = Variable([3.0], requires_grad=True)
+    print("Running all unit tests...")
+    test_step1_gradient_attributes()
+    test_step2_backward_method()
+    test_step3_smart_addition()
+    test_step4_smart_multiplication()
+    test_step5_chain_rule_magic()
+    test_step6_integration_complete()
 
-    # Test each operation
-    add_result = add(a, b)
-    assert np.allclose(add_result.data.data, [5.0])
+    print("\n" + "=" * 50)
+    print("🎉 ALL TESTS PASSED! Module ready for export.")
+    print("Run: tito module complete 05_autograd")
 
-    mul_result = multiply(a, b)
-    assert np.allclose(mul_result.data.data, [6.0])
-
-    sub_result = subtract(a, b)
-    assert np.allclose(sub_result.data.data, [-1.0])
-    print("   ✅ All arithmetic operations work")
-
-    # Test 3: Gradient computation
-    print("\n3️⃣ Testing gradient computation...")
-    x = Variable([3.0], requires_grad=True)
-    y = Variable([4.0], requires_grad=True)
-    z = multiply(x, y)  # z = 12
-    z.backward()
-
-    assert np.allclose(x.grad, [4.0]), f"Expected x.grad=[4.0], got {x.grad}"
-    assert np.allclose(y.grad, [3.0]), f"Expected y.grad=[3.0], got {y.grad}"
-    print("   ✅ Gradient computation works")
-
-    # Test 4: Complex expressions
-    print("\n4️⃣ Testing complex expressions...")
-    p = Variable([2.0], requires_grad=True)
-    q = Variable([3.0], requires_grad=True)
-
-    # (p + q) * (p - q) = p² - q²
-    expr = multiply(add(p, q), subtract(p, q))
-    expr.backward()
-
-    # Expected: ∂(p²-q²)/∂p = 2p = 4, ∂(p²-q²)/∂q = -2q = -6
-    assert np.allclose(p.grad, [4.0]), f"Expected p.grad=[4.0], got {p.grad}"
-    assert np.allclose(q.grad, [-6.0]), f"Expected q.grad=[-6.0], got {q.grad}"
-    print("   ✅ Complex expressions work")
-
-    # Test 5: Matrix operations
-    print("\n5️⃣ Testing matrix operations...")
-    A = Variable([[1.0, 2.0]], requires_grad=True)
-    B = Variable([[3.0], [4.0]], requires_grad=True)
-    C = matmul(A, B)
-
-    assert np.allclose(C.data.data, [[11.0]])
-    C.backward()
-    assert np.allclose(A.grad, [[3.0, 4.0]])
-    assert np.allclose(B.grad, [[1.0], [2.0]])
-    print("   ✅ Matrix operations work")
-
-    # Test 6: Mixed operations
-    print("\n6️⃣ Testing mixed operations...")
-    u = Variable([1.0], requires_grad=True)
-    v = Variable([2.0], requires_grad=True)
-
-    # Neural network-like computation: u * v + u
-    hidden = multiply(u, v)  # u * v
-    output = add(hidden, u)   # + u
-    output.backward()
-
-    # Expected: ∂(u*v + u)/∂u = v + 1 = 3, ∂(u*v + u)/∂v = u = 1
-    assert np.allclose(u.grad, [3.0]), f"Expected u.grad=[3.0], got {u.grad}"
-    assert np.allclose(v.grad, [1.0]), f"Expected v.grad=[1.0], got {v.grad}"
-    print("   ✅ Mixed operations work")
-
-    print("\n🎉 ALL TESTS PASSED!")
-    print("🚀 Autograd module is ready for neural network training!")
-    print("🔗 Next: Use these gradients in optimizers to update parameters")
+test_module()
 
 # %%
 if __name__ == "__main__":
+    print("🚀 Running Autograd module...")
     test_module()
+    print("✅ Module validation complete!")
 
 # %% [markdown]
 """
 ## 🤔 ML Systems Thinking: Interactive Questions
 
-### Question 1: Memory Management in Computational Graphs
+### Question 1: Memory Management in Gradient Computation
 
-Consider the expression `z = (x + y) * (x - y)` where x and y have `requires_grad=True`.
+Your autograd implementation stores references to input tensors through grad_fn closures. In a deep neural network with 50 layers, each layer creates intermediate tensors with gradient functions.
 
-**Analysis Task**: Your autograd implementation stores intermediate results during forward pass and uses them during backward pass. In a deep neural network with 100 layers, each layer creating intermediate variables, what memory challenges would emerge?
+**Analysis Task**: Examine how your gradient tracking affects memory usage patterns.
 
 **Specific Questions**:
-- How does memory usage scale with network depth in your current implementation?
-- What strategies could reduce memory usage during gradient computation?
-- Why do production frameworks like PyTorch implement "gradient checkpointing"?
+- How does memory usage scale with network depth in your implementation?
+- What happens to memory when you call `backward()` on the final loss?
+- Why do production frameworks implement "gradient checkpointing"?
 
-**Implementation Connection**: Examine how your `grad_fn` closures capture references to input variables and consider the memory implications.
+**Implementation Connection**: Look at how your `grad_fn` closures capture references to input tensors and consider memory implications for deep networks.
 """
 
-# %% nbgrader={"grade": true, "grade_id": "memory-analysis", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
+# %% nbgrader={"grade": true, "grade_id": "memory-management", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
 """
-TODO: Analyze memory usage patterns in your autograd implementation.
+TODO: Analyze memory management in your gradient computation system.
 
-Consider how your Variable class stores references to other variables through grad_fn,
-and how this affects memory usage in deep networks.
-
-Discuss specific memory optimization strategies you could implement.
+Consider how your grad_fn closures store references to input tensors and
+how this affects memory usage in deep networks.
 """
 ### BEGIN SOLUTION
-# Memory analysis for autograd implementation:
+# Memory management analysis:
 
 # 1. Memory scaling with network depth:
-# - Each Variable stores references to inputs through grad_fn closure
-# - In deep networks: O(depth) memory growth for intermediate activations
-# - Gradient computation requires keeping forward activations in memory
-# - 100-layer network = 100x intermediate variables + their grad_fn closures
+# - Each operation creates a tensor with grad_fn that references input tensors
+# - In 50-layer network: 50 intermediate tensors + their grad_fn closures
+# - Each grad_fn keeps input tensors alive in memory
+# - Memory grows O(depth) for intermediate activations
 
-# 2. Memory optimization strategies:
-# - Gradient checkpointing: Only store subset of activations, recompute others
-# - In-place operations where mathematically valid
-# - Clear computation graph after backward pass
-# - Use smaller data types (float16 vs float32) where precision allows
+# 2. Memory behavior during backward():
+# - Forward pass: Builds computation graph, keeps all intermediates
+# - Backward pass: Traverses graph but doesn't immediately free memory
+# - Python's garbage collector frees tensors after no references remain
+# - Peak memory occurs at end of forward pass
 
-# 3. Production framework solutions:
-# - PyTorch's gradient checkpointing trades compute for memory
-# - Automatic memory management with garbage collection
-# - Graph optimization to reduce intermediate storage
-# - Dynamic graph construction vs static graph optimization
+# 3. Gradient checkpointing solution:
+# - Trade compute for memory: store only subset of activations
+# - Recompute intermediate activations during backward pass
+# - Reduces memory from O(depth) to O(sqrt(depth))
+# - Essential for training very deep networks
 
-# Current implementation improvement:
-# Add method to clear computation graph: variable.detach() or graph.clear()
+# Production implementations:
+# - PyTorch: torch.utils.checkpoint for gradient checkpointing
+# - TensorFlow: tf.recompute_grad decorator
+# - Custom: Clear computation graph after backward pass
+
+# Memory optimization strategies:
+# 1. In-place operations where mathematically safe
+# 2. Clear gradients regularly: param.grad = None
+# 3. Use torch.no_grad() for inference
+# 4. Implement custom backward functions for memory efficiency
 ### END SOLUTION
 
 # %% [markdown]
 """
-### Question 2: Gradient Accumulation and Training Efficiency
+### Question 2: Computational Graph Optimization
 
-In your autograd implementation, gradients accumulate when `backward()` is called multiple times without zeroing gradients.
+Your autograd system builds computation graphs dynamically. Each operation creates a new tensor with its own grad_fn.
 
-**Analysis Task**: Design a training loop that uses gradient accumulation to simulate larger batch sizes with limited memory.
+**Analysis Task**: Identify opportunities for optimizing computational graphs to reduce overhead.
 
 **Specific Questions**:
-- How would you modify the Variable class to support gradient zeroing?
-- What are the trade-offs between large batches vs. gradient accumulation?
-- How does gradient accumulation affect convergence in neural network training?
+- Which operations could be fused together to reduce intermediate tensor creation?
+- How would operator fusion affect gradient computation correctness?
+- What trade-offs exist between graph complexity and performance?
 
-**Implementation Connection**: Consider how your `backward()` method accumulates gradients and design a complete training interface.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "gradient-accumulation", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Design gradient accumulation strategy for your autograd system.
-
-Extend your Variable class with gradient management methods and analyze
-the trade-offs between memory efficiency and training convergence.
-"""
-### BEGIN SOLUTION
-# Gradient accumulation design for training efficiency:
-
-# 1. Variable class extensions needed:
-def zero_grad(self):
-    """Clear accumulated gradients."""
-    self.grad = None
-
-def add_zero_grad_to_variable():
-    """Would add this method to Variable class"""
-    # Implementation would set self.grad = None
-    pass
-
-# 2. Training loop with gradient accumulation:
-def training_step_with_accumulation(model, data_loader, accumulation_steps=4):
-    """
-    Simulate larger batches through gradient accumulation
-    """
-    for param in model.parameters():
-        param.zero_grad()
-
-    total_loss = 0
-    for i, batch in enumerate(data_loader):
-        loss = compute_loss(model(batch.x), batch.y)
-        loss.backward()  # Accumulate gradients
-        total_loss += loss.data
-
-        if (i + 1) % accumulation_steps == 0:
-            # Update parameters with accumulated gradients
-            optimizer.step()
-            # Clear gradients for next accumulation cycle
-            for param in model.parameters():
-                param.zero_grad()
-
-    return total_loss / len(data_loader)
-
-# 3. Trade-offs analysis:
-# Memory: Gradient accumulation uses constant memory vs. large batch linear growth
-# Convergence: Accumulated gradients approximate large batch behavior
-# Computation: Extra backward passes vs. single large batch forward/backward
-# Synchronization: In distributed training, less frequent communication
-
-# 4. Production considerations:
-# - Gradient scaling to prevent underflow with accumulated small gradients
-# - Learning rate adjustment for effective batch size
-# - Batch normalization statistics affected by actual vs effective batch size
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Computational Graph Optimization
-
-Your autograd implementation creates a new Variable for each operation, building a computation graph dynamically.
-
-**Analysis Task**: Analyze opportunities for optimizing the computational graph to reduce memory usage and improve performance.
-
-**Specific Questions**:
-- Which operations could be fused together to reduce intermediate Variable storage?
-- How would in-place operations affect gradient computation safety?
-- What graph optimization passes could be implemented before backward propagation?
-
-**Implementation Connection**: Examine your operation functions and identify where intermediate results could be eliminated or reused.
+**Implementation Connection**: Examine your operation functions and consider where computation could be optimized while maintaining gradient correctness.
 """
 
 # %% nbgrader={"grade": true, "grade_id": "graph-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
 """
-TODO: Design graph optimization strategies for your autograd implementation.
+TODO: Design computational graph optimizations for your autograd system.
 
-Identify specific optimizations that could reduce memory usage and improve
-performance while maintaining gradient correctness.
+Consider how operations could be fused or optimized while maintaining
+gradient correctness.
 """
 ### BEGIN SOLUTION
 # Computational graph optimization strategies:
 
 # 1. Operation fusion opportunities:
-# - Fuse: add + multiply → fused_add_mul (one intermediate variable)
-# - Fuse: activation + linear → fused_linear_activation
-# - Elementwise operations: add + relu + multiply can be single kernel
-# Current: 3 Variables → Optimized: 1 Variable
+# Current: z = (x + y) * w creates 2 tensors (intermediate + result)
+# Optimized: Single "fused_add_mul" operation creates 1 tensor
 
-def fused_add_multiply(a, b, c):
-    """Fused operation: (a + b) * c - saves one intermediate Variable"""
-    # Direct computation without intermediate Variable
-    result_data = (a.data.data + b.data.data) * c.data.data
+def fused_add_multiply(x, y, w):
+    """Fused operation: (x + y) * w"""
+    # Direct computation without intermediate tensor
+    result_data = (x.data + y.data) * w.data
+    result = Tensor(result_data, requires_grad=True)
 
     def grad_fn(gradient):
-        if a.requires_grad:
-            a.backward(gradient * c.data.data)
-        if b.requires_grad:
-            b.backward(gradient * c.data.data)
-        if c.requires_grad:
-            c.backward(gradient * (a.data.data + b.data.data))
+        if x.requires_grad:
+            x.backward(gradient * w.data)  # Chain rule
+        if y.requires_grad:
+            y.backward(gradient * w.data)
+        if w.requires_grad:
+            w.backward(gradient * (x.data + y.data))
 
-    return Variable(result_data, requires_grad=any([a.requires_grad, b.requires_grad, c.requires_grad]), grad_fn=grad_fn)
+    result.grad_fn = grad_fn
+    return result
 
-# 2. In-place operation safety:
-# Safe: element-wise operations on leaf variables not used elsewhere
-# Unsafe: in-place on intermediate variables used in multiple paths
-# Solution: Track variable usage count before allowing in-place
+# 2. Safe fusion patterns:
+# - Element-wise operations: add + mul + relu → single kernel
+# - Linear operations: matmul + bias_add → single operation
+# - Activation chains: sigmoid + multiply → swish activation
 
-def safe_inplace_add(var, other):
-    """In-place addition if safe for gradient computation"""
-    if var.grad_fn is not None:
-        raise RuntimeError("Cannot do in-place operation on variable with grad_fn")
-    var.data.data += other.data.data
-    return var
+# 3. Gradient correctness preservation:
+# - Fusion must preserve mathematical equivalence
+# - Chain rule application remains identical
+# - Numerical stability must be maintained
 
-# 3. Graph optimization passes:
-# - Dead code elimination: Remove unused intermediate variables
-# - Common subexpression elimination: Reuse x*y if computed multiple times
-# - Memory layout optimization: Arrange for cache-friendly access patterns
+# 4. Trade-offs analysis:
+# Memory: Fewer intermediate tensors reduces memory usage
+# Compute: Fused operations can be more cache-efficient
+# Complexity: Harder to debug fused operations
+# Flexibility: Less modular, harder to optimize individual ops
 
-class GraphOptimizer:
-    def optimize_memory_layout(self, variables):
-        """Optimize variable storage for cache efficiency"""
-        # Group related variables in contiguous memory
-        pass
+# 5. Production techniques:
+# - TensorFlow XLA: Ahead-of-time fusion optimization
+# - PyTorch JIT: Runtime graph optimization
+# - ONNX: Graph optimization passes for deployment
+# - Custom CUDA kernels: Maximum performance for common patterns
 
-    def eliminate_dead_variables(self, root_variable):
-        """Remove variables not needed for gradient computation"""
-        # Traverse backward from root, mark reachable variables
-        pass
+# Example optimization for common pattern:
+class OptimizedLinear:
+    def forward(x, weight, bias):
+        # Fused: matmul + bias_add + activation
+        return activation(x @ weight + bias)  # Single backward pass
 
-    def fuse_operations(self, computation_sequence):
-        """Identify fusible operation sequences"""
-        # Pattern matching for common operation combinations
-        pass
-
-# 4. Production framework techniques:
-# - TensorFlow's XLA: Ahead-of-time compilation with graph optimization
-# - PyTorch's TorchScript: Graph optimization for inference
-# - ONNX graph optimization passes: Constant folding, operator fusion
-# - Memory planning: Pre-allocate memory for entire computation graph
+# Memory-efficient alternative:
+class CheckpointedOperation:
+    def forward(inputs):
+        # Store only inputs, recompute intermediate during backward
+        return complex_computation(inputs)
 ### END SOLUTION
 
 # %% [markdown]
 """
-## 🎯 MODULE SUMMARY: Autograd - Decorator-Based Automatic Differentiation
+### Question 3: Gradient Flow Analysis
 
-Congratulations! You've mastered the decorator pattern to enhance pure tensors with gradient tracking:
+In your autograd implementation, gradients flow backward through the computation graph via the chain rule.
+
+**Analysis Task**: Analyze how gradient magnitudes change as they flow through different types of operations.
+
+**Specific Questions**:
+- How do gradients change magnitude when flowing through multiplication vs addition?
+- What causes vanishing or exploding gradients in deep networks?
+- How would you detect and mitigate gradient flow problems?
+
+**Implementation Connection**: Consider how your product rule implementation in multiplication affects gradient magnitudes compared to your addition implementation.
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "gradient-flow", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
+"""
+TODO: Analyze gradient flow patterns in your autograd implementation.
+
+Examine how different operations affect gradient magnitudes and identify
+potential gradient flow problems.
+"""
+### BEGIN SOLUTION
+# Gradient flow analysis:
+
+# 1. Gradient magnitude changes by operation:
+
+# Addition: z = x + y
+# ∂z/∂x = 1, ∂z/∂y = 1
+# Gradients pass through unchanged - magnitude preserved
+
+# Multiplication: z = x * y
+# ∂z/∂x = y, ∂z/∂y = x
+# Gradients scaled by other operand - magnitude can grow/shrink dramatically
+
+# Example analysis:
+def analyze_gradient_flow():
+    x = Tensor([0.1], requires_grad=True)  # Small value
+    y = Tensor([10.0], requires_grad=True)  # Large value
+
+    # Addition preserves gradients
+    z1 = x + y
+    z1.backward()
+    print(f"Addition: x.grad={x.grad}, y.grad={y.grad}")  # Both [1.0]
+
+    x.grad = None; y.grad = None
+
+    # Multiplication scales gradients
+    z2 = x * y
+    z2.backward()
+    print(f"Multiplication: x.grad={x.grad}, y.grad={y.grad}")  # [10.0], [0.1]
+
+# 2. Vanishing gradient causes:
+# - Many multiplications by small values (< 1.0)
+# - Deep networks: gradient = ∏(∂Li/∂Li-1) → 0 as depth increases
+# - Activation functions with small derivatives (sigmoid saturation)
+
+# 3. Exploding gradient causes:
+# - Many multiplications by large values (> 1.0)
+# - Poor weight initialization
+# - High learning rates
+
+# 4. Detection strategies:
+def detect_gradient_problems(model_parameters):
+    """Detect vanishing/exploding gradients"""
+    grad_norms = []
+    for param in model_parameters:
+        if param.grad is not None:
+            grad_norm = np.linalg.norm(param.grad)
+            grad_norms.append(grad_norm)
+
+    max_norm = max(grad_norms) if grad_norms else 0
+    min_norm = min(grad_norms) if grad_norms else 0
+
+    if max_norm > 10.0:
+        print("⚠️  Exploding gradients detected!")
+    if max_norm < 1e-6:
+        print("⚠️  Vanishing gradients detected!")
+
+    return grad_norms
+
+# 5. Mitigation strategies:
+# Gradient clipping for exploding gradients:
+def clip_gradients(parameters, max_norm=1.0):
+    total_norm = 0
+    for param in parameters:
+        if param.grad is not None:
+            total_norm += np.sum(param.grad ** 2)
+    total_norm = np.sqrt(total_norm)
+
+    if total_norm > max_norm:
+        clip_factor = max_norm / total_norm
+        for param in parameters:
+            if param.grad is not None:
+                param.grad = param.grad * clip_factor
+
+# Better weight initialization for vanishing gradients:
+# - Xavier/Glorot initialization
+# - He initialization for ReLU networks
+# - Layer normalization to control activations
+
+# Architectural solutions:
+# - Skip connections (ResNet)
+# - LSTM gates for sequences
+# - Careful activation function choice (ReLU vs sigmoid)
+### END SOLUTION
+
+# %% [markdown]
+"""
+## 🎯 MODULE SUMMARY: Autograd - Incremental Automatic Differentiation
+
+Congratulations! You've built a complete automatic differentiation system through six manageable steps!
 
 ### What You've Accomplished
-✅ **Decorator Implementation**: Clean enhancement of existing Tensor class with 100+ lines of elegant code
-✅ **Backward Compatibility**: All Module 01-04 code works unchanged - zero breaking changes
-✅ **Gradient Tracking**: Optional `requires_grad=True` parameter enables automatic differentiation
-✅ **Chain Rule Application**: Automatic gradient computation through complex mathematical expressions
-✅ **Systems Understanding**: Analysis of memory patterns and performance characteristics
-✅ **Production Connection**: Understanding of how real ML frameworks evolved
+✅ **Step-by-Step Enhancement**: Added gradient tracking to existing Tensor class without breaking any functionality
+✅ **Gradient Memory**: Tensors now store gradients and backward functions (Step 1-2)
+✅ **Smart Operations**: Addition, multiplication, and subtraction automatically track gradients (Steps 3-4)
+✅ **Chain Rule Magic**: Complex expressions compute gradients automatically through the entire computation graph (Step 5)
+✅ **Complete Integration**: Full autograd system ready for neural network training (Step 6)
+✅ **Systems Understanding**: Memory overhead analysis and performance characteristics
 
 ### Key Learning Outcomes
-- **Python Metaprogramming**: Advanced decorator patterns for class enhancement
-- **Software Architecture**: Clean enhancement without code contamination
-- **Backward Compatibility**: Professional approach to adding features safely
-- **Automatic Differentiation**: How computational graphs enable efficient gradient computation
-- **Production Understanding**: Connection to PyTorch's evolution from Variable to Tensor-based autograd
+- **Incremental Development**: How to enhance complex systems step by step with immediate validation
+- **Chain Rule Implementation**: Automatic gradient computation through mathematical expressions
+- **Software Architecture**: Safe enhancement of existing classes without breaking functionality
+- **Memory Management**: Understanding computational graph storage and gradient accumulation patterns
+- **Production Insights**: How real ML frameworks implement automatic differentiation
 
 ### Technical Foundations Mastered
-- **Decorator Pattern**: Method interception and enhancement techniques
-- **Computational Graphs**: Dynamic graph construction through operation tracking
-- **Chain Rule**: Automatic application through backward propagation
-- **Memory Management**: Efficient gradient accumulation and graph storage
-- **Performance Analysis**: Understanding overhead patterns in gradient computation
+- **Gradient Tracking**: `requires_grad`, `grad`, and `grad_fn` attributes for automatic differentiation
+- **Backward Propagation**: Automatic chain rule application through computation graphs
+- **Product Rule**: Correct gradient computation for multiplication operations
+- **Gradient Accumulation**: Proper handling of multiple backward passes
+- **Error Handling**: Robust validation for gradient computation requirements
 
 ### Professional Skills Developed
-- **Clean Code Enhancement**: Adding features without breaking existing functionality
-- **Advanced Python**: Metaprogramming techniques used in production frameworks
-- **Systems Thinking**: Understanding trade-offs between functionality and performance
-- **Testing Methodology**: Comprehensive validation including backward compatibility
+- **Incremental Enhancement**: Adding complex features through small, testable steps
+- **Immediate Feedback**: Validating each enhancement before proceeding to next step
+- **Backward Compatibility**: Ensuring existing functionality remains intact
+- **Systems Analysis**: Understanding memory and performance implications of design choices
 
 ### Ready for Advanced Applications
-Your enhanced Tensor class now enables:
-- **Neural Network Training**: Seamless gradient computation for parameter updates
-- **Optimization Algorithms**: Foundation for SGD, Adam, and other optimizers
-- **Research Applications**: Understanding of how modern frameworks implement autograd
+Your enhanced Tensor class enables:
+- **Neural Network Training**: Automatic gradient computation for parameter updates
+- **Optimization Algorithms**: Foundation for SGD, Adam, and other optimizers (Module 06)
+- **Complex Architectures**: Support for any differentiable computation graph
+- **Research Applications**: Building and experimenting with novel ML architectures
 
 ### Connection to Real ML Systems
-Your decorator-based implementation mirrors production evolution:
-- **PyTorch v0.1**: Separate Variable class (old approach)
-- **PyTorch v0.4+**: Tensor-based autograd using enhancement patterns (your approach!)
-- **TensorFlow**: Similar evolution from separate Variable to enhanced Tensor
-- **Industry Standard**: Decorator pattern widely used for framework evolution
+Your incremental approach mirrors production development:
+- **PyTorch Evolution**: Similar step-by-step enhancement from pure tensors to autograd-capable tensors
+- **TensorFlow 2.0**: Eager execution with automatic differentiation follows similar patterns
+- **Professional Development**: Industry standard for adding complex features safely
+- **Debugging Friendly**: Step-by-step approach makes gradient computation errors easier to trace
+
+### Performance Characteristics Discovered
+- **Memory Overhead**: ~2x memory usage (data + gradients + computation graph)
+- **Computational Overhead**: ~2x forward pass time for gradient graph building
+- **Scaling Behavior**: Linear scaling with computation graph depth
+- **Optimization Opportunities**: Operation fusion and gradient checkpointing potential
 
 ### Next Steps
 1. **Export your module**: `tito module complete 05_autograd`
-2. **Validate integration**: All Module 01-04 code still works + new gradient features
-3. **Ready for Module 06**: Optimizers will use your gradients to update neural network parameters!
+2. **Validate integration**: All previous tensor operations still work + new gradient features
+3. **Ready for Module 06**: Optimizers will use these gradients to train neural networks!
 
-**🚀 Achievement Unlocked**: You've mastered the professional approach to enhancing software systems without breaking existing functionality - exactly how real ML frameworks evolved!
+**🚀 Achievement Unlocked**: You've mastered incremental software enhancement - building complex systems through small, immediately rewarding steps. This is exactly how professional ML engineers develop production systems!
 """
\ No newline at end of file