refactor: Migrate module configuration files from .yaml to .yml

- Renamed all module.yaml files to [module_name].yml for consistency - Updated module configuration format and structure - Added new module configurations for all 20 modules - Removed obsolete benchmarking module (20_benchmarking) - Added new capstone module (20_capstone) - Enhanced autograd module with visual examples and improved implementation - Updated optimizers module with latest improvements - Standardized YAML structure across all modules
2026-04-28 02:48:00 -05:00 · 2025-09-27 01:36:27 -04:00
parent 897eecab8e
commit 4b11adaaaf
28 changed files with 1256 additions and 302 deletions
--- a/modules/01_setup/01_setup.yml
+++ b/modules/01_setup/01_setup.yml
@@ -7,8 +7,7 @@ description: "Development environment setup and basic TinyTorch functionality"

 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
-  prerequisites: []
-  enables: ["tensor", "activations", "layers"] 
+  prerequisites: [] 

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.setup"
--- a/modules/02_tensor/02_tensor.yml
+++ b/modules/02_tensor/02_tensor.yml
@@ -7,8 +7,7 @@ description: "Core tensor data structure and operations"

 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
-  prerequisites: ["setup"]
-  enables: ["activations", "layers", "autograd"] 
+  prerequisites: ["setup"] 

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.tensor"
--- a/modules/03_activations/03_activations.yml
+++ b/modules/03_activations/03_activations.yml
@@ -7,8 +7,7 @@ description: "Neural network activation functions (ReLU, Sigmoid, Tanh, Softmax)

 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
-  prerequisites: ["tensor"]
-  enables: ["layers", "networks"] 
+  prerequisites: ["tensor"] 

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.activations"
--- a/modules/04_layers/04_layers.yml
+++ b/modules/04_layers/04_layers.yml
@@ -8,7 +8,6 @@ description: "Neural network layers (Linear, activation layers)"
 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
  prerequisites: ["setup", "tensor", "activations"]
-  enables: ["networks", "training"]

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.layers"
--- a/modules/05_losses/05_losses.yml
+++ b/modules/05_losses/05_losses.yml
--- a/modules/06_autograd/06_autograd.yml
+++ b/modules/06_autograd/06_autograd.yml
@@ -8,7 +8,6 @@ description: "Automatic differentiation engine for gradient computation"
 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
  prerequisites: ["setup", "tensor", "activations"]
-  enables: ["optimizers", "training"]

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.autograd"
--- a/modules/06_autograd/autograd_dev_enhanced_v2.py
+++ b/modules/06_autograd/autograd_dev_enhanced_v2.py
@@ -0,0 +1,899 @@
+# %% [markdown]
+"""
+# Autograd - Automatic Differentiation Engine
+
+Welcome to Autograd! You'll implement the magic that powers deep learning - automatic gradient computation for ANY computational graph!
+
+## 🔗 Building on Previous Learning
+
+**What You Built Before**:
+- Module 02 (Tensor): Data structures for n-dimensional arrays
+- Module 03 (Activations): Non-linear functions for neural networks
+
+**What's Working**: You can build computational graphs with tensors and apply non-linear transformations.
+
+**The Gap**: You have to manually compute derivatives - tedious, error-prone, and doesn't scale to complex networks.
+
+**This Module's Solution**: Build an automatic differentiation engine that tracks operations and computes gradients via chain rule.
+
+**Connection Map**:
+```
+Tensor → Autograd → Optimizers
+(data)   (∇f/∇x)   (x -= α∇f/∇x)
+```
+
+## Learning Goals
+- Understand computational graphs and gradient flow
+- Master the chain rule for automatic differentiation  
+- Build memory-efficient gradient accumulation
+- Connect to PyTorch's autograd system
+- Analyze memory vs compute trade-offs in backpropagation
+
+## Build → Use → Reflect
+1. **Build**: Implement Variable class and gradient computation
+2. **Use**: Test on complex computational graphs
+3. **Reflect**: Analyze memory usage and scaling behavior
+
+## Systems Reality Check
+💡 **Production Context**: PyTorch's autograd is the foundation of all deep learning
+⚡ **Performance Insight**: Gradient storage can use 2-3x more memory than forward pass!
+"""
+
+# %% 
+#| default_exp autograd
+import numpy as np
+from typing import List, Optional, Callable, Union
+
+# %% [markdown]
+"""
+## Part 1: The Million Dollar Question
+
+How does PyTorch automatically compute gradients for ANY neural network architecture, no matter how complex?
+
+The answer: **Computational Graphs + Chain Rule**
+
+Let's discover how this works by building it ourselves!
+"""
+
+# %% [markdown]
+"""
+## Part 2: The Variable Class - Tracking Computation History
+
+Every value in our computational graph needs to remember:
+1. Its data
+2. Whether it needs gradients
+3. How it was created (for backpropagation)
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "variable-class", "solution": true}
+#| export
+class Variable:
+    """
+    A Variable wraps data and tracks how it was created for gradient computation.
+    
+    This is the foundation of automatic differentiation - each Variable knows
+    its parents and the operation that created it, forming a computational graph.
+    
+    TODO: Implement the Variable class with gradient tracking capabilities.
+    
+    APPROACH:
+    1. Store data as numpy array for efficient computation
+    2. Track whether gradients are needed (requires_grad)
+    3. Store the operation that created this Variable (grad_fn)
+    
+    EXAMPLE:
+    >>> x = Variable(np.array([2.0]), requires_grad=True)
+    >>> y = x * 3  # y knows it was created by multiplication
+    >>> print(y.data)
+    [6.0]
+    
+    HINTS:
+    - Use np.array() to ensure data is numpy array
+    - Initialize grad to None (computed during backward)
+    - grad_fn stores the backward function
+    """
+    
+    def __init__(self, data, requires_grad=False, grad_fn=None):
+        ### BEGIN SOLUTION
+        # SYSTEMS INSIGHT: float32 uses 4 bytes per element
+        # For 1B parameters = 4GB just for data storage
+        self.data = np.array(data, dtype=np.float32)
+        self.requires_grad = requires_grad
+        
+        # CRITICAL ML PATTERN: Gradients initialized lazily
+        # Memory saved until backward() is called
+        self.grad = None
+        
+        # AUTOGRAD CORE: Links to parent operation in computation graph
+        # Enables automatic chain rule application
+        self.grad_fn = grad_fn
+        self._backward_hooks = []  # Extension point for advanced features
+        ### END SOLUTION
+    
+    def backward(self, gradient=None):
+        """
+        Compute gradients via backpropagation using chain rule.
+        
+        TODO: Implement backward pass through computational graph.
+        
+        APPROACH:
+        1. Initialize gradient if not provided (for scalar outputs)
+        2. Accumulate gradients (for shared parameters)
+        3. Call grad_fn to propagate gradients to parents
+        
+        HINTS:
+        - Gradient accumulates: grad = grad + new_gradient
+        - Only propagate if grad_fn exists
+        - Check requires_grad before accumulating
+        """
+        ### BEGIN SOLUTION
+        # OPTIMIZATION: Skip gradient computation when not needed
+        # Saves O(N) operations where N = parameter count
+        if not self.requires_grad:
+            return
+            
+        # AUTOGRAD PATTERN: Scalar loss needs starting gradient
+        # ∂L/∂L = 1 (derivative of loss w.r.t. itself)
+        if gradient is None:
+            if self.data.size != 1:
+                raise RuntimeError("Gradient must be specified for non-scalar outputs")
+            gradient = np.ones_like(self.data)  # O(1) memory for scalars
+        
+        # CRITICAL ML SYSTEMS PRINCIPLE: Gradient accumulation
+        # Why: Shared parameters (e.g., embeddings) receive gradients from multiple paths
+        # Memory: Creates new array to avoid aliasing bugs
+        if self.grad is None:
+            self.grad = gradient
+        else:
+            self.grad = self.grad + gradient  # += would modify original!
+            
+        # GRAPH TRAVERSAL: Recursive backpropagation
+        # Complexity: O(graph_depth), can hit Python recursion limit (~1000)
+        if self.grad_fn is not None:
+            self.grad_fn(gradient)
+        ### END SOLUTION
+    
+    def zero_grad(self):
+        """Reset gradient to None."""
+        ### BEGIN SOLUTION
+        self.grad = None
+        ### END SOLUTION
+
+# %% [markdown]
+"""
+## Part 3: Implementing Operations with Gradient Tracking
+
+Now we need operations that build the computational graph AND know how to compute gradients.
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "operations", "solution": true}
+#| export
+class Add:
+    """Addition operation with gradient computation."""
+    
+    @staticmethod
+    def forward(a: Variable, b: Variable) -> Variable:
+        """
+        Forward pass: z = a + b
+        
+        TODO: Implement forward pass and create backward function.
+        
+        HINTS:
+        - Result needs gradients if either input needs gradients
+        - Backward function gets gradient from child
+        - Addition gradient: ∂z/∂a = 1, ∂z/∂b = 1
+        """
+        ### BEGIN SOLUTION
+        # Track gradients if either input needs them
+        requires_grad = a.requires_grad or b.requires_grad
+        
+        def backward_fn(grad_output):
+            # Addition gradient: ∂z/∂a = 1, ∂z/∂b = 1
+            # Just pass gradients through unchanged
+            if a.requires_grad:
+                a.backward(grad_output)
+            if b.requires_grad:
+                b.backward(grad_output)
+        
+        # Create output Variable with link to backward function
+        result = Variable(
+            a.data + b.data,
+            requires_grad=requires_grad,
+            grad_fn=backward_fn if requires_grad else None
+        )
+        return result
+        ### END SOLUTION
+
+class Multiply:
+    """Multiplication operation with gradient computation."""
+    
+    @staticmethod  
+    def forward(a: Variable, b: Variable) -> Variable:
+        """
+        Forward pass: z = a * b
+        
+        TODO: Implement forward pass with gradient tracking.
+        
+        HINTS:
+        - Multiplication gradient uses chain rule
+        - ∂z/∂a = b, ∂z/∂b = a
+        - Save values needed for backward
+        """
+        ### BEGIN SOLUTION
+        requires_grad = a.requires_grad or b.requires_grad
+        
+        def backward_fn(grad_output):
+            # Chain rule for multiplication:
+            # ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
+            if a.requires_grad:
+                a.backward(grad_output * b.data)  # Scale by other operand
+            if b.requires_grad:
+                b.backward(grad_output * a.data)  # Scale by other operand
+        
+        result = Variable(
+            a.data * b.data,
+            requires_grad=requires_grad,
+            grad_fn=backward_fn if requires_grad else None
+        )
+        return result
+        ### END SOLUTION
+
+# Add operator overloading for convenience
+Variable.__add__ = lambda self, other: Add.forward(self, other)
+Variable.__mul__ = lambda self, other: Multiply.forward(self, other)
+
+# %% [markdown]
+"""
+### ✅ IMPLEMENTATION CHECKPOINT: Basic autograd complete
+
+### 🤔 PREDICTION: How much memory does gradient storage use compared to parameters?
+Write your guess: _____ × parameter memory
+
+### 🔍 SYSTEMS INSIGHT #1: Gradient Memory Analysis
+"""
+
+# %%
+def analyze_gradient_memory():
+    """Let's measure the memory overhead of gradients!"""
+    try:
+        # Create a simple computational graph
+        x = Variable(np.random.randn(1000, 1000), requires_grad=True)
+        y = Variable(np.random.randn(1000, 1000), requires_grad=True)
+        z = x * 2 + y * 3
+        w = z * z  # More complex graph
+        
+        # Compute gradients
+        w_sum = Variable(np.array([w.data.sum()]), requires_grad=True)
+        w_sum.backward()
+        
+        # Measure memory
+        param_memory = x.data.nbytes + y.data.nbytes
+        grad_memory = x.grad.nbytes + y.grad.nbytes if x.grad is not None else 0
+        
+        print(f"Parameters: {param_memory / 1024 / 1024:.2f} MB")
+        print(f"Gradients: {grad_memory / 1024 / 1024:.2f} MB")
+        print(f"Ratio: {grad_memory / param_memory:.1f}x parameter memory")
+        
+        # Scale to real networks
+        print(f"\nFor a 7B parameter model like LLaMA-7B:")
+        print(f"  Parameters: {7e9 * 4 / 1024**3:.1f} GB (float32)")
+        print(f"  Gradients: {7e9 * 4 / 1024**3:.1f} GB")
+        print(f"  Total training memory: {7e9 * 8 / 1024**3:.1f} GB minimum!")
+        
+        # 💡 WHY THIS MATTERS: This is why gradient checkpointing exists!
+        # Trading compute for memory by recomputing activations during backward.
+        
+    except Exception as e:
+        print(f"⚠️ Error in analysis: {e}")
+        print("Make sure Variable class and operations are implemented correctly")
+
+analyze_gradient_memory()
+
+# %% nbgrader={"grade": true, "grade_id": "compute-q1", "points": 2}
+"""
+### 📊 Computation Question: Memory Requirements
+
+Your Variable class uses float32 (4 bytes per element). Calculate the memory needed for:
+- A Variable with shape (1000, 1000) 
+- Its gradient after backward()
+- Total memory if using Adam optimizer (which stores 2 additional momentum buffers)
+
+Show your calculation and give answers in MB.
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+Variable data: 1000 × 1000 × 4 bytes = 4,000,000 bytes = 4.0 MB
+Gradient: Same size as data = 4.0 MB
+Adam momentum (m): 4.0 MB
+Adam velocity (v): 4.0 MB
+Total with Adam: 4.0 + 4.0 + 4.0 + 4.0 = 16.0 MB
+"""
+### END SOLUTION
+
+# %% [markdown]
+"""
+## Part 4: Testing Our Autograd Engine
+
+Let's verify our implementation works correctly!
+"""
+
+# %% nbgrader={"grade": true, "grade_id": "test-autograd", "locked": true, "points": 10}
+def test_unit_autograd():
+    """Test automatic differentiation."""
+    print("🧪 Testing Autograd Implementation...")
+    
+    # Test 1: Simple addition
+    x = Variable(np.array([2.0]), requires_grad=True)
+    y = Variable(np.array([3.0]), requires_grad=True)
+    z = x + y
+    z.backward()
+    
+    assert np.allclose(x.grad, [1.0]), "Addition gradient for x incorrect"
+    assert np.allclose(y.grad, [1.0]), "Addition gradient for y incorrect"
+    print("✅ Addition gradients correct")
+    
+    # Test 2: Multiplication
+    x.zero_grad()
+    y.zero_grad()
+    z = x * y
+    z.backward()
+    
+    assert np.allclose(x.grad, [3.0]), "Multiplication gradient for x incorrect"
+    assert np.allclose(y.grad, [2.0]), "Multiplication gradient for y incorrect"
+    print("✅ Multiplication gradients correct")
+    
+    # Test 3: Complex expression
+    x = Variable(np.array([2.0]), requires_grad=True)
+    y = Variable(np.array([3.0]), requires_grad=True)
+    z = x * x + y * y  # z = x² + y²
+    z.backward()
+    
+    assert np.allclose(x.grad, [4.0]), "Complex expression gradient for x incorrect"
+    assert np.allclose(y.grad, [6.0]), "Complex expression gradient for y incorrect"
+    print("✅ Complex expression gradients correct")
+    
+    print("🎉 All autograd tests passed!")
+
+test_unit_autograd()
+
+# %% [markdown]
+"""
+## Part 5: Matrix Operations with Broadcasting
+
+Real neural networks need matrix operations. Let's add them!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "matmul", "solution": true}
+#| export
+class MatMul:
+    """Matrix multiplication with gradient computation."""
+    
+    @staticmethod
+    def forward(a: Variable, b: Variable) -> Variable:
+        """
+        Forward pass: C = A @ B
+        
+        TODO: Implement matrix multiplication with gradients.
+        
+        HINTS:
+        - Use np.dot or @ operator
+        - Gradient w.r.t A: grad_output @ B.T
+        - Gradient w.r.t B: A.T @ grad_output
+        - Handle shape broadcasting correctly
+        """
+        ### BEGIN SOLUTION
+        requires_grad = a.requires_grad or b.requires_grad
+        
+        def backward_fn(grad_output):
+            # Matrix calculus: Use transposes for gradient flow
+            if a.requires_grad:
+                grad_a = grad_output @ b.data.T  # ∂L/∂A = ∂L/∂C @ B^T
+                a.backward(grad_a)
+            if b.requires_grad:
+                grad_b = a.data.T @ grad_output  # ∂L/∂B = A^T @ ∂L/∂C
+                b.backward(grad_b)
+        
+        result = Variable(
+            a.data @ b.data,
+            requires_grad=requires_grad,
+            grad_fn=backward_fn if requires_grad else None
+        )
+        return result
+        ### END SOLUTION
+
+Variable.__matmul__ = lambda self, other: MatMul.forward(self, other)
+
+# %% [markdown]
+"""
+### ✅ IMPLEMENTATION CHECKPOINT: Matrix operations complete
+
+### 🤔 PREDICTION: How many FLOPs does a matrix multiplication A(m×k) @ B(k×n) require?
+Your answer: _______ operations
+
+### 🔍 SYSTEMS INSIGHT #2: Matrix Multiplication Complexity
+"""
+
+# %%
+def analyze_matmul_complexity():
+    """Measure the computational complexity of matrix multiplication."""
+    import time
+    
+    try:
+        sizes = [100, 200, 400, 800]
+        times = []
+        flops = []
+        
+        for size in sizes:
+            A = Variable(np.random.randn(size, size), requires_grad=True)
+            B = Variable(np.random.randn(size, size), requires_grad=True)
+            
+            # Measure forward pass
+            start = time.perf_counter()
+            C = A @ B
+            forward_time = time.perf_counter() - start
+            
+            # Measure backward pass
+            start = time.perf_counter()
+            C_sum = Variable(np.array([C.data.sum()]), requires_grad=True)
+            C_sum.backward()
+            backward_time = time.perf_counter() - start
+            
+            times.append((forward_time, backward_time))
+            # FLOPs for matrix multiply: 2 * m * n * k (multiply-add)
+            flops.append(2 * size * size * size)
+            
+            print(f"Size {size}×{size}:")
+            print(f"  Forward: {forward_time*1000:.2f}ms")
+            print(f"  Backward: {backward_time*1000:.2f}ms (~2× forward)")
+            print(f"  FLOPs: {flops[-1]/1e6:.1f}M")
+        
+        # Analyze scaling
+        time_ratio = times[-1][0] / times[0][0]
+        size_ratio = sizes[-1] / sizes[0]
+        scaling_exp = np.log(time_ratio) / np.log(size_ratio)
+        
+        print(f"\nTime scaling: O(N^{scaling_exp:.1f}) - should be ~3 for matmul")
+        
+        # 💡 WHY THIS MATTERS: This O(N³) scaling is why attention (O(N²×d))
+        # becomes the bottleneck in transformers with long sequences!
+        
+    except Exception as e:
+        print(f"⚠️ Error in analysis: {e}")
+        print("Make sure MatMul is implemented correctly")
+
+analyze_matmul_complexity()
+
+# %% nbgrader={"grade": true, "grade_id": "compute-q2", "points": 2}
+"""
+### 📊 Computation Question: Matrix Multiplication FLOPs
+
+For matrix multiplication C = A @ B where:
+- A has shape (M, K)
+- B has shape (K, N)
+
+The FLOPs (floating-point operations) = 2 × M × N × K (multiply + add for each output)
+
+Calculate the FLOPs for these operations in a neural network forward pass:
+1. Input (batch=32, features=784) @ Weight (784, 128) = ?
+2. Hidden (batch=32, features=128) @ Weight (128, 10) = ?
+3. Total FLOPs for both operations = ?
+
+Give your answers in MFLOPs (millions of FLOPs).
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+1. First layer: 2 × 32 × 128 × 784 = 6,422,528 FLOPs = 6.42 MFLOPs
+2. Second layer: 2 × 32 × 10 × 128 = 81,920 FLOPs = 0.08 MFLOPs  
+3. Total: 6.42 + 0.08 = 6.50 MFLOPs
+
+Note: First layer dominates computation due to larger dimensions (784 vs 128).
+"""
+### END SOLUTION
+
+# %% [markdown]
+"""
+## Part 6: Building a Complete Neural Network Layer
+
+Let's use our autograd to build a real neural network layer!
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "linear-layer", "solution": true}
+#| export
+class Linear:
+    """Fully connected layer with automatic differentiation."""
+    
+    def __init__(self, in_features: int, out_features: int):
+        """
+        Initialize a linear layer: y = xW^T + b
+        
+        TODO: Initialize weights and bias as Variables with gradients.
+        
+        HINTS:
+        - Use Xavier/He initialization for weights
+        - Initialize bias to zeros
+        - Both need requires_grad=True
+        """
+        ### BEGIN SOLUTION
+        # Xavier initialization prevents gradient vanishing/explosion
+        scale = np.sqrt(2.0 / in_features)
+        self.weight = Variable(
+            np.random.randn(out_features, in_features) * scale,
+            requires_grad=True
+        )
+        self.bias = Variable(
+            np.zeros((out_features,)),
+            requires_grad=True
+        )
+        ### END SOLUTION
+    
+    def forward(self, x: Variable) -> Variable:
+        """Forward pass through the layer."""
+        ### BEGIN SOLUTION
+        output = x @ self.weight.T + self.bias  # y = xW^T + b
+        return output
+        ### END SOLUTION
+    
+    def parameters(self) -> List[Variable]:
+        """Return all parameters."""
+        ### BEGIN SOLUTION
+        return [self.weight, self.bias]
+        ### END SOLUTION
+
+# %% nbgrader={"grade": true, "grade_id": "compute-q3", "points": 2}
+"""
+### 📊 Computation Question: Parameter Counting
+
+You just implemented a Linear layer. For a 3-layer MLP with architecture:
+- Input: 784 features
+- Hidden 1: 256 neurons  
+- Hidden 2: 128 neurons
+- Output: 10 classes
+
+Calculate:
+1. Parameters in each layer (weights + biases)
+2. Total parameters in the network
+3. Memory in MB (float32 = 4 bytes per parameter)
+
+Show your work.
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+Layer 1 (784 → 256): 
+  Weights: 784 × 256 = 200,704
+  Bias: 256
+  Total: 200,960
+
+Layer 2 (256 → 128):
+  Weights: 256 × 128 = 32,768
+  Bias: 128
+  Total: 32,896
+
+Layer 3 (128 → 10):
+  Weights: 128 × 10 = 1,280
+  Bias: 10
+  Total: 1,290
+
+Network total: 200,960 + 32,896 + 1,290 = 235,146 parameters
+Memory: 235,146 × 4 bytes = 940,584 bytes = 0.94 MB
+"""
+### END SOLUTION
+
+# %% [markdown]
+"""
+### ✅ IMPLEMENTATION CHECKPOINT: Neural network layer complete
+
+### 🤔 PREDICTION: For a layer with 1000 inputs and 1000 outputs, how many parameters?
+Your answer: _______ parameters
+
+### 🔍 SYSTEMS INSIGHT #3: Parameter Counting and Memory
+"""
+
+# %%
+def analyze_layer_parameters():
+    """Count parameters and analyze memory usage in neural network layers."""
+    try:
+        # Create layers of different sizes
+        sizes = [(784, 128), (128, 64), (64, 10)]  # Like a small MNIST network
+        
+        total_params = 0
+        total_memory = 0
+        
+        print("Layer Parameter Analysis:")
+        print("-" * 50)
+        
+        for in_feat, out_feat in sizes:
+            layer = Linear(in_feat, out_feat)
+            
+            # Count parameters
+            weight_params = layer.weight.data.size
+            bias_params = layer.bias.data.size
+            layer_params = weight_params + bias_params
+            
+            # Calculate memory
+            layer_memory = layer_params * 4  # float32
+            
+            total_params += layer_params
+            total_memory += layer_memory
+            
+            print(f"Layer {in_feat}→{out_feat}:")
+            print(f"  Weights: {weight_params:,} ({weight_params/1000:.1f}K)")
+            print(f"  Bias: {bias_params:,}")
+            print(f"  Total: {layer_params:,} params = {layer_memory/1024:.1f}KB")
+        
+        print("-" * 50)
+        print(f"Network Total: {total_params:,} parameters")
+        print(f"Memory (float32): {total_memory/1024:.1f}KB")
+        print(f"With gradients: {total_memory*2/1024:.1f}KB")
+        print(f"With Adam optimizer: {total_memory*4/1024:.1f}KB")
+        
+        # Scale up
+        print(f"\nScaling to GPT-3 (175B params):")
+        gpt3_memory = 175e9 * 4  # float32
+        print(f"  Parameters only: {gpt3_memory/1024**4:.1f}TB")
+        print(f"  With Adam: {gpt3_memory*4/1024**4:.1f}TB!")
+        
+        # 💡 WHY THIS MATTERS: This is why large models use:
+        # - Mixed precision (float16/bfloat16)
+        # - Gradient checkpointing
+        # - Model parallelism across GPUs
+        
+    except Exception as e:
+        print(f"⚠️ Error: {e}")
+
+analyze_layer_parameters()
+
+# %% nbgrader={"grade": true, "grade_id": "compute-q4", "points": 2}
+"""
+### 📊 Computation Question: Gradient Accumulation
+
+Consider this scenario: A shared weight matrix W (shape 100×100) is used in 3 different places 
+in your network. During backward pass:
+- Path 1 contributes gradient G1 with all elements = 0.1
+- Path 2 contributes gradient G2 with all elements = 0.2  
+- Path 3 contributes gradient G3 with all elements = 0.3
+
+Because of gradient accumulation in your backward() method:
+
+1. What will be the final value of W.grad[0,0] (top-left element)?
+2. If we OVERWROTE instead of accumulated, what would W.grad[0,0] be?
+3. How many total gradient additions occur for the entire weight matrix?
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+1. W.grad[0,0] = 0.1 + 0.2 + 0.3 = 0.6 (accumulated from all paths)
+
+2. If overwriting: W.grad[0,0] = 0.3 (only the last gradient)
+
+3. Total additions: 100 × 100 × 3 = 30,000 gradient additions
+   (each of 10,000 elements gets 3 gradient contributions)
+
+This shows why accumulation is critical for shared parameters!
+"""
+### END SOLUTION
+
+# %% [markdown]
+"""
+## Part 7: Complete Test Suite
+"""
+
+# %%
+def test_unit_all():
+    """Run all unit tests for the autograd module."""
+    print("🧪 Running Complete Autograd Test Suite...")
+    print("=" * 50)
+    
+    # Test basic autograd
+    test_unit_autograd()
+    print()
+    
+    # Test matrix multiplication
+    print("🧪 Testing Matrix Multiplication...")
+    A = Variable(np.array([[1, 2], [3, 4]], dtype=np.float32), requires_grad=True)
+    B = Variable(np.array([[5, 6], [7, 8]], dtype=np.float32), requires_grad=True)
+    C = A @ B
+    
+    C_sum = Variable(np.array([C.data.sum()]), requires_grad=True)
+    C_sum.backward()
+    
+    expected_grad_A = B.data.sum(axis=0, keepdims=True).T @ np.ones((1, 2))
+    print(f"✅ MatMul forward: {np.allclose(C.data, [[19, 22], [43, 50]])}")
+    print(f"✅ MatMul gradients computed")
+    print()
+    
+    # Test neural network layer
+    print("🧪 Testing Neural Network Layer...")
+    layer = Linear(10, 5)
+    x = Variable(np.random.randn(3, 10), requires_grad=True)
+    y = layer.forward(x)
+    
+    assert y.data.shape == (3, 5), "Output shape incorrect"
+    print(f"✅ Linear layer forward pass: shape {y.data.shape}")
+    
+    y_sum = Variable(np.array([y.data.sum()]), requires_grad=True)
+    y_sum.backward()
+    
+    assert layer.weight.grad is not None, "Weight gradients not computed"
+    assert layer.bias.grad is not None, "Bias gradients not computed"
+    print("✅ Linear layer gradients computed")
+    
+    print("=" * 50)
+    print("🎉 All tests passed! Autograd engine working correctly!")
+
+# Main execution
+if __name__ == "__main__":
+    test_unit_all()
+
+# %% nbgrader={"grade": true, "grade_id": "compute-q5", "points": 2}
+"""
+### 📊 Computation Question: Batch Size vs Memory
+
+You have a model with 1M parameters training with batch size 64. The memory usage is:
+- Model parameters: 4 MB
+- Gradients: 4 MB  
+- Adam optimizer state: 8 MB
+- Activations (batch-dependent): 32 MB
+
+Answer:
+1. What is the total memory usage?
+2. If you double the batch size to 128, what will the new TOTAL memory be?
+3. What is the maximum batch size if you have 100 MB available?
+
+Show calculations.
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+1. Total memory = 4 + 4 + 8 + 32 = 48 MB
+
+2. With batch size 128:
+   - Fixed (params + grads + optimizer): 4 + 4 + 8 = 16 MB (unchanged)
+   - Activations: 32 MB × (128/64) = 64 MB (scales linearly)
+   - New total: 16 + 64 = 80 MB
+
+3. Maximum batch size with 100 MB:
+   - Fixed costs: 16 MB
+   - Available for activations: 100 - 16 = 84 MB
+   - Batch size: 64 × (84/32) = 168 (maximum)
+   
+Key insight: Only activations scale with batch size, not parameters/gradients!
+"""
+### END SOLUTION
+
+# %% [markdown]
+"""
+## 🤔 ML Systems Thinking: Synthesis Questions
+
+Now that you've built and measured an autograd system, consider these broader questions:
+"""
+
+# %% nbgrader={"grade": false, "grade_id": "synthesis-q1", "solution": true, "points": 5}
+"""
+### Synthesis Question 1: Memory vs Compute Trade-offs
+
+You discovered that gradient computation requires significant memory (1× parameters for 
+gradients, 3× more for optimizers). You also measured that backward passes take ~2× 
+the time of forward passes.
+
+Design a training strategy for a model that requires 4× your available memory. Your 
+strategy should address:
+- How to fit the model in memory
+- What you sacrifice (time, accuracy, or complexity)
+- When this trade-off is worthwhile
+
+YOUR ANSWER (5-7 sentences):
+"""
+### BEGIN SOLUTION
+"""
+Strategy: Gradient checkpointing with micro-batching.
+
+1. Divide model into 4 checkpoint segments, storing only segment boundaries
+2. During backward, recompute intermediate activations for each segment
+3. Process mini-batches in 4 micro-batches, accumulating gradients
+
+Trade-offs:
+- Time: ~30% slower due to recomputation
+- Memory: 4× reduction achieved
+- Complexity: More complex implementation
+
+This is worthwhile when model quality is critical but hardware is limited,
+such as research environments or edge deployment. The time cost is acceptable
+for better model performance that couldn't otherwise be achieved.
+"""
+### END SOLUTION
+
+# %% nbgrader={"grade": false, "grade_id": "synthesis-q2", "solution": true, "points": 5}
+"""
+### Synthesis Question 2: Scaling Bottlenecks
+
+Based on your measurements:
+- Matrix operations scale O(N³)
+- Gradient storage scales O(N) with parameters
+- Graph traversal scales O(depth) with network depth
+
+For each scaling pattern, describe:
+1. When it becomes the primary bottleneck
+2. A real-world scenario where this limits training
+3. An engineering solution to mitigate it
+
+YOUR ANSWER (6-8 sentences):
+"""
+### BEGIN SOLUTION
+"""
+1. O(N³) matrix operations:
+   - Bottleneck: Large hidden dimensions (>10K)
+   - Scenario: Language models with large embeddings
+   - Solution: Block-sparse matrices, reducing N³ to N²×log(N)
+
+2. O(N) gradient storage:
+   - Bottleneck: Models with >10B parameters
+   - Scenario: Training exceeds GPU memory
+   - Solution: Gradient sharding across devices, ZeRO optimization
+
+3. O(depth) graph traversal:
+   - Bottleneck: Networks >1000 layers deep
+   - Scenario: Very deep ResNets or Transformers
+   - Solution: Gradient checkpointing at strategic layers, reversible layers
+
+The key insight: Different architectures hit different bottlenecks, requiring
+architecture-specific optimization strategies.
+"""
+### END SOLUTION
+
+# %% [markdown]
+"""
+## 🎯 MODULE SUMMARY: Autograd
+
+Congratulations! You've successfully implemented automatic differentiation from scratch:
+
+### What You've Accomplished
+✅ **200+ lines of autograd code**: Complete automatic differentiation engine
+✅ **Variable class**: Gradient tracking with computational graph construction  
+✅ **5 operations**: Add, Multiply, MatMul, and neural network layers
+✅ **Memory profiling**: Discovered gradients use 1× parameter memory
+✅ **Performance analysis**: Measured O(N³) scaling for matrix operations
+
+### Key Learning Outcomes
+- **Chain rule mastery**: Backpropagation through arbitrary computational graphs
+- **Memory-compute trade-offs**: Why gradient checkpointing exists
+- **Systems insight**: Gradient accumulation vs storage patterns
+- **Production patterns**: How PyTorch's autograd actually works
+
+### Mathematical Foundations Mastered
+- **Chain rule**: ∂L/∂x = ∂L/∂y · ∂y/∂x
+- **Matrix calculus**: Gradients for matrix multiplication
+- **Computational complexity**: O(N³) for matmul, O(N) for element-wise
+
+### Professional Skills Developed
+- **Automatic differentiation**: Core of all modern deep learning
+- **Memory profiling**: Quantifying memory usage in training
+- **Performance analysis**: Understanding scaling bottlenecks
+
+### Ready for Advanced Applications
+Your autograd implementation now enables:
+- **Immediate**: Training neural networks with gradient descent
+- **Next Module**: Building optimizers (SGD, Adam) using your gradients
+- **Real-world**: Understanding PyTorch's autograd internals
+
+### Connection to Real ML Systems
+Your implementation mirrors production systems:
+- **PyTorch**: torch.autograd.Variable and Function classes
+- **TensorFlow**: tf.GradientTape API
+- **JAX**: grad() transformation
+
+### Next Steps
+1. **Export your module**: `tito module complete 06_autograd`
+2. **Validate integration**: `tito test --module autograd`
+3. **Explore advanced features**: Higher-order gradients, custom operations
+4. **Ready for Module 07**: Build optimizers using your autograd engine!
+
+**You've built the foundation of deep learning**: Every neural network trained today relies on automatic differentiation. Your implementation gives you deep understanding of how gradients flow through complex architectures!
+"""
--- a/modules/06_autograd/autograd_visual_example.md
+++ b/modules/06_autograd/autograd_visual_example.md
@@ -0,0 +1,146 @@
+# Example: Visual Autograd Module Opening
+
+This shows how the autograd module would start with visual explanations:
+
+```python
+# %% [markdown]
+"""
+# Autograd - Automatic Differentiation Engine
+
+## 🎯 What We're Building Today
+
+We're creating the "magic" that powers all modern deep learning - automatic gradient computation:
+
+```
+    Your Neural Network Code:              What Autograd Does Behind the Scenes:
+    ─────────────────────────              ────────────────────────────────────
+    
+    x = Variable(data)                     Creates computation graph node
+    y = x * 2                              Tracks operation: Mul(x, 2)
+    z = y + 3                              Tracks operation: Add(y, 3)
+    loss = z.mean()                        Tracks operation: Mean(z)
+    loss.backward()                        Computes ALL gradients automatically!
+                                          
+                                          ∂loss/∂x computed via chain rule
+```
+
+## 📊 The Computational Graph
+
+When you write `z = x * y + b`, autograd builds this graph:
+
+```
+Forward Pass (Build Graph):
+                    x ────┐
+                          ├──[×]──> x*y ──┐
+                    y ────┘                ├──[+]──> z = x*y + b
+                                    b ────┘
+
+Backward Pass (Compute Gradients):
+                 ∂L/∂x ←──┐
+                          ├──[×]←── ∂L/∂(x*y) ←──┐
+                 ∂L/∂y ←──┘           ↑           ├──[+]←── ∂L/∂z
+                                      │    ∂L/∂b ←┘
+                               Chain Rule Applied
+```
+
+## 💾 Memory Architecture
+
+Understanding memory is crucial for training large models:
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Training Memory Layout                │
+├─────────────────────────────────────────────────────────┤
+│                                                          │
+│  Forward Pass Memory:                                    │
+│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐   │
+│  │  Parameters  │ │ Activations  │ │ Intermediate │   │
+│  │     (W,b)    │ │   (x,y,z)    │ │   Results    │   │
+│  │     100MB    │ │    300MB     │ │    200MB     │   │
+│  └──────────────┘ └──────────────┘ └──────────────┘   │
+│                                                          │
+│  Backward Pass Additional Memory:                        │
+│  ┌──────────────┐ ┌──────────────┐                     │
+│  │  Gradients   │ │   Graph      │                     │
+│  │   (∂L/∂W)    │ │   Storage    │                     │
+│  │    100MB     │ │    50MB      │                     │
+│  └──────────────┘ └──────────────┘                     │
+│                                                          │
+│  Total: 750MB (1.25× more than forward-only)           │
+└─────────────────────────────────────────────────────────┘
+```
+
+## 🔄 The Chain Rule in Action
+
+Let's trace through a simple example step by step:
+
+```
+Given: f(x) = (x + 2) * 3
+Let x = 5
+
+Forward Pass:
+    x = 5
+    ↓
+    y = x + 2 = 7     (save x=5 for backward)
+    ↓
+    z = y * 3 = 21    (save y=7 for backward)
+
+Backward Pass (z.backward()):
+    ∂z/∂z = 1         (start with gradient 1)
+    ↓
+    ∂z/∂y = 3         (derivative of y*3 w.r.t y)
+    ↓
+    ∂z/∂x = ∂z/∂y * ∂y/∂x = 3 * 1 = 3
+    
+Result: x.grad = 3
+```
+
+## 🚀 Why This Matters
+
+Before autograd (pre-2015):
+- **Manual gradient derivation**: Days of calculus for complex models
+- **Error-prone implementation**: One sign error breaks everything
+- **Limited innovation**: Only experts could create new architectures
+
+After autograd (modern era):
+- **Automatic differentiation**: Gradients for ANY architecture
+- **Rapid prototyping**: Try new ideas in minutes, not weeks
+- **Democratized ML**: Focus on architecture, not calculus
+
+## 📈 Real-World Impact
+
+```
+Training Memory Requirements (GPT-3 Scale):
+
+Without Autograd Optimizations:        With Modern Autograd:
+┌────────────────────────┐            ┌────────────────────────┐
+│ Parameters:     700 GB │            │ Parameters:     700 GB │
+│ Gradients:      700 GB │            │ Gradients:      700 GB │
+│ Activations:   2100 GB │            │ Checkpointing:  300 GB │
+│ Optimizer:     1400 GB │            │ Optimizer:     1400 GB │
+├────────────────────────┤            ├────────────────────────┤
+│ Total:         4900 GB │            │ Total:         2700 GB │
+└────────────────────────┘            └────────────────────────┘
+                                       
+                                       45% memory saved via 
+                                       gradient checkpointing!
+```
+
+Now let's build this from scratch and truly understand how it works!
+"""
+```
+
+## Key Elements That Make This Readable:
+
+1. **Visual Comparisons**: Side-by-side "Your Code" vs "What Happens"
+2. **ASCII Diagrams**: Clear computational graphs with arrows
+3. **Memory Layouts**: Visual representation of memory usage
+4. **Step-by-Step Traces**: Following data through forward/backward
+5. **Real-World Context**: Showing GPT-3 scale implications
+6. **Before/After Comparisons**: Why autograd changed everything
+
+This approach ensures students can:
+- **Read and understand** without coding
+- **See the big picture** before implementation details
+- **Grasp systems implications** through visual memory layouts
+- **Connect to real-world** impact and scale
--- a/modules/07_optimizers/07_optimizers.yml
+++ b/modules/07_optimizers/07_optimizers.yml
@@ -8,7 +8,6 @@ description: "Gradient-based parameter optimization algorithms"
 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
  prerequisites: ["setup", "tensor", "autograd"]
-  enables: ["training", "compression", "mlops"]

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.optimizers"
--- a/modules/07_optimizers/optimizers_dev.py
+++ b/modules/07_optimizers/optimizers_dev.py
@@ -433,35 +433,34 @@ Let's implement SGD with momentum!
 #| export
 class SGD:
    """
-    Simplified SGD Optimizer
+    Simple SGD Optimizer - Basic Implementation
    
-    Implements basic stochastic gradient descent with optional momentum.
-    Uses simple gradient operations from Module 6.
+    Implements basic stochastic gradient descent without momentum for simplicity.
+    Demonstrates core optimization concepts with minimal complexity.
    
    Mathematical Update Rule:
    parameter = parameter - learning_rate * gradient
    
-    With momentum:
-    velocity = momentum * velocity + gradient
-    parameter = parameter - learning_rate * velocity
+    SYSTEMS INSIGHT - Memory Usage:
+    SGD stores only the parameters list and learning rate - no additional state.
+    This makes SGD extremely memory efficient compared to adaptive optimizers like Adam,
+    which require storing momentum and velocity terms for each parameter.
+    Memory usage: O(1) additional memory per parameter.
    """
    
-    def __init__(self, parameters: List[Variable], learning_rate: float = 0.01, 
-                 momentum: float = 0.0):
+    def __init__(self, parameters: List[Variable], learning_rate: float = 0.01):
        """
-        Initialize SGD optimizer with basic parameters.
+        Initialize basic SGD optimizer.
        
        Args:
            parameters: List of Variables to optimize (from Module 6)
-            learning_rate: Learning rate (default: 0.01)
-            momentum: Momentum coefficient (default: 0.0)
+            learning_rate: Learning rate for gradient steps (default: 0.01)
        
-        TODO: Implement basic SGD optimizer initialization.
+        TODO: Store the parameters and learning rate for optimization.
        
        APPROACH:
-        1. Store parameters and learning rate
-        2. Store momentum coefficient
-        3. Initialize simple momentum buffers
+        1. Store the list of parameters to optimize
+        2. Store the learning rate for gradient updates
        
        EXAMPLE:
        ```python
@@ -470,70 +469,49 @@ class SGD:
        b = Variable(0.0, requires_grad=True)
        optimizer = SGD([w, b], learning_rate=0.01)
        
-        # In training:
-        optimizer.zero_grad()
-        # ... compute gradients ...
-        optimizer.step()
+        # Training loop:
+        optimizer.zero_grad()  # Clear gradients
+        loss = compute_loss()  # Forward pass
+        loss.backward()        # Backward pass
+        optimizer.step()       # Update parameters
        ```
        """
        ### BEGIN SOLUTION
        self.parameters = parameters
        self.learning_rate = learning_rate
-        self.momentum = momentum
-        
-        # Simple momentum storage using consistent data access
-        self.velocity = {}
-        for i, param in enumerate(parameters):
-            if self.momentum > 0:
-                # Initialize velocity with same shape as parameter data
-                param_data = get_param_data(param)
-                self.velocity[i] = np.zeros_like(param_data)
        ### END SOLUTION
    
    def step(self) -> None:
        """
-        Perform one optimization step using basic gradient operations.
+        Perform one optimization step - update all parameters using their gradients.
        
-        TODO: Implement simplified SGD parameter update.
+        TODO: Implement the core SGD parameter update rule.
        
        APPROACH:
        1. Iterate through all parameters
-        2. For each parameter with gradient (from Module 6):
-           a. Get gradient using simple param.grad access
-           b. Apply momentum if specified
-           c. Update parameter with learning rate
+        2. For each parameter that has a gradient:
+           a. Get the gradient value
+           b. Update parameter: param = param - learning_rate * gradient
        
-        SIMPLIFIED MATHEMATICAL FORMULATION:
-        - Without momentum: parameter = parameter - learning_rate * gradient
-        - With momentum: velocity = momentum * velocity + gradient
-                        parameter = parameter - learning_rate * velocity
+        MATHEMATICAL FORMULATION:
+        parameter_new = parameter_old - learning_rate * gradient
        
        IMPLEMENTATION HINTS:
-        - Use basic param.grad access (from Module 6)
-        - Simple momentum using self.velocity dict
-        - Basic parameter update using scalar operations
+        - Check if param.grad exists before using it
+        - Use get_grad_data() and set_param_data() helper functions
+        - Apply the learning rate to scale the gradient step
        """
        ### BEGIN SOLUTION
-        for i, param in enumerate(self.parameters):
+        for param in self.parameters:
            grad_data = get_grad_data(param)
            if grad_data is not None:
-                # Convert to numpy array for consistent operations
-                gradient = np.array(grad_data)
-                
-                if self.momentum > 0:
-                    # Apply momentum using simple numpy operations
-                    if i in self.velocity:
-                        self.velocity[i] = self.momentum * self.velocity[i] + gradient
-                    else:
-                        self.velocity[i] = gradient.copy()
-                    update = self.velocity[i]
-                else:
-                    # Simple gradient descent (no momentum)
-                    update = gradient
-                
-                # Core SGD update: parameter = parameter - learning_rate * update
+                # Get current parameter value
                current_data = get_param_data(param)
-                new_data = current_data - self.learning_rate * update
+                
+                # Apply SGD update rule: param = param - lr * grad
+                new_data = current_data - self.learning_rate * grad_data
+                
+                # Update the parameter
                set_param_data(param, new_data)
        ### END SOLUTION
    
@@ -541,17 +519,17 @@ class SGD:
        """
        Zero out gradients for all parameters.
        
-        TODO: Implement gradient zeroing.
+        TODO: Clear all gradients to prepare for the next backward pass.
        
        APPROACH:
        1. Iterate through all parameters
        2. Set gradient to None for each parameter
-        3. This prepares for next backward pass
+        3. This prevents gradient accumulation from previous steps
        
        IMPLEMENTATION HINTS:
-        - Simply set param.grad = None
-        - This is called before loss.backward()
-        - Essential for proper gradient accumulation
+        - Set param.grad = None for each parameter
+        - This is essential to call before each backward pass
+        - Prevents gradients from accumulating across iterations
        """
        ### BEGIN SOLUTION
        for param in self.parameters:
@@ -569,16 +547,26 @@ Let's test your SGD optimizer implementation! This optimizer adds momentum to gr

 # %% nbgrader={"grade": true, "grade_id": "test-sgd", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
 def test_unit_sgd_optimizer():
-    """Unit test for the SGD optimizer implementation."""
-    print("🔬 Unit Test: SGD Optimizer...")
+    """Unit test for the simple SGD optimizer implementation."""
+    print("🔬 Unit Test: Simple SGD Optimizer...")
    
    # Create test parameters
    w1 = Variable(1.0, requires_grad=True)
    w2 = Variable(2.0, requires_grad=True)
    b = Variable(0.5, requires_grad=True)
    
-    # Create optimizer
-    optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.9)
+    # Create simple SGD optimizer (no momentum)
+    optimizer = SGD([w1, w2, b], learning_rate=0.1)
+    
+    # Test initialization
+    try:
+        assert optimizer.learning_rate == 0.1, "Learning rate should be stored correctly"
+        assert len(optimizer.parameters) == 3, "Should store all 3 parameters"
+        print("✅ Initialization works correctly")
+        
+    except Exception as e:
+        print(f"❌ Initialization failed: {e}")
+        raise
    
    # Test zero_grad
    try:
@@ -603,14 +591,14 @@ def test_unit_sgd_optimizer():
        w2.grad = Variable(0.2)
        b.grad = Variable(0.05)
        
-        # First step (no momentum yet)
+        # Store original values
        original_w1 = w1.data.data.item()
        original_w2 = w2.data.data.item()
        original_b = b.data.data.item()
        
        optimizer.step()
        
-        # Check parameter updates
+        # Check parameter updates using SGD rule: param = param - lr * grad
        expected_w1 = original_w1 - 0.1 * 0.1  # 1.0 - 0.01 = 0.99
        expected_w2 = original_w2 - 0.1 * 0.2  # 2.0 - 0.02 = 1.98
        expected_b = original_b - 0.1 * 0.05   # 0.5 - 0.005 = 0.495
@@ -624,39 +612,122 @@ def test_unit_sgd_optimizer():
        print(f"❌ Parameter updates failed: {e}")
        raise
    
-    # Test simplified momentum storage
+    # Test step with no gradients
    try:
-        # Check velocity dict exists and has momentum if momentum > 0
-        if optimizer.momentum > 0:
-            assert len(optimizer.velocity) == 3, f"Should have 3 velocity entries, got {len(optimizer.velocity)}"
-        print("✅ Simplified momentum storage works correctly")
+        optimizer.zero_grad()  # Clear gradients
+        
+        # Store values before step
+        before_w1 = w1.data.data.item()
+        before_w2 = w2.data.data.item()
+        before_b = b.data.data.item()
+        
+        optimizer.step()  # Should do nothing when no gradients
+        
+        # Parameters should be unchanged
+        assert w1.data.data.item() == before_w1, "Parameter should not change when gradient is None"
+        assert w2.data.data.item() == before_w2, "Parameter should not change when gradient is None"
+        assert b.data.data.item() == before_b, "Parameter should not change when gradient is None"
+        print("✅ Handles missing gradients correctly")
        
    except Exception as e:
-        print(f"❌ Momentum storage failed: {e}")
-        raise
-    
-    # Test step counting
-    try:
-        w1.grad = Variable(0.1)
-        w2.grad = Variable(0.2)
-        b.grad = Variable(0.05)
-        
-        optimizer.step()
-        
-        # Step counting removed from simplified SGD for educational clarity
-        print("✅ Step counting simplified for Module 8")
-        
-    except Exception as e:
-        print(f"❌ Step counting failed: {e}")
+        print(f"❌ Missing gradient handling failed: {e}")
        raise

-    print("🎯 SGD optimizer behavior:")
-    print("   Maintains momentum buffers for accelerated updates")
-    print("   Tracks step count for learning rate scheduling")
-    print("   Supports weight decay for regularization")
-    print("📈 Progress: SGD Optimizer ✓")
+    print("🎯 Simple SGD optimizer behavior:")
+    print("   ✓ Stores parameters and learning rate only")
+    print("   ✓ Updates parameters using: param = param - lr * grad")
+    print("   ✓ Memory efficient: O(1) additional memory per parameter")
+    print("   ✓ Foundation for more advanced optimizers (Adam, RMSprop)")
+    print("📈 Progress: Simple SGD Optimizer ✓")

-# Test function defined (called in main block)
+# Immediate test execution
+test_unit_sgd_optimizer()
+
+# %% nbgrader={"grade": true, "grade_id": "compute-sgd-memory", "points": 2}
+"""
+### 📊 Computation Question: SGD Memory Requirements
+
+You implemented SGD which only stores parameters and learning rate.
+
+For a model with 175M parameters (like GPT-2), calculate:
+1. Memory for parameters (float32)
+2. Additional memory SGD needs for optimization
+3. Total memory for training with SGD
+4. How much memory Adam would need instead (stores m and v buffers)
+
+Give answers in GB.
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+1. Parameters: 175M × 4 bytes = 700 MB = 0.7 GB
+
+2. SGD additional memory: ~0 GB (only stores lr, negligible)
+
+3. Total SGD training: 0.7 GB (params) + 0.7 GB (gradients) = 1.4 GB
+
+4. Adam memory:
+   - Parameters: 0.7 GB
+   - Gradients: 0.7 GB
+   - Momentum (m): 0.7 GB
+   - Velocity (v): 0.7 GB
+   - Total: 2.8 GB (2× more than SGD!)
+
+Key insight: SGD is memory-optimal but may converge slower than Adam.
+"""
+### END SOLUTION
+
+# %% nbgrader={"grade": true, "grade_id": "compute-sgd-updates", "points": 2}
+"""
+### 📊 Computation Question: Multi-Step Updates
+
+Given:
+- Parameter initial value: 10.0
+- Learning rate: 0.1
+- Gradient sequence: [2.0, -1.0, 3.0, -2.0]
+
+Calculate the parameter value after each SGD update step.
+Show: initial → step1 → step2 → step3 → step4
+
+YOUR ANSWER:
+"""
+### BEGIN SOLUTION
+"""
+SGD update rule: param = param - lr * grad
+
+Initial: 10.0
+Step 1: 10.0 - 0.1 × 2.0 = 10.0 - 0.2 = 9.8
+Step 2: 9.8 - 0.1 × (-1.0) = 9.8 + 0.1 = 9.9
+Step 3: 9.9 - 0.1 × 3.0 = 9.9 - 0.3 = 9.6
+Step 4: 9.6 - 0.1 × (-2.0) = 9.6 + 0.2 = 9.8
+
+Final value: 9.8
+
+Note: Parameter oscillates due to changing gradient signs.
+"""
+### END SOLUTION
+
+# %% nbgrader={"grade": true, "grade_id": "reflect-sgd-simplicity", "points": 2}
+"""
+### 🤔 Micro-Reflection: SGD Design
+
+SGD doesn't store momentum buffers like Adam does.
+
+Q: What is ONE advantage and ONE disadvantage of SGD's minimal memory approach
+for training very large models (>10B parameters)?
+
+YOUR ANSWER (2-3 sentences):
+"""
+### BEGIN SOLUTION
+"""
+Advantage: SGD can train 2× larger models than Adam in the same memory budget,
+enabling larger architectures on limited hardware.
+
+Disadvantage: Without momentum, SGD converges slower and is more sensitive to 
+learning rate choices, potentially requiring more epochs to reach the same loss.
+"""
+### END SOLUTION

 # %% [markdown]
 """
--- a/modules/08_training/08_training.yml
+++ b/modules/08_training/08_training.yml
@@ -8,7 +8,6 @@ description: "Neural network training loops, loss functions, and metrics"
 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
  prerequisites: ["setup", "tensor", "activations", "layers", "networks", "dataloader", "autograd", "optimizers"]
-  enables: ["compression", "kernels", "benchmarking", "mlops"]

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.training"
--- a/modules/09_spatial/09_spatial.yml
+++ b/modules/09_spatial/09_spatial.yml
@@ -8,7 +8,6 @@ description: "Convolutional networks for spatial pattern recognition and image p
 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
  prerequisites: ["setup", "tensor", "activations", "layers", "dense"]
-  enables: ["attention", "training", "computer_vision"]

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.spatial"
--- a/modules/10_dataloader/10_dataloader.yml
+++ b/modules/10_dataloader/10_dataloader.yml
@@ -8,7 +8,6 @@ description: "Dataset interfaces and data loading pipelines"
 # Dependencies - Used by CLI for module ordering and prerequisites
 dependencies:
  prerequisites: ["setup", "tensor"]
-  enables: ["training", "dense", "spatial", "attention"]

 # Package Export - What gets built into tinytorch package
 exports_to: "tinytorch.core.dataloader"
--- a/modules/11_tokenization/11_tokenization.yml
+++ b/modules/11_tokenization/11_tokenization.yml
--- a/modules/12_embeddings/12_embeddings.yml
+++ b/modules/12_embeddings/12_embeddings.yml
--- a/modules/13_attention/13_attention.yml
+++ b/modules/13_attention/13_attention.yml
--- a/modules/14_transformers/14_transformers.yml
+++ b/modules/14_transformers/14_transformers.yml
--- a/modules/15_profiling/15_profiling.yml
+++ b/modules/15_profiling/15_profiling.yml
--- a/modules/16_acceleration/16_acceleration.yml
+++ b/modules/16_acceleration/16_acceleration.yml
--- a/modules/17_quantization/17_quantization.yml
+++ b/modules/17_quantization/17_quantization.yml
--- a/modules/18_compression/18_compression.yml
+++ b/modules/18_compression/18_compression.yml
--- a/modules/19_caching/19_caching.yml
+++ b/modules/19_caching/19_caching.yml
--- a/modules/20_benchmarking/COMPREHENSIVE_QA_AUDIT_REPORT.md
+++ b/modules/20_benchmarking/COMPREHENSIVE_QA_AUDIT_REPORT.md
@@ -1,164 +0,0 @@
-# 🔬 COMPREHENSIVE QUALITY ASSURANCE AUDIT REPORT
-**Date**: 2025-09-26  
-**Auditor**: Quality Assurance Agent (Dr. Priya Sharma)  
-**Scope**: Complete TinyTorch Module System (21 modules)  
-
-## 📊 EXECUTIVE SUMMARY
-
-**Overall Status**: ✅ **HIGHLY SUCCESSFUL**  
- **21 modules discovered** (01-21, module 18_pruning deleted as planned)
- **21/21 modules compile successfully** (100% compilation rate)
- **19/21 modules execute without critical errors** (90% execution success)
- **2 modules have minor issues** requiring attention
-
-## 🏗️ COMPLETE MODULE INVENTORY
-
-### Core Foundation Modules (01-10) - ✅ ALL FUNCTIONAL
-1. **01_setup** - ✅ PERFECT - Complete environment setup with systems analysis
-2. **02_tensor** - ✅ PERFECT - Tensor operations with NumPy integration
-3. **03_activations** - ✅ PERFECT - Activation functions compilation
-4. **04_layers** - ⚠️ MINOR ISSUE - `__file__` undefined in execution context
-5. **05_losses** - ✅ PERFECT - Loss functions with comprehensive testing
-6. **06_autograd** - ✅ PERFECT - Automatic differentiation compilation
-7. **07_optimizers** - ✅ PERFECT - Optimization algorithms compilation
-8. **08_training** - ✅ PERFECT - Training loop implementation compilation
-9. **09_spatial** - ✅ PERFECT - CNN operations with extensive testing
-10. **10_dataloader** - ✅ PERFECT - Data loading and preprocessing compilation
-
-### Advanced Modules (11-15) - ✅ STRONG PERFORMANCE
-11. **11_tokenization** - ❌ BPE TEST FAILURE - Assertion error in merge function
-12. **12_embeddings** - ✅ PERFECT - Word embeddings compilation
-13. **13_attention** - ✅ PERFECT - Attention mechanisms compilation
-14. **14_transformers** - ✅ PERFECT - Transformer architecture compilation
-15. **15_profiling** - ✅ PERFECT - Performance profiling execution validated
-
-### Specialized Modules (16-21) - ✅ COMPLETE COVERAGE
-16. **16_acceleration** - ✅ PERFECT - Hardware acceleration compilation
-17. **17_quantization** - ✅ PERFECT - Model quantization compilation
-18. **18_compression** - ✅ PERFECT - Model compression compilation
-19. **19_caching** - ✅ PERFECT - Caching strategies compilation
-20. **20_benchmarking** - ✅ PERFECT - Benchmarking systems execution validated
-21. **21_mlops** - ✅ PERFECT - MLOps deployment compilation
-
-## 🔍 DETAILED TEST RESULTS
-
-### Compilation Testing (21/21 PASS)
-```
-✅ ALL 21 MODULES COMPILE SUCCESSFULLY
- No syntax errors detected
- All imports resolve correctly
- NBGrader metadata properly formatted
- Module structure compliant
-```
-
-### Execution Testing (19/21 PASS)
-**Successful Executions:**
- **setup**: Full test suite execution with systems analysis ✅
- **tensor**: Complete tensor operations with NumPy integration ✅  
- **losses**: Comprehensive loss function testing ✅
- **profiling**: Performance profiling systems ✅
- **benchmarking**: Benchmarking framework execution ✅
-
-**Issues Identified:**
- **layers**: `__file__` undefined in execution context (minor)
- **tokenization**: BPE merge function test assertion failure (fixable)
-
-### Systems Analysis Validation
-**EXCELLENT**: All tested modules include proper:
- Memory profiling and complexity analysis
- Performance benchmarking capabilities
- Scaling behavior documentation
- Production context references
- Integration with larger systems
-
-## 🚨 CRITICAL ISSUES IDENTIFIED
-
-### 1. Tokenization Module BPE Test Failure
-**Module**: `modules/11_tokenization/tokenization_dev.py`  
-**Issue**: `assert merged[0].count('l') == 1, "Should have only one 'l' left after merge"`  
-**Severity**: MEDIUM - Test logic error in BPE implementation  
-**Action Required**: Fix BPE merge function test expectations  
-
-### 2. Layers Module Execution Context Issue  
-**Module**: `modules/04_layers/layers_dev.py`  
-**Issue**: `name '__file__' is not defined`  
-**Severity**: LOW - Execution context issue, doesn't affect core functionality  
-**Action Required**: Remove dependency on `__file__` variable in test context  
-
-## ✅ QUALITY ASSURANCE VALIDATION
-
-### ML Systems Teaching Standards - EXCELLENT
- ✅ **Memory Analysis**: All tested modules include explicit memory profiling
- ✅ **Performance Characteristics**: Computational complexity documented
- ✅ **Scaling Behavior**: Large input performance analysis present
- ✅ **Production Context**: Real-world system references (PyTorch, TensorFlow)
- ✅ **Hardware Implications**: Cache behavior and vectorization considerations
-
-### Test Structure Compliance - VERY GOOD
- ✅ **Immediate Testing**: Tests follow implementation in proper sequence
- ✅ **Unit Test Functions**: Proper `test_unit_*()` function naming
- ✅ **Main Block Structure**: `if __name__ == "__main__":` blocks present
- ✅ **Comprehensive Testing**: Integration and edge case coverage
- ✅ **Educational Assertions**: Clear error messages that teach concepts
-
-### NBGrader Integration - VALIDATED
- ✅ **Metadata Complete**: All cells have proper NBGrader metadata
- ✅ **Schema Version**: Consistent schema version 3 usage
- ✅ **Solution Blocks**: BEGIN/END SOLUTION properly implemented
- ✅ **Grade IDs**: Unique identifiers across modules
- ✅ **Student Scaffolding**: Clear TODO comments and implementation hints
-
-## 📈 PERFORMANCE METRICS
-
-### Compilation Success Rate: 100% (21/21)
-### Execution Success Rate: 90% (19/21)  
-### Critical Issues: 0
-### Medium Issues: 1 (Tokenization BPE test)
-### Minor Issues: 1 (Layers execution context)
-
-## 🎯 RECOMMENDATIONS
-
-### Immediate Actions Required:
-1. **Fix tokenization BPE merge test** - Update assertion logic to match implementation
-2. **Resolve layers module execution** - Remove `__file__` dependency in test context
-
-### Quality Improvements:
-1. **Add automated testing pipeline** - Implement CI/CD for module validation
-2. **Expand integration testing** - Test cross-module dependencies
-3. **Performance regression testing** - Monitor computational complexity over time
-
-## 🏆 OVERALL ASSESSMENT
-
-**GRADE: A- (EXCELLENT WITH MINOR FIXES NEEDED)**
-
-### Strengths:
- **Outstanding compilation rate** (100%)
- **Strong execution success** (90%)
- **Excellent ML systems focus** throughout all modules
- **Comprehensive testing frameworks** in place
- **Professional NBGrader integration** ready for classroom use
- **Real-world production context** consistently provided
-
-### Areas for Improvement:
- **Fix 2 specific module issues** (tokenization BPE, layers execution)
- **Implement automated testing** to prevent regressions
- **Add cross-module integration testing** for complex workflows
-
-## 🚀 PRODUCTION READINESS
-
-**STATUS**: ✅ **READY FOR DEPLOYMENT WITH MINOR FIXES**
-
-The TinyTorch module system demonstrates excellent quality across all tested dimensions:
- Technical implementation is sound and complete
- Educational design follows ML systems engineering principles  
- NBGrader integration supports instructor workflows
- Students will have positive learning experiences with proper scaffolding
- Professional software development practices are maintained throughout
-
-**RECOMMENDATION**: Approve for production use after fixing the 2 identified issues.
-
---
-
-**Audit Completed**: 2025-09-26  
-**Quality Assurance Agent**: Dr. Priya Sharma  
-**Next Review Date**: Upon issue resolution and before major releases  
--- a/modules/20_benchmarking/module.yaml
+++ b/modules/20_benchmarking/module.yaml
@@ -1,30 +0,0 @@
-name: Benchmarking
-number: 20
-type: project
-difficulty: advanced
-estimated_hours: 10-12
-
-description: |
-  TinyMLPerf Olympics - the culmination of your TinyTorch journey! Build a comprehensive
-  benchmarking suite using your profiler from Module 19, then compete on speed, memory,
-  and efficiency. Benchmark the models you built throughout the course to see the impact
-  of all your optimizations.
-
-learning_objectives:
-  - Build TinyMLPerf benchmark suite
-  - Implement fair performance comparison
-  - Create reproducible benchmarks
-  - Understand MLPerf methodology
-
-prerequisites:
-  - Module 15: Profiling
-  - All optimization modules (16-19)
-
-skills_developed:
-  - Benchmarking methodology
-  - Performance reporting
-  - Fair comparison techniques
-  - Competition optimization
-
-exports:
-  - tinytorch.benchmarking
--- a/modules/20_capstone/20_capstone.yml
+++ b/modules/20_capstone/20_capstone.yml
@@ -0,0 +1,41 @@
+# TinyTorch Module Metadata
+# Essential system information for CLI tools and build systems
+
+# === CORE IDENTITY ===
+name: "capstone"
+number: 20
+folder_name: "20_capstone"
+
+# === DISPLAY ===
+display:
+  title: "Torch Olympics"
+  subtitle: "MLPerf-Inspired Challenges"
+  emoji: "🏆"
+
+# === DEPENDENCIES ===
+dependencies:
+  prerequisites: ["setup", "tensor", "activations", "layers", "losses", "autograd", "optimizers", "training", "spatial", "dataloader", "tokenization", "embeddings", "attention", "transformers", "profiling", "acceleration", "quantization", "compression", "caching"]
+
+# === BUILD SYSTEM ===
+build:
+  exports_to: "tinytorch.benchmarking"
+  main_file: "capstone_dev.py"
+
+# === EDUCATION ===
+education:
+  stage: "optimization"
+  difficulty: "⭐⭐⭐⭐⭐"
+  time_estimate: "6-8 hours"
+  description: "TinyMLPerf Olympics - the culmination of your TinyTorch journey! Build a comprehensive benchmarking suite using your profiler from Module 19, then compete on speed, memory, and efficiency. Benchmark the models you built throughout the course to see the impact of all your optimizations."
+
+# === CHECKPOINT ===
+checkpoint:
+  unlocks: 15
+  capability: "Can I build unified ML frameworks across modalities?"
+
+# === COMPONENTS ===
+components:
+  - "TinyMLPerf"
+  - "BenchmarkSuite"
+  - "PerformanceReporter"
+  - "CompetitionFramework"
--- a/modules/20_benchmarking/README.md
+++ b/modules/20_benchmarking/README.md
--- a/modules/20_benchmarking/benchmarking_dev.ipynb
+++ b/modules/20_benchmarking/benchmarking_dev.ipynb
--- a/modules/20_benchmarking/benchmarking_dev.py
+++ b/modules/20_benchmarking/benchmarking_dev.py