Add integration tests for training flow and NLP pipeline

New tests that catch module boundary bugs: test_training_flow.py: - Optimizer actually updates weights (SGD, Adam) - Training reduces loss over iterations - Gradient chain not broken through 5 layers - Input receives gradients - zero_grad works correctly - Batch gradients are averaged test_nlp_pipeline_flow.py: - Embedding receives gradients - Repeated tokens accumulate gradients - Attention projections receive gradients - Attention input receives gradients (xfail - known issue) - Transformer block gradient flow - Complete NLP pipeline end-to-end README.md: - Integration test philosophy - Good vs bad integration test examples - Coverage gaps and how to fill them
2026-04-27 11:38:42 -05:00 · 2025-12-02 22:14:08 -05:00
parent 3a885601f9
commit fb20e255c9
3 changed files with 847 additions and 0 deletions
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@@ -0,0 +1,142 @@
+# Integration Tests
+
+## Philosophy
+
+Integration tests catch bugs that **unit tests miss** - specifically bugs at **module boundaries** where one module's output becomes another module's input.
+
+### The Gradient Flow Pattern
+
+The gold standard is `test_gradient_flow.py`. It verifies:
+1. **Gradients exist** (not None)
+2. **Gradients are non-zero** (actually computed)
+3. **Gradients flow through each layer** (chain not broken)
+4. **Training actually works** (loss decreases)
+
+This pattern catches the most common and frustrating bugs students encounter.
+
+## Test Categories
+
+### 🔥 Critical (Must Pass)
+
+| Test File | What It Catches | Modules |
+|-----------|-----------------|---------|
+| `test_gradient_flow.py` | Broken backpropagation | 01-07 |
+| `test_training_flow.py` | Training loop failures | 05-07 |
+| `test_nlp_pipeline_flow.py` | NLP stack issues | 10-13 |
+| `test_cnn_integration.py` | CNN gradient issues | 09 |
+
+### 📋 Standard (Should Pass)
+
+| Test File | What It Catches | Modules |
+|-----------|-----------------|---------|
+| `test_dataloader_integration.py` | Data pipeline issues | 08 |
+| `test_api_simplification_integration.py` | API compatibility | All |
+
+### 🔬 Scenario Tests
+
+These test complete use cases:
+- `integration_xor_test.py` - XOR learning (classic test)
+- `integration_mnist_test.py` - MNIST classification
+- `integration_cnn_test.py` - CNN on images
+- `integration_tinygpt_test.py` - Language model training
+
+## What Makes a Good Integration Test
+
+### ✅ Good Integration Test
+```python
+def test_gradients_flow_through_mlp():
+    """Gradients must reach all layers"""
+    layers = [Linear(4, 4) for _ in range(5)]
+    
+    x = Tensor(np.random.randn(1, 4), requires_grad=True)
+    h = x
+    for layer in layers:
+        h = relu(layer(h))
+    loss = mse_loss(h, target)
+    loss.backward()
+    
+    # ALL layers must have gradients
+    for i, layer in enumerate(layers):
+        assert layer.weight.grad is not None, f"Layer {i} has no gradient!"
+```
+
+**Why it's good:**
+- Tests the **boundary** between layers
+- Catches gradient chain breaks
+- Clear error message tells you WHERE it broke
+
+### ❌ Bad Integration Test
+```python
+def test_linear_layer():
+    """Test linear layer works"""
+    layer = Linear(2, 3)
+    x = Tensor([[1, 2]])
+    y = layer(x)
+    assert y.shape == (1, 3)
+```
+
+**Why it's bad:**
+- This is a **unit test**, not integration
+- Doesn't test interaction with other modules
+- Belongs in `tests/03_layers/`
+
+## Running Tests
+
+```bash
+# Run all integration tests
+pytest tests/integration/ -v
+
+# Run only gradient flow tests
+pytest tests/integration/test_gradient_flow.py -v
+
+# Run only training flow tests  
+pytest tests/integration/test_training_flow.py -v
+
+# Run quick smoke tests (for CI)
+pytest tests/integration/ -v -k quick
+
+# Run with detailed output on failure
+pytest tests/integration/ -v --tb=long
+```
+
+## Adding New Integration Tests
+
+When adding a new module (e.g., Module 14: Profiling), ask:
+
+1. **What other modules does it interact with?**
+   - Profiling interacts with training loops (07) and models (03)
+
+2. **What could break at the boundary?**
+   - Profiling hooks might interfere with autograd
+   - Timing might change tensor operations
+
+3. **Write a test that exercises the boundary:**
+```python
+def test_profiling_does_not_break_training():
+    """Profiling should not interfere with gradient flow"""
+    with profiler.profile():
+        loss = model(x)
+        loss.backward()  # Should still work!
+    
+    assert model.weight.grad is not None
+```
+
+## Coverage Gaps
+
+### Currently Missing
+
+| Module | Integration Test Needed |
+|--------|------------------------|
+| 14 Profiling | Profiler + training loop |
+| 15 Quantization | Quantized model accuracy |
+| 16 Compression | Compressed model still trains |
+| 17 Memoization | Cached ops maintain correctness |
+| 18 Acceleration | Accelerated ops match baseline |
+
+### How to Fill Gaps
+
+For each gap, create a test that:
+1. Uses the module in a **realistic scenario**
+2. Verifies **correctness** (not just "doesn't crash")
+3. Checks **boundaries** with connected modules
+
--- a/tests/integration/test_nlp_pipeline_flow.py
+++ b/tests/integration/test_nlp_pipeline_flow.py
@@ -0,0 +1,349 @@
+"""
+NLP Pipeline Flow Integration Tests
+====================================
+
+Tests that the NLP pipeline works end-to-end:
+1. Tokenization produces valid token IDs
+2. Embeddings convert tokens to vectors
+3. Attention mechanisms process sequences
+4. Transformers combine everything correctly
+5. Gradients flow back through the entire pipeline
+
+These tests catch issues at module boundaries in the NLP stack.
+
+Modules tested: 10-13 (Tokenization → Embeddings → Attention → Transformers)
+"""
+
+import pytest
+import numpy as np
+import sys
+from pathlib import Path
+
+# Add project root
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+
+from tinytorch.core.tensor import Tensor
+from tinytorch.core.autograd import enable_autograd
+
+# Enable autograd
+enable_autograd()
+
+
+class TestEmbeddingGradientFlow:
+    """
+    Critical Test: Verify gradients flow through embeddings.
+    
+    Common bugs caught:
+    - Embedding lookup not differentiable
+    - Wrong gradient accumulation for repeated tokens
+    - Shape mismatches between embedding and attention
+    """
+    
+    def test_embedding_receives_gradients(self):
+        """Embedding weights must receive gradients during training"""
+        try:
+            from tinytorch.core.embeddings import Embedding
+        except ImportError:
+            pytest.skip("Embedding module not yet implemented")
+        
+        vocab_size = 100
+        embed_dim = 32
+        embedding = Embedding(vocab_size, embed_dim)
+        
+        # Token IDs (integers)
+        token_ids = [1, 5, 3, 7, 2]
+        
+        # Forward pass
+        embedded = embedding.forward(token_ids)
+        
+        # Simple loss: sum of embeddings
+        loss = Tensor(np.array([[embedded.data.sum()]]), requires_grad=True)
+        loss.backward()
+        
+        # Embedding weights should have gradients
+        assert embedding.weight.grad is not None, (
+            "Embedding weights did not receive gradients!"
+        )
+        
+        # Only used token embeddings should have non-zero gradients
+        for token_id in token_ids:
+            grad_row = embedding.weight.grad[token_id]
+            assert np.any(grad_row != 0), (
+                f"Token {token_id} embedding has zero gradient!"
+            )
+    
+    def test_repeated_tokens_accumulate_gradients(self):
+        """Same token appearing twice should have accumulated gradient"""
+        try:
+            from tinytorch.core.embeddings import Embedding
+        except ImportError:
+            pytest.skip("Embedding module not yet implemented")
+        
+        vocab_size = 10
+        embed_dim = 4
+        embedding = Embedding(vocab_size, embed_dim)
+        
+        # Token 5 appears twice
+        token_ids = [5, 2, 5, 3]
+        
+        embedded = embedding.forward(token_ids)
+        
+        # Loss that weights all positions equally
+        loss = Tensor(np.array([[embedded.data.sum()]]), requires_grad=True)
+        loss.backward()
+        
+        # Token 5 should have ~2x the gradient of token 2 or 3
+        grad_5 = np.linalg.norm(embedding.weight.grad[5])
+        grad_2 = np.linalg.norm(embedding.weight.grad[2])
+        
+        # Allow some tolerance
+        assert grad_5 > grad_2 * 1.5, (
+            f"Repeated token gradient not accumulated!\n"
+            f"  Token 5 (appears 2x) grad: {grad_5}\n"
+            f"  Token 2 (appears 1x) grad: {grad_2}\n"
+            f"  Expected ratio ~2, got {grad_5/grad_2:.2f}"
+        )
+
+
+class TestAttentionGradientFlow:
+    """
+    Critical Test: Verify gradients flow through attention mechanism.
+    
+    Common bugs caught:
+    - Softmax gradient issues
+    - Attention weights not differentiable
+    - Query/Key/Value projection gradients
+    """
+    
+    def test_attention_all_projections_receive_gradients(self):
+        """Q, K, V projections must all receive gradients"""
+        try:
+            from tinytorch.core.attention import MultiHeadAttention
+        except ImportError:
+            pytest.skip("Attention module not yet implemented")
+        
+        embed_dim = 32
+        num_heads = 4
+        seq_len = 8
+        batch_size = 2
+        
+        attention = MultiHeadAttention(embed_dim, num_heads)
+        
+        # Random input sequence
+        x = Tensor(
+            np.random.randn(batch_size, seq_len, embed_dim),
+            requires_grad=True
+        )
+        
+        # Forward pass (self-attention - single input for Q, K, V)
+        output = attention.forward(x)
+        
+        # Simple loss
+        loss = Tensor(np.array([[output.data.sum()]]), requires_grad=True)
+        loss.backward()
+        
+        # All projection matrices should have gradients
+        projections = ['W_q', 'W_k', 'W_v', 'W_o']
+        for proj_name in projections:
+            if hasattr(attention, proj_name):
+                proj = getattr(attention, proj_name)
+                if hasattr(proj, 'weight'):
+                    assert proj.weight.grad is not None, (
+                        f"{proj_name} did not receive gradients!"
+                    )
+    
+    @pytest.mark.xfail(reason="Known issue: Attention gradient flow needs fix - see Module 12")
+    def test_attention_input_receives_gradients(self):
+        """Input to attention must receive gradients for residual connections"""
+        try:
+            from tinytorch.core.attention import MultiHeadAttention
+        except ImportError:
+            pytest.skip("Attention module not yet implemented")
+        
+        embed_dim = 16
+        num_heads = 2
+        
+        attention = MultiHeadAttention(embed_dim, num_heads)
+        
+        x = Tensor(
+            np.random.randn(1, 4, embed_dim),
+            requires_grad=True
+        )
+        
+        output = attention.forward(x)
+        loss = Tensor(np.array([[output.data.sum()]]), requires_grad=True)
+        loss.backward()
+        
+        assert x.grad is not None, (
+            "Input to attention did not receive gradients!\n"
+            "This breaks residual connections in Transformers."
+        )
+        
+        assert x.grad.shape == x.shape, (
+            f"Input gradient shape mismatch: {x.grad.shape} vs {x.shape}"
+        )
+
+
+class TestTransformerGradientFlow:
+    """
+    Critical Test: Verify gradients flow through complete Transformer.
+    
+    Common bugs caught:
+    - Residual connection gradients
+    - Layer norm gradient issues
+    - Deep network vanishing gradients
+    """
+    
+    def test_transformer_block_gradient_flow(self):
+        """Gradients must flow through a complete transformer block"""
+        try:
+            from tinytorch.core.transformers import TransformerBlock
+        except ImportError:
+            pytest.skip("Transformer module not yet implemented")
+        
+        embed_dim = 32
+        num_heads = 4
+        ff_dim = 64
+        
+        block = TransformerBlock(embed_dim, num_heads, ff_dim)
+        
+        x = Tensor(
+            np.random.randn(1, 8, embed_dim),
+            requires_grad=True
+        )
+        
+        output = block.forward(x)
+        loss = Tensor(np.array([[output.data.sum()]]), requires_grad=True)
+        loss.backward()
+        
+        # Input must receive gradients (for stacking blocks)
+        assert x.grad is not None, (
+            "Transformer block input did not receive gradients!"
+        )
+        
+        # Gradient should not be too small (vanishing)
+        grad_norm = np.linalg.norm(x.grad)
+        assert grad_norm > 1e-6, (
+            f"Vanishing gradients in transformer block: {grad_norm}"
+        )
+    
+    def test_stacked_transformer_blocks(self):
+        """Gradients must flow through multiple stacked blocks"""
+        try:
+            from tinytorch.core.transformers import TransformerBlock
+        except ImportError:
+            pytest.skip("Transformer module not yet implemented")
+        
+        embed_dim = 32
+        num_heads = 4
+        ff_dim = 64
+        num_layers = 4
+        
+        blocks = [TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers)]
+        
+        x = Tensor(
+            np.random.randn(1, 8, embed_dim),
+            requires_grad=True
+        )
+        
+        # Forward through all blocks
+        h = x
+        for block in blocks:
+            h = block.forward(h)
+        
+        loss = Tensor(np.array([[h.data.sum()]]), requires_grad=True)
+        loss.backward()
+        
+        # Input must receive gradients through all layers
+        assert x.grad is not None, (
+            f"Gradients did not flow through {num_layers} transformer blocks!"
+        )
+        
+        # Check gradient magnitude is reasonable
+        grad_norm = np.linalg.norm(x.grad)
+        assert grad_norm > 1e-8, (
+            f"Severe vanishing gradients through {num_layers} blocks: {grad_norm}"
+        )
+
+
+class TestNLPPipelineEndToEnd:
+    """
+    Integration Test: Full NLP pipeline from tokens to loss.
+    
+    This tests the complete flow:
+    tokens → embedding → attention → linear → loss
+    """
+    
+    def test_complete_nlp_forward_backward(self):
+        """Complete NLP pipeline must work end-to-end"""
+        try:
+            from tinytorch.core.embeddings import Embedding
+            from tinytorch.core.attention import MultiHeadAttention
+            from tinytorch.core.layers import Linear
+            from tinytorch.core.losses import CrossEntropyLoss
+        except ImportError:
+            pytest.skip("NLP modules not yet implemented")
+        
+        vocab_size = 100
+        embed_dim = 32
+        num_heads = 4
+        num_classes = 10
+        seq_len = 8
+        
+        # Build pipeline
+        embedding = Embedding(vocab_size, embed_dim)
+        attention = MultiHeadAttention(embed_dim, num_heads)
+        classifier = Linear(embed_dim, num_classes)
+        loss_fn = CrossEntropyLoss()
+        
+        # Input: token IDs
+        token_ids = list(np.random.randint(0, vocab_size, seq_len))
+        target = Tensor(np.array([[3]]))  # Class 3
+        
+        # Forward pass
+        embedded = embedding.forward(token_ids)  # [seq_len, embed_dim]
+        # Reshape for attention: add batch dimension
+        embedded_batched = Tensor(embedded.data.reshape(1, seq_len, embed_dim), requires_grad=True)
+        attended = attention.forward(embedded_batched)  # [1, seq_len, embed_dim]
+        
+        # Mean pooling over sequence
+        pooled = Tensor(attended.data.mean(axis=0, keepdims=True), requires_grad=True)
+        
+        logits = classifier.forward(pooled)  # [1, num_classes]
+        loss = loss_fn.forward(logits, target)
+        
+        # Backward pass
+        loss.backward()
+        
+        # Verify gradients flowed to embedding
+        assert embedding.weight.grad is not None, (
+            "Gradients did not flow back to embeddings!"
+        )
+        
+        # Verify classifier received gradients
+        assert classifier.weight.grad is not None, (
+            "Classifier did not receive gradients!"
+        )
+
+
+# Quick smoke tests for CI
+@pytest.mark.quick
+class TestQuickNLPSmoke:
+    """Fast tests for CI"""
+    
+    def test_embedding_forward_works(self):
+        """Embedding forward should not crash"""
+        try:
+            from tinytorch.core.embeddings import Embedding
+        except ImportError:
+            pytest.skip("Embedding module not yet implemented")
+        
+        embedding = Embedding(100, 32)
+        result = embedding.forward([1, 2, 3])
+        assert result.shape[0] == 3
+        assert result.shape[1] == 32
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
+
--- a/tests/integration/test_training_flow.py
+++ b/tests/integration/test_training_flow.py
@@ -0,0 +1,356 @@
+"""
+Training Flow Integration Tests
+================================
+
+Tests that the complete training pipeline works:
+1. Forward pass produces valid outputs
+2. Loss computes correctly
+3. Backward pass populates gradients
+4. Optimizer updates weights
+5. Loss decreases over iterations
+
+These tests catch issues that unit tests miss - where modules
+work individually but fail when connected.
+
+Modules tested: 01-07 (Tensor → Training)
+"""
+
+import pytest
+import numpy as np
+import sys
+from pathlib import Path
+
+# Add project root
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+
+from tinytorch.core.tensor import Tensor
+from tinytorch.core.layers import Linear
+from tinytorch.core.activations import ReLU, Sigmoid
+from tinytorch.core.losses import MSELoss, CrossEntropyLoss
+from tinytorch.core.optimizers import SGD, Adam
+from tinytorch.core.autograd import enable_autograd
+
+# Enable autograd for all tests
+enable_autograd()
+
+
+class TestOptimzerActuallyUpdatesWeights:
+    """
+    Critical Test: Verify optimizer.step() actually changes weights.
+    
+    Common bugs caught:
+    - Optimizer not connected to parameters
+    - Gradients not flowing to weights
+    - Learning rate is zero
+    - step() not implemented correctly
+    """
+    
+    def test_sgd_updates_weights(self):
+        """SGD must modify weights after step()"""
+        layer = Linear(2, 1)
+        optimizer = SGD([layer.weight, layer.bias], lr=0.1)
+        
+        # Store initial weights
+        initial_weight = layer.weight.data.copy()
+        initial_bias = layer.bias.data.copy()
+        
+        # Forward + backward
+        x = Tensor([[1.0, 2.0]], requires_grad=True)
+        target = Tensor([[5.0]])
+        
+        output = layer.forward(x)
+        loss = MSELoss().forward(output, target)
+        loss.backward()
+        
+        # Verify gradients exist
+        assert layer.weight.grad is not None, "Weight gradient is None!"
+        assert layer.bias.grad is not None, "Bias gradient is None!"
+        
+        # Step should update weights
+        optimizer.step()
+        
+        # Weights MUST be different
+        weight_changed = not np.allclose(initial_weight, layer.weight.data)
+        bias_changed = not np.allclose(initial_bias, layer.bias.data)
+        
+        assert weight_changed, (
+            f"SGD.step() did not change weights!\n"
+            f"  Before: {initial_weight}\n"
+            f"  After:  {layer.weight.data}\n"
+            f"  Grad:   {layer.weight.grad}"
+        )
+        assert bias_changed, "SGD.step() did not change bias!"
+    
+    def test_adam_updates_weights(self):
+        """Adam must modify weights after step()"""
+        layer = Linear(2, 1)
+        optimizer = Adam([layer.weight, layer.bias], lr=0.1)
+        
+        initial_weight = layer.weight.data.copy()
+        
+        x = Tensor([[1.0, 2.0]], requires_grad=True)
+        target = Tensor([[5.0]])
+        
+        output = layer.forward(x)
+        loss = MSELoss().forward(output, target)
+        loss.backward()
+        
+        optimizer.step()
+        
+        assert not np.allclose(initial_weight, layer.weight.data), (
+            "Adam.step() did not change weights!"
+        )
+
+
+class TestTrainingReducesLoss:
+    """
+    Critical Test: Verify that training actually reduces loss.
+    
+    Common bugs caught:
+    - Gradients have wrong sign
+    - Learning rate too high (divergence)
+    - Optimizer not using gradients correctly
+    - Loss function returning wrong values
+    """
+    
+    def test_mlp_loss_decreases(self):
+        """A simple MLP must learn XOR-like pattern"""
+        # Simple 2-layer network
+        layer1 = Linear(2, 4)
+        relu = ReLU()
+        layer2 = Linear(4, 1)
+        sigmoid = Sigmoid()
+        loss_fn = MSELoss()
+        
+        params = [layer1.weight, layer1.bias, layer2.weight, layer2.bias]
+        optimizer = SGD(params, lr=0.5)
+        
+        # XOR-like data
+        X = Tensor([
+            [0., 0.],
+            [0., 1.],
+            [1., 0.],
+            [1., 1.]
+        ], requires_grad=True)
+        y = Tensor([[0.], [1.], [1.], [0.]])
+        
+        # Track loss over time
+        losses = []
+        
+        for epoch in range(100):
+            # Zero gradients
+            for p in params:
+                if p.grad is not None:
+                    p.grad = np.zeros_like(p.grad)
+            
+            # Forward
+            h = relu.forward(layer1.forward(X))
+            out = sigmoid.forward(layer2.forward(h))
+            loss = loss_fn.forward(out, y)
+            
+            losses.append(float(loss.data))
+            
+            # Backward
+            loss.backward()
+            
+            # Update
+            optimizer.step()
+        
+        # Loss MUST decrease
+        initial_loss = losses[0]
+        final_loss = losses[-1]
+        
+        assert final_loss < initial_loss, (
+            f"Training did not reduce loss!\n"
+            f"  Initial: {initial_loss:.4f}\n"
+            f"  Final:   {final_loss:.4f}\n"
+            f"  Loss history: {losses[:5]}...{losses[-5:]}"
+        )
+        
+        # Loss should decrease significantly (at least 20%)
+        improvement = (initial_loss - final_loss) / initial_loss
+        assert improvement > 0.2, (
+            f"Training improved loss by only {improvement*100:.1f}%\n"
+            f"  Expected at least 20% improvement"
+        )
+
+
+class TestGradientChainNotBroken:
+    """
+    Critical Test: Verify gradient chain is not broken.
+    
+    Common bugs caught:
+    - requires_grad not propagating
+    - Operations not recording grad_fn
+    - Intermediate tensors breaking the chain
+    """
+    
+    def test_deep_network_gradient_chain(self):
+        """Gradients must flow through 5 layers"""
+        layers = [Linear(4, 4) for _ in range(5)]
+        relu = ReLU()
+        
+        x = Tensor(np.random.randn(1, 4), requires_grad=True)
+        target = Tensor(np.random.randn(1, 4))
+        
+        # Forward through all layers
+        h = x
+        for layer in layers:
+            h = relu.forward(layer.forward(h))
+        
+        loss = MSELoss().forward(h, target)
+        loss.backward()
+        
+        # ALL layers must have gradients
+        for i, layer in enumerate(layers):
+            assert layer.weight.grad is not None, (
+                f"Layer {i} weight.grad is None - gradient chain broken!"
+            )
+            assert layer.bias.grad is not None, (
+                f"Layer {i} bias.grad is None - gradient chain broken!"
+            )
+            
+            # Gradients should be non-trivial
+            grad_norm = np.linalg.norm(layer.weight.grad)
+            assert grad_norm > 1e-10, (
+                f"Layer {i} has vanishing gradients: {grad_norm}"
+            )
+    
+    def test_input_receives_gradients(self):
+        """Input tensor must receive gradients for visualization/debugging"""
+        layer = Linear(3, 2)
+        x = Tensor([[1., 2., 3.]], requires_grad=True)
+        target = Tensor([[1., 0.]])
+        
+        output = layer.forward(x)
+        loss = MSELoss().forward(output, target)
+        loss.backward()
+        
+        assert x.grad is not None, "Input tensor did not receive gradients!"
+        assert x.grad.shape == x.shape, (
+            f"Input gradient shape mismatch: {x.grad.shape} vs {x.shape}"
+        )
+
+
+class TestZeroGradWorks:
+    """
+    Critical Test: Verify zero_grad clears gradients properly.
+    
+    Common bugs caught:
+    - Gradients accumulating across batches
+    - zero_grad not actually zeroing
+    - Memory leaks from gradient accumulation
+    """
+    
+    def test_gradients_dont_accumulate_after_zero_grad(self):
+        """Gradients must not accumulate when zero_grad is called"""
+        layer = Linear(2, 1)
+        optimizer = SGD([layer.weight, layer.bias], lr=0.1)
+        
+        x = Tensor([[1., 2.]], requires_grad=True)
+        target = Tensor([[1.]])
+        
+        # First forward/backward
+        out1 = layer.forward(x)
+        loss1 = MSELoss().forward(out1, target)
+        loss1.backward()
+        
+        grad_after_first = layer.weight.grad.copy()
+        
+        # Zero gradients
+        optimizer.zero_grad()
+        
+        # Verify zeroed
+        assert layer.weight.grad is None or np.allclose(layer.weight.grad, 0), (
+            "zero_grad() did not clear weight gradients!"
+        )
+        
+        # Second forward/backward
+        out2 = layer.forward(x)
+        loss2 = MSELoss().forward(out2, target)
+        loss2.backward()
+        
+        grad_after_second = layer.weight.grad.copy()
+        
+        # Gradients should be similar magnitude (not accumulated)
+        ratio = np.linalg.norm(grad_after_second) / np.linalg.norm(grad_after_first)
+        assert 0.5 < ratio < 2.0, (
+            f"Gradients appear to be accumulating!\n"
+            f"  First grad norm: {np.linalg.norm(grad_after_first)}\n"
+            f"  Second grad norm: {np.linalg.norm(grad_after_second)}\n"
+            f"  Ratio: {ratio} (should be ~1.0)"
+        )
+
+
+class TestBatchTraining:
+    """
+    Critical Test: Verify batch training works correctly.
+    
+    Common bugs caught:
+    - Shape mismatches with batches
+    - Mean vs sum reduction issues
+    - Gradient scaling problems
+    """
+    
+    def test_batch_gradients_are_averaged(self):
+        """Gradients should be averaged over batch (not summed)"""
+        layer = Linear(2, 1)
+        
+        # Single sample
+        x1 = Tensor([[1., 2.]], requires_grad=True)
+        target1 = Tensor([[3.]])
+        
+        out1 = layer.forward(x1)
+        loss1 = MSELoss().forward(out1, target1)
+        loss1.backward()
+        
+        single_grad = layer.weight.grad.copy()
+        
+        # Reset
+        layer.weight.grad = None
+        layer.bias.grad = None
+        
+        # Batch of same sample repeated 4 times
+        x_batch = Tensor([[1., 2.]] * 4, requires_grad=True)
+        target_batch = Tensor([[3.]] * 4)
+        
+        out_batch = layer.forward(x_batch)
+        loss_batch = MSELoss().forward(out_batch, target_batch)
+        loss_batch.backward()
+        
+        batch_grad = layer.weight.grad.copy()
+        
+        # Gradients should be similar (averaged, not 4x)
+        ratio = np.linalg.norm(batch_grad) / np.linalg.norm(single_grad)
+        assert 0.8 < ratio < 1.2, (
+            f"Batch gradients not properly averaged!\n"
+            f"  Single sample grad norm: {np.linalg.norm(single_grad)}\n"
+            f"  Batch (4x same) grad norm: {np.linalg.norm(batch_grad)}\n"
+            f"  Ratio: {ratio} (should be ~1.0, got {ratio:.2f})"
+        )
+
+
+# Quick smoke test for CI
+@pytest.mark.quick
+class TestQuickTrainingSmoke:
+    """Fast tests for CI - just verify nothing crashes"""
+    
+    def test_simple_training_step(self):
+        """One training step should not crash"""
+        layer = Linear(2, 1)
+        opt = SGD([layer.weight, layer.bias], lr=0.1)
+        
+        x = Tensor([[1., 2.]], requires_grad=True)
+        y = Tensor([[1.]])
+        
+        out = layer.forward(x)
+        loss = MSELoss().forward(out, y)
+        loss.backward()
+        opt.step()
+        
+        assert True  # If we got here, it works
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
+