TinyTorch

mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-28 01:36:58 -05:00

Files

Vijay Janapa Reddi cbf553f1c7 fix(autograd): Complete transformer gradient flow - ALL PARAMETERS NOW WORK!

Critical fixes to enable full gradient flow through transformer:

1. PermuteBackward:
   - Added general axis permutation backward function
   - Handles multi-dimensional transposes like (0, 2, 1, 3)
   - Fixed MultiHeadAttention breaking graph with np.transpose

2. GELUBackward:
   - Implemented GELU activation gradient
   - Uses tanh approximation derivative formula
   - Patched GELU.forward() in enable_autograd()

3. MultiHeadAttention fixes:
   - Replaced raw np.transpose with permute_axes helper
   - Now attaches PermuteBackward to preserve computation graph
   - Q/K/V projections now receive gradients ✅

Results:
- Before: 0/21 parameters with gradients (0%)
- After: 21/21 parameters with gradients (100%) ✅
- Single batch overfit: 4.66 → 0.10 (97.9% improvement!) ✅
- ALL Phase 1 architecture tests PASS ✅

Gradient flow verified through:
- Token + Position embeddings ✅
- LayerNorm (all 3 instances) ✅
- Multi-Head Attention (Q, K, V, out projections) ✅
- MLP (both linear layers) ✅
- LM head ✅

The transformer architecture is now fully differentiable!

2025-10-28 08:18:20 -04:00

01_tensor

fix(module-01): Fix batched matmul and transpose grad preservation

2025-10-27 20:28:53 -04:00

02_activations

fix(module-02): Rewrite Softmax to use Tensor operations

2025-10-27 20:29:35 -04:00

03_layers

fix(module-03): Rewrite Dropout to use Tensor operations

2025-10-27 20:29:43 -04:00

04_losses

feat: Add Milestone 04 (CNN Revolution 1998) + Clean spatial imports