mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-28 01:36:58 -05:00
Critical fixes to enable full gradient flow through transformer: 1. PermuteBackward: - Added general axis permutation backward function - Handles multi-dimensional transposes like (0, 2, 1, 3) - Fixed MultiHeadAttention breaking graph with np.transpose 2. GELUBackward: - Implemented GELU activation gradient - Uses tanh approximation derivative formula - Patched GELU.forward() in enable_autograd() 3. MultiHeadAttention fixes: - Replaced raw np.transpose with permute_axes helper - Now attaches PermuteBackward to preserve computation graph - Q/K/V projections now receive gradients ✅ Results: - Before: 0/21 parameters with gradients (0%) - After: 21/21 parameters with gradients (100%) ✅ - Single batch overfit: 4.66 → 0.10 (97.9% improvement!) ✅ - ALL Phase 1 architecture tests PASS ✅ Gradient flow verified through: - Token + Position embeddings ✅ - LayerNorm (all 3 instances) ✅ - Multi-Head Attention (Q, K, V, out projections) ✅ - MLP (both linear layers) ✅ - LM head ✅ The transformer architecture is now fully differentiable!