mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-29 01:10:36 -05:00
Major rewrite for gradient flow: - scaled_dot_product_attention: Use Tensor ops (matmul, transpose, softmax) - MultiHeadAttention: Process all heads in parallel with 4D batched tensors - No explicit batch loops or .data extraction - Proper mask broadcasting for (batch * heads) dimension This is the most complex fix - attention is now fully differentiable end-to-end