Major rewrite for gradient flow:
- scaled_dot_product_attention: Use Tensor ops (matmul, transpose, softmax)
- MultiHeadAttention: Process all heads in parallel with 4D batched tensors
- No explicit batch loops or .data extraction
- Proper mask broadcasting for (batch * heads) dimension
This is the most complex fix - attention is now fully differentiable end-to-end