feat: Add comprehensive integration tests for attention module

 Created test_tensor_attention_integration.py:
- Basic tensor-attention integration with real TinyTorch components
- Self-attention wrapper testing with proper Tensor objects
- Attention masking integration (causal, padding, bidirectional)
- Batched tensor processing and different data types
- Numerical stability and gradient flow compatibility

 Created test_attention_pipeline_integration.py:
- Complete transformer-like pipeline testing
- Multi-layer attention stacks (transformer encoders)
- Causal masking for language modeling workflows
- Encoder-decoder architecture integration
- Cross-module integration with dense layers and activations
- Real-world scenarios: sequence classification, seq2seq translation
- Scalability testing across different sequence lengths and dimensions

 Updated tests/README.md:
- Documented new attention integration tests (15→17 total tests)
- Organized tests by category (Foundation, Architecture, Training, Inference Serving)
- Added specific usage examples for attention tests
- Clear documentation of test coverage and purpose

Integration tests ensure:
- Attention works with real Tensor objects (not mocks)
- Cross-module compatibility with dense, spatial, activations
- Complete ML workflows (classification, translation, transformers)
- Realistic transformer architectures and patterns
- System-level regression detection for attention functionality
This commit is contained in:
Vijay Janapa Reddi
2025-07-18 00:21:48 -04:00
parent 59d58718f9
commit a9ee348355
3 changed files with 647 additions and 3 deletions

View File

@@ -2,7 +2,7 @@
## ⚠️ **CRITICAL DIRECTORY - DO NOT DELETE**
This directory contains **15 integration test files** that verify cross-module functionality across the entire TinyTorch system. These tests represent significant development effort and are essential for:
This directory contains **17 integration test files** that verify cross-module functionality across the entire TinyTorch system. These tests represent significant development effort and are essential for:
- **Module integration validation**
- **Cross-component compatibility**
@@ -14,13 +14,42 @@ This directory contains **15 integration test files** that verify cross-module f
- `test_utils.py` - Shared testing utilities
- `test_integration_report.md` - Test documentation
## 🧪 **Integration Test Coverage**
### Foundation Integration
- `test_tensor_activations_integration.py` - Tensor + Activations
- `test_layers_networks_integration.py` - Layers + Dense Networks
- `test_tensor_autograd_integration.py` - Tensor + Autograd
### Architecture Integration
- `test_tensor_attention_integration.py` - **NEW**: Tensor + Attention mechanisms
- `test_attention_pipeline_integration.py` - **NEW**: Complete transformer-like pipelines
- `test_tensor_cnn_integration.py` - Tensor + Spatial/CNN
- `test_cnn_networks_integration.py` - Spatial + Dense Networks
- `test_cnn_pipeline_integration.py` - Complete CNN pipelines
### Training & Data Integration
- `test_dataloader_tensor_integration.py` - DataLoader + Tensor
- `test_training_integration.py` - Complete training workflows
- `test_ml_pipeline_integration.py` - End-to-end ML pipelines
### Inference Serving Integration
- `test_compression_integration.py` - Model compression
- `test_kernels_integration.py` - Custom operations
- `test_benchmarking_integration.py` - Performance measurement
- `test_mlops_integration.py` - Deployment and serving
## 🔧 **Usage**
```bash
# Run all integration tests
pytest tests/ -v
# Run specific module integration
pytest tests/test_tensor_activations_integration.py -v
pytest tests/test_tensor_attention_integration.py -v
pytest tests/test_attention_pipeline_integration.py -v
# Run attention-related tests
pytest tests/ -k "attention" -v
```
## 🚨 **Recovery Instructions**
@@ -31,4 +60,8 @@ git status # Verify recovery
```
## 📊 **Test Coverage**
These integration tests complement the inline tests in each module's `*_dev.py` files, providing comprehensive system validation.
These integration tests complement the inline tests in each module's `*_dev.py` files, providing comprehensive system validation with focus on:
- **Real component integration** (not mocks)
- **Cross-module compatibility**
- **Realistic ML workflows** (classification, seq2seq, transformers)
- **Performance and scalability**

View File

@@ -0,0 +1,369 @@
"""
Integration Tests - Attention Pipeline
Tests attention mechanism in complete ML pipelines.
Uses actual TinyTorch components to verify transformer-like architectures work correctly.
"""
import pytest
import numpy as np
from test_utils import setup_integration_test
# Ensure proper setup before importing
setup_integration_test()
# Import ONLY from TinyTorch package
from tinytorch.core.tensor import Tensor
from tinytorch.core.attention import scaled_dot_product_attention, SelfAttention, create_causal_mask
from tinytorch.core.layers import Dense
from tinytorch.core.activations import ReLU, Softmax
from tinytorch.core.dense import Sequential
class TestAttentionPipelineIntegration:
"""Test attention in complete ML pipelines with other modules."""
def test_attention_dense_pipeline(self):
"""Test attention followed by dense layers (transformer-like)."""
seq_len, d_model = 8, 16
vocab_size = 10
# Create input sequence (like word embeddings)
embeddings = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Step 1: Self-attention (transformer block)
self_attn = SelfAttention(d_model)
attended_output, attention_weights = self_attn(embeddings.data)
# Step 2: Dense feedforward network
dense_network = Sequential([
Dense(input_size=d_model, output_size=d_model * 2),
ReLU(),
Dense(input_size=d_model * 2, output_size=d_model)
])
# Apply dense network to each position
ff_outputs = []
for i in range(seq_len):
pos_input = Tensor(attended_output[i:i+1]) # Single position
pos_output = dense_network(pos_input)
ff_outputs.append(pos_output.data)
final_output = np.concatenate(ff_outputs, axis=0)
# Step 3: Final classification head
classifier = Dense(input_size=d_model, output_size=vocab_size)
# Classify each position
predictions = []
for i in range(seq_len):
pos_input = Tensor(final_output[i:i+1])
pred = classifier(pos_input)
predictions.append(pred.data)
final_predictions = np.concatenate(predictions, axis=0)
# Verify complete pipeline
assert final_predictions.shape == (seq_len, vocab_size), f"Expected shape {(seq_len, vocab_size)}, got {final_predictions.shape}"
assert not np.any(np.isnan(final_predictions)), "Pipeline should not produce NaN"
assert attention_weights.shape == (seq_len, seq_len), "Attention weights should be preserved"
# Verify attention weights are sensible
assert np.allclose(np.sum(attention_weights, axis=-1), 1.0), "Attention weights should sum to 1"
def test_multi_layer_attention_pipeline(self):
"""Test multiple attention layers in sequence (like transformer encoder)."""
seq_len, d_model = 6, 12
num_layers = 3
# Initial embeddings
x = Tensor(np.random.randn(seq_len, d_model) * 0.1)
current_input = x.data
# Multi-layer attention (transformer encoder stack)
attention_layers = []
for layer in range(num_layers):
# Create attention layer
attn = SelfAttention(d_model)
attention_layers.append(attn)
# Apply attention
attn_output, attn_weights = attn(current_input)
# Simple residual connection (add input to output)
if current_input.shape == attn_output.shape:
current_input = current_input + attn_output
else:
current_input = attn_output
# Verify intermediate outputs
assert attn_output.shape == (seq_len, d_model), f"Layer {layer} output shape wrong"
assert attn_weights.shape == (seq_len, seq_len), f"Layer {layer} attention weights wrong"
assert np.allclose(np.sum(attn_weights, axis=-1), 1.0), f"Layer {layer} weights should sum to 1"
# Final output verification
assert current_input.shape == (seq_len, d_model), "Multi-layer output should preserve shape"
assert not np.any(np.isnan(current_input)), "Multi-layer pipeline should not produce NaN"
def test_attention_with_causal_masking_pipeline(self):
"""Test attention with causal masking in language modeling pipeline."""
seq_len, d_model = 10, 16
vocab_size = 20
# Create sequence (like tokens in a sentence)
token_embeddings = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Create causal mask for autoregressive generation
causal_mask = create_causal_mask(seq_len)
# Apply causal self-attention
attn_output, attn_weights = scaled_dot_product_attention(
token_embeddings.data, token_embeddings.data, token_embeddings.data, causal_mask
)
# Language modeling head (predict next token)
lm_head = Dense(input_size=d_model, output_size=vocab_size)
# Generate predictions for each position
predictions = []
for pos in range(seq_len):
pos_input = Tensor(attn_output[pos:pos+1])
pred = lm_head(pos_input)
predictions.append(pred.data)
all_predictions = np.concatenate(predictions, axis=0)
# Verify causal masking worked
assert np.all(np.triu(attn_weights, k=1) < 1e-6), "Causal mask should prevent future attention"
assert all_predictions.shape == (seq_len, vocab_size), "Should predict vocabulary for each position"
# Verify predictions are reasonable for language modeling
for pos in range(seq_len):
pos_pred = all_predictions[pos]
assert not np.any(np.isnan(pos_pred)), f"Position {pos} predictions should not be NaN"
assert len(pos_pred) == vocab_size, f"Position {pos} should predict full vocabulary"
def test_attention_encoder_decoder_pipeline(self):
"""Test attention in encoder-decoder architecture."""
src_len, tgt_len, d_model = 6, 8, 12
# Source sequence (encoder input)
src_embeddings = Tensor(np.random.randn(src_len, d_model) * 0.1)
# Target sequence (decoder input)
tgt_embeddings = Tensor(np.random.randn(tgt_len, d_model) * 0.1)
# Encoder: self-attention on source
encoder = SelfAttention(d_model)
encoder_output, encoder_weights = encoder(src_embeddings.data)
# Decoder: self-attention on target with causal mask
decoder = SelfAttention(d_model)
causal_mask = create_causal_mask(tgt_len)
decoder_self_attn, decoder_weights = scaled_dot_product_attention(
tgt_embeddings.data, tgt_embeddings.data, tgt_embeddings.data, causal_mask
)
# Cross-attention: target queries attend to source keys/values
cross_attn_output, cross_attn_weights = scaled_dot_product_attention(
decoder_self_attn, # Queries from decoder
encoder_output, # Keys from encoder
encoder_output # Values from encoder
)
# Final output layer
output_layer = Dense(input_size=d_model, output_size=d_model)
final_outputs = []
for pos in range(tgt_len):
pos_input = Tensor(cross_attn_output[pos:pos+1])
pos_output = output_layer(pos_input)
final_outputs.append(pos_output.data)
final_sequence = np.concatenate(final_outputs, axis=0)
# Verify encoder-decoder pipeline
assert encoder_output.shape == (src_len, d_model), "Encoder output should preserve source shape"
assert decoder_self_attn.shape == (tgt_len, d_model), "Decoder self-attention should preserve target shape"
assert cross_attn_output.shape == (tgt_len, d_model), "Cross-attention should output target length"
assert final_sequence.shape == (tgt_len, d_model), "Final output should match target length"
# Verify attention patterns
assert encoder_weights.shape == (src_len, src_len), "Encoder attention should be source x source"
assert decoder_weights.shape == (tgt_len, tgt_len), "Decoder attention should be target x target"
assert cross_attn_weights.shape == (tgt_len, src_len), "Cross-attention should be target x source"
# Verify causal masking in decoder
assert np.all(np.triu(decoder_weights, k=1) < 1e-6), "Decoder should have causal masking"
def test_attention_with_multiple_architectures(self):
"""Test attention integrated with different architecture types."""
seq_len, d_model = 8, 16
# Input sequence
x = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Path 1: Attention → Dense layers
attn1 = SelfAttention(d_model)
attn_out1, _ = attn1(x.data)
dense_path = Sequential([
Dense(input_size=d_model, output_size=d_model // 2),
ReLU(),
Dense(input_size=d_model // 2, output_size=d_model)
])
# Apply dense to sequence (position by position)
dense_outputs = []
for i in range(seq_len):
pos_input = Tensor(attn_out1[i:i+1])
pos_output = dense_path(pos_input)
dense_outputs.append(pos_output.data)
dense_result = np.concatenate(dense_outputs, axis=0)
# Path 2: Multiple attention layers
attn2 = SelfAttention(d_model)
attn3 = SelfAttention(d_model)
attn_out2, _ = attn2(x.data)
attn_out3, _ = attn3(attn_out2)
# Verify both paths work
assert dense_result.shape == (seq_len, d_model), "Dense path should preserve sequence shape"
assert attn_out3.shape == (seq_len, d_model), "Multi-attention path should preserve shape"
assert not np.any(np.isnan(dense_result)), "Dense path should not produce NaN"
assert not np.any(np.isnan(attn_out3)), "Multi-attention path should not produce NaN"
# Verify they can be combined
combined = dense_result + attn_out3
assert combined.shape == (seq_len, d_model), "Combined paths should work"
assert not np.any(np.isnan(combined)), "Combined result should not be NaN"
def test_attention_scalability_pipeline(self):
"""Test attention pipeline with different sequence lengths and dimensions."""
test_configs = [
(4, 8), # Small sequence, small dimension
(16, 32), # Medium sequence, medium dimension
(32, 16), # Long sequence, smaller dimension
(8, 64), # Short sequence, large dimension
]
for seq_len, d_model in test_configs:
# Create test data
x = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Attention pipeline
attn = SelfAttention(d_model)
attn_output, attn_weights = attn(x.data)
# Dense post-processing
post_process = Dense(input_size=d_model, output_size=min(d_model, 10))
# Process each position
processed_outputs = []
for i in range(seq_len):
pos_input = Tensor(attn_output[i:i+1])
pos_output = post_process(pos_input)
processed_outputs.append(pos_output.data)
final_output = np.concatenate(processed_outputs, axis=0)
# Verify scalability
assert attn_output.shape == (seq_len, d_model), f"Config {test_configs} attention output wrong"
assert attn_weights.shape == (seq_len, seq_len), f"Config {test_configs} attention weights wrong"
assert final_output.shape == (seq_len, min(d_model, 10)), f"Config {test_configs} final output wrong"
assert not np.any(np.isnan(final_output)), f"Config {test_configs} should not produce NaN"
class TestAttentionRealWorldPipelines:
"""Test attention in realistic ML scenarios."""
def test_sequence_classification_pipeline(self):
"""Test attention for sequence classification (like sentiment analysis)."""
seq_len, d_model = 12, 24
num_classes = 3
# Input sequence (like sentence embeddings)
sentence = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Self-attention to capture dependencies
attn = SelfAttention(d_model)
attended_sequence, attn_weights = attn(sentence.data)
# Global pooling (mean over sequence dimension)
pooled_representation = np.mean(attended_sequence, axis=0, keepdims=True)
# Classification head
classifier = Sequential([
Dense(input_size=d_model, output_size=d_model // 2),
ReLU(),
Dense(input_size=d_model // 2, output_size=num_classes)
])
# Get classification scores
pooled_tensor = Tensor(pooled_representation)
class_scores = classifier(pooled_tensor)
# Verify classification pipeline
assert attended_sequence.shape == (seq_len, d_model), "Attention should preserve sequence shape"
assert pooled_representation.shape == (1, d_model), "Pooling should create single representation"
assert class_scores.shape == (1, num_classes), "Should output class scores"
assert not np.any(np.isnan(class_scores.data)), "Classification should not produce NaN"
def test_sequence_to_sequence_pipeline(self):
"""Test attention for sequence-to-sequence tasks (like translation)."""
src_len, tgt_len, d_model = 10, 8, 20
vocab_size = 30
# Source sequence (input language)
src_seq = Tensor(np.random.randn(src_len, d_model) * 0.1)
# Target sequence (output language, teacher forcing)
tgt_seq = Tensor(np.random.randn(tgt_len, d_model) * 0.1)
# Encoder: process source sequence
encoder = SelfAttention(d_model)
encoded_src, src_attn = encoder(src_seq.data)
# Decoder: process target with causal masking
decoder = SelfAttention(d_model)
causal_mask = create_causal_mask(tgt_len)
decoded_tgt, tgt_attn = scaled_dot_product_attention(
tgt_seq.data, tgt_seq.data, tgt_seq.data, causal_mask
)
# Cross-attention: target attends to source
cross_attended, cross_attn = scaled_dot_product_attention(
decoded_tgt, # Target queries
encoded_src, # Source keys
encoded_src # Source values
)
# Output projection to vocabulary
vocab_proj = Dense(input_size=d_model, output_size=vocab_size)
# Generate output tokens
output_tokens = []
for pos in range(tgt_len):
pos_input = Tensor(cross_attended[pos:pos+1])
token_logits = vocab_proj(pos_input)
output_tokens.append(token_logits.data)
output_sequence = np.concatenate(output_tokens, axis=0)
# Verify seq2seq pipeline
assert encoded_src.shape == (src_len, d_model), "Encoder should preserve source shape"
assert cross_attended.shape == (tgt_len, d_model), "Cross-attention should output target length"
assert output_sequence.shape == (tgt_len, vocab_size), "Should generate vocab logits for each target position"
# Verify attention patterns make sense
assert src_attn.shape == (src_len, src_len), "Source attention should be src x src"
assert tgt_attn.shape == (tgt_len, tgt_len), "Target attention should be tgt x tgt"
assert cross_attn.shape == (tgt_len, src_len), "Cross-attention should be tgt x src"
# Verify causal masking
assert np.all(np.triu(tgt_attn, k=1) < 1e-6), "Target should have causal masking"
if __name__ == "__main__":
pytest.main([__file__])

View File

@@ -0,0 +1,242 @@
"""
Integration Tests - Tensor and Attention
Tests real integration between Tensor and Attention modules.
Uses actual TinyTorch components to verify they work together correctly.
"""
import pytest
import numpy as np
from test_utils import setup_integration_test
# Ensure proper setup before importing
setup_integration_test()
# Import ONLY from TinyTorch package
from tinytorch.core.tensor import Tensor
from tinytorch.core.attention import (
scaled_dot_product_attention,
SelfAttention,
create_causal_mask,
create_padding_mask,
create_bidirectional_mask
)
class TestTensorAttentionIntegration:
"""Test real integration between Tensor and Attention modules."""
def test_scaled_dot_product_attention_with_real_tensors(self):
"""Test scaled dot-product attention with real Tensor objects."""
# Create Q, K, V as real Tensors
seq_len, d_model = 4, 8
np.random.seed(42)
Q = Tensor(np.random.randn(seq_len, d_model) * 0.1)
K = Tensor(np.random.randn(seq_len, d_model) * 0.1)
V = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Apply attention - should work with numpy arrays (tensor.data)
output, weights = scaled_dot_product_attention(Q.data, K.data, V.data)
# Verify output properties
assert output.shape == (seq_len, d_model), f"Output shape should be {(seq_len, d_model)}, got {output.shape}"
assert weights.shape == (seq_len, seq_len), f"Weights shape should be {(seq_len, seq_len)}, got {weights.shape}"
assert np.allclose(np.sum(weights, axis=-1), 1.0), "Attention weights should sum to 1"
assert np.all(weights >= 0), "All attention weights should be non-negative"
def test_self_attention_with_real_tensors(self):
"""Test SelfAttention wrapper with real Tensor objects."""
d_model = 16
seq_len = 6
# Create self-attention
self_attn = SelfAttention(d_model)
# Create input tensor
x = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Apply self-attention - should work with numpy arrays
output, weights = self_attn(x.data)
# Verify integration
assert output.shape == x.data.shape, f"Output shape should match input {x.data.shape}, got {output.shape}"
assert weights.shape == (seq_len, seq_len), f"Weights should be square {(seq_len, seq_len)}, got {weights.shape}"
assert np.allclose(np.sum(weights, axis=-1), 1.0), "Self-attention weights should sum to 1"
def test_attention_with_masking_and_tensors(self):
"""Test attention with masking using real Tensor inputs."""
seq_len = 5
d_model = 12
# Create tensors
Q = Tensor(np.random.randn(seq_len, d_model) * 0.1)
K = Tensor(np.random.randn(seq_len, d_model) * 0.1)
V = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Test causal masking
causal_mask = create_causal_mask(seq_len)
output_causal, weights_causal = scaled_dot_product_attention(
Q.data, K.data, V.data, causal_mask
)
# Verify causal masking worked
assert np.all(np.triu(weights_causal, k=1) < 1e-6), "Causal mask should zero upper triangle"
assert output_causal.shape == (seq_len, d_model), "Causal attention should preserve shape"
# Test padding masking
lengths = [seq_len, seq_len-2]
padding_mask = create_padding_mask(lengths, seq_len)
output_padded, weights_padded = scaled_dot_product_attention(
Q.data, K.data, V.data, padding_mask[0]
)
# Verify padding masking worked
assert output_padded.shape == (seq_len, d_model), "Padding attention should preserve shape"
assert np.allclose(np.sum(weights_padded, axis=-1), 1.0), "Padded weights should still sum to 1"
def test_attention_batched_tensors(self):
"""Test attention with batched Tensor inputs."""
batch_size, seq_len, d_model = 3, 4, 8
# Create batched tensors
Q_batch = Tensor(np.random.randn(batch_size, seq_len, d_model) * 0.1)
K_batch = Tensor(np.random.randn(batch_size, seq_len, d_model) * 0.1)
V_batch = Tensor(np.random.randn(batch_size, seq_len, d_model) * 0.1)
# Apply attention to batched data
output, weights = scaled_dot_product_attention(Q_batch.data, K_batch.data, V_batch.data)
# Verify batched processing
assert output.shape == (batch_size, seq_len, d_model), "Batched output should preserve batch dimension"
assert weights.shape == (batch_size, seq_len, seq_len), "Batched weights should preserve batch dimension"
# Verify each batch item has proper attention weights
for b in range(batch_size):
assert np.allclose(np.sum(weights[b], axis=-1), 1.0), f"Batch {b} weights should sum to 1"
def test_attention_with_different_tensor_dtypes(self):
"""Test attention works with different Tensor data types."""
seq_len, d_model = 3, 6
# Test with float32
Q_f32 = Tensor(np.random.randn(seq_len, d_model).astype(np.float32))
K_f32 = Tensor(np.random.randn(seq_len, d_model).astype(np.float32))
V_f32 = Tensor(np.random.randn(seq_len, d_model).astype(np.float32))
output_f32, weights_f32 = scaled_dot_product_attention(Q_f32.data, K_f32.data, V_f32.data)
assert output_f32.dtype == np.float32, "Output should preserve float32 dtype"
assert weights_f32.dtype == np.float32, "Weights should preserve float32 dtype"
# Test with float64
Q_f64 = Tensor(np.random.randn(seq_len, d_model).astype(np.float64))
K_f64 = Tensor(np.random.randn(seq_len, d_model).astype(np.float64))
V_f64 = Tensor(np.random.randn(seq_len, d_model).astype(np.float64))
output_f64, weights_f64 = scaled_dot_product_attention(Q_f64.data, K_f64.data, V_f64.data)
assert output_f64.dtype == np.float64, "Output should preserve float64 dtype"
assert weights_f64.dtype == np.float64, "Weights should preserve float64 dtype"
def test_attention_numerical_stability(self):
"""Test attention numerical stability with real Tensor inputs."""
seq_len, d_model = 4, 8
# Create tensors with extreme values
Q_extreme = Tensor(np.random.randn(seq_len, d_model) * 10) # Large values
K_extreme = Tensor(np.random.randn(seq_len, d_model) * 10)
V_extreme = Tensor(np.random.randn(seq_len, d_model) * 10)
# Apply attention
output, weights = scaled_dot_product_attention(Q_extreme.data, K_extreme.data, V_extreme.data)
# Verify numerical stability
assert not np.any(np.isnan(output)), "Attention output should not contain NaN"
assert not np.any(np.isinf(output)), "Attention output should not contain Inf"
assert not np.any(np.isnan(weights)), "Attention weights should not contain NaN"
assert not np.any(np.isinf(weights)), "Attention weights should not contain Inf"
assert np.allclose(np.sum(weights, axis=-1), 1.0), "Weights should sum to 1 even with extreme inputs"
def test_attention_gradient_flow_compatibility(self):
"""Test attention compatibility with gradient computation requirements."""
seq_len, d_model = 3, 4
# Create small tensors for gradient testing
Q = Tensor(np.random.randn(seq_len, d_model) * 0.1)
K = Tensor(np.random.randn(seq_len, d_model) * 0.1)
V = Tensor(np.random.randn(seq_len, d_model) * 0.1)
# Apply attention
output, weights = scaled_dot_product_attention(Q.data, K.data, V.data)
# Verify output is suitable for gradient computation
assert output.dtype in [np.float32, np.float64], "Output should be floating point for gradients"
assert weights.dtype in [np.float32, np.float64], "Weights should be floating point for gradients"
# Verify no degenerate cases that would break gradients
assert np.all(np.abs(output) < 1e6), "Output values should be reasonable for gradient computation"
assert np.all(weights > 1e-10), "No attention weights should be exactly zero (for gradient flow)"
class TestAttentionMaskingIntegration:
"""Test attention masking utilities with Tensor integration."""
def test_causal_mask_integration(self):
"""Test causal mask creation and usage."""
seq_len = 6
# Create causal mask
mask = create_causal_mask(seq_len)
# Verify mask properties
assert mask.shape == (seq_len, seq_len), f"Causal mask should be {(seq_len, seq_len)}"
assert mask.dtype in [np.float32, np.float64, np.int32, np.int64], "Mask should be numeric"
assert np.allclose(mask, np.tril(mask)), "Causal mask should be lower triangular"
# Test with actual attention
Q = Tensor(np.random.randn(seq_len, 4) * 0.1)
output, weights = scaled_dot_product_attention(Q.data, Q.data, Q.data, mask)
assert np.all(np.triu(weights, k=1) < 1e-6), "Causal mask should zero future positions"
def test_padding_mask_integration(self):
"""Test padding mask creation and usage."""
lengths = [4, 2, 3]
max_length = 4
# Create padding mask
mask = create_padding_mask(lengths, max_length)
# Verify mask properties
assert mask.shape == (len(lengths), max_length, max_length), "Padding mask should have batch dimension"
assert mask.dtype in [np.float32, np.float64, np.int32, np.int64], "Mask should be numeric"
# Test with actual attention (first sequence)
Q = Tensor(np.random.randn(max_length, 4) * 0.1)
output, weights = scaled_dot_product_attention(Q.data, Q.data, Q.data, mask[0])
# Verify masking worked for first sequence (length=4, should be all ones, no masking effect)
assert np.all(weights >= 0), "All visible positions should have non-negative attention"
def test_bidirectional_mask_integration(self):
"""Test bidirectional mask creation and usage."""
seq_len = 5
# Create bidirectional mask
mask = create_bidirectional_mask(seq_len)
# Verify mask properties
assert mask.shape == (seq_len, seq_len), f"Bidirectional mask should be {(seq_len, seq_len)}"
assert np.all(mask == 1), "Bidirectional mask should be all ones"
# Test with actual attention
Q = Tensor(np.random.randn(seq_len, 4) * 0.1)
output, weights = scaled_dot_product_attention(Q.data, Q.data, Q.data, mask)
# Verify all positions can attend to all positions
assert np.all(weights > 0), "Bidirectional attention should allow all position interactions"
if __name__ == "__main__":
pytest.main([__file__])