19 KiB
Module 10 (Tokenization) Integration Test Audit
Date: 2025-11-25 Auditor: QA Agent Status: CRITICAL ISSUES FOUND - Test file contains completely wrong content
Executive Summary
CRITICAL FINDING: The integration test file /tests/10_tokenization/test_progressive_integration.py contains WRONG MODULE CONTENT - it tests Module 11 (Training) instead of Module 10 (Tokenization).
Current Coverage: 0% - No tokenization integration tests exist Missing Tests: 100% - All critical integration points untested Priority: HIGH - Module 10 has no integration validation
Current Test File Analysis
Problem: Wrong Module Tests
The file test_progressive_integration.py contains:
- ❌ Line 3-6: References wrong dependency chain (mentions "11_training")
- ❌ Classes: TestModule11TrainingCore, TestAdvancedTrainingFeatures
- ❌ Tests: training loops, loss functions, optimizers, CNN pipelines
- ❌ Imports: training.Trainer, training.CrossEntropyLoss, etc.
Root Cause: Copy-paste error from Module 11 template
Module 10 Actual Implementation
What Module 10 Provides
Location: tinytorch.text.tokenization
Classes Implemented:
Tokenizer- Base class with encode/decode interfaceCharTokenizer- Character-level tokenizationBPETokenizer- Byte Pair Encoding tokenizer
Key Methods:
CharTokenizer.build_vocab(corpus)- Build vocabulary from textCharTokenizer.encode(text)- Text → token IDs (List[int])CharTokenizer.decode(tokens)- Token IDs → textBPETokenizer.train(corpus, vocab_size)- Learn BPE mergesBPETokenizer.encode(text)- BPE encodingBPETokenizer.decode(tokens)- BPE decoding
Integration Points with Other Modules:
- Module 01 (Tensor): Can convert token IDs to Tensor (optional)
- Module 11 (Embeddings): Token IDs feed into embedding layers
- Module 08 (DataLoader): Tokenizers process text datasets
Critical Integration Tests MISSING
Priority 1: Data Type Correctness (Bug-Catching Priority)
Missing Test: Tokenizers produce correct tensor dtypes
def test_tokenizer_produces_int64_tensors():
"""Verify tokenizers produce int64 token IDs for embedding layers."""
# WHY CRITICAL: Embeddings expect int64 indices, not float32
# BUG SCENARIO: If tokenizer returns float, embedding lookup crashes
tokenizer = CharTokenizer()
tokenizer.build_vocab(["hello world"])
# Encode text
token_ids = tokenizer.encode("hello")
# CRITICAL: Must be integers, not floats
assert all(isinstance(t, (int, np.integer)) for t in token_ids), \
"Token IDs must be integers for embedding lookup"
# If converting to Tensor, must be int64
token_tensor = Tensor(token_ids)
assert token_tensor.data.dtype == np.int64, \
f"Expected int64 for embeddings, got {token_tensor.data.dtype}"
Bug This Catches: Type mismatch between tokenizer output and embedding input
Priority 2: Embedding Layer Integration (Module 11 Dependency)
Missing Test: Token sequences work with embeddings
def test_tokenization_to_embedding_pipeline():
"""Test complete tokenization → embedding pipeline."""
# WHY CRITICAL: This is the PRIMARY use case for tokenizers
try:
from tinytorch.text.embeddings import Embedding
from tinytorch.text.tokenization import CharTokenizer
# Build tokenizer
tokenizer = CharTokenizer()
corpus = ["hello", "world", "test"]
tokenizer.build_vocab(corpus)
vocab_size = len(tokenizer.vocab)
embed_dim = 16
# Create embedding layer
embedding = Embedding(vocab_size, embed_dim)
# Tokenize text
text = "hello world"
token_ids = tokenizer.encode(text)
# CRITICAL: Shape compatibility
token_tensor = Tensor(token_ids)
assert token_tensor.shape == (len(token_ids),), \
"Token IDs should be 1D sequence"
# Embedding lookup should work
embedded = embedding(token_tensor)
assert embedded.shape == (len(token_ids), embed_dim), \
f"Expected shape ({len(token_ids)}, {embed_dim}), got {embedded.shape}"
# Values should be actual embeddings, not zeros
assert not np.allclose(embedded.data, 0), \
"Embeddings should be non-zero (initialized randomly)"
except ImportError:
pytest.skip("Embeddings module not yet implemented")
Bug This Catches: Shape mismatches, dtype errors, index out-of-bounds
Priority 3: BPE Edge Cases (Robustness)
Missing Test: BPE tokenizer handles edge cases
def test_bpe_edge_cases():
"""Test BPE tokenizer robustness with edge cases."""
tokenizer = BPETokenizer(vocab_size=100)
# Edge Case 1: Empty string
token_ids = tokenizer.encode("")
assert token_ids == [], "Empty string should produce empty token list"
decoded = tokenizer.decode([])
assert decoded == "", "Empty tokens should decode to empty string"
# Edge Case 2: Single character
tokenizer.train(["a", "b", "c"])
token_ids = tokenizer.encode("a")
assert len(token_ids) > 0, "Single char should tokenize"
assert tokenizer.decode(token_ids).strip() == "a", "Should roundtrip"
# Edge Case 3: Unknown characters (after training on limited corpus)
tokenizer.train(["hello", "world"])
token_ids = tokenizer.encode("xyz") # Characters not in training
# Should handle gracefully with <UNK> token
assert 0 in token_ids or tokenizer.token_to_id.get('<UNK>') in token_ids, \
"Unknown characters should map to <UNK> token"
# Edge Case 4: Very long text
long_text = "hello " * 1000
token_ids = tokenizer.encode(long_text)
assert len(token_ids) > 0, "Long text should tokenize"
assert all(isinstance(t, int) for t in token_ids), \
"All tokens should be integers"
# Edge Case 5: Special characters
special_text = "hello, world! @#$%"
token_ids = tokenizer.encode(special_text)
decoded = tokenizer.decode(token_ids)
# Should preserve word content even if punctuation changes
assert "hello" in decoded or "world" in decoded, \
"Should preserve core words"
Bug This Catches: Crashes on empty input, unknown character handling, memory issues
Priority 4: Vocabulary Consistency
Missing Test: Vocabulary consistency across encode/decode
def test_vocabulary_encode_decode_consistency():
"""Verify vocabulary mappings are bidirectional and consistent."""
# Test CharTokenizer
char_tokenizer = CharTokenizer()
corpus = ["abc", "def", "xyz"]
char_tokenizer.build_vocab(corpus)
# Check bidirectional mappings
for token, token_id in char_tokenizer.token_to_id.items():
assert char_tokenizer.id_to_token[token_id] == token, \
f"Bidirectional mapping broken: {token} -> {token_id} -> {char_tokenizer.id_to_token[token_id]}"
# Test roundtrip for all corpus text
for text in corpus:
token_ids = char_tokenizer.encode(text)
decoded = char_tokenizer.decode(token_ids)
# Should preserve characters (may have different spacing)
for char in text:
assert char in decoded, f"Lost character '{char}' in roundtrip"
# Test BPETokenizer
bpe_tokenizer = BPETokenizer(vocab_size=50)
bpe_tokenizer.train(["hello world", "test data"])
# Vocabulary should contain special tokens
assert '<UNK>' in bpe_tokenizer.vocab, "BPE should have <UNK> token"
assert bpe_tokenizer.token_to_id['<UNK>'] == 0, "<UNK> should be ID 0"
# Test roundtrip
text = "hello world"
token_ids = bpe_tokenizer.encode(text)
decoded = bpe_tokenizer.decode(token_ids)
# Should preserve words (BPE may merge/split differently)
words = text.split()
for word in words:
# Word content should be preserved (possibly with merges)
assert word in decoded or any(word in decoded for word in words), \
f"Lost word '{word}' in BPE roundtrip"
Bug This Catches: Vocabulary corruption, ID collisions, decode inconsistency
Priority 5: Batch Processing
Missing Test: Tokenizer handles batches correctly
def test_tokenizer_batch_processing():
"""Test tokenizer works with batched text data."""
tokenizer = CharTokenizer()
corpus = ["hello", "world", "test", "data"]
tokenizer.build_vocab(corpus)
# Batch of texts
texts = ["hello world", "test data", "new text"]
# Encode batch
batch_token_ids = [tokenizer.encode(text) for text in texts]
# Check all are lists of ints
for token_ids in batch_token_ids:
assert isinstance(token_ids, list), "Each should be a list"
assert all(isinstance(t, int) for t in token_ids), \
"All tokens should be integers"
# Check different texts produce different token sequences
assert batch_token_ids[0] != batch_token_ids[1], \
"Different texts should produce different token sequences"
# Decode batch
decoded_texts = [tokenizer.decode(token_ids) for token_ids in batch_token_ids]
# Should preserve core content
for original, decoded in zip(texts, decoded_texts):
# May have spacing differences, but core words should match
original_words = set(original.split())
decoded_words = set(decoded.split())
# At least some words should match
assert len(original_words & decoded_words) > 0, \
f"Lost all words in roundtrip: {original} -> {decoded}"
Bug This Catches: Batch size errors, state pollution between encodes
Priority 6: Memory and Performance
Missing Test: Tokenization memory usage and throughput
def test_tokenization_performance():
"""Test tokenization memory and throughput characteristics."""
import time
# Build tokenizers
char_tokenizer = CharTokenizer()
bpe_tokenizer = BPETokenizer(vocab_size=1000)
# Training corpus
corpus = ["hello world"] * 100
char_tokenizer.build_vocab(corpus)
bpe_tokenizer.train(corpus)
# Test text (simulate real document)
test_text = "hello world test data " * 100 # ~400 chars
# Measure CharTokenizer throughput
start = time.time()
iterations = 1000
for _ in range(iterations):
token_ids = char_tokenizer.encode(test_text)
char_time = time.time() - start
char_throughput = (len(test_text) * iterations) / char_time
print(f"CharTokenizer: {char_throughput:.0f} chars/sec")
assert char_throughput > 10000, \
f"CharTokenizer too slow: {char_throughput:.0f} chars/sec (expected >10K)"
# Measure BPE throughput
start = time.time()
for _ in range(iterations):
token_ids = bpe_tokenizer.encode(test_text)
bpe_time = time.time() - start
bpe_throughput = (len(test_text) * iterations) / bpe_time
print(f"BPETokenizer: {bpe_throughput:.0f} chars/sec")
# BPE should be slower (more complex), but still reasonable
assert bpe_throughput > 1000, \
f"BPETokenizer too slow: {bpe_throughput:.0f} chars/sec (expected >1K)"
# Vocabulary size check
assert len(char_tokenizer.vocab) < 500, \
f"CharTokenizer vocab too large: {len(char_tokenizer.vocab)} (expected <500)"
assert len(bpe_tokenizer.vocab) <= 1000, \
f"BPETokenizer vocab exceeded limit: {len(bpe_tokenizer.vocab)}"
Bug This Catches: Performance regressions, memory leaks, vocabulary explosion
Priority 7: DataLoader Integration
Missing Test: Tokenizer integration with DataLoader
def test_tokenizer_dataloader_integration():
"""Test tokenizer works in DataLoader pipeline."""
try:
from tinytorch.core.data import Dataset, DataLoader
from tinytorch.text.tokenization import CharTokenizer
# Custom dataset with tokenization
class TextDataset(Dataset):
def __init__(self, texts, tokenizer):
self.texts = texts
self.tokenizer = tokenizer
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
token_ids = self.tokenizer.encode(text)
# Return as tensor
return Tensor(token_ids)
# Build tokenizer
tokenizer = CharTokenizer()
texts = ["hello world", "test data", "sample text"]
tokenizer.build_vocab(texts)
# Create dataset and dataloader
dataset = TextDataset(texts, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)
# Iterate batches
batch_count = 0
for batch in dataloader:
batch_count += 1
# Batch should be tensor or list of tensors
if isinstance(batch, (list, tuple)):
assert len(batch) <= 2, "Batch size should be 2"
for item in batch:
assert hasattr(item, 'data') or isinstance(item, Tensor), \
"Items should be Tensors"
else:
# Single batch tensor
assert hasattr(batch, 'data'), "Batch should be Tensor"
assert batch_count > 0, "DataLoader should produce batches"
except ImportError:
pytest.skip("DataLoader not yet implemented")
Bug This Catches: DataLoader compatibility issues, batching errors
Regression Prevention Tests MISSING
Test: Prior Stack Still Works
Missing Test: Verify Modules 01-09 unchanged
def test_no_prior_module_regression():
"""Ensure tokenization doesn't break prior modules."""
# Module 01 (Tensor) should still work
from tinytorch.core.tensor import Tensor
x = Tensor([1, 2, 3])
assert x.shape == (3,), "Tensor creation broken"
# Module 02 (Activations) should still work
try:
from tinytorch.core.activations import ReLU
relu = ReLU()
y = relu(x)
assert y.shape == x.shape, "Activation broken"
except ImportError:
pass # Not implemented yet
# Module 08 (DataLoader) should still work
try:
from tinytorch.core.data import Dataset, DataLoader
class DummyDataset(Dataset):
def __len__(self):
return 5
def __getitem__(self, idx):
return idx
dataset = DummyDataset()
loader = DataLoader(dataset, batch_size=2)
assert len(dataset) == 5, "Dataset broken"
except ImportError:
pass
Recommended Test File Structure
"""
Module 10: Progressive Integration Tests
Tests that Module 10 (Tokenization) works correctly AND integrates with prior modules.
DEPENDENCY CHAIN: 01_tensor → ... → 08_dataloader → 10_tokenization → 11_embeddings
This is where we enable text processing for NLP.
"""
class TestPriorStackStillWorking:
"""Quick regression checks that prior modules (01-09) still work."""
def test_tensor_operations_stable(self):
"""Verify Module 01 (Tensor) still works."""
def test_dataloader_stable(self):
"""Verify Module 08 (DataLoader) still works."""
class TestModule10TokenizationCore:
"""Test Module 10 (Tokenization) core functionality."""
def test_char_tokenizer_creation(self):
"""Test CharTokenizer initialization and vocab building."""
def test_char_tokenizer_encode_decode(self):
"""Test CharTokenizer encode/decode roundtrip."""
def test_bpe_tokenizer_training(self):
"""Test BPE tokenizer training on corpus."""
def test_bpe_tokenizer_encode_decode(self):
"""Test BPE encode/decode roundtrip."""
class TestTokenizationIntegration:
"""Test tokenization integration with other modules."""
def test_tokenizer_produces_correct_dtypes(self):
"""PRIORITY 1: Verify int64 output for embeddings."""
def test_tokenization_to_embedding_pipeline(self):
"""PRIORITY 2: Test complete tokenization → embedding flow."""
def test_tokenizer_dataloader_integration(self):
"""Test tokenizer in DataLoader pipeline."""
class TestTokenizationEdgeCases:
"""Test tokenization robustness with edge cases."""
def test_bpe_edge_cases(self):
"""PRIORITY 3: Empty strings, unknown tokens, special chars."""
def test_vocabulary_consistency(self):
"""PRIORITY 4: Bidirectional mappings, roundtrip integrity."""
def test_batch_processing(self):
"""PRIORITY 5: Batch encoding/decoding correctness."""
class TestTokenizationPerformance:
"""Test tokenization performance characteristics."""
def test_tokenization_throughput(self):
"""PRIORITY 6: Measure chars/sec, vocab size."""
def test_memory_usage(self):
"""Verify vocabulary doesn't consume excessive memory."""
class TestRegressionPrevention:
"""Ensure previous modules still work after Module 10."""
def test_no_tensor_regression(self):
"""Verify Module 01 (Tensor) unchanged."""
def test_no_dataloader_regression(self):
"""Verify Module 08 (DataLoader) unchanged."""
Summary Statistics
| Category | Missing Tests | Priority | Impact |
|---|---|---|---|
| Data Type Correctness | 1 | CRITICAL | Breaks embeddings |
| Embedding Integration | 1 | CRITICAL | Core use case |
| BPE Edge Cases | 1 | HIGH | Production robustness |
| Vocabulary Consistency | 1 | HIGH | Data integrity |
| Batch Processing | 1 | MEDIUM | Real-world usage |
| Performance | 1 | MEDIUM | Production viability |
| DataLoader Integration | 1 | MEDIUM | Pipeline integrity |
| Regression Prevention | 2 | HIGH | Stack stability |
Total Missing Tests: 9 critical integration tests Current Test Coverage: 0% (wrong module) Recommended Action: REPLACE entire test file
Recommended Action Plan
Phase 1: Immediate (Critical Fixes)
- REPLACE test_progressive_integration.py with correct Module 10 tests
- Implement Priority 1-2 tests (dtype correctness, embedding integration)
- Add BPE edge case tests (Priority 3)
Phase 2: Short-term (Robustness)
- Add vocabulary consistency tests (Priority 4)
- Add batch processing tests (Priority 5)
- Add regression prevention tests
Phase 3: Performance Validation
- Add performance benchmarks (Priority 6)
- Add DataLoader integration (Priority 7)
Bug-Catching Priorities (Ranked)
- Data Type Mismatch (CRITICAL): int vs float breaks embedding lookup
- Embedding Integration (CRITICAL): Core use case must work
- Unknown Token Handling (HIGH): Crashes on unseen characters
- Vocabulary Corruption (HIGH): Encode/decode inconsistency
- Empty Input Crashes (MEDIUM): Edge case handling
- Batch State Pollution (MEDIUM): Tokenizer state leaks between calls
- Performance Regression (LOW): Slow tokenization impacts pipelines
Audit Completed: 2025-11-25 Next Review: After test file replacement Sign-off: QA Agent - Integration Testing Team