Files
TinyTorch/tests/10_tokenization/INTEGRATION_TEST_AUDIT.md
2025-11-28 14:59:51 +01:00

19 KiB

Module 10 (Tokenization) Integration Test Audit

Date: 2025-11-25 Auditor: QA Agent Status: CRITICAL ISSUES FOUND - Test file contains completely wrong content


Executive Summary

CRITICAL FINDING: The integration test file /tests/10_tokenization/test_progressive_integration.py contains WRONG MODULE CONTENT - it tests Module 11 (Training) instead of Module 10 (Tokenization).

Current Coverage: 0% - No tokenization integration tests exist Missing Tests: 100% - All critical integration points untested Priority: HIGH - Module 10 has no integration validation


Current Test File Analysis

Problem: Wrong Module Tests

The file test_progressive_integration.py contains:

  • Line 3-6: References wrong dependency chain (mentions "11_training")
  • Classes: TestModule11TrainingCore, TestAdvancedTrainingFeatures
  • Tests: training loops, loss functions, optimizers, CNN pipelines
  • Imports: training.Trainer, training.CrossEntropyLoss, etc.

Root Cause: Copy-paste error from Module 11 template


Module 10 Actual Implementation

What Module 10 Provides

Location: tinytorch.text.tokenization

Classes Implemented:

  1. Tokenizer - Base class with encode/decode interface
  2. CharTokenizer - Character-level tokenization
  3. BPETokenizer - Byte Pair Encoding tokenizer

Key Methods:

  • CharTokenizer.build_vocab(corpus) - Build vocabulary from text
  • CharTokenizer.encode(text) - Text → token IDs (List[int])
  • CharTokenizer.decode(tokens) - Token IDs → text
  • BPETokenizer.train(corpus, vocab_size) - Learn BPE merges
  • BPETokenizer.encode(text) - BPE encoding
  • BPETokenizer.decode(tokens) - BPE decoding

Integration Points with Other Modules:

  • Module 01 (Tensor): Can convert token IDs to Tensor (optional)
  • Module 11 (Embeddings): Token IDs feed into embedding layers
  • Module 08 (DataLoader): Tokenizers process text datasets

Critical Integration Tests MISSING

Priority 1: Data Type Correctness (Bug-Catching Priority)

Missing Test: Tokenizers produce correct tensor dtypes

def test_tokenizer_produces_int64_tensors():
    """Verify tokenizers produce int64 token IDs for embedding layers."""
    # WHY CRITICAL: Embeddings expect int64 indices, not float32
    # BUG SCENARIO: If tokenizer returns float, embedding lookup crashes

    tokenizer = CharTokenizer()
    tokenizer.build_vocab(["hello world"])

    # Encode text
    token_ids = tokenizer.encode("hello")

    # CRITICAL: Must be integers, not floats
    assert all(isinstance(t, (int, np.integer)) for t in token_ids), \
        "Token IDs must be integers for embedding lookup"

    # If converting to Tensor, must be int64
    token_tensor = Tensor(token_ids)
    assert token_tensor.data.dtype == np.int64, \
        f"Expected int64 for embeddings, got {token_tensor.data.dtype}"

Bug This Catches: Type mismatch between tokenizer output and embedding input


Priority 2: Embedding Layer Integration (Module 11 Dependency)

Missing Test: Token sequences work with embeddings

def test_tokenization_to_embedding_pipeline():
    """Test complete tokenization → embedding pipeline."""
    # WHY CRITICAL: This is the PRIMARY use case for tokenizers

    try:
        from tinytorch.text.embeddings import Embedding
        from tinytorch.text.tokenization import CharTokenizer

        # Build tokenizer
        tokenizer = CharTokenizer()
        corpus = ["hello", "world", "test"]
        tokenizer.build_vocab(corpus)

        vocab_size = len(tokenizer.vocab)
        embed_dim = 16

        # Create embedding layer
        embedding = Embedding(vocab_size, embed_dim)

        # Tokenize text
        text = "hello world"
        token_ids = tokenizer.encode(text)

        # CRITICAL: Shape compatibility
        token_tensor = Tensor(token_ids)
        assert token_tensor.shape == (len(token_ids),), \
            "Token IDs should be 1D sequence"

        # Embedding lookup should work
        embedded = embedding(token_tensor)
        assert embedded.shape == (len(token_ids), embed_dim), \
            f"Expected shape ({len(token_ids)}, {embed_dim}), got {embedded.shape}"

        # Values should be actual embeddings, not zeros
        assert not np.allclose(embedded.data, 0), \
            "Embeddings should be non-zero (initialized randomly)"

    except ImportError:
        pytest.skip("Embeddings module not yet implemented")

Bug This Catches: Shape mismatches, dtype errors, index out-of-bounds


Priority 3: BPE Edge Cases (Robustness)

Missing Test: BPE tokenizer handles edge cases

def test_bpe_edge_cases():
    """Test BPE tokenizer robustness with edge cases."""
    tokenizer = BPETokenizer(vocab_size=100)

    # Edge Case 1: Empty string
    token_ids = tokenizer.encode("")
    assert token_ids == [], "Empty string should produce empty token list"

    decoded = tokenizer.decode([])
    assert decoded == "", "Empty tokens should decode to empty string"

    # Edge Case 2: Single character
    tokenizer.train(["a", "b", "c"])
    token_ids = tokenizer.encode("a")
    assert len(token_ids) > 0, "Single char should tokenize"
    assert tokenizer.decode(token_ids).strip() == "a", "Should roundtrip"

    # Edge Case 3: Unknown characters (after training on limited corpus)
    tokenizer.train(["hello", "world"])
    token_ids = tokenizer.encode("xyz")  # Characters not in training

    # Should handle gracefully with <UNK> token
    assert 0 in token_ids or tokenizer.token_to_id.get('<UNK>') in token_ids, \
        "Unknown characters should map to <UNK> token"

    # Edge Case 4: Very long text
    long_text = "hello " * 1000
    token_ids = tokenizer.encode(long_text)
    assert len(token_ids) > 0, "Long text should tokenize"
    assert all(isinstance(t, int) for t in token_ids), \
        "All tokens should be integers"

    # Edge Case 5: Special characters
    special_text = "hello, world! @#$%"
    token_ids = tokenizer.encode(special_text)
    decoded = tokenizer.decode(token_ids)
    # Should preserve word content even if punctuation changes
    assert "hello" in decoded or "world" in decoded, \
        "Should preserve core words"

Bug This Catches: Crashes on empty input, unknown character handling, memory issues


Priority 4: Vocabulary Consistency

Missing Test: Vocabulary consistency across encode/decode

def test_vocabulary_encode_decode_consistency():
    """Verify vocabulary mappings are bidirectional and consistent."""

    # Test CharTokenizer
    char_tokenizer = CharTokenizer()
    corpus = ["abc", "def", "xyz"]
    char_tokenizer.build_vocab(corpus)

    # Check bidirectional mappings
    for token, token_id in char_tokenizer.token_to_id.items():
        assert char_tokenizer.id_to_token[token_id] == token, \
            f"Bidirectional mapping broken: {token} -> {token_id} -> {char_tokenizer.id_to_token[token_id]}"

    # Test roundtrip for all corpus text
    for text in corpus:
        token_ids = char_tokenizer.encode(text)
        decoded = char_tokenizer.decode(token_ids)
        # Should preserve characters (may have different spacing)
        for char in text:
            assert char in decoded, f"Lost character '{char}' in roundtrip"

    # Test BPETokenizer
    bpe_tokenizer = BPETokenizer(vocab_size=50)
    bpe_tokenizer.train(["hello world", "test data"])

    # Vocabulary should contain special tokens
    assert '<UNK>' in bpe_tokenizer.vocab, "BPE should have <UNK> token"
    assert bpe_tokenizer.token_to_id['<UNK>'] == 0, "<UNK> should be ID 0"

    # Test roundtrip
    text = "hello world"
    token_ids = bpe_tokenizer.encode(text)
    decoded = bpe_tokenizer.decode(token_ids)

    # Should preserve words (BPE may merge/split differently)
    words = text.split()
    for word in words:
        # Word content should be preserved (possibly with merges)
        assert word in decoded or any(word in decoded for word in words), \
            f"Lost word '{word}' in BPE roundtrip"

Bug This Catches: Vocabulary corruption, ID collisions, decode inconsistency


Priority 5: Batch Processing

Missing Test: Tokenizer handles batches correctly

def test_tokenizer_batch_processing():
    """Test tokenizer works with batched text data."""
    tokenizer = CharTokenizer()
    corpus = ["hello", "world", "test", "data"]
    tokenizer.build_vocab(corpus)

    # Batch of texts
    texts = ["hello world", "test data", "new text"]

    # Encode batch
    batch_token_ids = [tokenizer.encode(text) for text in texts]

    # Check all are lists of ints
    for token_ids in batch_token_ids:
        assert isinstance(token_ids, list), "Each should be a list"
        assert all(isinstance(t, int) for t in token_ids), \
            "All tokens should be integers"

    # Check different texts produce different token sequences
    assert batch_token_ids[0] != batch_token_ids[1], \
        "Different texts should produce different token sequences"

    # Decode batch
    decoded_texts = [tokenizer.decode(token_ids) for token_ids in batch_token_ids]

    # Should preserve core content
    for original, decoded in zip(texts, decoded_texts):
        # May have spacing differences, but core words should match
        original_words = set(original.split())
        decoded_words = set(decoded.split())

        # At least some words should match
        assert len(original_words & decoded_words) > 0, \
            f"Lost all words in roundtrip: {original} -> {decoded}"

Bug This Catches: Batch size errors, state pollution between encodes


Priority 6: Memory and Performance

Missing Test: Tokenization memory usage and throughput

def test_tokenization_performance():
    """Test tokenization memory and throughput characteristics."""
    import time

    # Build tokenizers
    char_tokenizer = CharTokenizer()
    bpe_tokenizer = BPETokenizer(vocab_size=1000)

    # Training corpus
    corpus = ["hello world"] * 100
    char_tokenizer.build_vocab(corpus)
    bpe_tokenizer.train(corpus)

    # Test text (simulate real document)
    test_text = "hello world test data " * 100  # ~400 chars

    # Measure CharTokenizer throughput
    start = time.time()
    iterations = 1000
    for _ in range(iterations):
        token_ids = char_tokenizer.encode(test_text)
    char_time = time.time() - start
    char_throughput = (len(test_text) * iterations) / char_time

    print(f"CharTokenizer: {char_throughput:.0f} chars/sec")
    assert char_throughput > 10000, \
        f"CharTokenizer too slow: {char_throughput:.0f} chars/sec (expected >10K)"

    # Measure BPE throughput
    start = time.time()
    for _ in range(iterations):
        token_ids = bpe_tokenizer.encode(test_text)
    bpe_time = time.time() - start
    bpe_throughput = (len(test_text) * iterations) / bpe_time

    print(f"BPETokenizer: {bpe_throughput:.0f} chars/sec")
    # BPE should be slower (more complex), but still reasonable
    assert bpe_throughput > 1000, \
        f"BPETokenizer too slow: {bpe_throughput:.0f} chars/sec (expected >1K)"

    # Vocabulary size check
    assert len(char_tokenizer.vocab) < 500, \
        f"CharTokenizer vocab too large: {len(char_tokenizer.vocab)} (expected <500)"

    assert len(bpe_tokenizer.vocab) <= 1000, \
        f"BPETokenizer vocab exceeded limit: {len(bpe_tokenizer.vocab)}"

Bug This Catches: Performance regressions, memory leaks, vocabulary explosion


Priority 7: DataLoader Integration

Missing Test: Tokenizer integration with DataLoader

def test_tokenizer_dataloader_integration():
    """Test tokenizer works in DataLoader pipeline."""
    try:
        from tinytorch.core.data import Dataset, DataLoader
        from tinytorch.text.tokenization import CharTokenizer

        # Custom dataset with tokenization
        class TextDataset(Dataset):
            def __init__(self, texts, tokenizer):
                self.texts = texts
                self.tokenizer = tokenizer

            def __len__(self):
                return len(self.texts)

            def __getitem__(self, idx):
                text = self.texts[idx]
                token_ids = self.tokenizer.encode(text)
                # Return as tensor
                return Tensor(token_ids)

        # Build tokenizer
        tokenizer = CharTokenizer()
        texts = ["hello world", "test data", "sample text"]
        tokenizer.build_vocab(texts)

        # Create dataset and dataloader
        dataset = TextDataset(texts, tokenizer)
        dataloader = DataLoader(dataset, batch_size=2, shuffle=False)

        # Iterate batches
        batch_count = 0
        for batch in dataloader:
            batch_count += 1

            # Batch should be tensor or list of tensors
            if isinstance(batch, (list, tuple)):
                assert len(batch) <= 2, "Batch size should be 2"
                for item in batch:
                    assert hasattr(item, 'data') or isinstance(item, Tensor), \
                        "Items should be Tensors"
            else:
                # Single batch tensor
                assert hasattr(batch, 'data'), "Batch should be Tensor"

        assert batch_count > 0, "DataLoader should produce batches"

    except ImportError:
        pytest.skip("DataLoader not yet implemented")

Bug This Catches: DataLoader compatibility issues, batching errors


Regression Prevention Tests MISSING

Test: Prior Stack Still Works

Missing Test: Verify Modules 01-09 unchanged

def test_no_prior_module_regression():
    """Ensure tokenization doesn't break prior modules."""
    # Module 01 (Tensor) should still work
    from tinytorch.core.tensor import Tensor

    x = Tensor([1, 2, 3])
    assert x.shape == (3,), "Tensor creation broken"

    # Module 02 (Activations) should still work
    try:
        from tinytorch.core.activations import ReLU
        relu = ReLU()
        y = relu(x)
        assert y.shape == x.shape, "Activation broken"
    except ImportError:
        pass  # Not implemented yet

    # Module 08 (DataLoader) should still work
    try:
        from tinytorch.core.data import Dataset, DataLoader

        class DummyDataset(Dataset):
            def __len__(self):
                return 5
            def __getitem__(self, idx):
                return idx

        dataset = DummyDataset()
        loader = DataLoader(dataset, batch_size=2)
        assert len(dataset) == 5, "Dataset broken"
    except ImportError:
        pass

"""
Module 10: Progressive Integration Tests
Tests that Module 10 (Tokenization) works correctly AND integrates with prior modules.

DEPENDENCY CHAIN: 01_tensor → ... → 08_dataloader → 10_tokenization → 11_embeddings
This is where we enable text processing for NLP.
"""

class TestPriorStackStillWorking:
    """Quick regression checks that prior modules (01-09) still work."""

    def test_tensor_operations_stable(self):
        """Verify Module 01 (Tensor) still works."""

    def test_dataloader_stable(self):
        """Verify Module 08 (DataLoader) still works."""


class TestModule10TokenizationCore:
    """Test Module 10 (Tokenization) core functionality."""

    def test_char_tokenizer_creation(self):
        """Test CharTokenizer initialization and vocab building."""

    def test_char_tokenizer_encode_decode(self):
        """Test CharTokenizer encode/decode roundtrip."""

    def test_bpe_tokenizer_training(self):
        """Test BPE tokenizer training on corpus."""

    def test_bpe_tokenizer_encode_decode(self):
        """Test BPE encode/decode roundtrip."""


class TestTokenizationIntegration:
    """Test tokenization integration with other modules."""

    def test_tokenizer_produces_correct_dtypes(self):
        """PRIORITY 1: Verify int64 output for embeddings."""

    def test_tokenization_to_embedding_pipeline(self):
        """PRIORITY 2: Test complete tokenization → embedding flow."""

    def test_tokenizer_dataloader_integration(self):
        """Test tokenizer in DataLoader pipeline."""


class TestTokenizationEdgeCases:
    """Test tokenization robustness with edge cases."""

    def test_bpe_edge_cases(self):
        """PRIORITY 3: Empty strings, unknown tokens, special chars."""

    def test_vocabulary_consistency(self):
        """PRIORITY 4: Bidirectional mappings, roundtrip integrity."""

    def test_batch_processing(self):
        """PRIORITY 5: Batch encoding/decoding correctness."""


class TestTokenizationPerformance:
    """Test tokenization performance characteristics."""

    def test_tokenization_throughput(self):
        """PRIORITY 6: Measure chars/sec, vocab size."""

    def test_memory_usage(self):
        """Verify vocabulary doesn't consume excessive memory."""


class TestRegressionPrevention:
    """Ensure previous modules still work after Module 10."""

    def test_no_tensor_regression(self):
        """Verify Module 01 (Tensor) unchanged."""

    def test_no_dataloader_regression(self):
        """Verify Module 08 (DataLoader) unchanged."""

Summary Statistics

Category Missing Tests Priority Impact
Data Type Correctness 1 CRITICAL Breaks embeddings
Embedding Integration 1 CRITICAL Core use case
BPE Edge Cases 1 HIGH Production robustness
Vocabulary Consistency 1 HIGH Data integrity
Batch Processing 1 MEDIUM Real-world usage
Performance 1 MEDIUM Production viability
DataLoader Integration 1 MEDIUM Pipeline integrity
Regression Prevention 2 HIGH Stack stability

Total Missing Tests: 9 critical integration tests Current Test Coverage: 0% (wrong module) Recommended Action: REPLACE entire test file


Phase 1: Immediate (Critical Fixes)

  1. REPLACE test_progressive_integration.py with correct Module 10 tests
  2. Implement Priority 1-2 tests (dtype correctness, embedding integration)
  3. Add BPE edge case tests (Priority 3)

Phase 2: Short-term (Robustness)

  1. Add vocabulary consistency tests (Priority 4)
  2. Add batch processing tests (Priority 5)
  3. Add regression prevention tests

Phase 3: Performance Validation

  1. Add performance benchmarks (Priority 6)
  2. Add DataLoader integration (Priority 7)

Bug-Catching Priorities (Ranked)

  1. Data Type Mismatch (CRITICAL): int vs float breaks embedding lookup
  2. Embedding Integration (CRITICAL): Core use case must work
  3. Unknown Token Handling (HIGH): Crashes on unseen characters
  4. Vocabulary Corruption (HIGH): Encode/decode inconsistency
  5. Empty Input Crashes (MEDIUM): Edge case handling
  6. Batch State Pollution (MEDIUM): Tokenizer state leaks between calls
  7. Performance Regression (LOW): Slow tokenization impacts pipelines

Audit Completed: 2025-11-25 Next Review: After test file replacement Sign-off: QA Agent - Integration Testing Team