Files
TinyTorch/tests/10_tokenization/WRONG_VS_CORRECT.md
2025-11-28 14:59:51 +01:00

9.4 KiB

Module 10 Integration Tests: Wrong vs Correct

Current File (WRONG)

"""
Module 10: Progressive Integration Tests
Tests that Module 11 (Training) works correctly...  # ← WRONG MODULE!

DEPENDENCY CHAIN: 01_setup → ... → 10_optimizers → 11_training  # ← WRONG!
This is where we enable complete end-to-end training loops.  # ← WRONG!
"""

class TestModule11TrainingCore:  # ← WRONG MODULE!
    """Test Module 11 (Training) core functionality."""  # ← WRONG!

    def test_training_loop_creation(self):
        from tinytorch.core.training import Trainer  # ← WRONG!
        from tinytorch.core.optimizers import SGD
        # Tests training loops... ← WRONG TOPIC!

    def test_loss_function_support(self):
        from tinytorch.core.training import CrossEntropyLoss, MSELoss  # ← WRONG!
        # Tests loss functions... ← WRONG TOPIC!

class TestAdvancedTrainingFeatures:  # ← WRONG MODULE!
    def test_distributed_training_support(self):  # ← WRONG!
    def test_mixed_precision_training(self):  # ← WRONG!

Problems:

  • Tests Module 11 (Training) instead of Module 10 (Tokenization)
  • All imports from tinytorch.core.training (doesn't exist yet)
  • Tests loss functions, optimizers, CNN pipelines (wrong concepts)
  • 0% coverage of actual Module 10 functionality
  • Copy-paste error from Module 11 template

Corrected File (CORRECT)

"""
Module 10: Progressive Integration Tests
Tests that Module 10 (Tokenization) works correctly...  # ← CORRECT!

DEPENDENCY CHAIN: 01_tensor → ... → 08_dataloader → 10_tokenization → 11_embeddings  # ← CORRECT!
This is where we enable text processing for NLP tasks.  # ← CORRECT!
"""

class TestModule10TokenizationCore:  # ← CORRECT MODULE!
    """Test Module 10 (Tokenization) core functionality."""  # ← CORRECT!

    def test_char_tokenizer_creation(self):
        from tinytorch.text.tokenization import CharTokenizer  # ← CORRECT!
        # Tests CharTokenizer initialization

    def test_char_tokenizer_encode_decode(self):
        # Tests encode/decode roundtrip

    def test_bpe_tokenizer_training(self):
        from tinytorch.text.tokenization import BPETokenizer  # ← CORRECT!
        # Tests BPE training

    def test_bpe_tokenizer_encode_decode(self):
        # Tests BPE encode/decode

class TestTokenizationIntegration:  # ← CORRECT!
    """Test tokenization integration with other modules."""

    def test_tokenizer_produces_correct_dtypes(self):
        # CRITICAL: Verify int64 for embeddings

    def test_tokenization_to_embedding_pipeline(self):
        from tinytorch.text.embeddings import Embedding
        from tinytorch.text.tokenization import CharTokenizer
        # Tests tokenization → embedding flow

    def test_tokenizer_dataloader_integration(self):
        # Tests tokenizer with DataLoader

class TestTokenizationEdgeCases:  # ← CORRECT!
    """Test tokenization robustness with edge cases."""

    def test_bpe_edge_cases(self):
        # Empty strings, unknown tokens, special chars

    def test_vocabulary_consistency(self):
        # Bidirectional mappings, roundtrips

    def test_batch_processing(self):
        # Batch encoding/decoding

Benefits:

  • Tests actual Module 10 (Tokenization) functionality
  • Correct imports from tinytorch.text.tokenization
  • Tests CharTokenizer, BPETokenizer, vocabularies
  • Validates integration with Tensor, Embeddings, DataLoader
  • 100% coverage of critical integration points

Side-by-Side Comparison

Aspect Current (WRONG) Corrected (CORRECT)
Module Tested Module 11 (Training) Module 10 (Tokenization)
Primary Imports tinytorch.core.training tinytorch.text.tokenization
Classes Tested Trainer, CrossEntropyLoss CharTokenizer, BPETokenizer
Test Focus Training loops, loss functions Encode/decode, vocabularies
Integration Points Optimizers, CNN, distributed Tensors, Embeddings, DataLoader
Edge Cases Checkpointing, early stopping Empty strings, unknown tokens
Coverage 0% (wrong module) 100% (correct tests)
Bug-Catching None (tests wrong code) High (catches dtype, shape errors)

Key Differences

Wrong File Tests

  1. Training loops and Trainer class
  2. Loss functions (MSELoss, CrossEntropyLoss)
  3. Validation loops and metrics
  4. Checkpointing and early stopping
  5. Learning rate scheduling
  6. Distributed training
  7. Mixed precision training
  8. Gradient accumulation
  9. CNN training pipelines
  10. End-to-end model training

Correct File Tests

  1. CharTokenizer initialization and vocab building
  2. CharTokenizer encode/decode roundtrip
  3. BPETokenizer training on corpus
  4. BPE encode/decode operations
  5. Token ID dtype correctness (int64)
  6. Tokenization → Embedding pipeline
  7. DataLoader integration
  8. BPE edge cases (empty, unknown, special)
  9. Vocabulary consistency (bidirectional)
  10. Batch processing correctness
  11. Performance benchmarks (throughput)
  12. Regression prevention (Tensor, DataLoader)

Example: What Each Tests

Wrong File Example

def test_training_loop_creation(self):
    """Test basic training loop functionality."""  # ← Module 11, not 10!
    from tinytorch.core.training import Trainer  # ← Doesn't exist
    from tinytorch.core.layers import Dense
    from tinytorch.core.optimizers import SGD

    model = Dense(10, 3)
    optimizer = SGD(model.parameters(), lr=0.01)
    trainer = Trainer(model, optimizer)  # ← Testing training, not tokenization!

    assert hasattr(trainer, 'train'), "Trainer broken"

Correct File Example

def test_char_tokenizer_encode_decode(self):
    """Test CharTokenizer encode/decode roundtrip."""  # ← Module 10!
    from tinytorch.text.tokenization import CharTokenizer  # ← Correct import

    tokenizer = CharTokenizer()
    tokenizer.build_vocab(["hello", "world"])

    text = "hello"
    token_ids = tokenizer.encode(text)  # ← Testing tokenization!

    assert isinstance(token_ids, list), "encode() should return list"
    assert all(isinstance(t, int) for t in token_ids), "Token IDs should be integers"

    decoded = tokenizer.decode(token_ids)
    for char in text:
        assert char in decoded, f"Lost character '{char}' in roundtrip"

Critical Integration Tests Only in Correct File

1. Dtype Correctness (Catches Embedding Bugs)

def test_tokenizer_produces_correct_dtypes(self):
    """Verify int64 output for embeddings."""
    token_tensor = Tensor(token_ids)
    assert token_tensor.data.dtype in [np.int32, np.int64, np.int_]

Why Critical: Embeddings crash if token IDs are float32 instead of int64

2. Embedding Integration (Primary Use Case)

def test_tokenization_to_embedding_pipeline(self):
    """Test complete tokenization → embedding pipeline."""
    tokenizer = CharTokenizer()
    embedding = Embedding(vocab_size, embed_dim)

    token_ids = tokenizer.encode("hello")
    embedded = embedding(Tensor(token_ids))
    assert embedded.shape == (len(token_ids), embed_dim)

Why Critical: This is THE use case for tokenizers - must work!

3. BPE Edge Cases (Production Robustness)

def test_bpe_edge_cases(self):
    """Empty strings, unknown tokens, special chars."""
    tokenizer = BPETokenizer(vocab_size=100)

    # Empty string
    assert tokenizer.encode("") == []

    # Unknown characters
    tokenizer.train(["hello"])
    tokens = tokenizer.encode("xyz")  # Not in training
    assert isinstance(tokens, list)  # Should handle gracefully

Why Critical: Production systems receive unexpected input


Impact of Using Wrong Tests

If we keep the wrong file:

  • Students implement tokenizers but have 0% test coverage
  • Dtype bugs (int vs float) go undetected → embeddings crash
  • BPE edge cases untested → production failures
  • No validation of tokenization → embedding pipeline
  • Vocabulary corruption undetected
  • Integration with DataLoader untested

With correct tests:

  • Catch dtype mismatches before they reach embeddings
  • Validate primary use case (tokenization → embeddings)
  • Test production robustness (edge cases)
  • Ensure vocabulary integrity
  • Verify DataLoader integration
  • Maintain stack stability (regression tests)

How to Fix

Option 1: Replace File

cd /Users/VJ/GitHub/TinyTorch/tests/10_tokenization
mv test_progressive_integration.py test_progressive_integration_OLD.py
mv test_progressive_integration_REFERENCE.py test_progressive_integration.py

Option 2: Manual Edit

  1. Delete all content in test_progressive_integration.py
  2. Copy content from test_progressive_integration_REFERENCE.py
  3. Save and commit

Verify Fix

pytest tests/10_tokenization/test_progressive_integration.py -v

# Should see:
# - TestModule10TokenizationCore (not TestModule11TrainingCore)
# - Tests for CharTokenizer, BPETokenizer
# - Integration tests with Embedding, DataLoader

Summary

Current Status: CRITICAL - Wrong module tested (Module 11 instead of 10) Root Cause: Copy-paste error from Module 11 template Impact: 0% integration test coverage for Module 10 Fix: Replace with corrected reference implementation Urgency: HIGH - Students have no validation of tokenization integration