github-starred/TinyTorch

Fork 0

mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-01 09:22:33 -05:00

Files

Vijay Janapa Reddi 403d4c2f4c Add .tito/backups and docs/_build to gitignore

2025-11-28 14:59:51 +01:00

9.4 KiB

Raw Blame History

Module 10 Integration Tests: Wrong vs Correct

Current File (WRONG) ❌

"""
Module 10: Progressive Integration Tests
Tests that Module 11 (Training) works correctly...  # ← WRONG MODULE!

DEPENDENCY CHAIN: 01_setup → ... → 10_optimizers → 11_training  # ← WRONG!
This is where we enable complete end-to-end training loops.  # ← WRONG!
"""

class TestModule11TrainingCore:  # ← WRONG MODULE!
    """Test Module 11 (Training) core functionality."""  # ← WRONG!

    def test_training_loop_creation(self):
        from tinytorch.core.training import Trainer  # ← WRONG!
        from tinytorch.core.optimizers import SGD
        # Tests training loops... ← WRONG TOPIC!

    def test_loss_function_support(self):
        from tinytorch.core.training import CrossEntropyLoss, MSELoss  # ← WRONG!
        # Tests loss functions... ← WRONG TOPIC!

class TestAdvancedTrainingFeatures:  # ← WRONG MODULE!
    def test_distributed_training_support(self):  # ← WRONG!
    def test_mixed_precision_training(self):  # ← WRONG!

Problems:

Tests Module 11 (Training) instead of Module 10 (Tokenization)
All imports from tinytorch.core.training (doesn't exist yet)
Tests loss functions, optimizers, CNN pipelines (wrong concepts)
0% coverage of actual Module 10 functionality
Copy-paste error from Module 11 template

Corrected File (CORRECT) ✅

"""
Module 10: Progressive Integration Tests
Tests that Module 10 (Tokenization) works correctly...  # ← CORRECT!

DEPENDENCY CHAIN: 01_tensor → ... → 08_dataloader → 10_tokenization → 11_embeddings  # ← CORRECT!
This is where we enable text processing for NLP tasks.  # ← CORRECT!
"""

class TestModule10TokenizationCore:  # ← CORRECT MODULE!
    """Test Module 10 (Tokenization) core functionality."""  # ← CORRECT!

    def test_char_tokenizer_creation(self):
        from tinytorch.text.tokenization import CharTokenizer  # ← CORRECT!
        # Tests CharTokenizer initialization

    def test_char_tokenizer_encode_decode(self):
        # Tests encode/decode roundtrip

    def test_bpe_tokenizer_training(self):
        from tinytorch.text.tokenization import BPETokenizer  # ← CORRECT!
        # Tests BPE training

    def test_bpe_tokenizer_encode_decode(self):
        # Tests BPE encode/decode

class TestTokenizationIntegration:  # ← CORRECT!
    """Test tokenization integration with other modules."""

    def test_tokenizer_produces_correct_dtypes(self):
        # CRITICAL: Verify int64 for embeddings

    def test_tokenization_to_embedding_pipeline(self):
        from tinytorch.text.embeddings import Embedding
        from tinytorch.text.tokenization import CharTokenizer
        # Tests tokenization → embedding flow

    def test_tokenizer_dataloader_integration(self):
        # Tests tokenizer with DataLoader

class TestTokenizationEdgeCases:  # ← CORRECT!
    """Test tokenization robustness with edge cases."""

    def test_bpe_edge_cases(self):
        # Empty strings, unknown tokens, special chars

    def test_vocabulary_consistency(self):
        # Bidirectional mappings, roundtrips

    def test_batch_processing(self):
        # Batch encoding/decoding

Benefits:

Tests actual Module 10 (Tokenization) functionality
Correct imports from tinytorch.text.tokenization
Tests CharTokenizer, BPETokenizer, vocabularies
Validates integration with Tensor, Embeddings, DataLoader
100% coverage of critical integration points

Side-by-Side Comparison

Aspect	Current (WRONG)	Corrected (CORRECT)
Module Tested	Module 11 (Training)	Module 10 (Tokenization)
Primary Imports	`tinytorch.core.training`	`tinytorch.text.tokenization`
Classes Tested	Trainer, CrossEntropyLoss	CharTokenizer, BPETokenizer
Test Focus	Training loops, loss functions	Encode/decode, vocabularies
Integration Points	Optimizers, CNN, distributed	Tensors, Embeddings, DataLoader
Edge Cases	Checkpointing, early stopping	Empty strings, unknown tokens
Coverage	0% (wrong module)	100% (correct tests)
Bug-Catching	None (tests wrong code)	High (catches dtype, shape errors)

Key Differences

Wrong File Tests

❌ Training loops and Trainer class
❌ Loss functions (MSELoss, CrossEntropyLoss)
❌ Validation loops and metrics
❌ Checkpointing and early stopping
❌ Learning rate scheduling
❌ Distributed training
❌ Mixed precision training
❌ Gradient accumulation
❌ CNN training pipelines
❌ End-to-end model training

Correct File Tests

✅ CharTokenizer initialization and vocab building
✅ CharTokenizer encode/decode roundtrip
✅ BPETokenizer training on corpus
✅ BPE encode/decode operations
✅ Token ID dtype correctness (int64)
✅ Tokenization → Embedding pipeline
✅ DataLoader integration
✅ BPE edge cases (empty, unknown, special)
✅ Vocabulary consistency (bidirectional)
✅ Batch processing correctness
✅ Performance benchmarks (throughput)
✅ Regression prevention (Tensor, DataLoader)

Example: What Each Tests

Wrong File Example

def test_training_loop_creation(self):
    """Test basic training loop functionality."""  # ← Module 11, not 10!
    from tinytorch.core.training import Trainer  # ← Doesn't exist
    from tinytorch.core.layers import Dense
    from tinytorch.core.optimizers import SGD

    model = Dense(10, 3)
    optimizer = SGD(model.parameters(), lr=0.01)
    trainer = Trainer(model, optimizer)  # ← Testing training, not tokenization!

    assert hasattr(trainer, 'train'), "Trainer broken"

Correct File Example

def test_char_tokenizer_encode_decode(self):
    """Test CharTokenizer encode/decode roundtrip."""  # ← Module 10!
    from tinytorch.text.tokenization import CharTokenizer  # ← Correct import

    tokenizer = CharTokenizer()
    tokenizer.build_vocab(["hello", "world"])

    text = "hello"
    token_ids = tokenizer.encode(text)  # ← Testing tokenization!

    assert isinstance(token_ids, list), "encode() should return list"
    assert all(isinstance(t, int) for t in token_ids), "Token IDs should be integers"

    decoded = tokenizer.decode(token_ids)
    for char in text:
        assert char in decoded, f"Lost character '{char}' in roundtrip"

Critical Integration Tests Only in Correct File

1. Dtype Correctness (Catches Embedding Bugs)

def test_tokenizer_produces_correct_dtypes(self):
    """Verify int64 output for embeddings."""
    token_tensor = Tensor(token_ids)
    assert token_tensor.data.dtype in [np.int32, np.int64, np.int_]

Why Critical: Embeddings crash if token IDs are float32 instead of int64

2. Embedding Integration (Primary Use Case)

def test_tokenization_to_embedding_pipeline(self):
    """Test complete tokenization → embedding pipeline."""
    tokenizer = CharTokenizer()
    embedding = Embedding(vocab_size, embed_dim)

    token_ids = tokenizer.encode("hello")
    embedded = embedding(Tensor(token_ids))
    assert embedded.shape == (len(token_ids), embed_dim)

Why Critical: This is THE use case for tokenizers - must work!

3. BPE Edge Cases (Production Robustness)

def test_bpe_edge_cases(self):
    """Empty strings, unknown tokens, special chars."""
    tokenizer = BPETokenizer(vocab_size=100)

    # Empty string
    assert tokenizer.encode("") == []

    # Unknown characters
    tokenizer.train(["hello"])
    tokens = tokenizer.encode("xyz")  # Not in training
    assert isinstance(tokens, list)  # Should handle gracefully

Why Critical: Production systems receive unexpected input

Impact of Using Wrong Tests

If we keep the wrong file:

❌ Students implement tokenizers but have 0% test coverage
❌ Dtype bugs (int vs float) go undetected → embeddings crash
❌ BPE edge cases untested → production failures
❌ No validation of tokenization → embedding pipeline
❌ Vocabulary corruption undetected
❌ Integration with DataLoader untested

With correct tests:

✅ Catch dtype mismatches before they reach embeddings
✅ Validate primary use case (tokenization → embeddings)
✅ Test production robustness (edge cases)
✅ Ensure vocabulary integrity
✅ Verify DataLoader integration
✅ Maintain stack stability (regression tests)

How to Fix

Option 1: Replace File

cd /Users/VJ/GitHub/TinyTorch/tests/10_tokenization
mv test_progressive_integration.py test_progressive_integration_OLD.py
mv test_progressive_integration_REFERENCE.py test_progressive_integration.py

Option 2: Manual Edit

Delete all content in test_progressive_integration.py
Copy content from test_progressive_integration_REFERENCE.py
Save and commit

Verify Fix

pytest tests/10_tokenization/test_progressive_integration.py -v

# Should see:
# - TestModule10TokenizationCore (not TestModule11TrainingCore)
# - Tests for CharTokenizer, BPETokenizer
# - Integration tests with Embedding, DataLoader

Summary

Current Status: CRITICAL - Wrong module tested (Module 11 instead of 10) Root Cause: Copy-paste error from Module 11 template Impact: 0% integration test coverage for Module 10 Fix: Replace with corrected reference implementation Urgency: HIGH - Students have no validation of tokenization integration

9.4 KiB Raw Blame History