mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 09:22:33 -05:00
9.4 KiB
9.4 KiB
Module 10 Integration Tests: Wrong vs Correct
Current File (WRONG) ❌
"""
Module 10: Progressive Integration Tests
Tests that Module 11 (Training) works correctly... # ← WRONG MODULE!
DEPENDENCY CHAIN: 01_setup → ... → 10_optimizers → 11_training # ← WRONG!
This is where we enable complete end-to-end training loops. # ← WRONG!
"""
class TestModule11TrainingCore: # ← WRONG MODULE!
"""Test Module 11 (Training) core functionality.""" # ← WRONG!
def test_training_loop_creation(self):
from tinytorch.core.training import Trainer # ← WRONG!
from tinytorch.core.optimizers import SGD
# Tests training loops... ← WRONG TOPIC!
def test_loss_function_support(self):
from tinytorch.core.training import CrossEntropyLoss, MSELoss # ← WRONG!
# Tests loss functions... ← WRONG TOPIC!
class TestAdvancedTrainingFeatures: # ← WRONG MODULE!
def test_distributed_training_support(self): # ← WRONG!
def test_mixed_precision_training(self): # ← WRONG!
Problems:
- Tests Module 11 (Training) instead of Module 10 (Tokenization)
- All imports from
tinytorch.core.training(doesn't exist yet) - Tests loss functions, optimizers, CNN pipelines (wrong concepts)
- 0% coverage of actual Module 10 functionality
- Copy-paste error from Module 11 template
Corrected File (CORRECT) ✅
"""
Module 10: Progressive Integration Tests
Tests that Module 10 (Tokenization) works correctly... # ← CORRECT!
DEPENDENCY CHAIN: 01_tensor → ... → 08_dataloader → 10_tokenization → 11_embeddings # ← CORRECT!
This is where we enable text processing for NLP tasks. # ← CORRECT!
"""
class TestModule10TokenizationCore: # ← CORRECT MODULE!
"""Test Module 10 (Tokenization) core functionality.""" # ← CORRECT!
def test_char_tokenizer_creation(self):
from tinytorch.text.tokenization import CharTokenizer # ← CORRECT!
# Tests CharTokenizer initialization
def test_char_tokenizer_encode_decode(self):
# Tests encode/decode roundtrip
def test_bpe_tokenizer_training(self):
from tinytorch.text.tokenization import BPETokenizer # ← CORRECT!
# Tests BPE training
def test_bpe_tokenizer_encode_decode(self):
# Tests BPE encode/decode
class TestTokenizationIntegration: # ← CORRECT!
"""Test tokenization integration with other modules."""
def test_tokenizer_produces_correct_dtypes(self):
# CRITICAL: Verify int64 for embeddings
def test_tokenization_to_embedding_pipeline(self):
from tinytorch.text.embeddings import Embedding
from tinytorch.text.tokenization import CharTokenizer
# Tests tokenization → embedding flow
def test_tokenizer_dataloader_integration(self):
# Tests tokenizer with DataLoader
class TestTokenizationEdgeCases: # ← CORRECT!
"""Test tokenization robustness with edge cases."""
def test_bpe_edge_cases(self):
# Empty strings, unknown tokens, special chars
def test_vocabulary_consistency(self):
# Bidirectional mappings, roundtrips
def test_batch_processing(self):
# Batch encoding/decoding
Benefits:
- Tests actual Module 10 (Tokenization) functionality
- Correct imports from
tinytorch.text.tokenization - Tests CharTokenizer, BPETokenizer, vocabularies
- Validates integration with Tensor, Embeddings, DataLoader
- 100% coverage of critical integration points
Side-by-Side Comparison
| Aspect | Current (WRONG) | Corrected (CORRECT) |
|---|---|---|
| Module Tested | Module 11 (Training) | Module 10 (Tokenization) |
| Primary Imports | tinytorch.core.training |
tinytorch.text.tokenization |
| Classes Tested | Trainer, CrossEntropyLoss | CharTokenizer, BPETokenizer |
| Test Focus | Training loops, loss functions | Encode/decode, vocabularies |
| Integration Points | Optimizers, CNN, distributed | Tensors, Embeddings, DataLoader |
| Edge Cases | Checkpointing, early stopping | Empty strings, unknown tokens |
| Coverage | 0% (wrong module) | 100% (correct tests) |
| Bug-Catching | None (tests wrong code) | High (catches dtype, shape errors) |
Key Differences
Wrong File Tests
- ❌ Training loops and Trainer class
- ❌ Loss functions (MSELoss, CrossEntropyLoss)
- ❌ Validation loops and metrics
- ❌ Checkpointing and early stopping
- ❌ Learning rate scheduling
- ❌ Distributed training
- ❌ Mixed precision training
- ❌ Gradient accumulation
- ❌ CNN training pipelines
- ❌ End-to-end model training
Correct File Tests
- ✅ CharTokenizer initialization and vocab building
- ✅ CharTokenizer encode/decode roundtrip
- ✅ BPETokenizer training on corpus
- ✅ BPE encode/decode operations
- ✅ Token ID dtype correctness (int64)
- ✅ Tokenization → Embedding pipeline
- ✅ DataLoader integration
- ✅ BPE edge cases (empty, unknown, special)
- ✅ Vocabulary consistency (bidirectional)
- ✅ Batch processing correctness
- ✅ Performance benchmarks (throughput)
- ✅ Regression prevention (Tensor, DataLoader)
Example: What Each Tests
Wrong File Example
def test_training_loop_creation(self):
"""Test basic training loop functionality.""" # ← Module 11, not 10!
from tinytorch.core.training import Trainer # ← Doesn't exist
from tinytorch.core.layers import Dense
from tinytorch.core.optimizers import SGD
model = Dense(10, 3)
optimizer = SGD(model.parameters(), lr=0.01)
trainer = Trainer(model, optimizer) # ← Testing training, not tokenization!
assert hasattr(trainer, 'train'), "Trainer broken"
Correct File Example
def test_char_tokenizer_encode_decode(self):
"""Test CharTokenizer encode/decode roundtrip.""" # ← Module 10!
from tinytorch.text.tokenization import CharTokenizer # ← Correct import
tokenizer = CharTokenizer()
tokenizer.build_vocab(["hello", "world"])
text = "hello"
token_ids = tokenizer.encode(text) # ← Testing tokenization!
assert isinstance(token_ids, list), "encode() should return list"
assert all(isinstance(t, int) for t in token_ids), "Token IDs should be integers"
decoded = tokenizer.decode(token_ids)
for char in text:
assert char in decoded, f"Lost character '{char}' in roundtrip"
Critical Integration Tests Only in Correct File
1. Dtype Correctness (Catches Embedding Bugs)
def test_tokenizer_produces_correct_dtypes(self):
"""Verify int64 output for embeddings."""
token_tensor = Tensor(token_ids)
assert token_tensor.data.dtype in [np.int32, np.int64, np.int_]
Why Critical: Embeddings crash if token IDs are float32 instead of int64
2. Embedding Integration (Primary Use Case)
def test_tokenization_to_embedding_pipeline(self):
"""Test complete tokenization → embedding pipeline."""
tokenizer = CharTokenizer()
embedding = Embedding(vocab_size, embed_dim)
token_ids = tokenizer.encode("hello")
embedded = embedding(Tensor(token_ids))
assert embedded.shape == (len(token_ids), embed_dim)
Why Critical: This is THE use case for tokenizers - must work!
3. BPE Edge Cases (Production Robustness)
def test_bpe_edge_cases(self):
"""Empty strings, unknown tokens, special chars."""
tokenizer = BPETokenizer(vocab_size=100)
# Empty string
assert tokenizer.encode("") == []
# Unknown characters
tokenizer.train(["hello"])
tokens = tokenizer.encode("xyz") # Not in training
assert isinstance(tokens, list) # Should handle gracefully
Why Critical: Production systems receive unexpected input
Impact of Using Wrong Tests
If we keep the wrong file:
- ❌ Students implement tokenizers but have 0% test coverage
- ❌ Dtype bugs (int vs float) go undetected → embeddings crash
- ❌ BPE edge cases untested → production failures
- ❌ No validation of tokenization → embedding pipeline
- ❌ Vocabulary corruption undetected
- ❌ Integration with DataLoader untested
With correct tests:
- ✅ Catch dtype mismatches before they reach embeddings
- ✅ Validate primary use case (tokenization → embeddings)
- ✅ Test production robustness (edge cases)
- ✅ Ensure vocabulary integrity
- ✅ Verify DataLoader integration
- ✅ Maintain stack stability (regression tests)
How to Fix
Option 1: Replace File
cd /Users/VJ/GitHub/TinyTorch/tests/10_tokenization
mv test_progressive_integration.py test_progressive_integration_OLD.py
mv test_progressive_integration_REFERENCE.py test_progressive_integration.py
Option 2: Manual Edit
- Delete all content in
test_progressive_integration.py - Copy content from
test_progressive_integration_REFERENCE.py - Save and commit
Verify Fix
pytest tests/10_tokenization/test_progressive_integration.py -v
# Should see:
# - TestModule10TokenizationCore (not TestModule11TrainingCore)
# - Tests for CharTokenizer, BPETokenizer
# - Integration tests with Embedding, DataLoader
Summary
Current Status: CRITICAL - Wrong module tested (Module 11 instead of 10) Root Cause: Copy-paste error from Module 11 template Impact: 0% integration test coverage for Module 10 Fix: Replace with corrected reference implementation Urgency: HIGH - Students have no validation of tokenization integration