mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-30 09:27:32 -05:00
🎯 NORTH STAR VISION DOCUMENTED: 'Don't Just Import It, Build It' - Training AI Engineers, not just ML users AI Engineering emerges as a foundational discipline like Computer Engineering, bridging algorithms and systems to build the AI infrastructure of the future. 🧪 ROBUST TESTING FRAMEWORK ESTABLISHED: - Created tests/regression/ for sandbox integrity tests - Implemented test-driven bug prevention workflow - Clear separation: student tests (pedagogical) vs system tests (robustness) - Every bug becomes a test to prevent recurrence ✅ KEY IMPLEMENTATIONS: - NORTH_STAR.md: Vision for AI Engineering discipline - Testing best practices: Focus on robust student sandbox - Git workflow standards: Professional development practices - Regression test suite: Prevent infrastructure issues - Conv->Linear dimension tests (found CNN bug) - Transformer reshaping tests (found GPT bug) 🏗️ SANDBOX INTEGRITY: Students need a solid, predictable environment where they focus on ML concepts, not debugging framework issues. The framework must be invisible. 📚 EDUCATIONAL PHILOSOPHY: TinyTorch isn't just teaching a framework - it's founding the AI Engineering discipline by training engineers who understand how to BUILD ML systems. This establishes the foundation for training the first generation of true AI Engineers who will define this emerging discipline.
9.0 KiB
9.0 KiB
TinyTorch Testing Standards
🎯 Core Testing Philosophy
Test immediately, test simply, test educationally.
Testing in TinyTorch serves two purposes:
- Verification: Ensure the code works
- Education: Help students understand what they built
📋 Testing Patterns
The Immediate Testing Pattern
MANDATORY: Test immediately after each implementation, not at the end.
# ✅ CORRECT: Implementation followed by immediate test
class Tensor:
def __init__(self, data):
self.data = data
# Test Tensor creation immediately
def test_tensor_creation():
t = Tensor([1, 2, 3])
assert t.data == [1, 2, 3], "Tensor should store data"
print("✅ Tensor creation works")
test_tensor_creation()
# ❌ WRONG: All tests grouped at the end
# [100 lines of implementations]
# [Then all tests at the bottom]
Simple Assertion Testing
Use simple assertions, not complex frameworks.
# ✅ GOOD: Simple and clear
def test_forward_pass():
model = SimpleMLP()
x = Tensor(np.random.randn(32, 784))
output = model.forward(x)
assert output.shape == (32, 10), f"Expected (32, 10), got {output.shape}"
print("✅ Forward pass shapes correct")
# ❌ BAD: Over-engineered
class TestMLPForwardPass(unittest.TestCase):
def setUp(self):
self.model = SimpleMLP()
def test_forward_pass_shape_validation_with_mock_data(self):
# ... 50 lines of test setup
Educational Test Messages
Tests should teach, not just verify.
# ✅ GOOD: Educational
def test_backpropagation():
# Create simple network: 2 inputs → 2 hidden → 1 output
net = TwoLayerNet(2, 2, 1)
# Forward pass with XOR data
x = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
y = Tensor([[0], [1], [1], [0]])
output = net.forward(x)
loss = mse_loss(output, y)
print(f"Initial loss: {loss.data:.4f}")
print("This high loss shows the network hasn't learned XOR yet")
# Backward pass
loss.backward()
# Check gradients exist
assert net.w1.grad is not None, "Gradients should be computed"
print("✅ Backpropagation computed gradients")
print("The network can now learn from its mistakes!")
# ❌ BAD: Just verification
def test_backprop():
net = TwoLayerNet(2, 2, 1)
# ... minimal test
assert net.w1.grad is not None
# No educational value
🧪 Performance Testing
Baseline Comparisons
Always test against a clear baseline.
def test_model_performance():
# 1. Test random baseline
random_model = create_random_network()
random_acc = evaluate(random_model, test_data)
print(f"Random network accuracy: {random_acc:.1%}")
# 2. Test trained model
trained_model = load_trained_model()
trained_acc = evaluate(trained_model, test_data)
print(f"Trained network accuracy: {trained_acc:.1%}")
# 3. Show improvement
improvement = trained_acc / random_acc
print(f"Improvement: {improvement:.1f}× better than random")
assert trained_acc > random_acc * 2, "Should be at least 2× better than random"
Honest Performance Reporting
# ✅ GOOD: Report actual measurements
def test_training_performance():
start_time = time.time()
accuracy = train_model(epochs=10)
train_time = time.time() - start_time
print(f"Achieved accuracy: {accuracy:.1%}")
print(f"Training time: {train_time:.1f} seconds")
print(f"Status: {'✅ PASS' if accuracy > 0.5 else '❌ FAIL'}")
# ❌ BAD: Theoretical claims
def test_training():
# ... training code
print("Can achieve 60-70% with proper tuning") # Unverified claim
🔍 Test Organization
Test Placement
# Module structure with immediate tests
# module_name.py
# Part 1: Core implementation
class Tensor:
...
# Immediate test
test_tensor_creation()
# Part 2: Operations
def add(a, b):
...
# Immediate test
test_addition()
# Part 3: Advanced features
def backward():
...
# Immediate test
test_backward()
# At the end: Run all tests when executed directly
if __name__ == "__main__":
print("Running all tests...")
test_tensor_creation()
test_addition()
test_backward()
print("✅ All tests passed!")
⚠️ Common Testing Mistakes
-
Grouping all tests at the end
- Loses educational flow
- Students don't see immediate verification
-
Over-complicated test frameworks
- Obscures what's being tested
- Adds unnecessary complexity
-
Testing without teaching
- Missing opportunity to reinforce concepts
- No educational value
-
Unverified performance claims
- Damages credibility
- Misleads students
📝 Test Documentation
def test_attention_mechanism():
"""
Test that attention correctly weighs different positions.
This test demonstrates the key insight of attention:
the model learns what to focus on.
"""
# Create simple sequence
sequence = Tensor([[1, 0, 0], # Position 0: important
[0, 0, 0], # Position 1: padding
[0, 0, 1]]) # Position 2: important
attention_weights = compute_attention(sequence)
# Check that important positions get more weight
assert attention_weights[0] > attention_weights[1]
assert attention_weights[2] > attention_weights[1]
print("✅ Attention focuses on important positions")
print(f"Weights: {attention_weights}")
print("Notice how padding (position 1) gets less attention")
🔧 Module Integration Testing
Three-Tier Testing Strategy
TinyTorch uses a comprehensive testing approach:
- Unit Tests: Individual module functionality (in modules)
- Module Integration Tests: Inter-module compatibility (tests/integration/)
- System Integration Tests: End-to-end examples (examples/)
Module Integration Tests Explained
Purpose: Test that modules work TOGETHER, not just individually.
What Integration Tests Cover:
- Data flows correctly between modules
- Import paths don't conflict
- Modules can consume each other's outputs
- Training pipelines work end-to-end
- Optimization modules integrate with core modules
Example Integration Test:
def test_tensor_autograd_integration():
"""Test tensor and autograd modules work together"""
from tinytorch.core.tensor import Tensor
from tinytorch.core.autograd import Variable
# Test data flow between modules
t = Tensor([1.0, 2.0, 3.0])
v = Variable(t, requires_grad=True)
# Test that autograd can handle tensor operations
result = v * 2
assert result.data.tolist() == [2.0, 4.0, 6.0]
print("✅ Tensor + Autograd integration working")
def test_training_pipeline_integration():
"""Test complete training pipeline works"""
from tinytorch.utils.data import DataLoader, SimpleDataset
from tinytorch.nn import Linear
from tinytorch.core.optimizers import SGD
# Test that data → model → optimizer → training works
dataset = SimpleDataset([(i, i*2) for i in range(10)])
dataloader = DataLoader(dataset, batch_size=2)
model = Linear(1, 1)
optimizer = SGD([model.weight], lr=0.01)
# Integration test: does the pipeline execute?
for batch_data, batch_labels in dataloader:
output = model(batch_data)
optimizer.step()
break # Just test one iteration
print("✅ Training pipeline integration working")
Running Integration Tests
# Run module integration tests
python tests/integration/test_module_integration.py
# Expected output:
# ✅ Core Module Integration
# ✅ Training Pipeline Integration
# ✅ Optimization Module Integration
# ✅ Import Compatibility
# ✅ Cross-Module Data Flow
Integration Test Categories
- Core Module Integration: tensor + autograd + layers
- Training Pipeline Integration: data + models + optimizers + training
- Optimization Module Integration: profiler + quantization + pruning with core
- Import Compatibility: All import paths work without conflicts
Critical Integration Points
- Data Flow: Tensor objects work across module boundaries
- Interface Compatibility: Module APIs match expectations
- Training Workflows: Complete training pipelines execute
- Performance Integration: Optimizations preserve correctness
📋 Testing Checklist
Before Any Commit
- Modified module unit tests pass
- Integration tests pass (90%+ success rate)
- At least one example still works
- No import errors in package structure
Module Completion Requirements
- Unit tests in module pass
- Integration tests with other modules pass
- Module exports correctly to package
- Module works in training pipeline
🎯 Remember
Tests are teaching tools, not just verification tools.
Every test should help a student understand:
- What the code does
- Why it matters
- How to verify it works
- What success looks like
- How modules work together (integration focus)