mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-30 09:21:44 -05:00
🎯 NORTH STAR VISION DOCUMENTED: 'Don't Just Import It, Build It' - Training AI Engineers, not just ML users AI Engineering emerges as a foundational discipline like Computer Engineering, bridging algorithms and systems to build the AI infrastructure of the future. 🧪 ROBUST TESTING FRAMEWORK ESTABLISHED: - Created tests/regression/ for sandbox integrity tests - Implemented test-driven bug prevention workflow - Clear separation: student tests (pedagogical) vs system tests (robustness) - Every bug becomes a test to prevent recurrence ✅ KEY IMPLEMENTATIONS: - NORTH_STAR.md: Vision for AI Engineering discipline - Testing best practices: Focus on robust student sandbox - Git workflow standards: Professional development practices - Regression test suite: Prevent infrastructure issues - Conv->Linear dimension tests (found CNN bug) - Transformer reshaping tests (found GPT bug) 🏗️ SANDBOX INTEGRITY: Students need a solid, predictable environment where they focus on ML concepts, not debugging framework issues. The framework must be invisible. 📚 EDUCATIONAL PHILOSOPHY: TinyTorch isn't just teaching a framework - it's founding the AI Engineering discipline by training engineers who understand how to BUILD ML systems. This establishes the foundation for training the first generation of true AI Engineers who will define this emerging discipline.
Module 11: Tokenization - Text Processing for Language Models
Overview
This module implements the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand. You'll build character-level and subword tokenizers from scratch, understanding the critical trade-offs between vocabulary size and sequence length that affect model performance.
What You'll Learn
Core Implementations
- Character Tokenizer: Simple character-level tokenization with special tokens
- BPE Tokenizer: Byte Pair Encoding for efficient subword units
- Vocabulary Management: Bidirectional mappings between text and indices
- Padding & Truncation: Batch processing utilities for uniform sequences
ML Systems Concepts
- Memory Efficiency: How vocabulary size affects model parameters
- Performance Optimization: Tokenization throughput and caching strategies
- Scaling Trade-offs: Vocabulary size vs sequence length vs compute
- Production Patterns: Efficient text processing for large-scale systems
Performance Engineering
- Tokenization Profiling: Measuring speed and memory usage
- Cache Optimization: Reducing repeated tokenization overhead
- Batch Processing: Efficient handling of multiple texts
- Scaling Analysis: Understanding performance with large texts
Key Learning Outcomes
By completing this module, you'll understand:
- Text-to-Numbers Pipeline: How raw text becomes neural network input
- Tokenization Strategies: Character vs subword vs word-level approaches
- Systems Trade-offs: Vocabulary size impacts on memory and compute
- Performance Engineering: Optimizing text processing for production
- Language Model Foundation: How tokenization affects model capabilities
Files in This Module
tokenization_dev.py- Main implementation file with all tokenizerstokenization_dev.ipynb- Jupyter notebook (auto-generated)module.yaml- Module configuration and metadataREADME.md- This documentation file
Usage Example
from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
# Character-level tokenization
char_tokenizer = CharTokenizer()
tokens = char_tokenizer.encode("Hello world!")
text = char_tokenizer.decode(tokens)
# BPE tokenization
bpe_tokenizer = BPETokenizer(vocab_size=1000)
bpe_tokenizer.train(["Hello world", "World hello", "Hello hello world"])
tokens = bpe_tokenizer.encode("Hello world!")
Integration with TinyTorch
This module exports to tinytorch.core.tokenization and provides the text processing foundation for:
- Embedding layers (Module 12) - Converting tokens to vectors
- Language models (Module 14+) - Processing text sequences
- Training pipelines - Efficient batch text processing
Systems Engineering Focus
This module emphasizes the systems engineering aspects of tokenization:
Performance Characteristics
- Character tokenization: Small vocab (~256), long sequences
- BPE tokenization: Medium vocab (~50k), shorter sequences
- Memory scaling: O(vocab_size × embedding_dim) for embedding tables
- Attention scaling: O(sequence_length²) for transformer models
Production Considerations
- Tokenization can become a bottleneck in training pipelines
- Efficient string processing is critical for high-throughput systems
- Caching strategies provide significant speedups for repeated texts
- Vocabulary size affects model download size and memory usage
Prerequisites
- Module 02: Tensor (for basic data structures)
- Understanding of string processing and algorithms
Estimated Time
4-5 hours including implementation, testing, and analysis
Next Steps
After completing this module, you'll be ready for:
- Module 12: Embeddings - Converting tokens to dense vector representations
- Module 13: Attention - Processing sequences with attention mechanisms
- Module 14: Transformers - Complete language model architectures