mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 07:17:33 -05:00
- Removed temporary test files and audit reports - Deleted backup and temp_holding directories - Reorganized module structure (07->09 spatial, 09->07 dataloader) - Added new modules: 11-14 (tokenization, embeddings, attention, transformers) - Updated examples with historical ML milestones - Cleaned up documentation structure
Module 11: Tokenization - Text Processing for Language Models
Overview
This module implements the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand. You'll build character-level and subword tokenizers from scratch, understanding the critical trade-offs between vocabulary size and sequence length that affect model performance.
What You'll Learn
Core Implementations
- Character Tokenizer: Simple character-level tokenization with special tokens
- BPE Tokenizer: Byte Pair Encoding for efficient subword units
- Vocabulary Management: Bidirectional mappings between text and indices
- Padding & Truncation: Batch processing utilities for uniform sequences
ML Systems Concepts
- Memory Efficiency: How vocabulary size affects model parameters
- Performance Optimization: Tokenization throughput and caching strategies
- Scaling Trade-offs: Vocabulary size vs sequence length vs compute
- Production Patterns: Efficient text processing for large-scale systems
Performance Engineering
- Tokenization Profiling: Measuring speed and memory usage
- Cache Optimization: Reducing repeated tokenization overhead
- Batch Processing: Efficient handling of multiple texts
- Scaling Analysis: Understanding performance with large texts
Key Learning Outcomes
By completing this module, you'll understand:
- Text-to-Numbers Pipeline: How raw text becomes neural network input
- Tokenization Strategies: Character vs subword vs word-level approaches
- Systems Trade-offs: Vocabulary size impacts on memory and compute
- Performance Engineering: Optimizing text processing for production
- Language Model Foundation: How tokenization affects model capabilities
Files in This Module
tokenization_dev.py- Main implementation file with all tokenizerstokenization_dev.ipynb- Jupyter notebook (auto-generated)module.yaml- Module configuration and metadataREADME.md- This documentation file
Usage Example
from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
# Character-level tokenization
char_tokenizer = CharTokenizer()
tokens = char_tokenizer.encode("Hello world!")
text = char_tokenizer.decode(tokens)
# BPE tokenization
bpe_tokenizer = BPETokenizer(vocab_size=1000)
bpe_tokenizer.train(["Hello world", "World hello", "Hello hello world"])
tokens = bpe_tokenizer.encode("Hello world!")
Integration with TinyTorch
This module exports to tinytorch.core.tokenization and provides the text processing foundation for:
- Embedding layers (Module 12) - Converting tokens to vectors
- Language models (Module 14+) - Processing text sequences
- Training pipelines - Efficient batch text processing
Systems Engineering Focus
This module emphasizes the systems engineering aspects of tokenization:
Performance Characteristics
- Character tokenization: Small vocab (~256), long sequences
- BPE tokenization: Medium vocab (~50k), shorter sequences
- Memory scaling: O(vocab_size × embedding_dim) for embedding tables
- Attention scaling: O(sequence_length²) for transformer models
Production Considerations
- Tokenization can become a bottleneck in training pipelines
- Efficient string processing is critical for high-throughput systems
- Caching strategies provide significant speedups for repeated texts
- Vocabulary size affects model download size and memory usage
Prerequisites
- Module 02: Tensor (for basic data structures)
- Understanding of string processing and algorithms
Estimated Time
4-5 hours including implementation, testing, and analysis
Next Steps
After completing this module, you'll be ready for:
- Module 12: Embeddings - Converting tokens to dense vector representations
- Module 13: Attention - Processing sequences with attention mechanisms
- Module 14: Transformers - Complete language model architectures