mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-06 00:19:04 -05:00
- Removed 01_setup module (archived to archive/setup_module) - Renumbered all modules: tensor is now 01, activations is 02, etc. - Added tito setup command for environment setup and package installation - Added numeric shortcuts: tito 01, tito 02, etc. for quick module access - Fixed view command to find dev files correctly - Updated module dependencies and references - Improved user experience: immediate ML learning instead of boring setup
93 lines
3.9 KiB
Markdown
93 lines
3.9 KiB
Markdown
# Module 11: Tokenization - Text Processing for Language Models
|
||
|
||
## Overview
|
||
This module implements the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand. You'll build character-level and subword tokenizers from scratch, understanding the critical trade-offs between vocabulary size and sequence length that affect model performance.
|
||
|
||
## What You'll Learn
|
||
|
||
### Core Implementations
|
||
- **Character Tokenizer**: Simple character-level tokenization with special tokens
|
||
- **BPE Tokenizer**: Byte Pair Encoding for efficient subword units
|
||
- **Vocabulary Management**: Bidirectional mappings between text and indices
|
||
- **Padding & Truncation**: Batch processing utilities for uniform sequences
|
||
|
||
### ML Systems Concepts
|
||
- **Memory Efficiency**: How vocabulary size affects model parameters
|
||
- **Performance Optimization**: Tokenization throughput and caching strategies
|
||
- **Scaling Trade-offs**: Vocabulary size vs sequence length vs compute
|
||
- **Production Patterns**: Efficient text processing for large-scale systems
|
||
|
||
### Performance Engineering
|
||
- **Tokenization Profiling**: Measuring speed and memory usage
|
||
- **Cache Optimization**: Reducing repeated tokenization overhead
|
||
- **Batch Processing**: Efficient handling of multiple texts
|
||
- **Scaling Analysis**: Understanding performance with large texts
|
||
|
||
## Key Learning Outcomes
|
||
|
||
By completing this module, you'll understand:
|
||
|
||
1. **Text-to-Numbers Pipeline**: How raw text becomes neural network input
|
||
2. **Tokenization Strategies**: Character vs subword vs word-level approaches
|
||
3. **Systems Trade-offs**: Vocabulary size impacts on memory and compute
|
||
4. **Performance Engineering**: Optimizing text processing for production
|
||
5. **Language Model Foundation**: How tokenization affects model capabilities
|
||
|
||
## Files in This Module
|
||
|
||
- `tokenization_dev.py` - Main implementation file with all tokenizers
|
||
- `tokenization_dev.ipynb` - Jupyter notebook (auto-generated)
|
||
- `module.yaml` - Module configuration and metadata
|
||
- `README.md` - This documentation file
|
||
|
||
## Usage Example
|
||
|
||
```python
|
||
from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
|
||
|
||
# Character-level tokenization
|
||
char_tokenizer = CharTokenizer()
|
||
tokens = char_tokenizer.encode("Hello world!")
|
||
text = char_tokenizer.decode(tokens)
|
||
|
||
# BPE tokenization
|
||
bpe_tokenizer = BPETokenizer(vocab_size=1000)
|
||
bpe_tokenizer.train(["Hello world", "World hello", "Hello hello world"])
|
||
tokens = bpe_tokenizer.encode("Hello world!")
|
||
```
|
||
|
||
## Integration with TinyTorch
|
||
|
||
This module exports to `tinytorch.core.tokenization` and provides the text processing foundation for:
|
||
- **Embedding layers** (Module 12) - Converting tokens to vectors
|
||
- **Language models** (Module 14+) - Processing text sequences
|
||
- **Training pipelines** - Efficient batch text processing
|
||
|
||
## Systems Engineering Focus
|
||
|
||
This module emphasizes the systems engineering aspects of tokenization:
|
||
|
||
### Performance Characteristics
|
||
- **Character tokenization**: Small vocab (~256), long sequences
|
||
- **BPE tokenization**: Medium vocab (~50k), shorter sequences
|
||
- **Memory scaling**: O(vocab_size × embedding_dim) for embedding tables
|
||
- **Attention scaling**: O(sequence_length²) for transformer models
|
||
|
||
### Production Considerations
|
||
- Tokenization can become a bottleneck in training pipelines
|
||
- Efficient string processing is critical for high-throughput systems
|
||
- Caching strategies provide significant speedups for repeated texts
|
||
- Vocabulary size affects model download size and memory usage
|
||
|
||
## Prerequisites
|
||
- Module 02: Tensor (for basic data structures)
|
||
- Understanding of string processing and algorithms
|
||
|
||
## Estimated Time
|
||
4-5 hours including implementation, testing, and analysis
|
||
|
||
## Next Steps
|
||
After completing this module, you'll be ready for:
|
||
- **Module 12: Embeddings** - Converting tokens to dense vector representations
|
||
- **Module 13: Attention** - Processing sequences with attention mechanisms
|
||
- **Module 14: Transformers** - Complete language model architectures |