Files
TinyTorch/modules/10_tokenization/module.yaml
Vijay Janapa Reddi 45a9cef548 Major reorganization: Remove setup module, renumber all modules, add tito setup command and numeric shortcuts
- Removed 01_setup module (archived to archive/setup_module)
- Renumbered all modules: tensor is now 01, activations is 02, etc.
- Added tito setup command for environment setup and package installation
- Added numeric shortcuts: tito 01, tito 02, etc. for quick module access
- Fixed view command to find dev files correctly
- Updated module dependencies and references
- Improved user experience: immediate ML learning instead of boring setup
2025-09-28 07:02:08 -04:00

29 lines
989 B
YAML

description: Text processing systems that convert raw text into numerical sequences
for language models
estimated_time: 4-5 hours
exports:
- CharTokenizer
- BPETokenizer
- TokenizationProfiler
- OptimizedTokenizer
learning_objectives:
- Implement character-level tokenization with special token handling
- Build BPE (Byte Pair Encoding) tokenizer for subword units
- 'Understand tokenization trade-offs: vocabulary size vs sequence length'
- Optimize tokenization performance for production systems
- Analyze how tokenization affects model memory and training efficiency
ml_systems_focus: Text processing pipelines, tokenization throughput, memory-efficient
vocabulary management
name: Tokenization
next_modules:
- 12_embeddings
number: 11
prerequisites:
- 02_tensor
systems_concepts:
- Memory efficiency of token representations
- Vocabulary size vs model size tradeoffs
- Tokenization throughput optimization
- String processing performance
- Cache-friendly text processing patterns