mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 08:25:22 -05:00
- Removed 01_setup module (archived to archive/setup_module) - Renumbered all modules: tensor is now 01, activations is 02, etc. - Added tito setup command for environment setup and package installation - Added numeric shortcuts: tito 01, tito 02, etc. for quick module access - Fixed view command to find dev files correctly - Updated module dependencies and references - Improved user experience: immediate ML learning instead of boring setup
29 lines
989 B
YAML
29 lines
989 B
YAML
description: Text processing systems that convert raw text into numerical sequences
|
|
for language models
|
|
estimated_time: 4-5 hours
|
|
exports:
|
|
- CharTokenizer
|
|
- BPETokenizer
|
|
- TokenizationProfiler
|
|
- OptimizedTokenizer
|
|
learning_objectives:
|
|
- Implement character-level tokenization with special token handling
|
|
- Build BPE (Byte Pair Encoding) tokenizer for subword units
|
|
- 'Understand tokenization trade-offs: vocabulary size vs sequence length'
|
|
- Optimize tokenization performance for production systems
|
|
- Analyze how tokenization affects model memory and training efficiency
|
|
ml_systems_focus: Text processing pipelines, tokenization throughput, memory-efficient
|
|
vocabulary management
|
|
name: Tokenization
|
|
next_modules:
|
|
- 12_embeddings
|
|
number: 11
|
|
prerequisites:
|
|
- 02_tensor
|
|
systems_concepts:
|
|
- Memory efficiency of token representations
|
|
- Vocabulary size vs model size tradeoffs
|
|
- Tokenization throughput optimization
|
|
- String processing performance
|
|
- Cache-friendly text processing patterns
|