mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 05:37:30 -05:00
docs: Improve tokenization module with enhanced ASCII diagrams
Following module developer guidelines, added comprehensive visual diagrams: 1. Text-to-Numbers Pipeline (Introduction): - Added full boxed diagram showing 4-step tokenization process - Clear visual flow from human text to numerical IDs - Each step explained inline with the diagram 2. Character Tokenization Process: - Step-by-step vocabulary building visualization - Shows corpus → unique chars → vocab with IDs - Encoding process with ID lookup visualization - Decoding process with reverse lookup - All in clear nested boxes 3. BPE Training Algorithm: - Comprehensive 4-step process with nested boxes - Pair frequency analysis with bar charts (████) - Before/After merge visualizations - Iteration examples showing vocabulary growth - Final results with key insights 4. Memory Layout for Embedding Tables: - Visual bars showing relative memory sizes - Character (204KB) vs BPE-50K (102MB) vs Word-100K (204MB) - Shows fp32/fp16/int8 precision trade-offs - Real production model examples (GPT-2/3, BERT, T5, LLaMA) - Clear table format for comparison Educational improvements: - More visual, less text-heavy - Clearer step-by-step flows - Better intuition building - Production context throughout - Following module developer ASCII diagram patterns Students now see: - HOW tokenization works (not just WHAT) - WHY different strategies exist - WHAT the memory implications are - HOW production models make these choices
This commit is contained in:
@@ -79,23 +79,40 @@ Neural networks operate on numbers, but humans communicate with text. Tokenizati
|
||||
|
||||
### The Text-to-Numbers Challenge
|
||||
|
||||
Consider the sentence: "Hello, world!"
|
||||
Consider the sentence: "Hello, world!" - how do we turn this into numbers a neural network can process?
|
||||
|
||||
```
|
||||
Human Text: "Hello, world!"
|
||||
↓
|
||||
[Tokenization]
|
||||
↓
|
||||
Numerical IDs: [72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33]
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TOKENIZATION PIPELINE: Text → Numbers │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Input (Human Text): "Hello, world!" │
|
||||
│ │ │
|
||||
│ ├─ Step 1: Split into tokens │
|
||||
│ │ ['H','e','l','l','o',',', ...'] │
|
||||
│ │ │
|
||||
│ ├─ Step 2: Map to vocabulary IDs │
|
||||
│ │ [72, 101, 108, 108, 111, ...] │
|
||||
│ │ │
|
||||
│ ├─ Step 3: Handle unknowns │
|
||||
│ │ Unknown chars → special <UNK> token │
|
||||
│ │ │
|
||||
│ └─ Step 4: Enable decoding │
|
||||
│ IDs → original text │
|
||||
│ │
|
||||
│ Output (Token IDs): [72, 101, 108, 108, 111, 44, 32, ...] │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### The Four-Step Process
|
||||
|
||||
How do we represent this for a neural network? We need to:
|
||||
1. **Split text into tokens** - meaningful units like words, subwords, or characters
|
||||
2. **Map tokens to integers** - create a vocabulary that assigns unique IDs
|
||||
3. **Handle unknown text** - deal with words not seen during training
|
||||
4. **Enable reconstruction** - convert numbers back to readable text
|
||||
How do we represent text for a neural network? We need a systematic pipeline:
|
||||
|
||||
**1. Split text into tokens** - Break text into meaningful units (words, subwords, or characters)
|
||||
**2. Map tokens to integers** - Create a vocabulary that assigns each token a unique ID
|
||||
**3. Handle unknown text** - Deal gracefully with tokens not seen during training
|
||||
**4. Enable reconstruction** - Convert numbers back to readable text for interpretation
|
||||
|
||||
### Why This Matters
|
||||
|
||||
@@ -116,15 +133,59 @@ Different tokenization approaches make different trade-offs between vocabulary s
|
||||
**Approach**: Each character gets its own token
|
||||
|
||||
```
|
||||
Text: "Hello world"
|
||||
↓
|
||||
Tokens: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
|
||||
↓
|
||||
IDs: [8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4]
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ CHARACTER TOKENIZATION PROCESS │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Step 1: Build Vocabulary from Unique Characters │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ Corpus: ["hello", "world"] │ │
|
||||
│ │ ↓ │ │
|
||||
│ │ Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd'] │ │
|
||||
│ │ ↓ │ │
|
||||
│ │ Vocabulary: ['<UNK>','h','e','l','o','w','r','d'] │ │
|
||||
│ │ IDs: 0 1 2 3 4 5 6 7 │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Step 2: Encode Text Character by Character │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ Text: "hello" │ │
|
||||
│ │ │ │
|
||||
│ │ 'h' → 1 (lookup in vocabulary) │ │
|
||||
│ │ 'e' → 2 │ │
|
||||
│ │ 'l' → 3 │ │
|
||||
│ │ 'l' → 3 │ │
|
||||
│ │ 'o' → 4 │ │
|
||||
│ │ │ │
|
||||
│ │ Result: [1, 2, 3, 3, 4] │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Step 3: Decode by Reversing ID Lookup │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ IDs: [1, 2, 3, 3, 4] │ │
|
||||
│ │ │ │
|
||||
│ │ 1 → 'h' (reverse lookup) │ │
|
||||
│ │ 2 → 'e' │ │
|
||||
│ │ 3 → 'l' │ │
|
||||
│ │ 3 → 'l' │ │
|
||||
│ │ 4 → 'o' │ │
|
||||
│ │ │ │
|
||||
│ │ Result: "hello" │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Pros**: Small vocabulary (~100), handles any text, no unknown tokens
|
||||
**Cons**: Long sequences (1 char = 1 token), limited semantic understanding
|
||||
**Pros**:
|
||||
- Small vocabulary (~100 chars)
|
||||
- Handles any text perfectly
|
||||
- No unknown tokens (every character can be mapped)
|
||||
- Simple implementation
|
||||
|
||||
**Cons**:
|
||||
- Long sequences (1 character = 1 token)
|
||||
- Limited semantic understanding (no word boundaries)
|
||||
- More compute (longer sequences to process)
|
||||
|
||||
### Word-Level Tokenization
|
||||
**Approach**: Each word gets its own token
|
||||
@@ -477,38 +538,84 @@ Character tokenization provides a simple, robust foundation for text processing.
|
||||
"""
|
||||
### Byte Pair Encoding (BPE) Tokenizer
|
||||
|
||||
BPE is the secret sauce behind modern language models. It learns to merge frequent character pairs, creating subword units that balance vocabulary size with sequence length.
|
||||
BPE is the secret sauce behind modern language models (GPT, BERT, etc.). It learns to merge frequent character pairs, creating subword units that balance vocabulary size with sequence length.
|
||||
|
||||
```
|
||||
BPE Training Process:
|
||||
|
||||
Step 1: Start with character vocabulary
|
||||
Text: ["hello", "hello", "help"]
|
||||
Initial tokens: [['h','e','l','l','o</w>'], ['h','e','l','l','o</w>'], ['h','e','l','p</w>']]
|
||||
|
||||
Step 2: Count character pairs
|
||||
('h','e'): 3 times ← Most frequent!
|
||||
('e','l'): 3 times
|
||||
('l','l'): 2 times
|
||||
('l','o'): 2 times
|
||||
('l','p'): 1 time
|
||||
|
||||
Step 3: Merge most frequent pair
|
||||
Merge ('h','e') → 'he'
|
||||
Tokens: [['he','l','l','o</w>'], ['he','l','l','o</w>'], ['he','l','p</w>']]
|
||||
Vocab: ['h','e','l','o','p','</w>','he'] ← New token added
|
||||
|
||||
Step 4: Repeat until target vocabulary size
|
||||
Next merge: ('l','l') → 'll'
|
||||
Tokens: [['he','ll','o</w>'], ['he','ll','o</w>'], ['he','l','p</w>']]
|
||||
Vocab: ['h','e','l','o','p','</w>','he','ll'] ← Growing vocabulary
|
||||
|
||||
Final result:
|
||||
Text "hello" → ['he', 'll', 'o</w>'] → 3 tokens (vs 5 characters)
|
||||
Text "help" → ['he', 'l', 'p</w>'] → 3 tokens (vs 4 characters)
|
||||
┌───────────────────────────────────────────────────────────────────────────┐
|
||||
│ BPE TRAINING ALGORITHM: Learning Subword Units │
|
||||
├───────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ STEP 1: Initialize with Character Vocabulary │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Training Data: ["hello", "hello", "help"] │ │
|
||||
│ │ │ │
|
||||
│ │ Initial Tokens (with end-of-word markers): │ │
|
||||
│ │ ['h','e','l','l','o</w>'] (hello) │ │
|
||||
│ │ ['h','e','l','l','o</w>'] (hello) │ │
|
||||
│ │ ['h','e','l','p</w>'] (help) │ │
|
||||
│ │ │ │
|
||||
│ │ Starting Vocab: ['h', 'e', 'l', 'o', 'p', '</w>'] │ │
|
||||
│ │ ↑ All unique characters │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ STEP 2: Count All Adjacent Pairs │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Pair Frequency Analysis: │ │
|
||||
│ │ │ │
|
||||
│ │ ('h', 'e'): ██████ 3 occurrences ← MOST FREQUENT! │ │
|
||||
│ │ ('e', 'l'): ██████ 3 occurrences │ │
|
||||
│ │ ('l', 'l'): ████ 2 occurrences │ │
|
||||
│ │ ('l', 'o'): ████ 2 occurrences │ │
|
||||
│ │ ('o', '<'): ████ 2 occurrences │ │
|
||||
│ │ ('l', 'p'): ██ 1 occurrence │ │
|
||||
│ │ ('p', '<'): ██ 1 occurrence │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ STEP 3: Merge Most Frequent Pair │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Merge Operation: ('h', 'e') → 'he' │ │
|
||||
│ │ │ │
|
||||
│ │ BEFORE: AFTER: │ │
|
||||
│ │ ['h','e','l','l','o</w>'] → ['he','l','l','o</w>'] │ │
|
||||
│ │ ['h','e','l','l','o</w>'] → ['he','l','l','o</w>'] │ │
|
||||
│ │ ['h','e','l','p</w>'] → ['he','l','p</w>'] │ │
|
||||
│ │ │ │
|
||||
│ │ Updated Vocab: ['h','e','l','o','p','</w>', 'he'] │ │
|
||||
│ │ ↑ NEW TOKEN! │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ STEP 4: Repeat Until Target Vocab Size Reached │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Iteration 2: Next most frequent is ('l', 'l') │ │
|
||||
│ │ Merge ('l','l') → 'll' │ │
|
||||
│ │ │ │
|
||||
│ │ ['he','l','l','o</w>'] → ['he','ll','o</w>'] │ │
|
||||
│ │ ['he','l','l','o</w>'] → ['he','ll','o</w>'] │ │
|
||||
│ │ ['he','l','p</w>'] → ['he','l','p</w>'] │ │
|
||||
│ │ │ │
|
||||
│ │ Updated Vocab: ['h','e','l','o','p','</w>','he','ll'] │ │
|
||||
│ │ ↑ NEW! │ │
|
||||
│ │ │ │
|
||||
│ │ Continue merging until vocab_size target... │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ FINAL RESULTS: │
|
||||
│ ┌──────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Trained BPE can now encode efficiently: │ │
|
||||
│ │ │ │
|
||||
│ │ "hello" → ['he', 'll', 'o</w>'] = 3 tokens (vs 5 chars) │ │
|
||||
│ │ "help" → ['he', 'l', 'p</w>'] = 3 tokens (vs 4 chars) │ │
|
||||
│ │ │ │
|
||||
│ │ 💡 Key Insight: BPE automatically discovers: │ │
|
||||
│ │ - Common prefixes ('he') │ │
|
||||
│ │ - Morphological patterns ('ll') │ │
|
||||
│ │ - Natural word boundaries (</w>) │ │
|
||||
│ └──────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└───────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
BPE discovers natural word boundaries and common patterns automatically!
|
||||
**Why BPE Works**: By starting with characters and iteratively merging frequent pairs, BPE discovers the natural statistical patterns in language. Common words become single tokens, rare words split into recognizable subword pieces!
|
||||
"""
|
||||
|
||||
# %% nbgrader={"grade": false, "grade_id": "bpe-tokenizer", "solution": true}
|
||||
@@ -1080,11 +1187,57 @@ ChatGPT: ~100K tokens with extended vocabulary
|
||||
|
||||
**Memory implications for embedding tables**:
|
||||
```
|
||||
Tokenizer Vocab Size Embed Dim Parameters Memory (fp32)
|
||||
Character 100 512 51K 204 KB
|
||||
BPE-1K 1,000 512 512K 2.0 MB
|
||||
BPE-50K 50,000 512 25.6M 102.4 MB
|
||||
Word-100K 100,000 512 51.2M 204.8 MB
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ EMBEDDING TABLE MEMORY: Vocabulary Size × Embedding Dimension │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ CHARACTER TOKENIZER (Vocab: 100) │
|
||||
│ ┌────────────────────────────┐ │
|
||||
│ │ 100 × 512 = 51,200 params │ Memory: 204 KB │
|
||||
│ │ ████ │ ↑ Tiny embedding table! │
|
||||
│ └────────────────────────────┘ │
|
||||
│ │
|
||||
│ BPE-SMALL (Vocab: 1,000) │
|
||||
│ ┌────────────────────────────┐ │
|
||||
│ │ 1K × 512 = 512K params │ Memory: 2.0 MB │
|
||||
│ │ ██████████ │ ↑ Still manageable │
|
||||
│ └────────────────────────────┘ │
|
||||
│ │
|
||||
│ BPE-LARGE (Vocab: 50,000) ← MOST PRODUCTION MODELS │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ 50K × 512 = 25.6M params │ │
|
||||
│ │ ████████████████████████████████████████████████ │ │
|
||||
│ │ │ │
|
||||
│ │ Memory: 102.4 MB (fp32) │ │
|
||||
│ │ 51.2 MB (fp16) ← Half precision saves 50% │ │
|
||||
│ │ 25.6 MB (int8) ← Quantization saves 75% │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ WORD-LEVEL (Vocab: 100,000) │
|
||||
│ ┌────────────────────────────────────────────────────────┐ │
|
||||
│ │ 100K × 512 = 51.2M params │ │
|
||||
│ │ ████████████████████████████████████████████████████ │ │
|
||||
│ │ │ │
|
||||
│ │ Memory: 204.8 MB (fp32) ← Often too large! │ │
|
||||
│ │ 102.4 MB (fp16) │ │
|
||||
│ └────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ 💡 Key Trade-off: │
|
||||
│ Larger vocab → Shorter sequences → Less compute │
|
||||
│ BUT larger vocab → More embedding memory → Harder to train │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Real-World Production Examples:
|
||||
┌─────────────┬──────────────┬───────────────┬──────────────────┐
|
||||
│ Model │ Vocab Size │ Embed Dim │ Embed Memory │
|
||||
├─────────────┼──────────────┼───────────────┼──────────────────┤
|
||||
│ GPT-2 │ 50,257 │ 1,600 │ 321 MB │
|
||||
│ GPT-3 │ 50,257 │ 12,288 │ 2.4 GB │
|
||||
│ BERT │ 30,522 │ 768 │ 94 MB │
|
||||
│ T5 │ 32,128 │ 512 │ 66 MB │
|
||||
│ LLaMA-7B │ 32,000 │ 4,096 │ 524 MB │
|
||||
└─────────────┴──────────────┴───────────────┴──────────────────┘
|
||||
```
|
||||
"""
|
||||
|
||||
|
||||
Reference in New Issue
Block a user