docs: Improve tokenization module with enhanced ASCII diagrams

Following module developer guidelines, added comprehensive visual diagrams: 1. Text-to-Numbers Pipeline (Introduction): - Added full boxed diagram showing 4-step tokenization process - Clear visual flow from human text to numerical IDs - Each step explained inline with the diagram 2. Character Tokenization Process: - Step-by-step vocabulary building visualization - Shows corpus → unique chars → vocab with IDs - Encoding process with ID lookup visualization - Decoding process with reverse lookup - All in clear nested boxes 3. BPE Training Algorithm: - Comprehensive 4-step process with nested boxes - Pair frequency analysis with bar charts (████) - Before/After merge visualizations - Iteration examples showing vocabulary growth - Final results with key insights 4. Memory Layout for Embedding Tables: - Visual bars showing relative memory sizes - Character (204KB) vs BPE-50K (102MB) vs Word-100K (204MB) - Shows fp32/fp16/int8 precision trade-offs - Real production model examples (GPT-2/3, BERT, T5, LLaMA) - Clear table format for comparison Educational improvements: - More visual, less text-heavy - Clearer step-by-step flows - Better intuition building - Production context throughout - Following module developer ASCII diagram patterns Students now see: - HOW tokenization works (not just WHAT) - WHY different strategies exist - WHAT the memory implications are - HOW production models make these choices
2026-05-01 05:37:30 -05:00 · 2025-10-24 10:51:00 -04:00
parent 0e997e4a10
commit c6853d7550
1 changed files with 204 additions and 51 deletions
--- a/modules/source/10_tokenization/tokenization_dev.py
+++ b/modules/source/10_tokenization/tokenization_dev.py
@@ -79,23 +79,40 @@ Neural networks operate on numbers, but humans communicate with text. Tokenizati

 ### The Text-to-Numbers Challenge

-Consider the sentence: "Hello, world!"
+Consider the sentence: "Hello, world!" - how do we turn this into numbers a neural network can process?

 ```
-Human Text:     "Hello, world!"
-                      ↓
-               [Tokenization]
-                      ↓
-Numerical IDs:  [72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33]
+┌─────────────────────────────────────────────────────────────────┐
+│  TOKENIZATION PIPELINE: Text → Numbers                          │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  Input (Human Text):     "Hello, world!"                        │
+│           │                                                      │
+│           ├─ Step 1: Split into tokens                         │
+│           │         ['H','e','l','l','o',',', ...']            │
+│           │                                                      │
+│           ├─ Step 2: Map to vocabulary IDs                     │
+│           │         [72, 101, 108, 108, 111, ...]              │
+│           │                                                      │
+│           ├─ Step 3: Handle unknowns                           │
+│           │         Unknown chars → special <UNK> token        │
+│           │                                                      │
+│           └─ Step 4: Enable decoding                           │
+│                     IDs → original text                         │
+│                                                                  │
+│  Output (Token IDs):  [72, 101, 108, 108, 111, 44, 32, ...]    │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
 ```

 ### The Four-Step Process

-How do we represent this for a neural network? We need to:
-1. **Split text into tokens** - meaningful units like words, subwords, or characters
-2. **Map tokens to integers** - create a vocabulary that assigns unique IDs
-3. **Handle unknown text** - deal with words not seen during training
-4. **Enable reconstruction** - convert numbers back to readable text
+How do we represent text for a neural network? We need a systematic pipeline:
+
+**1. Split text into tokens** - Break text into meaningful units (words, subwords, or characters)
+**2. Map tokens to integers** - Create a vocabulary that assigns each token a unique ID
+**3. Handle unknown text** - Deal gracefully with tokens not seen during training
+**4. Enable reconstruction** - Convert numbers back to readable text for interpretation

 ### Why This Matters

@@ -116,15 +133,59 @@ Different tokenization approaches make different trade-offs between vocabulary s
 **Approach**: Each character gets its own token

 ```
-Text: "Hello world"
-       ↓
-Tokens: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
-       ↓
-IDs:    [8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4]
+┌──────────────────────────────────────────────────────────────────┐
+│ CHARACTER TOKENIZATION PROCESS                                   │
+├──────────────────────────────────────────────────────────────────┤
+│                                                                   │
+│  Step 1: Build Vocabulary from Unique Characters                 │
+│  ┌────────────────────────────────────────────────────────┐     │
+│  │ Corpus: ["hello", "world"]                             │     │
+│  │                ↓                                        │     │
+│  │ Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd']    │     │
+│  │                ↓                                        │     │
+│  │ Vocabulary:  ['<UNK>','h','e','l','o','w','r','d']   │     │
+│  │ IDs:            0      1   2   3   4   5   6   7      │     │
+│  └────────────────────────────────────────────────────────┘     │
+│                                                                   │
+│  Step 2: Encode Text Character by Character                      │
+│  ┌────────────────────────────────────────────────────────┐     │
+│  │  Text: "hello"                                         │     │
+│  │                                                         │     │
+│  │   'h' → 1    (lookup in vocabulary)                   │     │
+│  │   'e' → 2                                              │     │
+│  │   'l' → 3                                              │     │
+│  │   'l' → 3                                              │     │
+│  │   'o' → 4                                              │     │
+│  │                                                         │     │
+│  │  Result: [1, 2, 3, 3, 4]                              │     │
+│  └────────────────────────────────────────────────────────┘     │
+│                                                                   │
+│  Step 3: Decode by Reversing ID Lookup                           │
+│  ┌────────────────────────────────────────────────────────┐     │
+│  │  IDs: [1, 2, 3, 3, 4]                                 │     │
+│  │                                                         │     │
+│  │   1 → 'h'    (reverse lookup)                         │     │
+│  │   2 → 'e'                                              │     │
+│  │   3 → 'l'                                              │     │
+│  │   3 → 'l'                                              │     │
+│  │   4 → 'o'                                              │     │
+│  │                                                         │     │
+│  │  Result: "hello"                                       │     │
+│  └────────────────────────────────────────────────────────┘     │
+│                                                                   │
+└──────────────────────────────────────────────────────────────────┘
 ```

-**Pros**: Small vocabulary (~100), handles any text, no unknown tokens
-**Cons**: Long sequences (1 char = 1 token), limited semantic understanding
+**Pros**: 
+- Small vocabulary (~100 chars)
+- Handles any text perfectly
+- No unknown tokens (every character can be mapped)
+- Simple implementation
+
+**Cons**: 
+- Long sequences (1 character = 1 token)
+- Limited semantic understanding (no word boundaries)
+- More compute (longer sequences to process)

 ### Word-Level Tokenization
 **Approach**: Each word gets its own token
@@ -477,38 +538,84 @@ Character tokenization provides a simple, robust foundation for text processing.
 """
 ### Byte Pair Encoding (BPE) Tokenizer

-BPE is the secret sauce behind modern language models. It learns to merge frequent character pairs, creating subword units that balance vocabulary size with sequence length.
+BPE is the secret sauce behind modern language models (GPT, BERT, etc.). It learns to merge frequent character pairs, creating subword units that balance vocabulary size with sequence length.

 ```
-BPE Training Process:
-
-Step 1: Start with character vocabulary
-Text: ["hello", "hello", "help"]
-Initial tokens: [['h','e','l','l','o</w>'], ['h','e','l','l','o</w>'], ['h','e','l','p</w>']]
-
-Step 2: Count character pairs
-('h','e'): 3 times  ← Most frequent!
-('e','l'): 3 times
-('l','l'): 2 times
-('l','o'): 2 times
-('l','p'): 1 time
-
-Step 3: Merge most frequent pair
-Merge ('h','e') → 'he'
-Tokens: [['he','l','l','o</w>'], ['he','l','l','o</w>'], ['he','l','p</w>']]
-Vocab: ['h','e','l','o','p','</w>','he']  ← New token added
-
-Step 4: Repeat until target vocabulary size
-Next merge: ('l','l') → 'll'
-Tokens: [['he','ll','o</w>'], ['he','ll','o</w>'], ['he','l','p</w>']]
-Vocab: ['h','e','l','o','p','</w>','he','ll']  ← Growing vocabulary
-
-Final result:
-Text "hello" → ['he', 'll', 'o</w>'] → 3 tokens (vs 5 characters)
-Text "help"  → ['he', 'l', 'p</w>']  → 3 tokens (vs 4 characters)
+┌───────────────────────────────────────────────────────────────────────────┐
+│ BPE TRAINING ALGORITHM: Learning Subword Units                            │
+├───────────────────────────────────────────────────────────────────────────┤
+│                                                                            │
+│ STEP 1: Initialize with Character Vocabulary                              │
+│ ┌──────────────────────────────────────────────────────────────┐         │
+│ │ Training Data: ["hello", "hello", "help"]                    │         │
+│ │                                                               │         │
+│ │ Initial Tokens (with end-of-word markers):                   │         │
+│ │   ['h','e','l','l','o</w>']    (hello)                      │         │
+│ │   ['h','e','l','l','o</w>']    (hello)                      │         │
+│ │   ['h','e','l','p</w>']        (help)                       │         │
+│ │                                                               │         │
+│ │ Starting Vocab: ['h', 'e', 'l', 'o', 'p', '</w>']          │         │
+│ │                   ↑ All unique characters                    │         │
+│ └──────────────────────────────────────────────────────────────┘         │
+│                                                                            │
+│ STEP 2: Count All Adjacent Pairs                                          │
+│ ┌──────────────────────────────────────────────────────────────┐         │
+│ │ Pair Frequency Analysis:                                      │         │
+│ │                                                               │         │
+│ │   ('h', 'e'): ██████  3 occurrences  ← MOST FREQUENT!       │         │
+│ │   ('e', 'l'): ██████  3 occurrences                         │         │
+│ │   ('l', 'l'): ████    2 occurrences                         │         │
+│ │   ('l', 'o'): ████    2 occurrences                         │         │
+│ │   ('o', '<'): ████    2 occurrences                         │         │
+│ │   ('l', 'p'): ██      1 occurrence                          │         │
+│ │   ('p', '<'): ██      1 occurrence                          │         │
+│ └──────────────────────────────────────────────────────────────┘         │
+│                                                                            │
+│ STEP 3: Merge Most Frequent Pair                                          │
+│ ┌──────────────────────────────────────────────────────────────┐         │
+│ │ Merge Operation: ('h', 'e') → 'he'                          │         │
+│ │                                                               │         │
+│ │ BEFORE:                          AFTER:                       │         │
+│ │   ['h','e','l','l','o</w>']  →  ['he','l','l','o</w>']    │         │
+│ │   ['h','e','l','l','o</w>']  →  ['he','l','l','o</w>']    │         │
+│ │   ['h','e','l','p</w>']      →  ['he','l','p</w>']        │         │
+│ │                                                               │         │
+│ │ Updated Vocab: ['h','e','l','o','p','</w>', 'he']           │         │
+│ │                                              ↑ NEW TOKEN!    │         │
+│ └──────────────────────────────────────────────────────────────┘         │
+│                                                                            │
+│ STEP 4: Repeat Until Target Vocab Size Reached                            │
+│ ┌──────────────────────────────────────────────────────────────┐         │
+│ │ Iteration 2: Next most frequent is ('l', 'l')               │         │
+│ │ Merge ('l','l') → 'll'                                       │         │
+│ │                                                               │         │
+│ │   ['he','l','l','o</w>']     →  ['he','ll','o</w>']       │         │
+│ │   ['he','l','l','o</w>']     →  ['he','ll','o</w>']       │         │
+│ │   ['he','l','p</w>']         →  ['he','l','p</w>']        │         │
+│ │                                                               │         │
+│ │ Updated Vocab: ['h','e','l','o','p','</w>','he','ll']       │         │
+│ │                                                  ↑ NEW!      │         │
+│ │                                                               │         │
+│ │ Continue merging until vocab_size target...                  │         │
+│ └──────────────────────────────────────────────────────────────┘         │
+│                                                                            │
+│ FINAL RESULTS:                                                             │
+│ ┌──────────────────────────────────────────────────────────────┐         │
+│ │ Trained BPE can now encode efficiently:                      │         │
+│ │                                                               │         │
+│ │ "hello" → ['he', 'll', 'o</w>']  = 3 tokens (vs 5 chars)   │         │
+│ │ "help"  → ['he', 'l', 'p</w>']   = 3 tokens (vs 4 chars)   │         │
+│ │                                                               │         │
+│ │ 💡 Key Insight: BPE automatically discovers:                 │         │
+│ │    - Common prefixes ('he')                                  │         │
+│ │    - Morphological patterns ('ll')                           │         │
+│ │    - Natural word boundaries (</w>)                          │         │
+│ └──────────────────────────────────────────────────────────────┘         │
+│                                                                            │
+└───────────────────────────────────────────────────────────────────────────┘
 ```

-BPE discovers natural word boundaries and common patterns automatically!
+**Why BPE Works**: By starting with characters and iteratively merging frequent pairs, BPE discovers the natural statistical patterns in language. Common words become single tokens, rare words split into recognizable subword pieces!
 """

 # %% nbgrader={"grade": false, "grade_id": "bpe-tokenizer", "solution": true}
@@ -1080,11 +1187,57 @@ ChatGPT:     ~100K tokens with extended vocabulary

 **Memory implications for embedding tables**:
 ```
-Tokenizer     Vocab Size   Embed Dim   Parameters    Memory (fp32)
-Character           100        512        51K           204 KB
-BPE-1K            1,000        512       512K           2.0 MB
-BPE-50K          50,000        512      25.6M         102.4 MB
-Word-100K       100,000        512      51.2M         204.8 MB
+┌─────────────────────────────────────────────────────────────────────┐
+│ EMBEDDING TABLE MEMORY: Vocabulary Size × Embedding Dimension       │
+├─────────────────────────────────────────────────────────────────────┤
+│                                                                      │
+│ CHARACTER TOKENIZER (Vocab: 100)                                     │
+│ ┌────────────────────────────┐                                      │
+│ │  100 × 512 = 51,200 params │     Memory: 204 KB                   │
+│ │  ████                      │     ↑ Tiny embedding table!          │
+│ └────────────────────────────┘                                      │
+│                                                                      │
+│ BPE-SMALL (Vocab: 1,000)                                             │
+│ ┌────────────────────────────┐                                      │
+│ │  1K × 512 = 512K params    │     Memory: 2.0 MB                   │
+│ │  ██████████                │     ↑ Still manageable               │
+│ └────────────────────────────┘                                      │
+│                                                                      │
+│ BPE-LARGE (Vocab: 50,000) ← MOST PRODUCTION MODELS                  │
+│ ┌────────────────────────────────────────────────────────┐          │
+│ │  50K × 512 = 25.6M params                              │          │
+│ │  ████████████████████████████████████████████████      │          │
+│ │                                                         │          │
+│ │  Memory: 102.4 MB (fp32)                               │          │
+│ │          51.2 MB (fp16)    ← Half precision saves 50%  │          │
+│ │          25.6 MB (int8)    ← Quantization saves 75%    │          │
+│ └────────────────────────────────────────────────────────┘          │
+│                                                                      │
+│ WORD-LEVEL (Vocab: 100,000)                                          │
+│ ┌────────────────────────────────────────────────────────┐          │
+│ │  100K × 512 = 51.2M params                             │          │
+│ │  ████████████████████████████████████████████████████  │          │
+│ │                                                         │          │
+│ │  Memory: 204.8 MB (fp32)  ← Often too large!          │          │
+│ │          102.4 MB (fp16)                               │          │
+│ └────────────────────────────────────────────────────────┘          │
+│                                                                      │
+│ 💡 Key Trade-off:                                                    │
+│    Larger vocab → Shorter sequences → Less compute                   │
+│    BUT larger vocab → More embedding memory → Harder to train        │
+│                                                                      │
+└─────────────────────────────────────────────────────────────────────┘
+
+Real-World Production Examples:
+┌─────────────┬──────────────┬───────────────┬──────────────────┐
+│   Model     │  Vocab Size  │  Embed Dim    │  Embed Memory    │
+├─────────────┼──────────────┼───────────────┼──────────────────┤
+│  GPT-2      │    50,257    │     1,600     │     321 MB       │
+│  GPT-3      │    50,257    │    12,288     │    2.4 GB        │
+│  BERT       │    30,522    │       768     │      94 MB       │
+│  T5         │    32,128    │       512     │      66 MB       │
+│  LLaMA-7B   │    32,000    │     4,096     │     524 MB       │
+└─────────────┴──────────────┴───────────────┴──────────────────┘
 ```
 """