fix: Adjust ASCII diagram spacing for consistent alignment

2026-04-28 07:17:33 -05:00 · 2025-10-24 17:50:48 -04:00
parent c6853d7550
commit bde003d908
1 changed files with 136 additions and 136 deletions
--- a/modules/source/10_tokenization/tokenization_dev.py
+++ b/modules/source/10_tokenization/tokenization_dev.py
@@ -85,23 +85,23 @@ Consider the sentence: "Hello, world!" - how do we turn this into numbers a neur
 ┌─────────────────────────────────────────────────────────────────┐
 │  TOKENIZATION PIPELINE: Text → Numbers                          │
 ├─────────────────────────────────────────────────────────────────┤
-│                                                                  │
+│                                                                 │
 │  Input (Human Text):     "Hello, world!"                        │
-│           │                                                      │
-│           ├─ Step 1: Split into tokens                         │
-│           │         ['H','e','l','l','o',',', ...']            │
-│           │                                                      │
-│           ├─ Step 2: Map to vocabulary IDs                     │
-│           │         [72, 101, 108, 108, 111, ...]              │
-│           │                                                      │
-│           ├─ Step 3: Handle unknowns                           │
-│           │         Unknown chars → special <UNK> token        │
-│           │                                                      │
-│           └─ Step 4: Enable decoding                           │
+│           │                                                     │
+│           ├─ Step 1: Split into tokens                          │
+│           │         ['H','e','l','l','o',',', ...']             │
+│           │                                                     │
+│           ├─ Step 2: Map to vocabulary IDs                      │
+│           │         [72, 101, 108, 108, 111, ...]               │
+│           │                                                     │
+│           ├─ Step 3: Handle unknowns                            │
+│           │         Unknown chars → special <UNK> token         │
+│           │                                                     │
+│           └─ Step 4: Enable decoding                            │
 │                     IDs → original text                         │
-│                                                                  │
-│  Output (Token IDs):  [72, 101, 108, 108, 111, 44, 32, ...]    │
-│                                                                  │
+│                                                                 │
+│  Output (Token IDs):  [72, 101, 108, 108, 111, 44, 32, ...]     │
+│                                                                 │
 └─────────────────────────────────────────────────────────────────┘
 ```

@@ -133,47 +133,47 @@ Different tokenization approaches make different trade-offs between vocabulary s
 **Approach**: Each character gets its own token

 ```
-┌──────────────────────────────────────────────────────────────────┐
-│ CHARACTER TOKENIZATION PROCESS                                   │
-├──────────────────────────────────────────────────────────────────┤
-│                                                                   │
-│  Step 1: Build Vocabulary from Unique Characters                 │
-│  ┌────────────────────────────────────────────────────────┐     │
-│  │ Corpus: ["hello", "world"]                             │     │
-│  │                ↓                                        │     │
-│  │ Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd']    │     │
-│  │                ↓                                        │     │
-│  │ Vocabulary:  ['<UNK>','h','e','l','o','w','r','d']   │     │
-│  │ IDs:            0      1   2   3   4   5   6   7      │     │
-│  └────────────────────────────────────────────────────────┘     │
-│                                                                   │
-│  Step 2: Encode Text Character by Character                      │
-│  ┌────────────────────────────────────────────────────────┐     │
-│  │  Text: "hello"                                         │     │
-│  │                                                         │     │
-│  │   'h' → 1    (lookup in vocabulary)                   │     │
-│  │   'e' → 2                                              │     │
-│  │   'l' → 3                                              │     │
-│  │   'l' → 3                                              │     │
-│  │   'o' → 4                                              │     │
-│  │                                                         │     │
-│  │  Result: [1, 2, 3, 3, 4]                              │     │
-│  └────────────────────────────────────────────────────────┘     │
-│                                                                   │
-│  Step 3: Decode by Reversing ID Lookup                           │
-│  ┌────────────────────────────────────────────────────────┐     │
-│  │  IDs: [1, 2, 3, 3, 4]                                 │     │
-│  │                                                         │     │
-│  │   1 → 'h'    (reverse lookup)                         │     │
-│  │   2 → 'e'                                              │     │
-│  │   3 → 'l'                                              │     │
-│  │   3 → 'l'                                              │     │
-│  │   4 → 'o'                                              │     │
-│  │                                                         │     │
-│  │  Result: "hello"                                       │     │
-│  └────────────────────────────────────────────────────────┘     │
-│                                                                   │
-└──────────────────────────────────────────────────────────────────┘
+┌──────────────────────────────────────────────────────────────┐
+│ CHARACTER TOKENIZATION PROCESS                               │
+├──────────────────────────────────────────────────────────────┤
+│                                                              │
+│  Step 1: Build Vocabulary from Unique Characters             │
+│  ┌────────────────────────────────────────────────────────┐  │
+│  │ Corpus: ["hello", "world"]                             │  │
+│  │                ↓                                       │  │
+│  │ Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd']      │  │
+│  │                ↓                                       │  │
+│  │ Vocabulary:  ['<UNK>','h','e','l','o','w','r','d']     │  │
+│  │ IDs:            0      1   2   3   4   5   6   7       │  │
+│  └────────────────────────────────────────────────────────┘  │
+│                                                              │
+│  Step 2: Encode Text Character by Character                  │
+│  ┌────────────────────────────────────────────────────────┐  │
+│  │  Text: "hello"                                         │  │
+│  │                                                        │  │
+│  │   'h' → 1    (lookup in vocabulary)                    │  │
+│  │   'e' → 2                                              │  │
+│  │   'l' → 3                                              │  │
+│  │   'l' → 3                                              │  │
+│  │   'o' → 4                                              │  │
+│  │                                                        │  │
+│  │  Result: [1, 2, 3, 3, 4]                               │  │
+│  └────────────────────────────────────────────────────────┘  │
+│                                                              │
+│  Step 3: Decode by Reversing ID Lookup                       │
+│  ┌────────────────────────────────────────────────────────┐  │
+│  │  IDs: [1, 2, 3, 3, 4]                                  │  │
+│  │                                                        │  │
+│  │   1 → 'h'    (reverse lookup)                          │  │
+│  │   2 → 'e'                                              │  │
+│  │   3 → 'l'                                              │  │
+│  │   3 → 'l'                                              │  │
+│  │   4 → 'o'                                              │  |
+│  │                                                        │  │
+│  │  Result: "hello"                                       │  │
+│  └────────────────────────────────────────────────────────┘  │
+│                                                              │
+└──────────────────────────────────────────────────────────────┘
 ```

 **Pros**: 
@@ -544,74 +544,74 @@ BPE is the secret sauce behind modern language models (GPT, BERT, etc.). It lear
 ┌───────────────────────────────────────────────────────────────────────────┐
 │ BPE TRAINING ALGORITHM: Learning Subword Units                            │
 ├───────────────────────────────────────────────────────────────────────────┤
-│                                                                            │
+│                                                                           │
 │ STEP 1: Initialize with Character Vocabulary                              │
-│ ┌──────────────────────────────────────────────────────────────┐         │
-│ │ Training Data: ["hello", "hello", "help"]                    │         │
-│ │                                                               │         │
-│ │ Initial Tokens (with end-of-word markers):                   │         │
-│ │   ['h','e','l','l','o</w>']    (hello)                      │         │
-│ │   ['h','e','l','l','o</w>']    (hello)                      │         │
-│ │   ['h','e','l','p</w>']        (help)                       │         │
-│ │                                                               │         │
-│ │ Starting Vocab: ['h', 'e', 'l', 'o', 'p', '</w>']          │         │
-│ │                   ↑ All unique characters                    │         │
-│ └──────────────────────────────────────────────────────────────┘         │
-│                                                                            │
+│ ┌──────────────────────────────────────────────────────────────┐          │
+│ │ Training Data: ["hello", "hello", "help"]                    │          │
+│ │                                                              │          │
+│ │ Initial Tokens (with end-of-word markers):                   │          │
+│ │   ['h','e','l','l','o</w>']    (hello)                       │          │
+│ │   ['h','e','l','l','o</w>']    (hello)                       │          │
+│ │   ['h','e','l','p</w>']        (help)                        │          │
+│ │                                                              │          │
+│ │ Starting Vocab: ['h', 'e', 'l', 'o', 'p', '</w>']            │          │
+│ │                   ↑ All unique characters                    │          │
+│ └──────────────────────────────────────────────────────────────┘          │
+│                                                                           │
 │ STEP 2: Count All Adjacent Pairs                                          │
-│ ┌──────────────────────────────────────────────────────────────┐         │
-│ │ Pair Frequency Analysis:                                      │         │
-│ │                                                               │         │
-│ │   ('h', 'e'): ██████  3 occurrences  ← MOST FREQUENT!       │         │
-│ │   ('e', 'l'): ██████  3 occurrences                         │         │
-│ │   ('l', 'l'): ████    2 occurrences                         │         │
-│ │   ('l', 'o'): ████    2 occurrences                         │         │
-│ │   ('o', '<'): ████    2 occurrences                         │         │
-│ │   ('l', 'p'): ██      1 occurrence                          │         │
-│ │   ('p', '<'): ██      1 occurrence                          │         │
-│ └──────────────────────────────────────────────────────────────┘         │
-│                                                                            │
+│ ┌──────────────────────────────────────────────────────────────┐          │
+│ │ Pair Frequency Analysis:                                     │          │
+│ │                                                              │          │
+│ │   ('h', 'e'): ██████  3 occurrences  ← MOST FREQUENT!        │          │
+│ │   ('e', 'l'): ██████  3 occurrences                          │          │
+│ │   ('l', 'l'): ████    2 occurrences                          │          │
+│ │   ('l', 'o'): ████    2 occurrences                          │          │
+│ │   ('o', '<'): ████    2 occurrences                          │          │
+│ │   ('l', 'p'): ██      1 occurrence                           │          │
+│ │   ('p', '<'): ██      1 occurrence                           │          │
+│ └──────────────────────────────────────────────────────────────┘          │
+│                                                                           │
 │ STEP 3: Merge Most Frequent Pair                                          │
-│ ┌──────────────────────────────────────────────────────────────┐         │
-│ │ Merge Operation: ('h', 'e') → 'he'                          │         │
-│ │                                                               │         │
-│ │ BEFORE:                          AFTER:                       │         │
-│ │   ['h','e','l','l','o</w>']  →  ['he','l','l','o</w>']    │         │
-│ │   ['h','e','l','l','o</w>']  →  ['he','l','l','o</w>']    │         │
-│ │   ['h','e','l','p</w>']      →  ['he','l','p</w>']        │         │
-│ │                                                               │         │
-│ │ Updated Vocab: ['h','e','l','o','p','</w>', 'he']           │         │
-│ │                                              ↑ NEW TOKEN!    │         │
-│ └──────────────────────────────────────────────────────────────┘         │
-│                                                                            │
+│ ┌──────────────────────────────────────────────────────────────┐          │
+│ │ Merge Operation: ('h', 'e') → 'he'                           │          │
+│ │                                                              │          │
+│ │ BEFORE:                          AFTER:                      │          │
+│ │   ['h','e','l','l','o</w>']  →  ['he','l','l','o</w>']       │          │
+│ │   ['h','e','l','l','o</w>']  →  ['he','l','l','o</w>']       │          │
+│ │   ['h','e','l','p</w>']      →  ['he','l','p</w>']           │          │
+│ │                                                              │          │
+│ │ Updated Vocab: ['h','e','l','o','p','</w>', 'he']            │          │
+│ │                                              ↑ NEW TOKEN!    │          │
+│ └──────────────────────────────────────────────────────────────┘          │
+│                                                                           │
 │ STEP 4: Repeat Until Target Vocab Size Reached                            │
-│ ┌──────────────────────────────────────────────────────────────┐         │
-│ │ Iteration 2: Next most frequent is ('l', 'l')               │         │
-│ │ Merge ('l','l') → 'll'                                       │         │
-│ │                                                               │         │
-│ │   ['he','l','l','o</w>']     →  ['he','ll','o</w>']       │         │
-│ │   ['he','l','l','o</w>']     →  ['he','ll','o</w>']       │         │
-│ │   ['he','l','p</w>']         →  ['he','l','p</w>']        │         │
-│ │                                                               │         │
-│ │ Updated Vocab: ['h','e','l','o','p','</w>','he','ll']       │         │
-│ │                                                  ↑ NEW!      │         │
-│ │                                                               │         │
-│ │ Continue merging until vocab_size target...                  │         │
-│ └──────────────────────────────────────────────────────────────┘         │
-│                                                                            │
-│ FINAL RESULTS:                                                             │
-│ ┌──────────────────────────────────────────────────────────────┐         │
-│ │ Trained BPE can now encode efficiently:                      │         │
-│ │                                                               │         │
-│ │ "hello" → ['he', 'll', 'o</w>']  = 3 tokens (vs 5 chars)   │         │
-│ │ "help"  → ['he', 'l', 'p</w>']   = 3 tokens (vs 4 chars)   │         │
-│ │                                                               │         │
-│ │ 💡 Key Insight: BPE automatically discovers:                 │         │
-│ │    - Common prefixes ('he')                                  │         │
-│ │    - Morphological patterns ('ll')                           │         │
-│ │    - Natural word boundaries (</w>)                          │         │
-│ └──────────────────────────────────────────────────────────────┘         │
-│                                                                            │
+│ ┌──────────────────────────────────────────────────────────────┐          │
+│ │ Iteration 2: Next most frequent is ('l', 'l')                │          │
+│ │ Merge ('l','l') → 'll'                                       │          │
+│ │                                                              │          │
+│ │   ['he','l','l','o</w>']     →  ['he','ll','o</w>']          │          │
+│ │   ['he','l','l','o</w>']     →  ['he','ll','o</w>']          │          │
+│ │   ['he','l','p</w>']         →  ['he','l','p</w>']           │          │
+│ │                                                              │          │
+│ │ Updated Vocab: ['h','e','l','o','p','</w>','he','ll']        │          │
+│ │                                                  ↑ NEW!      │          │
+│ │                                                              │          │
+│ │ Continue merging until vocab_size target...                  │          │
+│ └──────────────────────────────────────────────────────────────┘          │
+│                                                                           │
+│ FINAL RESULTS:                                                            │
+│ ┌──────────────────────────────────────────────────────────────┐          │
+│ │ Trained BPE can now encode efficiently:                      │          │
+│ │                                                              │          │
+│ │ "hello" → ['he', 'll', 'o</w>']  = 3 tokens (vs 5 chars)     │          │
+│ │ "help"  → ['he', 'l', 'p</w>']   = 3 tokens (vs 4 chars)     │          │
+│ │                                                              │          │
+│ │  Key Insights: BPE automatically discovers:                  │          │
+│ │    - Common prefixes ('he')                                  │          │
+│ │    - Morphological patterns ('ll')                           │          │
+│ │    - Natural word boundaries (</w>)                          │          │
+│ └──────────────────────────────────────────────────────────────┘          │
+│                                                                           │
 └───────────────────────────────────────────────────────────────────────────┘
 ```

@@ -1190,42 +1190,42 @@ ChatGPT:     ~100K tokens with extended vocabulary
 ┌─────────────────────────────────────────────────────────────────────┐
 │ EMBEDDING TABLE MEMORY: Vocabulary Size × Embedding Dimension       │
 ├─────────────────────────────────────────────────────────────────────┤
-│                                                                      │
-│ CHARACTER TOKENIZER (Vocab: 100)                                     │
+│                                                                     │
+│ CHARACTER TOKENIZER (Vocab: 100)                                    │
 │ ┌────────────────────────────┐                                      │
 │ │  100 × 512 = 51,200 params │     Memory: 204 KB                   │
 │ │  ████                      │     ↑ Tiny embedding table!          │
 │ └────────────────────────────┘                                      │
-│                                                                      │
-│ BPE-SMALL (Vocab: 1,000)                                             │
+│                                                                     │
+│ BPE-SMALL (Vocab: 1,000)                                            │
 │ ┌────────────────────────────┐                                      │
 │ │  1K × 512 = 512K params    │     Memory: 2.0 MB                   │
 │ │  ██████████                │     ↑ Still manageable               │
 │ └────────────────────────────┘                                      │
-│                                                                      │
+│                                                                     │
 │ BPE-LARGE (Vocab: 50,000) ← MOST PRODUCTION MODELS                  │
 │ ┌────────────────────────────────────────────────────────┐          │
 │ │  50K × 512 = 25.6M params                              │          │
 │ │  ████████████████████████████████████████████████      │          │
-│ │                                                         │          │
+│ │                                                        │          │
 │ │  Memory: 102.4 MB (fp32)                               │          │
 │ │          51.2 MB (fp16)    ← Half precision saves 50%  │          │
 │ │          25.6 MB (int8)    ← Quantization saves 75%    │          │
 │ └────────────────────────────────────────────────────────┘          │
-│                                                                      │
-│ WORD-LEVEL (Vocab: 100,000)                                          │
+│                                                                     │
+│ WORD-LEVEL (Vocab: 100,000)                                         │
 │ ┌────────────────────────────────────────────────────────┐          │
 │ │  100K × 512 = 51.2M params                             │          │
 │ │  ████████████████████████████████████████████████████  │          │
-│ │                                                         │          │
-│ │  Memory: 204.8 MB (fp32)  ← Often too large!          │          │
+│ │                                                        │          │
+│ │  Memory: 204.8 MB (fp32)  ← Often too large!           │          │
 │ │          102.4 MB (fp16)                               │          │
 │ └────────────────────────────────────────────────────────┘          │
-│                                                                      │
-│ 💡 Key Trade-off:                                                    │
-│    Larger vocab → Shorter sequences → Less compute                   │
-│    BUT larger vocab → More embedding memory → Harder to train        │
-│                                                                      │
+│                                                                     │
+│  Key Trade-off:                                                     │
+│    Larger vocab → Shorter sequences → Less compute                  │
+│    BUT larger vocab → More embedding memory → Harder to train       │
+│                                                                     │
 └─────────────────────────────────────────────────────────────────────┘

 Real-World Production Examples:
@@ -1233,7 +1233,7 @@ Real-World Production Examples:
 │   Model     │  Vocab Size  │  Embed Dim    │  Embed Memory    │
 ├─────────────┼──────────────┼───────────────┼──────────────────┤
 │  GPT-2      │    50,257    │     1,600     │     321 MB       │
-│  GPT-3      │    50,257    │    12,288     │    2.4 GB        │
+│  GPT-3      │    50,257    │    12,288     │     2.4 GB       │
 │  BERT       │    30,522    │       768     │      94 MB       │
 │  T5         │    32,128    │       512     │      66 MB       │
 │  LLaMA-7B   │    32,000    │     4,096     │     524 MB       │