mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-12 19:23:36 -05:00
- Added typing imports (List, Dict, Tuple, Optional, Set) to export section - Fixed NameError: name 'List' is not defined - Fixed milestone copilot references from SimpleTokenizer to CharTokenizer - Verified transformer learning: 99.1% loss decrease in 500 steps Training results: - Initial loss: 3.555 - Final loss: 0.031 - Training time: 52.1s for 500 steps - Gradient flow: All 21 parameters receiving gradients - Model: 1-layer GPT with 32d embeddings, 4 heads
1634 lines
72 KiB
Plaintext
1634 lines
72 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c20728c2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp text.tokenization\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"import numpy as np\n",
|
||
"from typing import List, Dict, Tuple, Optional, Set\n",
|
||
"import json\n",
|
||
"import re\n",
|
||
"from collections import defaultdict, Counter"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b005926e",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 10: Tokenization - Converting Text to Numbers\n",
|
||
"\n",
|
||
"Welcome to Module 10! Today you'll build tokenization - the bridge that converts human-readable text into numerical representations that machine learning models can process.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Neural networks, layers, training loops, and data loading\n",
|
||
"**You'll Build**: Text tokenization systems (character and BPE-based)\n",
|
||
"**You'll Enable**: Text processing for language models and NLP tasks\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"DataLoader → Tokenization → Embeddings\n",
|
||
"(batching) (text→numbers) (learnable representations)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement character-based tokenization for simple text processing\n",
|
||
"2. Build a BPE (Byte Pair Encoding) tokenizer for efficient text representation\n",
|
||
"3. Understand vocabulary management and encoding/decoding operations\n",
|
||
"4. Create the foundation for text processing in neural networks\n",
|
||
"\n",
|
||
"Let's get started!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d5b93d34",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/10_tokenization/tokenization_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.text.tokenization`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.text.tokenization import Tokenizer, CharTokenizer, BPETokenizer\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete tokenization system in one focused module for deep understanding\n",
|
||
"- **Production:** Proper organization like Hugging Face's tokenizers with all text processing together\n",
|
||
"- **Consistency:** All tokenization operations and vocabulary management in text.tokenization\n",
|
||
"- **Integration:** Works seamlessly with embeddings and data loading for complete NLP pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c89f5e86",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"from typing import List, Dict, Tuple, Optional, Set\n",
|
||
"import json\n",
|
||
"import re\n",
|
||
"from collections import defaultdict, Counter\n",
|
||
"\n",
|
||
"# Import only Module 01 (Tensor) - this module has minimal dependencies\n",
|
||
"from tinytorch.core.tensor import Tensor"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c139104c",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction - Why Tokenization?\n",
|
||
"\n",
|
||
"Neural networks operate on numbers, but humans communicate with text. Tokenization is the crucial bridge that converts text into numerical sequences that models can process.\n",
|
||
"\n",
|
||
"### The Text-to-Numbers Challenge\n",
|
||
"\n",
|
||
"Consider the sentence: \"Hello, world!\" - how do we turn this into numbers a neural network can process?\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ TOKENIZATION PIPELINE: Text → Numbers │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ Input (Human Text): \"Hello, world!\" │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ Step 1: Split into tokens │\n",
|
||
"│ │ ['H','e','l','l','o',',', ...'] │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ Step 2: Map to vocabulary IDs │\n",
|
||
"│ │ [72, 101, 108, 108, 111, ...] │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ Step 3: Handle unknowns │\n",
|
||
"│ │ Unknown chars → special <UNK> token │\n",
|
||
"│ │ │\n",
|
||
"│ └─ Step 4: Enable decoding │\n",
|
||
"│ IDs → original text │\n",
|
||
"│ │\n",
|
||
"│ Output (Token IDs): [72, 101, 108, 108, 111, 44, 32, ...] │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Four-Step Process\n",
|
||
"\n",
|
||
"How do we represent text for a neural network? We need a systematic pipeline:\n",
|
||
"\n",
|
||
"**1. Split text into tokens** - Break text into meaningful units (words, subwords, or characters)\n",
|
||
"**2. Map tokens to integers** - Create a vocabulary that assigns each token a unique ID\n",
|
||
"**3. Handle unknown text** - Deal gracefully with tokens not seen during training\n",
|
||
"**4. Enable reconstruction** - Convert numbers back to readable text for interpretation\n",
|
||
"\n",
|
||
"### Why This Matters\n",
|
||
"\n",
|
||
"The choice of tokenization strategy dramatically affects:\n",
|
||
"- **Model performance** - How well the model understands text\n",
|
||
"- **Vocabulary size** - Memory requirements for embedding tables\n",
|
||
"- **Computational efficiency** - Sequence length affects processing time\n",
|
||
"- **Robustness** - How well the model handles new/rare words"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2446a382",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations - Tokenization Strategies\n",
|
||
"\n",
|
||
"Different tokenization approaches make different trade-offs between vocabulary size, sequence length, and semantic understanding.\n",
|
||
"\n",
|
||
"### Character-Level Tokenization\n",
|
||
"**Approach**: Each character gets its own token\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌──────────────────────────────────────────────────────────────┐\n",
|
||
"│ CHARACTER TOKENIZATION PROCESS │\n",
|
||
"├──────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ Step 1: Build Vocabulary from Unique Characters │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Corpus: [\"hello\", \"world\"] │ │\n",
|
||
"│ │ ↓ │ │\n",
|
||
"│ │ Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd'] │ │\n",
|
||
"│ │ ↓ │ │\n",
|
||
"│ │ Vocabulary: ['<UNK>','h','e','l','o','w','r','d'] │ │\n",
|
||
"│ │ IDs: 0 1 2 3 4 5 6 7 │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ Step 2: Encode Text Character by Character │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Text: \"hello\" │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ 'h' → 1 (lookup in vocabulary) │ │\n",
|
||
"│ │ 'e' → 2 │ │\n",
|
||
"│ │ 'l' → 3 │ │\n",
|
||
"│ │ 'l' → 3 │ │\n",
|
||
"│ │ 'o' → 4 │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Result: [1, 2, 3, 3, 4] │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ Step 3: Decode by Reversing ID Lookup │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ IDs: [1, 2, 3, 3, 4] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ 1 → 'h' (reverse lookup) │ │\n",
|
||
"│ │ 2 → 'e' │ │\n",
|
||
"│ │ 3 → 'l' │ │\n",
|
||
"│ │ 3 → 'l' │ │\n",
|
||
"│ │ 4 → 'o' │ |\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Result: \"hello\" │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"└──────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Pros**: \n",
|
||
"- Small vocabulary (~100 chars)\n",
|
||
"- Handles any text perfectly\n",
|
||
"- No unknown tokens (every character can be mapped)\n",
|
||
"- Simple implementation\n",
|
||
"\n",
|
||
"**Cons**: \n",
|
||
"- Long sequences (1 character = 1 token)\n",
|
||
"- Limited semantic understanding (no word boundaries)\n",
|
||
"- More compute (longer sequences to process)\n",
|
||
"\n",
|
||
"### Word-Level Tokenization\n",
|
||
"**Approach**: Each word gets its own token\n",
|
||
"\n",
|
||
"```\n",
|
||
"Text: \"Hello world\"\n",
|
||
" ↓\n",
|
||
"Tokens: ['Hello', 'world']\n",
|
||
" ↓\n",
|
||
"IDs: [5847, 1254]\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Pros**: Semantic meaning preserved, shorter sequences\n",
|
||
"**Cons**: Huge vocabularies (100K+), many unknown tokens\n",
|
||
"\n",
|
||
"### Subword Tokenization (BPE)\n",
|
||
"**Approach**: Learn frequent character pairs, build subword units\n",
|
||
"\n",
|
||
"```\n",
|
||
"Text: \"tokenization\"\n",
|
||
" ↓ Character level\n",
|
||
"Initial: ['t', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n']\n",
|
||
" ↓ Learn frequent pairs\n",
|
||
"Merged: ['to', 'ken', 'ization']\n",
|
||
" ↓\n",
|
||
"IDs: [142, 1847, 2341]\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Pros**: Balance between vocabulary size and sequence length\n",
|
||
"**Cons**: More complex training process\n",
|
||
"\n",
|
||
"### Strategy Comparison\n",
|
||
"\n",
|
||
"```\n",
|
||
"Text: \"tokenization\" (12 characters)\n",
|
||
"\n",
|
||
"Character: ['t','o','k','e','n','i','z','a','t','i','o','n'] → 12 tokens, vocab ~100\n",
|
||
"Word: ['tokenization'] → 1 token, vocab 100K+\n",
|
||
"BPE: ['token','ization'] → 2 tokens, vocab 10-50K\n",
|
||
"```\n",
|
||
"\n",
|
||
"The sweet spot for most applications is BPE with 10K-50K vocabulary size."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7b6f7e01",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 3. Implementation - Building Tokenization Systems\n",
|
||
"\n",
|
||
"Let's implement tokenization systems from simple character-based to sophisticated BPE. We'll start with the base interface and work our way up to advanced algorithms."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6da9d664",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Base Tokenizer Interface\n",
|
||
"\n",
|
||
"All tokenizers need to provide two core operations: encoding text to numbers and decoding numbers back to text. Let's define the common interface.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Tokenizer Interface:\n",
|
||
" encode(text) → [id1, id2, id3, ...]\n",
|
||
" decode([id1, id2, id3, ...]) → text\n",
|
||
"```\n",
|
||
"\n",
|
||
"This ensures consistent behavior across different tokenization strategies."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "07703775",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "base-tokenizer",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class Tokenizer:\n",
|
||
" \"\"\"\n",
|
||
" Base tokenizer class providing the interface for all tokenizers.\n",
|
||
"\n",
|
||
" This defines the contract that all tokenizers must follow:\n",
|
||
" - encode(): text → list of token IDs\n",
|
||
" - decode(): list of token IDs → text\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def encode(self, text: str) -> List[int]:\n",
|
||
" \"\"\"\n",
|
||
" Convert text to a list of token IDs.\n",
|
||
"\n",
|
||
" TODO: Implement encoding logic in subclasses\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Subclasses will override this method\n",
|
||
" 2. Return list of integer token IDs\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer = CharTokenizer(['a', 'b', 'c'])\n",
|
||
" >>> tokenizer.encode(\"abc\")\n",
|
||
" [0, 1, 2]\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" raise NotImplementedError(\"Subclasses must implement encode()\")\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def decode(self, tokens: List[int]) -> str:\n",
|
||
" \"\"\"\n",
|
||
" Convert list of token IDs back to text.\n",
|
||
"\n",
|
||
" TODO: Implement decoding logic in subclasses\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Subclasses will override this method\n",
|
||
" 2. Return reconstructed text string\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer = CharTokenizer(['a', 'b', 'c'])\n",
|
||
" >>> tokenizer.decode([0, 1, 2])\n",
|
||
" \"abc\"\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" raise NotImplementedError(\"Subclasses must implement decode()\")\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "66f5edec",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-base-tokenizer",
|
||
"locked": true,
|
||
"points": 5
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_base_tokenizer():\n",
|
||
" \"\"\"🔬 Test base tokenizer interface.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Base Tokenizer Interface...\")\n",
|
||
"\n",
|
||
" # Test that base class defines the interface\n",
|
||
" tokenizer = Tokenizer()\n",
|
||
"\n",
|
||
" # Should raise NotImplementedError for both methods\n",
|
||
" try:\n",
|
||
" tokenizer.encode(\"test\")\n",
|
||
" assert False, \"encode() should raise NotImplementedError\"\n",
|
||
" except NotImplementedError:\n",
|
||
" pass\n",
|
||
"\n",
|
||
" try:\n",
|
||
" tokenizer.decode([1, 2, 3])\n",
|
||
" assert False, \"decode() should raise NotImplementedError\"\n",
|
||
" except NotImplementedError:\n",
|
||
" pass\n",
|
||
"\n",
|
||
" print(\"✅ Base tokenizer interface works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_base_tokenizer()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "472f18d8",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Character-Level Tokenizer\n",
|
||
"\n",
|
||
"The simplest tokenization approach: each character becomes a token. This gives us perfect coverage of any text but produces long sequences.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Character Tokenization Process:\n",
|
||
"\n",
|
||
"Step 1: Build vocabulary from unique characters\n",
|
||
"Text corpus: [\"hello\", \"world\"]\n",
|
||
"Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd']\n",
|
||
"Vocabulary: ['<UNK>', 'h', 'e', 'l', 'o', 'w', 'r', 'd'] # <UNK> for unknown\n",
|
||
" 0 1 2 3 4 5 6 7\n",
|
||
"\n",
|
||
"Step 2: Encode text character by character\n",
|
||
"Text: \"hello\"\n",
|
||
" 'h' → 1\n",
|
||
" 'e' → 2\n",
|
||
" 'l' → 3\n",
|
||
" 'l' → 3\n",
|
||
" 'o' → 4\n",
|
||
"Result: [1, 2, 3, 3, 4]\n",
|
||
"\n",
|
||
"Step 3: Decode by looking up each ID\n",
|
||
"IDs: [1, 2, 3, 3, 4]\n",
|
||
" 1 → 'h'\n",
|
||
" 2 → 'e'\n",
|
||
" 3 → 'l'\n",
|
||
" 3 → 'l'\n",
|
||
" 4 → 'o'\n",
|
||
"Result: \"hello\"\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8413441a",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "char-tokenizer",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class CharTokenizer(Tokenizer):\n",
|
||
" \"\"\"\n",
|
||
" Character-level tokenizer that treats each character as a separate token.\n",
|
||
"\n",
|
||
" This is the simplest tokenization approach - every character in the\n",
|
||
" vocabulary gets its own unique ID.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, vocab: Optional[List[str]] = None):\n",
|
||
" \"\"\"\n",
|
||
" Initialize character tokenizer.\n",
|
||
"\n",
|
||
" TODO: Set up vocabulary mappings\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Store vocabulary list\n",
|
||
" 2. Create char→id and id→char mappings\n",
|
||
" 3. Handle special tokens (unknown character)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer = CharTokenizer(['a', 'b', 'c'])\n",
|
||
" >>> tokenizer.vocab_size\n",
|
||
" 4 # 3 chars + 1 unknown token\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if vocab is None:\n",
|
||
" vocab = []\n",
|
||
"\n",
|
||
" # Add special unknown token\n",
|
||
" self.vocab = ['<UNK>'] + vocab\n",
|
||
" self.vocab_size = len(self.vocab)\n",
|
||
"\n",
|
||
" # Create bidirectional mappings\n",
|
||
" self.char_to_id = {char: idx for idx, char in enumerate(self.vocab)}\n",
|
||
" self.id_to_char = {idx: char for idx, char in enumerate(self.vocab)}\n",
|
||
"\n",
|
||
" # Store unknown token ID\n",
|
||
" self.unk_id = 0\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def build_vocab(self, corpus: List[str]) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Build vocabulary from a corpus of text.\n",
|
||
"\n",
|
||
" TODO: Extract unique characters and build vocabulary\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Collect all unique characters from corpus\n",
|
||
" 2. Sort for consistent ordering\n",
|
||
" 3. Rebuild mappings with new vocabulary\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use set() to find unique characters\n",
|
||
" - Join all texts then convert to set\n",
|
||
" - Don't forget the <UNK> token\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Collect all unique characters\n",
|
||
" all_chars = set()\n",
|
||
" for text in corpus:\n",
|
||
" all_chars.update(text)\n",
|
||
"\n",
|
||
" # Sort for consistent ordering\n",
|
||
" unique_chars = sorted(list(all_chars))\n",
|
||
"\n",
|
||
" # Rebuild vocabulary with <UNK> token first\n",
|
||
" self.vocab = ['<UNK>'] + unique_chars\n",
|
||
" self.vocab_size = len(self.vocab)\n",
|
||
"\n",
|
||
" # Rebuild mappings\n",
|
||
" self.char_to_id = {char: idx for idx, char in enumerate(self.vocab)}\n",
|
||
" self.id_to_char = {idx: char for idx, char in enumerate(self.vocab)}\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def encode(self, text: str) -> List[int]:\n",
|
||
" \"\"\"\n",
|
||
" Encode text to list of character IDs.\n",
|
||
"\n",
|
||
" TODO: Convert each character to its vocabulary ID\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Iterate through each character in text\n",
|
||
" 2. Look up character ID in vocabulary\n",
|
||
" 3. Use unknown token ID for unseen characters\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer = CharTokenizer(['h', 'e', 'l', 'o'])\n",
|
||
" >>> tokenizer.encode(\"hello\")\n",
|
||
" [1, 2, 3, 3, 4] # maps to h,e,l,l,o\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" tokens = []\n",
|
||
" for char in text:\n",
|
||
" tokens.append(self.char_to_id.get(char, self.unk_id))\n",
|
||
" return tokens\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def decode(self, tokens: List[int]) -> str:\n",
|
||
" \"\"\"\n",
|
||
" Decode list of token IDs back to text.\n",
|
||
"\n",
|
||
" TODO: Convert each token ID back to its character\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Look up each token ID in vocabulary\n",
|
||
" 2. Join characters into string\n",
|
||
" 3. Handle invalid token IDs gracefully\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer = CharTokenizer(['h', 'e', 'l', 'o'])\n",
|
||
" >>> tokenizer.decode([1, 2, 3, 3, 4])\n",
|
||
" \"hello\"\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" chars = []\n",
|
||
" for token_id in tokens:\n",
|
||
" # Use unknown token for invalid IDs\n",
|
||
" char = self.id_to_char.get(token_id, '<UNK>')\n",
|
||
" chars.append(char)\n",
|
||
" return ''.join(chars)\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5268f9a8",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-char-tokenizer",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_char_tokenizer():\n",
|
||
" \"\"\"🔬 Test character tokenizer implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Character Tokenizer...\")\n",
|
||
"\n",
|
||
" # Test basic functionality\n",
|
||
" vocab = ['h', 'e', 'l', 'o', ' ', 'w', 'r', 'd']\n",
|
||
" tokenizer = CharTokenizer(vocab)\n",
|
||
"\n",
|
||
" # Test vocabulary setup\n",
|
||
" assert tokenizer.vocab_size == 9 # 8 chars + UNK\n",
|
||
" assert tokenizer.vocab[0] == '<UNK>'\n",
|
||
" assert 'h' in tokenizer.char_to_id\n",
|
||
"\n",
|
||
" # Test encoding\n",
|
||
" text = \"hello\"\n",
|
||
" tokens = tokenizer.encode(text)\n",
|
||
" expected = [1, 2, 3, 3, 4] # h,e,l,l,o (based on actual vocab order)\n",
|
||
" assert tokens == expected, f\"Expected {expected}, got {tokens}\"\n",
|
||
"\n",
|
||
" # Test decoding\n",
|
||
" decoded = tokenizer.decode(tokens)\n",
|
||
" assert decoded == text, f\"Expected '{text}', got '{decoded}'\"\n",
|
||
"\n",
|
||
" # Test unknown character handling\n",
|
||
" tokens_with_unk = tokenizer.encode(\"hello!\")\n",
|
||
" assert tokens_with_unk[-1] == 0 # '!' should map to <UNK>\n",
|
||
"\n",
|
||
" # Test vocabulary building\n",
|
||
" corpus = [\"hello world\", \"test text\"]\n",
|
||
" tokenizer.build_vocab(corpus)\n",
|
||
" assert 't' in tokenizer.char_to_id\n",
|
||
" assert 'x' in tokenizer.char_to_id\n",
|
||
"\n",
|
||
" print(\"✅ Character tokenizer works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_char_tokenizer()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "389f7a3a",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🧪 Character Tokenizer Analysis\n",
|
||
"Character tokenization provides a simple, robust foundation for text processing. The key insight is that with a small vocabulary (typically <100 characters), we can represent any text without unknown tokens.\n",
|
||
"\n",
|
||
"**Trade-offs**:\n",
|
||
"- **Pro**: No out-of-vocabulary issues, handles any language\n",
|
||
"- **Con**: Long sequences (1 char = 1 token), limited semantic understanding\n",
|
||
"- **Use case**: When robustness is more important than efficiency"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "246bba99",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Byte Pair Encoding (BPE) Tokenizer\n",
|
||
"\n",
|
||
"BPE is the secret sauce behind modern language models (GPT, BERT, etc.). It learns to merge frequent character pairs, creating subword units that balance vocabulary size with sequence length.\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌───────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ BPE TRAINING ALGORITHM: Learning Subword Units │\n",
|
||
"├───────────────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ STEP 1: Initialize with Character Vocabulary │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Training Data: [\"hello\", \"hello\", \"help\"] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Initial Tokens (with end-of-word markers): │ │\n",
|
||
"│ │ ['h','e','l','l','o</w>'] (hello) │ │\n",
|
||
"│ │ ['h','e','l','l','o</w>'] (hello) │ │\n",
|
||
"│ │ ['h','e','l','p</w>'] (help) │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Starting Vocab: ['h', 'e', 'l', 'o', 'p', '</w>'] │ │\n",
|
||
"│ │ ↑ All unique characters │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ STEP 2: Count All Adjacent Pairs │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Pair Frequency Analysis: │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ ('h', 'e'): ██████ 3 occurrences ← MOST FREQUENT! │ │\n",
|
||
"│ │ ('e', 'l'): ██████ 3 occurrences │ │\n",
|
||
"│ │ ('l', 'l'): ████ 2 occurrences │ │\n",
|
||
"│ │ ('l', 'o'): ████ 2 occurrences │ │\n",
|
||
"│ │ ('o', '<'): ████ 2 occurrences │ │\n",
|
||
"│ │ ('l', 'p'): ██ 1 occurrence │ │\n",
|
||
"│ │ ('p', '<'): ██ 1 occurrence │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ STEP 3: Merge Most Frequent Pair │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Merge Operation: ('h', 'e') → 'he' │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ BEFORE: AFTER: │ │\n",
|
||
"│ │ ['h','e','l','l','o</w>'] → ['he','l','l','o</w>'] │ │\n",
|
||
"│ │ ['h','e','l','l','o</w>'] → ['he','l','l','o</w>'] │ │\n",
|
||
"│ │ ['h','e','l','p</w>'] → ['he','l','p</w>'] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Updated Vocab: ['h','e','l','o','p','</w>', 'he'] │ │\n",
|
||
"│ │ ↑ NEW TOKEN! │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ STEP 4: Repeat Until Target Vocab Size Reached │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Iteration 2: Next most frequent is ('l', 'l') │ │\n",
|
||
"│ │ Merge ('l','l') → 'll' │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ ['he','l','l','o</w>'] → ['he','ll','o</w>'] │ │\n",
|
||
"│ │ ['he','l','l','o</w>'] → ['he','ll','o</w>'] │ │\n",
|
||
"│ │ ['he','l','p</w>'] → ['he','l','p</w>'] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Updated Vocab: ['h','e','l','o','p','</w>','he','ll'] │ │\n",
|
||
"│ │ ↑ NEW! │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Continue merging until vocab_size target... │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ FINAL RESULTS: │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Trained BPE can now encode efficiently: │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ \"hello\" → ['he', 'll', 'o</w>'] = 3 tokens (vs 5 chars) │ │\n",
|
||
"│ │ \"help\" → ['he', 'l', 'p</w>'] = 3 tokens (vs 4 chars) │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Key Insights: BPE automatically discovers: │ │\n",
|
||
"│ │ - Common prefixes ('he') │ │\n",
|
||
"│ │ - Morphological patterns ('ll') │ │\n",
|
||
"│ │ - Natural word boundaries (</w>) │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"└───────────────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why BPE Works**: By starting with characters and iteratively merging frequent pairs, BPE discovers the natural statistical patterns in language. Common words become single tokens, rare words split into recognizable subword pieces!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0190c2fc",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "bpe-tokenizer",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class BPETokenizer(Tokenizer):\n",
|
||
" \"\"\"\n",
|
||
" Byte Pair Encoding (BPE) tokenizer that learns subword units.\n",
|
||
"\n",
|
||
" BPE works by:\n",
|
||
" 1. Starting with character-level vocabulary\n",
|
||
" 2. Finding most frequent character pairs\n",
|
||
" 3. Merging frequent pairs into single tokens\n",
|
||
" 4. Repeating until desired vocabulary size\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, vocab_size: int = 1000):\n",
|
||
" \"\"\"\n",
|
||
" Initialize BPE tokenizer.\n",
|
||
"\n",
|
||
" TODO: Set up basic tokenizer state\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Store target vocabulary size\n",
|
||
" 2. Initialize empty vocabulary and merge rules\n",
|
||
" 3. Set up mappings for encoding/decoding\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.vocab_size = vocab_size\n",
|
||
" self.vocab = []\n",
|
||
" self.merges = [] # List of (pair, new_token) merges\n",
|
||
" self.token_to_id = {}\n",
|
||
" self.id_to_token = {}\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _get_word_tokens(self, word: str) -> List[str]:\n",
|
||
" \"\"\"\n",
|
||
" Convert word to list of characters with end-of-word marker.\n",
|
||
"\n",
|
||
" TODO: Tokenize word into character sequence\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Split word into characters\n",
|
||
" 2. Add </w> marker to last character\n",
|
||
" 3. Return list of tokens\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer._get_word_tokens(\"hello\")\n",
|
||
" ['h', 'e', 'l', 'l', 'o</w>']\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if not word:\n",
|
||
" return []\n",
|
||
"\n",
|
||
" tokens = list(word)\n",
|
||
" tokens[-1] += '</w>' # Mark end of word\n",
|
||
" return tokens\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _get_pairs(self, word_tokens: List[str]) -> Set[Tuple[str, str]]:\n",
|
||
" \"\"\"\n",
|
||
" Get all adjacent pairs from word tokens.\n",
|
||
"\n",
|
||
" TODO: Extract all consecutive character pairs\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Iterate through adjacent tokens\n",
|
||
" 2. Create pairs of consecutive tokens\n",
|
||
" 3. Return set of unique pairs\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> tokenizer._get_pairs(['h', 'e', 'l', 'l', 'o</w>'])\n",
|
||
" {('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o</w>')}\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" pairs = set()\n",
|
||
" for i in range(len(word_tokens) - 1):\n",
|
||
" pairs.add((word_tokens[i], word_tokens[i + 1]))\n",
|
||
" return pairs\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def train(self, corpus: List[str], vocab_size: int = None) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Train BPE on corpus to learn merge rules.\n",
|
||
"\n",
|
||
" TODO: Implement BPE training algorithm\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Build initial character vocabulary\n",
|
||
" 2. Count word frequencies in corpus\n",
|
||
" 3. Iteratively merge most frequent pairs\n",
|
||
" 4. Build final vocabulary and mappings\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Start with character-level tokens\n",
|
||
" - Use frequency counts to guide merging\n",
|
||
" - Stop when vocabulary reaches target size\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if vocab_size:\n",
|
||
" self.vocab_size = vocab_size\n",
|
||
"\n",
|
||
" # Count word frequencies\n",
|
||
" word_freq = Counter(corpus)\n",
|
||
"\n",
|
||
" # Initialize vocabulary with characters\n",
|
||
" vocab = set()\n",
|
||
" word_tokens = {}\n",
|
||
"\n",
|
||
" for word in word_freq:\n",
|
||
" tokens = self._get_word_tokens(word)\n",
|
||
" word_tokens[word] = tokens\n",
|
||
" vocab.update(tokens)\n",
|
||
"\n",
|
||
" # Convert to sorted list for consistency\n",
|
||
" self.vocab = sorted(list(vocab))\n",
|
||
"\n",
|
||
" # Add special tokens\n",
|
||
" if '<UNK>' not in self.vocab:\n",
|
||
" self.vocab = ['<UNK>'] + self.vocab\n",
|
||
"\n",
|
||
" # Learn merges\n",
|
||
" self.merges = []\n",
|
||
"\n",
|
||
" while len(self.vocab) < self.vocab_size:\n",
|
||
" # Count all pairs across all words\n",
|
||
" pair_counts = Counter()\n",
|
||
"\n",
|
||
" for word, freq in word_freq.items():\n",
|
||
" tokens = word_tokens[word]\n",
|
||
" pairs = self._get_pairs(tokens)\n",
|
||
" for pair in pairs:\n",
|
||
" pair_counts[pair] += freq\n",
|
||
"\n",
|
||
" if not pair_counts:\n",
|
||
" break\n",
|
||
"\n",
|
||
" # Get most frequent pair\n",
|
||
" best_pair = pair_counts.most_common(1)[0][0]\n",
|
||
"\n",
|
||
" # Merge this pair in all words\n",
|
||
" for word in word_tokens:\n",
|
||
" tokens = word_tokens[word]\n",
|
||
" new_tokens = []\n",
|
||
" i = 0\n",
|
||
" while i < len(tokens):\n",
|
||
" if (i < len(tokens) - 1 and\n",
|
||
" tokens[i] == best_pair[0] and\n",
|
||
" tokens[i + 1] == best_pair[1]):\n",
|
||
" # Merge pair\n",
|
||
" new_tokens.append(best_pair[0] + best_pair[1])\n",
|
||
" i += 2\n",
|
||
" else:\n",
|
||
" new_tokens.append(tokens[i])\n",
|
||
" i += 1\n",
|
||
" word_tokens[word] = new_tokens\n",
|
||
"\n",
|
||
" # Add merged token to vocabulary\n",
|
||
" merged_token = best_pair[0] + best_pair[1]\n",
|
||
" self.vocab.append(merged_token)\n",
|
||
" self.merges.append(best_pair)\n",
|
||
"\n",
|
||
" # Build final mappings\n",
|
||
" self._build_mappings()\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _build_mappings(self):\n",
|
||
" \"\"\"Build token-to-ID and ID-to-token mappings.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.token_to_id = {token: idx for idx, token in enumerate(self.vocab)}\n",
|
||
" self.id_to_token = {idx: token for idx, token in enumerate(self.vocab)}\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _apply_merges(self, tokens: List[str]) -> List[str]:\n",
|
||
" \"\"\"\n",
|
||
" Apply learned merge rules to token sequence.\n",
|
||
"\n",
|
||
" TODO: Apply BPE merges to token list\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Start with character-level tokens\n",
|
||
" 2. Apply each merge rule in order\n",
|
||
" 3. Continue until no more merges possible\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if not self.merges:\n",
|
||
" return tokens\n",
|
||
"\n",
|
||
" for merge_pair in self.merges:\n",
|
||
" new_tokens = []\n",
|
||
" i = 0\n",
|
||
" while i < len(tokens):\n",
|
||
" if (i < len(tokens) - 1 and\n",
|
||
" tokens[i] == merge_pair[0] and\n",
|
||
" tokens[i + 1] == merge_pair[1]):\n",
|
||
" # Apply merge\n",
|
||
" new_tokens.append(merge_pair[0] + merge_pair[1])\n",
|
||
" i += 2\n",
|
||
" else:\n",
|
||
" new_tokens.append(tokens[i])\n",
|
||
" i += 1\n",
|
||
" tokens = new_tokens\n",
|
||
"\n",
|
||
" return tokens\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def encode(self, text: str) -> List[int]:\n",
|
||
" \"\"\"\n",
|
||
" Encode text using BPE.\n",
|
||
"\n",
|
||
" TODO: Apply BPE encoding to text\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Split text into words\n",
|
||
" 2. Convert each word to character tokens\n",
|
||
" 3. Apply BPE merges\n",
|
||
" 4. Convert to token IDs\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if not self.vocab:\n",
|
||
" return []\n",
|
||
"\n",
|
||
" # Simple word splitting (could be more sophisticated)\n",
|
||
" words = text.split()\n",
|
||
" all_tokens = []\n",
|
||
"\n",
|
||
" for word in words:\n",
|
||
" # Get character-level tokens\n",
|
||
" word_tokens = self._get_word_tokens(word)\n",
|
||
"\n",
|
||
" # Apply BPE merges\n",
|
||
" merged_tokens = self._apply_merges(word_tokens)\n",
|
||
"\n",
|
||
" all_tokens.extend(merged_tokens)\n",
|
||
"\n",
|
||
" # Convert to IDs\n",
|
||
" token_ids = []\n",
|
||
" for token in all_tokens:\n",
|
||
" token_ids.append(self.token_to_id.get(token, 0)) # 0 = <UNK>\n",
|
||
"\n",
|
||
" return token_ids\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def decode(self, tokens: List[int]) -> str:\n",
|
||
" \"\"\"\n",
|
||
" Decode token IDs back to text.\n",
|
||
"\n",
|
||
" TODO: Convert token IDs back to readable text\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Convert IDs to tokens\n",
|
||
" 2. Join tokens together\n",
|
||
" 3. Clean up word boundaries and markers\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if not self.id_to_token:\n",
|
||
" return \"\"\n",
|
||
"\n",
|
||
" # Convert IDs to tokens\n",
|
||
" token_strings = []\n",
|
||
" for token_id in tokens:\n",
|
||
" token = self.id_to_token.get(token_id, '<UNK>')\n",
|
||
" token_strings.append(token)\n",
|
||
"\n",
|
||
" # Join and clean up\n",
|
||
" text = ''.join(token_strings)\n",
|
||
"\n",
|
||
" # Replace end-of-word markers with spaces\n",
|
||
" text = text.replace('</w>', ' ')\n",
|
||
"\n",
|
||
" # Clean up extra spaces\n",
|
||
" text = ' '.join(text.split())\n",
|
||
"\n",
|
||
" return text\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "3f7bd31f",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-bpe-tokenizer",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_bpe_tokenizer():\n",
|
||
" \"\"\"🔬 Test BPE tokenizer implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: BPE Tokenizer...\")\n",
|
||
"\n",
|
||
" # Test basic functionality with simple corpus\n",
|
||
" corpus = [\"hello\", \"world\", \"hello\", \"hell\"] # \"hell\" and \"hello\" share prefix\n",
|
||
" tokenizer = BPETokenizer(vocab_size=20)\n",
|
||
" tokenizer.train(corpus)\n",
|
||
"\n",
|
||
" # Check that vocabulary was built\n",
|
||
" assert len(tokenizer.vocab) > 0\n",
|
||
" assert '<UNK>' in tokenizer.vocab\n",
|
||
"\n",
|
||
" # Test helper functions\n",
|
||
" word_tokens = tokenizer._get_word_tokens(\"test\")\n",
|
||
" assert word_tokens[-1].endswith('</w>'), \"Should have end-of-word marker\"\n",
|
||
"\n",
|
||
" pairs = tokenizer._get_pairs(['h', 'e', 'l', 'l', 'o</w>'])\n",
|
||
" assert ('h', 'e') in pairs\n",
|
||
" assert ('l', 'l') in pairs\n",
|
||
"\n",
|
||
" # Test encoding/decoding\n",
|
||
" text = \"hello\"\n",
|
||
" tokens = tokenizer.encode(text)\n",
|
||
" assert isinstance(tokens, list)\n",
|
||
" assert all(isinstance(t, int) for t in tokens)\n",
|
||
"\n",
|
||
" decoded = tokenizer.decode(tokens)\n",
|
||
" assert isinstance(decoded, str)\n",
|
||
"\n",
|
||
" # Test round-trip on training data should work well\n",
|
||
" for word in corpus:\n",
|
||
" tokens = tokenizer.encode(word)\n",
|
||
" decoded = tokenizer.decode(tokens)\n",
|
||
" # Allow some flexibility due to BPE merging\n",
|
||
" assert len(decoded.strip()) > 0\n",
|
||
"\n",
|
||
" print(\"✅ BPE tokenizer works correctly!\")\n",
|
||
"\n",
|
||
"test_unit_bpe_tokenizer()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3baf97cf",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🧪 BPE Tokenizer Analysis\n",
|
||
"\n",
|
||
"BPE provides a balance between vocabulary size and sequence length. By learning frequent subword patterns, it can handle new words through decomposition while maintaining reasonable sequence lengths.\n",
|
||
"\n",
|
||
"```\n",
|
||
"BPE Merging Visualization:\n",
|
||
"\n",
|
||
"Original: \"tokenization\" → ['t','o','k','e','n','i','z','a','t','i','o','n','</w>']\n",
|
||
" ↓ Merge frequent pairs\n",
|
||
"Step 1: ('t','o') is frequent → ['to','k','e','n','i','z','a','t','i','o','n','</w>']\n",
|
||
"Step 2: ('i','o') is frequent → ['to','k','e','n','io','z','a','t','io','n','</w>']\n",
|
||
"Step 3: ('io','n') is frequent → ['to','k','e','n','io','z','a','t','ion','</w>']\n",
|
||
"Step 4: ('to','k') is frequent → ['tok','e','n','io','z','a','t','ion','</w>']\n",
|
||
" ↓ Continue merging...\n",
|
||
"Final: \"tokenization\" → ['token','ization'] # 2 tokens vs 13 characters!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key insights**:\n",
|
||
"- **Adaptive vocabulary**: Learns from data, not hand-crafted\n",
|
||
"- **Subword robustness**: Handles rare/new words through decomposition\n",
|
||
"- **Efficiency trade-off**: Larger vocabulary → shorter sequences → faster processing\n",
|
||
"- **Morphological awareness**: Naturally discovers prefixes, suffixes, roots"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0b06184b",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 4. Integration - Bringing It Together\n",
|
||
"\n",
|
||
"Now let's build utility functions that make tokenization easy to use in practice. These tools will help you tokenize datasets, analyze performance, and choose the right strategy.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Tokenization Workflow:\n",
|
||
"\n",
|
||
"1. Choose Strategy → 2. Train Tokenizer → 3. Process Dataset → 4. Analyze Results\n",
|
||
" ↓ ↓ ↓ ↓\n",
|
||
" char/bpe corpus training batch encoding stats/metrics\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8899f6cd",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "tokenization-utils",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def create_tokenizer(strategy: str = \"char\", vocab_size: int = 1000, corpus: List[str] = None) -> Tokenizer:\n",
|
||
" \"\"\"\n",
|
||
" Factory function to create and train tokenizers.\n",
|
||
"\n",
|
||
" TODO: Create appropriate tokenizer based on strategy\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Check strategy type\n",
|
||
" 2. Create appropriate tokenizer class\n",
|
||
" 3. Train on corpus if provided\n",
|
||
" 4. Return configured tokenizer\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> corpus = [\"hello world\", \"test text\"]\n",
|
||
" >>> tokenizer = create_tokenizer(\"char\", corpus=corpus)\n",
|
||
" >>> tokens = tokenizer.encode(\"hello\")\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if strategy == \"char\":\n",
|
||
" tokenizer = CharTokenizer()\n",
|
||
" if corpus:\n",
|
||
" tokenizer.build_vocab(corpus)\n",
|
||
" elif strategy == \"bpe\":\n",
|
||
" tokenizer = BPETokenizer(vocab_size=vocab_size)\n",
|
||
" if corpus:\n",
|
||
" tokenizer.train(corpus, vocab_size)\n",
|
||
" else:\n",
|
||
" raise ValueError(f\"Unknown tokenization strategy: {strategy}\")\n",
|
||
"\n",
|
||
" return tokenizer\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def tokenize_dataset(texts: List[str], tokenizer: Tokenizer, max_length: int = None) -> List[List[int]]:\n",
|
||
" \"\"\"\n",
|
||
" Tokenize a dataset with optional length limits.\n",
|
||
"\n",
|
||
" TODO: Tokenize all texts with consistent preprocessing\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Encode each text with the tokenizer\n",
|
||
" 2. Apply max_length truncation if specified\n",
|
||
" 3. Return list of tokenized sequences\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Handle empty texts gracefully\n",
|
||
" - Truncate from the end if too long\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" tokenized = []\n",
|
||
" for text in texts:\n",
|
||
" tokens = tokenizer.encode(text)\n",
|
||
"\n",
|
||
" # Apply length limit\n",
|
||
" if max_length and len(tokens) > max_length:\n",
|
||
" tokens = tokens[:max_length]\n",
|
||
"\n",
|
||
" tokenized.append(tokens)\n",
|
||
"\n",
|
||
" return tokenized\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
"def analyze_tokenization(texts: List[str], tokenizer: Tokenizer) -> Dict[str, float]:\n",
|
||
" \"\"\"\n",
|
||
" Analyze tokenization statistics.\n",
|
||
"\n",
|
||
" TODO: Compute useful statistics about tokenization\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Tokenize all texts\n",
|
||
" 2. Compute sequence length statistics\n",
|
||
" 3. Calculate compression ratio\n",
|
||
" 4. Return analysis dictionary\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" all_tokens = []\n",
|
||
" total_chars = 0\n",
|
||
"\n",
|
||
" for text in texts:\n",
|
||
" tokens = tokenizer.encode(text)\n",
|
||
" all_tokens.extend(tokens)\n",
|
||
" total_chars += len(text)\n",
|
||
"\n",
|
||
" # Calculate statistics\n",
|
||
" tokenized_lengths = [len(tokenizer.encode(text)) for text in texts]\n",
|
||
"\n",
|
||
" stats = {\n",
|
||
" 'vocab_size': tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else len(tokenizer.vocab),\n",
|
||
" 'avg_sequence_length': np.mean(tokenized_lengths),\n",
|
||
" 'max_sequence_length': max(tokenized_lengths) if tokenized_lengths else 0,\n",
|
||
" 'total_tokens': len(all_tokens),\n",
|
||
" 'compression_ratio': total_chars / len(all_tokens) if all_tokens else 0,\n",
|
||
" 'unique_tokens': len(set(all_tokens))\n",
|
||
" }\n",
|
||
"\n",
|
||
" return stats\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d4a23373",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-tokenization-utils",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_tokenization_utils():\n",
|
||
" \"\"\"🔬 Test tokenization utility functions.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Tokenization Utils...\")\n",
|
||
"\n",
|
||
" # Test tokenizer factory\n",
|
||
" corpus = [\"hello world\", \"test text\", \"more examples\"]\n",
|
||
"\n",
|
||
" char_tokenizer = create_tokenizer(\"char\", corpus=corpus)\n",
|
||
" assert isinstance(char_tokenizer, CharTokenizer)\n",
|
||
" assert char_tokenizer.vocab_size > 0\n",
|
||
"\n",
|
||
" bpe_tokenizer = create_tokenizer(\"bpe\", vocab_size=50, corpus=corpus)\n",
|
||
" assert isinstance(bpe_tokenizer, BPETokenizer)\n",
|
||
"\n",
|
||
" # Test dataset tokenization\n",
|
||
" texts = [\"hello\", \"world\", \"test\"]\n",
|
||
" tokenized = tokenize_dataset(texts, char_tokenizer, max_length=10)\n",
|
||
" assert len(tokenized) == len(texts)\n",
|
||
" assert all(len(seq) <= 10 for seq in tokenized)\n",
|
||
"\n",
|
||
" # Test analysis\n",
|
||
" stats = analyze_tokenization(texts, char_tokenizer)\n",
|
||
" assert 'vocab_size' in stats\n",
|
||
" assert 'avg_sequence_length' in stats\n",
|
||
" assert 'compression_ratio' in stats\n",
|
||
" assert stats['total_tokens'] > 0\n",
|
||
"\n",
|
||
" print(\"✅ Tokenization utils work correctly!\")\n",
|
||
"\n",
|
||
"test_unit_tokenization_utils()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2771ad8d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis - Tokenization Trade-offs\n",
|
||
"\n",
|
||
"Understanding the performance implications of different tokenization strategies is crucial for building efficient NLP systems."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "58050b9b",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "tokenization-analysis",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_tokenization_strategies():\n",
|
||
" \"\"\"📊 Compare different tokenization strategies on various texts.\"\"\"\n",
|
||
" print(\"📊 Analyzing Tokenization Strategies...\")\n",
|
||
"\n",
|
||
" # Create test corpus with different text types\n",
|
||
" corpus = [\n",
|
||
" \"Hello world\",\n",
|
||
" \"The quick brown fox jumps over the lazy dog\",\n",
|
||
" \"Machine learning is transforming artificial intelligence\",\n",
|
||
" \"Tokenization is fundamental to natural language processing\",\n",
|
||
" \"Subword units balance vocabulary size and sequence length\"\n",
|
||
" ]\n",
|
||
"\n",
|
||
" # Test different strategies\n",
|
||
" strategies = [\n",
|
||
" (\"Character\", create_tokenizer(\"char\", corpus=corpus)),\n",
|
||
" (\"BPE-100\", create_tokenizer(\"bpe\", vocab_size=100, corpus=corpus)),\n",
|
||
" (\"BPE-500\", create_tokenizer(\"bpe\", vocab_size=500, corpus=corpus))\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(f\"{'Strategy':<12} {'Vocab':<8} {'Avg Len':<8} {'Compression':<12} {'Coverage':<10}\")\n",
|
||
" print(\"-\" * 60)\n",
|
||
"\n",
|
||
" for name, tokenizer in strategies:\n",
|
||
" stats = analyze_tokenization(corpus, tokenizer)\n",
|
||
"\n",
|
||
" print(f\"{name:<12} {stats['vocab_size']:<8} \"\n",
|
||
" f\"{stats['avg_sequence_length']:<8.1f} \"\n",
|
||
" f\"{stats['compression_ratio']:<12.2f} \"\n",
|
||
" f\"{stats['unique_tokens']:<10}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Key Insights:\")\n",
|
||
" print(\"- Character tokenization: Small vocab, long sequences, perfect coverage\")\n",
|
||
" print(\"- BPE: Larger vocab trades off with shorter sequences\")\n",
|
||
" print(\"- Higher compression ratio = more characters per token = efficiency\")\n",
|
||
"\n",
|
||
"analyze_tokenization_strategies()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "11fc9711",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 📊 Performance Analysis: Vocabulary Size vs Sequence Length\n",
|
||
"\n",
|
||
"The fundamental trade-off in tokenization creates a classic systems engineering challenge:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Tokenization Trade-off Spectrum:\n",
|
||
"\n",
|
||
"Character BPE-Small BPE-Large Word-Level\n",
|
||
"vocab: ~100 → vocab: ~1K → vocab: ~50K → vocab: ~100K+\n",
|
||
"seq: very long → seq: long → seq: medium → seq: short\n",
|
||
"memory: low → memory: med → memory: high → memory: very high\n",
|
||
"compute: high → compute: med → compute: low → compute: very low\n",
|
||
"coverage: 100% → coverage: 99% → coverage: 95% → coverage: <80%\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Character tokenization (vocab ~100)**:\n",
|
||
"- Pro: Universal coverage, simple implementation, small embedding table\n",
|
||
"- Con: Long sequences (high compute), limited semantic units\n",
|
||
"- Use case: Morphologically rich languages, robust preprocessing\n",
|
||
"\n",
|
||
"**BPE tokenization (vocab 10K-50K)**:\n",
|
||
"- Pro: Balanced efficiency, handles morphology, good coverage\n",
|
||
"- Con: Training complexity, domain-specific vocabularies\n",
|
||
"- Use case: Most modern language models (GPT, BERT family)\n",
|
||
"\n",
|
||
"**Real-world scaling examples**:\n",
|
||
"```\n",
|
||
"GPT-3/4: ~50K BPE tokens, avg 3-4 chars/token\n",
|
||
"BERT: ~30K WordPiece tokens, avg 4-5 chars/token\n",
|
||
"T5: ~32K SentencePiece tokens, handles 100+ languages\n",
|
||
"ChatGPT: ~100K tokens with extended vocabulary\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Memory implications for embedding tables**:\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ EMBEDDING TABLE MEMORY: Vocabulary Size × Embedding Dimension │\n",
|
||
"├─────────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ CHARACTER TOKENIZER (Vocab: 100) │\n",
|
||
"│ ┌────────────────────────────┐ │\n",
|
||
"│ │ 100 × 512 = 51,200 params │ Memory: 204 KB │\n",
|
||
"│ │ ████ │ ↑ Tiny embedding table! │\n",
|
||
"│ └────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ BPE-SMALL (Vocab: 1,000) │\n",
|
||
"│ ┌────────────────────────────┐ │\n",
|
||
"│ │ 1K × 512 = 512K params │ Memory: 2.0 MB │\n",
|
||
"│ │ ██████████ │ ↑ Still manageable │\n",
|
||
"│ └────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ BPE-LARGE (Vocab: 50,000) ← MOST PRODUCTION MODELS │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ 50K × 512 = 25.6M params │ │\n",
|
||
"│ │ ████████████████████████████████████████████████ │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Memory: 102.4 MB (fp32) │ │\n",
|
||
"│ │ 51.2 MB (fp16) ← Half precision saves 50% │ │\n",
|
||
"│ │ 25.6 MB (int8) ← Quantization saves 75% │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ WORD-LEVEL (Vocab: 100,000) │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ 100K × 512 = 51.2M params │ │\n",
|
||
"│ │ ████████████████████████████████████████████████████ │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Memory: 204.8 MB (fp32) ← Often too large! │ │\n",
|
||
"│ │ 102.4 MB (fp16) │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ Key Trade-off: │\n",
|
||
"│ Larger vocab → Shorter sequences → Less compute │\n",
|
||
"│ BUT larger vocab → More embedding memory → Harder to train │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"Real-World Production Examples:\n",
|
||
"┌─────────────┬──────────────┬───────────────┬──────────────────┐\n",
|
||
"│ Model │ Vocab Size │ Embed Dim │ Embed Memory │\n",
|
||
"├─────────────┼──────────────┼───────────────┼──────────────────┤\n",
|
||
"│ GPT-2 │ 50,257 │ 1,600 │ 321 MB │\n",
|
||
"│ GPT-3 │ 50,257 │ 12,288 │ 2.4 GB │\n",
|
||
"│ BERT │ 30,522 │ 768 │ 94 MB │\n",
|
||
"│ T5 │ 32,128 │ 512 │ 66 MB │\n",
|
||
"│ LLaMA-7B │ 32,000 │ 4,096 │ 524 MB │\n",
|
||
"└─────────────┴──────────────┴───────────────┴──────────────────┘\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a403fac4",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 6. Module Integration Test\n",
|
||
"\n",
|
||
"Let's test our complete tokenization system to ensure everything works together."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4e0168d9",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-module",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire tokenization module.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All unit tests pass\n",
|
||
" - Functions work together correctly\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_base_tokenizer()\n",
|
||
" test_unit_char_tokenizer()\n",
|
||
" test_unit_bpe_tokenizer()\n",
|
||
" test_unit_tokenization_utils()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test realistic tokenization workflow\n",
|
||
" print(\"🔬 Integration Test: Complete tokenization pipeline...\")\n",
|
||
"\n",
|
||
" # Create training corpus\n",
|
||
" training_corpus = [\n",
|
||
" \"Natural language processing\",\n",
|
||
" \"Machine learning models\",\n",
|
||
" \"Neural networks learn\",\n",
|
||
" \"Tokenization enables text processing\",\n",
|
||
" \"Embeddings represent meaning\"\n",
|
||
" ]\n",
|
||
"\n",
|
||
" # Train different tokenizers\n",
|
||
" char_tokenizer = create_tokenizer(\"char\", corpus=training_corpus)\n",
|
||
" bpe_tokenizer = create_tokenizer(\"bpe\", vocab_size=200, corpus=training_corpus)\n",
|
||
"\n",
|
||
" # Test on new text\n",
|
||
" test_text = \"Neural language models\"\n",
|
||
"\n",
|
||
" # Test character tokenization\n",
|
||
" char_tokens = char_tokenizer.encode(test_text)\n",
|
||
" char_decoded = char_tokenizer.decode(char_tokens)\n",
|
||
" assert char_decoded == test_text, \"Character round-trip failed\"\n",
|
||
"\n",
|
||
" # Test BPE tokenization (may not be exact due to subword splits)\n",
|
||
" bpe_tokens = bpe_tokenizer.encode(test_text)\n",
|
||
" bpe_decoded = bpe_tokenizer.decode(bpe_tokens)\n",
|
||
" assert len(bpe_decoded.strip()) > 0, \"BPE decoding failed\"\n",
|
||
"\n",
|
||
" # Test dataset processing\n",
|
||
" test_dataset = [\"hello world\", \"tokenize this\", \"neural networks\"]\n",
|
||
" char_dataset = tokenize_dataset(test_dataset, char_tokenizer, max_length=20)\n",
|
||
" bpe_dataset = tokenize_dataset(test_dataset, bpe_tokenizer, max_length=10)\n",
|
||
"\n",
|
||
" assert len(char_dataset) == len(test_dataset)\n",
|
||
" assert len(bpe_dataset) == len(test_dataset)\n",
|
||
" assert all(len(seq) <= 20 for seq in char_dataset)\n",
|
||
" assert all(len(seq) <= 10 for seq in bpe_dataset)\n",
|
||
"\n",
|
||
" # Test analysis functions\n",
|
||
" char_stats = analyze_tokenization(test_dataset, char_tokenizer)\n",
|
||
" bpe_stats = analyze_tokenization(test_dataset, bpe_tokenizer)\n",
|
||
"\n",
|
||
" assert char_stats['vocab_size'] > 0\n",
|
||
" assert bpe_stats['vocab_size'] > 0\n",
|
||
" assert char_stats['compression_ratio'] < bpe_stats['compression_ratio'] # BPE should compress better\n",
|
||
"\n",
|
||
" print(\"✅ End-to-end tokenization pipeline works!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 10\")\n",
|
||
"\n",
|
||
"# Call the comprehensive test\n",
|
||
"test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2761d570",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if __name__ == \"__main__\":\n",
|
||
" print(\"🚀 Running Tokenization module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "92d46fdb",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Text Processing Foundations\n",
|
||
"\n",
|
||
"### Question 1: Vocabulary Size vs Memory\n",
|
||
"You implemented tokenizers with different vocabulary sizes.\n",
|
||
"If you have a BPE tokenizer with vocab_size=50,000 and embed_dim=512:\n",
|
||
"- How many parameters are in the embedding table? _____ million\n",
|
||
"- If using float32, how much memory does this embedding table require? _____ MB\n",
|
||
"\n",
|
||
"### Question 2: Sequence Length Trade-offs\n",
|
||
"Your character tokenizer produces longer sequences than BPE.\n",
|
||
"For the text \"machine learning\" (16 characters):\n",
|
||
"- Character tokenizer produces ~16 tokens\n",
|
||
"- BPE tokenizer might produce ~3-4 tokens\n",
|
||
"If processing batch_size=32 with max_length=512:\n",
|
||
"- Character model needs _____ total tokens per batch\n",
|
||
"- BPE model needs _____ total tokens per batch\n",
|
||
"- Which requires more memory during training? _____\n",
|
||
"\n",
|
||
"### Question 3: Tokenization Coverage\n",
|
||
"Your BPE tokenizer handles unknown words by decomposing into subwords.\n",
|
||
"- Why is this better than word-level tokenization for real applications? _____\n",
|
||
"- What happens to model performance when many tokens map to <UNK>? _____\n",
|
||
"- How does vocabulary size affect the number of unknown decompositions? _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0bb8fde5",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Tokenization\n",
|
||
"\n",
|
||
"Congratulations! You've built a complete tokenization system for converting text to numerical representations!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built character-level tokenizer with perfect text coverage\n",
|
||
"- Implemented BPE tokenizer that learns efficient subword representations\n",
|
||
"- Created vocabulary management and encoding/decoding systems\n",
|
||
"- Discovered the vocabulary size vs sequence length trade-off\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your tokenization implementation enables text processing for language models.\n",
|
||
"Export with: `tito module complete 10`\n",
|
||
"\n",
|
||
"**Next**: Module 11 will add learnable embeddings that convert your token IDs into rich vector representations!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|