mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 05:07:31 -05:00
2011 lines
83 KiB
Python
2011 lines
83 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Tokenization - Text Processing for Language Models
|
||
|
||
Welcome to the Tokenization module! You'll implement the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand.
|
||
|
||
## Learning Goals
|
||
- Systems understanding: How tokenization affects model performance, memory usage, and computational efficiency
|
||
- Core implementation skill: Build character and subword tokenizers from scratch
|
||
- Pattern recognition: Understand how tokenization choices impact model capacity and training dynamics
|
||
- Framework connection: See how your implementations match production tokenization systems
|
||
- Performance insight: Learn how tokenization throughput affects training pipeline efficiency
|
||
|
||
## Build -> Use -> Reflect
|
||
1. **Build**: Character tokenizer and basic BPE (Byte Pair Encoding) implementation
|
||
2. **Use**: Process real text and observe how different tokenization strategies affect sequence length
|
||
3. **Reflect**: How does tokenization choice determine model efficiency and language understanding?
|
||
|
||
## What You'll Achieve
|
||
By the end of this module, you'll understand:
|
||
- Deep technical understanding of how text becomes numbers that models can process
|
||
- Practical capability to implement tokenizers that handle real text data efficiently
|
||
- Systems insight into how vocabulary size affects memory usage and model performance
|
||
- Performance consideration of how tokenization speed affects overall training throughput
|
||
- Connection to production systems like GPT's tokenizers and their design trade-offs
|
||
|
||
## Systems Reality Check
|
||
TIP **Production Context**: Modern language models use sophisticated tokenizers (GPT's tiktoken, SentencePiece) - your implementation reveals the algorithmic foundations
|
||
SPEED **Performance Note**: Tokenization can become a bottleneck in training pipelines - efficient string processing is critical for high-throughput training
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tokenization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
#| default_exp core.tokenization
|
||
|
||
#| export
|
||
import os
|
||
import sys
|
||
import re
|
||
import json
|
||
from typing import List, Dict, Tuple, Optional, Union
|
||
from collections import Counter, defaultdict
|
||
|
||
# Import our Tensor class - try from package first, then from local module
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
except ImportError:
|
||
# For development, import from local tensor module
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
|
||
from tensor_dev import Tensor
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tokenization-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
print("🔤 TinyTorch Tokenization Module")
|
||
print("Ready to build text processing systems!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## PACKAGE Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/source/11_tokenization/tokenization_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.core.tokenization`
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
|
||
from tinytorch.core.tensor import Tensor # Foundation
|
||
from tinytorch.core.embeddings import Embedding # Next module
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Focused modules for deep understanding
|
||
- **Production:** Proper organization like Hugging Face's tokenizers
|
||
- **Consistency:** All tokenization tools live together in `core.tokenization`
|
||
- **Integration:** Works seamlessly with embeddings and language models
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## What is Tokenization?
|
||
|
||
### The Problem: Text to Numbers
|
||
Neural networks work with numbers, but we want to process text:
|
||
|
||
```
|
||
"Hello world!" -> [15496, 995, 0] # Numbers the model can understand
|
||
```
|
||
|
||
### 🔤 Visual Tokenization Flow
|
||
```
|
||
Raw Text -> Tokenization Strategy -> Token IDs -> Neural Network Input
|
||
|
||
"Hello world!"
|
||
v
|
||
+-------------------------+
|
||
| Tokenization Process |
|
||
| +---------------------+|
|
||
| | Split into tokens ||
|
||
| +---------------------+|
|
||
| v |
|
||
| +---------------------+|
|
||
| | Map to vocabulary ||
|
||
| +---------------------+|
|
||
+-------------------------+
|
||
v
|
||
[15496, 995, 0]
|
||
v
|
||
Neural Network
|
||
```
|
||
|
||
### 📊 Tokenization Strategy Comparison
|
||
```
|
||
Strategy | Vocab Size | Sequence Length | Use Case
|
||
--------------+------------+-----------------+-----------------
|
||
Character | ~256 | Long | Simple/Debug
|
||
Subword (BPE) | ~50,000 | Medium | Production
|
||
Word-level | ~100,000+ | Short | Specialized
|
||
```
|
||
|
||
### TARGET Systems Trade-offs Visualization
|
||
```
|
||
Memory Usage Impact
|
||
v
|
||
+-------------------------+
|
||
| Vocabulary Size |---> Embedding Table Memory
|
||
| | vocab_size * embed_dim * 4 bytes
|
||
+-------------------------+
|
||
v
|
||
+-------------------------+
|
||
| Sequence Length |---> Attention Memory
|
||
| | O(sequence_length²)
|
||
+-------------------------+
|
||
v
|
||
+-------------------------+
|
||
| Tokenization Speed |---> Training Throughput
|
||
| | tokens/second pipeline
|
||
+-------------------------+
|
||
|
||
Key Insight: Tokenization choices create cascading effects throughout ML systems!
|
||
```
|
||
|
||
### MAGNIFY Character vs Subword vs Word Example
|
||
```
|
||
Input: "The tokenization process"
|
||
|
||
Character-level:
|
||
['T','h','e',' ','t','o','k','e','n','i','z','a','t','i','o','n',' ','p','r','o','c','e','s','s']
|
||
v (24 tokens, vocab ~256)
|
||
|
||
Subword (BPE):
|
||
['The', 'token', 'ization', 'process']
|
||
v (4 tokens, vocab ~50k)
|
||
|
||
Word-level:
|
||
['The', 'tokenization', 'process']
|
||
v (3 tokens, vocab ~100k+)
|
||
|
||
Trade-off: Smaller vocab = Longer sequences = More computation
|
||
Larger vocab = More parameters = More memory
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Character Tokenizer Implementation
|
||
|
||
Let's start with the simplest tokenizer: character-level. Every character becomes a token.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "char-tokenizer", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class CharTokenizer:
|
||
"""
|
||
Character-level tokenizer that converts text to character tokens.
|
||
|
||
Simple but effective for understanding tokenization fundamentals.
|
||
Used in character-level language models and as baseline for comparison.
|
||
"""
|
||
|
||
def __init__(self, special_tokens: Optional[Dict[str, int]] = None):
|
||
"""
|
||
Initialize character tokenizer with optional special tokens.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Initialize character-to-index and index-to-character mappings
|
||
2. Add standard special tokens (PAD, UNK, BOS, EOS)
|
||
3. Build vocabulary from printable ASCII characters
|
||
4. Add any additional special tokens provided
|
||
|
||
DESIGN DECISIONS:
|
||
- Use ASCII characters (32-126) for basic English text
|
||
- Reserve indices 0-3 for special tokens
|
||
- Build bidirectional mappings for efficiency
|
||
|
||
Args:
|
||
special_tokens: Optional dict of special token name -> index
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Initialize mappings
|
||
self.char_to_idx = {}
|
||
self.idx_to_char = {}
|
||
self.vocab_size = 0
|
||
|
||
# Standard special tokens
|
||
default_special = {
|
||
'<PAD>': 0, # Padding token
|
||
'<UNK>': 1, # Unknown token
|
||
'<BOS>': 2, # Beginning of sequence
|
||
'<EOS>': 3 # End of sequence
|
||
}
|
||
|
||
# Merge with user-provided special tokens
|
||
if special_tokens is None:
|
||
special_tokens = {}
|
||
all_special = {**default_special, **special_tokens}
|
||
|
||
# Add special tokens first
|
||
for token, idx in all_special.items():
|
||
self.char_to_idx[token] = idx
|
||
self.idx_to_char[idx] = token
|
||
self.vocab_size = max(self.vocab_size, idx + 1)
|
||
|
||
# Add printable ASCII characters (space to ~)
|
||
next_idx = self.vocab_size
|
||
for i in range(32, 127): # ASCII printable characters
|
||
char = chr(i)
|
||
if char not in self.char_to_idx:
|
||
self.char_to_idx[char] = next_idx
|
||
self.idx_to_char[next_idx] = char
|
||
next_idx += 1
|
||
|
||
self.vocab_size = next_idx
|
||
### END SOLUTION
|
||
|
||
def encode(self, text: str, add_special_tokens: bool = True) -> List[int]:
|
||
"""
|
||
Convert text to list of token indices.
|
||
|
||
TODO: Implement text encoding.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Optionally add beginning-of-sequence token
|
||
2. Convert each character to its index
|
||
3. Handle unknown characters with UNK token
|
||
4. Optionally add end-of-sequence token
|
||
5. Return list of integers
|
||
|
||
EXAMPLE:
|
||
tokenizer = CharTokenizer()
|
||
tokens = tokenizer.encode("Hi!")
|
||
# Returns: [2, 72, 105, 33, 3] (BOS, H, i, !, EOS)
|
||
|
||
Args:
|
||
text: Input text string
|
||
add_special_tokens: Whether to add BOS/EOS tokens
|
||
|
||
Returns:
|
||
List of token indices
|
||
"""
|
||
### BEGIN SOLUTION
|
||
tokens = []
|
||
|
||
# Add beginning of sequence token
|
||
if add_special_tokens:
|
||
tokens.append(self.char_to_idx['<BOS>'])
|
||
|
||
# Convert each character
|
||
for char in text:
|
||
if char in self.char_to_idx:
|
||
tokens.append(self.char_to_idx[char])
|
||
else:
|
||
# Unknown character - use UNK token
|
||
tokens.append(self.char_to_idx['<UNK>'])
|
||
|
||
# Add end of sequence token
|
||
if add_special_tokens:
|
||
tokens.append(self.char_to_idx['<EOS>'])
|
||
|
||
return tokens
|
||
### END SOLUTION
|
||
|
||
def decode(self, tokens: List[int], skip_special_tokens: bool = True) -> str:
|
||
"""
|
||
Convert list of token indices back to text.
|
||
|
||
TODO: Implement token decoding.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Convert each token index to its character
|
||
2. Optionally skip special tokens (PAD, UNK, BOS, EOS)
|
||
3. Join characters into string
|
||
4. Return decoded text
|
||
|
||
EXAMPLE:
|
||
tokenizer = CharTokenizer()
|
||
text = tokenizer.decode([2, 72, 105, 33, 3])
|
||
# Returns: "Hi!" (BOS and EOS removed)
|
||
|
||
Args:
|
||
tokens: List of token indices
|
||
skip_special_tokens: Whether to exclude special tokens
|
||
|
||
Returns:
|
||
Decoded text string
|
||
"""
|
||
### BEGIN SOLUTION
|
||
special_tokens = {'<PAD>', '<UNK>', '<BOS>', '<EOS>'}
|
||
chars = []
|
||
|
||
for token_idx in tokens:
|
||
if token_idx in self.idx_to_char:
|
||
char = self.idx_to_char[token_idx]
|
||
# Skip special tokens if requested
|
||
if skip_special_tokens and char in special_tokens:
|
||
continue
|
||
chars.append(char)
|
||
else:
|
||
# Unknown token index
|
||
if not skip_special_tokens:
|
||
chars.append('<UNK>')
|
||
|
||
return ''.join(chars)
|
||
### END SOLUTION
|
||
|
||
def pad_sequences(self, sequences: List[List[int]], max_length: Optional[int] = None) -> List[List[int]]:
|
||
"""
|
||
Pad sequences to uniform length for batch processing.
|
||
|
||
This function is PROVIDED to show padding implementation.
|
||
Essential for creating batches of text data.
|
||
"""
|
||
if not sequences:
|
||
return []
|
||
|
||
if max_length is None:
|
||
max_length = max(len(seq) for seq in sequences)
|
||
|
||
pad_token = self.char_to_idx['<PAD>']
|
||
padded = []
|
||
|
||
for sequence in sequences:
|
||
if len(sequence) >= max_length:
|
||
# Truncate if too long
|
||
padded.append(sequence[:max_length])
|
||
else:
|
||
# Pad if too short
|
||
padding_needed = max_length - len(sequence)
|
||
padded_sequence = sequence + [pad_token] * padding_needed
|
||
padded.append(padded_sequence)
|
||
|
||
return padded
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Test Your Character Tokenizer Implementation
|
||
|
||
Once you implement the CharTokenizer encode and decode methods above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-char-tokenizer-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_char_tokenizer():
|
||
"""Unit test for the character tokenizer."""
|
||
print("🔬 Unit Test: Character Tokenizer...")
|
||
|
||
# Create tokenizer
|
||
tokenizer = CharTokenizer()
|
||
|
||
# Test basic encoding
|
||
text = "Hi!"
|
||
tokens = tokenizer.encode(text, add_special_tokens=False)
|
||
expected_chars = ['H', 'i', '!']
|
||
|
||
assert len(tokens) == len(expected_chars), f"Expected {len(expected_chars)} tokens, got {len(tokens)}"
|
||
|
||
# Test decoding
|
||
decoded = tokenizer.decode(tokens, skip_special_tokens=True)
|
||
assert decoded == text, f"Expected '{text}', got '{decoded}'"
|
||
|
||
# Test with special tokens
|
||
tokens_with_special = tokenizer.encode(text, add_special_tokens=True)
|
||
assert len(tokens_with_special) == len(tokens) + 2, "Should add BOS and EOS tokens"
|
||
assert tokens_with_special[0] == tokenizer.char_to_idx['<BOS>'], "First token should be BOS"
|
||
assert tokens_with_special[-1] == tokenizer.char_to_idx['<EOS>'], "Last token should be EOS"
|
||
|
||
# Test vocabulary size (4 special + 95 ASCII = 99 total)
|
||
assert tokenizer.vocab_size >= 99, "Should have at least 99 tokens (4 special + 95 ASCII)"
|
||
|
||
# Test unknown character handling
|
||
unknown_tokens = tokenizer.encode("🚀", add_special_tokens=False) # Emoji not in ASCII
|
||
assert unknown_tokens[0] == tokenizer.char_to_idx['<UNK>'], "Should use UNK token for unknown chars"
|
||
|
||
# Test padding
|
||
sequences = [[1, 2, 3], [4, 5]]
|
||
padded = tokenizer.pad_sequences(sequences, max_length=4)
|
||
assert len(padded[0]) == 4, "First sequence should be padded to length 4"
|
||
assert len(padded[1]) == 4, "Second sequence should be padded to length 4"
|
||
assert padded[1][-1] == tokenizer.char_to_idx['<PAD>'], "Should use PAD token for padding"
|
||
|
||
print("PASS Character tokenizer tests passed!")
|
||
print(f"PASS Vocabulary size: {tokenizer.vocab_size}")
|
||
print(f"PASS Encode/decode cycle works correctly")
|
||
print(f"PASS Special tokens handled properly")
|
||
print(f"PASS Padding functionality works")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Basic BPE (Byte Pair Encoding) Tokenizer
|
||
|
||
Now let's implement a simplified version of BPE, the subword tokenization algorithm used in GPT and many modern language models.
|
||
|
||
### 🧩 BPE Algorithm Visualization
|
||
```
|
||
Step 1: Start with characters
|
||
"hello" -> ['h', 'e', 'l', 'l', 'o', '</w>']
|
||
|
||
Step 2: Count adjacent pairs
|
||
('l', 'l'): 1 occurrence <- Most frequent pair
|
||
|
||
Step 3: Merge most frequent pair
|
||
['h', 'e', 'l', 'l', 'o', '</w>'] -> ['h', 'e', 'll', 'o', '</w>']
|
||
|
||
Step 4: Repeat until vocabulary target reached
|
||
Next iteration might merge ('e', 'll') -> 'ell' if frequent enough
|
||
|
||
BPE Training Process:
|
||
+-----------------+ +-----------------+ +-----------------+
|
||
| Character Vocab | ---> | Count Pairs | ---> | Merge Most |
|
||
| a, b, c, d... | | (a,b): 5 | | Frequent Pair |
|
||
+-----------------+ | (c,d): 3 | | (a,b) -> ab |
|
||
^ | (e,f): 1 | +-----------------+
|
||
| +-----------------+ |
|
||
| |
|
||
+------------------- Repeat Until Target <---------+
|
||
```
|
||
|
||
### PROGRESS BPE Learning Process Example
|
||
```
|
||
Initial: "hello" = ['h', 'e', 'l', 'l', 'o', '</w>']
|
||
|
||
Iteration 1:
|
||
Pairs: (h,e):1, (e,l):1, (l,l):1, (l,o):1, (o,</w>):1
|
||
Merge: (l,l) -> 'll'
|
||
Result: ['h', 'e', 'll', 'o', '</w>']
|
||
|
||
Iteration 2:
|
||
Pairs: (h,e):1, (e,ll):1, (ll,o):1, (o,</w>):1
|
||
Merge: Most frequent (if any occur >1 time)
|
||
Continue until vocab_size reached...
|
||
|
||
Key Insight: BPE learns common subword patterns from data!
|
||
```
|
||
|
||
### TARGET BPE Benefits
|
||
```
|
||
Traditional Tokenization Problems:
|
||
FAIL "unhappiness" -> UNK (unknown word)
|
||
FAIL "supercalifragilisticexpialidocious" -> UNK
|
||
|
||
BPE Solution:
|
||
PASS "unhappiness" -> ['un', 'happy', 'ness'] (recognizable parts)
|
||
PASS "supercali..." -> ['super', 'cal', 'i', 'frag', ...] (graceful degradation)
|
||
|
||
Memory Efficiency:
|
||
Character: 26 vocab * 512 embed_dim = 13,312 parameters
|
||
BPE-50k: 50,000 vocab * 512 embed_dim = 25,600,000 parameters
|
||
Trade-off: More parameters, shorter sequences (faster attention)
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "bpe-tokenizer", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class BPETokenizer:
|
||
"""
|
||
Basic Byte Pair Encoding (BPE) tokenizer implementation.
|
||
|
||
Learns subword units by iteratively merging the most frequent
|
||
character pairs. This creates a vocabulary that balances
|
||
sequence length and vocabulary size.
|
||
"""
|
||
|
||
def __init__(self, vocab_size: int = 1000):
|
||
"""
|
||
Initialize BPE tokenizer.
|
||
|
||
Args:
|
||
vocab_size: Target vocabulary size (includes special tokens)
|
||
"""
|
||
self.vocab_size = vocab_size
|
||
self.char_to_idx = {}
|
||
self.idx_to_char = {}
|
||
self.merges = [] # List of (pair, new_token) merges learned during training
|
||
self.trained = False
|
||
|
||
# Initialize with special tokens
|
||
special_tokens = ['<PAD>', '<UNK>', '<BOS>', '<EOS>']
|
||
for i, token in enumerate(special_tokens):
|
||
self.char_to_idx[token] = i
|
||
self.idx_to_char[i] = token
|
||
|
||
def _get_word_tokens(self, text: str) -> List[List[str]]:
|
||
"""
|
||
Convert text to list of words, where each word is a list of characters.
|
||
|
||
This function is PROVIDED to handle text preprocessing.
|
||
"""
|
||
# Simple whitespace tokenization, then character splitting
|
||
words = text.lower().split()
|
||
word_tokens = []
|
||
|
||
for word in words:
|
||
# Add end-of-word marker to distinguish word boundaries
|
||
word_chars = list(word) + ['</w>']
|
||
word_tokens.append(word_chars)
|
||
|
||
return word_tokens
|
||
|
||
def _get_pair_counts(self, word_tokens: List[List[str]]) -> Dict[Tuple[str, str], int]:
|
||
"""
|
||
Count frequency of adjacent token pairs.
|
||
|
||
TODO: Implement pair counting for BPE merge selection.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Initialize empty count dictionary
|
||
2. For each word (list of tokens):
|
||
- For each adjacent pair of tokens
|
||
- Count how many times this pair appears
|
||
3. Return dictionary of (token1, token2) -> count
|
||
|
||
EXAMPLE:
|
||
word_tokens = [['h', 'e', 'l', 'l', 'o', '</w>'], ['h', 'i', '</w>']]
|
||
pairs = _get_pair_counts(word_tokens)
|
||
# Returns: {('h', 'e'): 1, ('e', 'l'): 1, ('l', 'l'): 1, ('l', 'o'): 1, ('o', '</w>'): 1, ('h', 'i'): 1, ('i', '</w>'): 1}
|
||
|
||
ALGORITHM INSIGHT:
|
||
This is the core of BPE learning - we find the most frequent adjacent pairs
|
||
to merge. High-frequency pairs indicate common subword patterns in the language.
|
||
|
||
Args:
|
||
word_tokens: List of words, each word is list of tokens
|
||
|
||
Returns:
|
||
Dictionary mapping token pairs to their counts
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Use defaultdict for efficient counting - avoids key existence checks
|
||
pair_counts = defaultdict(int)
|
||
|
||
# Iterate through all words in the corpus
|
||
for word in word_tokens:
|
||
# Count adjacent pairs in this word
|
||
# Range(len(word) - 1) ensures we don't go out of bounds
|
||
for i in range(len(word) - 1):
|
||
pair = (word[i], word[i + 1]) # Create tuple for dictionary key
|
||
pair_counts[pair] += 1 # Increment count for this pair
|
||
|
||
# Convert to regular dict for consistent return type
|
||
return dict(pair_counts)
|
||
### END SOLUTION
|
||
|
||
def _merge_pair(self, word_tokens: List[List[str]], pair: Tuple[str, str], new_token: str) -> List[List[str]]:
|
||
"""
|
||
Replace all occurrences of a token pair with a new merged token.
|
||
|
||
TODO: Implement pair merging for BPE vocabulary building.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Create new list to store updated words
|
||
2. For each word:
|
||
- Scan through tokens looking for the target pair
|
||
- When found, replace pair with new_token
|
||
- Continue until no more pairs in this word
|
||
3. Return updated word tokens
|
||
|
||
EXAMPLE:
|
||
word_tokens = [['h', 'e', 'l', 'l', 'o', '</w>']]
|
||
pair = ('l', 'l')
|
||
new_token = 'll'
|
||
result = _merge_pair(word_tokens, pair, new_token)
|
||
# Returns: [['h', 'e', 'll', 'o', '</w>']]
|
||
|
||
EFFICIENCY NOTE:
|
||
This operation is performed many times during BPE training. Each merge
|
||
creates a more compact representation, trading vocabulary size for sequence length.
|
||
|
||
Args:
|
||
word_tokens: List of words (each word is list of tokens)
|
||
pair: The token pair to merge
|
||
new_token: The new token to replace the pair
|
||
|
||
Returns:
|
||
Updated word tokens with pairs merged
|
||
"""
|
||
### BEGIN SOLUTION
|
||
updated_words = []
|
||
|
||
# Process each word independently
|
||
for word in word_tokens:
|
||
new_word = []
|
||
i = 0
|
||
|
||
# Scan through word looking for target pair
|
||
while i < len(word):
|
||
# Check if current position has the target pair
|
||
# Must check bounds to avoid index errors
|
||
if (i < len(word) - 1 and
|
||
word[i] == pair[0] and
|
||
word[i + 1] == pair[1]):
|
||
# Found the pair - replace with merged token
|
||
new_word.append(new_token)
|
||
i += 2 # Skip both tokens in the pair (important!)
|
||
else:
|
||
# No pair match - keep current token unchanged
|
||
new_word.append(word[i])
|
||
i += 1 # Move to next token
|
||
|
||
# Add processed word to results
|
||
updated_words.append(new_word)
|
||
|
||
return updated_words
|
||
### END SOLUTION
|
||
|
||
def train(self, texts: List[str]) -> None:
|
||
"""
|
||
Train BPE tokenizer on a corpus of texts.
|
||
|
||
This function is PROVIDED to show the complete BPE training algorithm.
|
||
Students implement the helper functions above.
|
||
"""
|
||
print(f"Training BPE tokenizer (target vocab size: {self.vocab_size})...")
|
||
|
||
# Step 1: Convert texts to word tokens (character level initially)
|
||
all_word_tokens = []
|
||
for text in texts:
|
||
word_tokens = self._get_word_tokens(text)
|
||
all_word_tokens.extend(word_tokens)
|
||
|
||
# Step 2: Build initial character vocabulary
|
||
all_chars = set()
|
||
for word in all_word_tokens:
|
||
all_chars.update(word)
|
||
|
||
# Add characters to vocabulary (after special tokens)
|
||
next_idx = len(self.char_to_idx)
|
||
for char in sorted(all_chars):
|
||
if char not in self.char_to_idx:
|
||
self.char_to_idx[char] = next_idx
|
||
self.idx_to_char[next_idx] = char
|
||
next_idx += 1
|
||
|
||
# Step 3: Iteratively merge most frequent pairs
|
||
current_word_tokens = all_word_tokens
|
||
|
||
while len(self.char_to_idx) < self.vocab_size:
|
||
# Count all adjacent pairs
|
||
pair_counts = self._get_pair_counts(current_word_tokens)
|
||
|
||
if not pair_counts:
|
||
print("No more pairs to merge!")
|
||
break
|
||
|
||
# Find most frequent pair
|
||
most_frequent_pair = max(pair_counts, key=pair_counts.get)
|
||
most_frequent_count = pair_counts[most_frequent_pair]
|
||
|
||
if most_frequent_count < 2:
|
||
print("No pairs occur more than once - stopping merge process")
|
||
break
|
||
|
||
# Create new merged token
|
||
new_token = most_frequent_pair[0] + most_frequent_pair[1]
|
||
|
||
# Add to vocabulary
|
||
self.char_to_idx[new_token] = len(self.char_to_idx)
|
||
self.idx_to_char[len(self.idx_to_char)] = new_token
|
||
|
||
# Record this merge for later encoding
|
||
self.merges.append((most_frequent_pair, new_token))
|
||
|
||
# Apply merge to all words
|
||
current_word_tokens = self._merge_pair(current_word_tokens, most_frequent_pair, new_token)
|
||
|
||
if len(self.char_to_idx) % 100 == 0:
|
||
print(f" Vocabulary size: {len(self.char_to_idx)}, Last merge: {most_frequent_pair} -> '{new_token}' (count: {most_frequent_count})")
|
||
|
||
self.trained = True
|
||
print(f"Training complete! Final vocabulary size: {len(self.char_to_idx)}")
|
||
print(f"Learned {len(self.merges)} merges")
|
||
|
||
def encode(self, text: str, add_special_tokens: bool = True) -> List[int]:
|
||
"""
|
||
Encode text using trained BPE tokenizer.
|
||
|
||
This function is PROVIDED to show BPE encoding process.
|
||
"""
|
||
if not self.trained:
|
||
raise ValueError("Tokenizer must be trained before encoding!")
|
||
|
||
# Convert to word tokens (character level initially)
|
||
word_tokens = self._get_word_tokens(text)
|
||
|
||
# Apply all learned merges in order
|
||
for pair, new_token in self.merges:
|
||
word_tokens = self._merge_pair(word_tokens, pair, new_token)
|
||
|
||
# Convert tokens to indices
|
||
tokens = []
|
||
if add_special_tokens:
|
||
tokens.append(self.char_to_idx['<BOS>'])
|
||
|
||
for word in word_tokens:
|
||
for token in word:
|
||
if token in self.char_to_idx:
|
||
tokens.append(self.char_to_idx[token])
|
||
else:
|
||
tokens.append(self.char_to_idx['<UNK>'])
|
||
|
||
if add_special_tokens:
|
||
tokens.append(self.char_to_idx['<EOS>'])
|
||
|
||
return tokens
|
||
|
||
def decode(self, tokens: List[int], skip_special_tokens: bool = True) -> str:
|
||
"""
|
||
Decode tokens back to text.
|
||
|
||
This function is PROVIDED to show BPE decoding process.
|
||
"""
|
||
special_tokens = {'<PAD>', '<UNK>', '<BOS>', '<EOS>'}
|
||
token_strings = []
|
||
|
||
for token_idx in tokens:
|
||
if token_idx in self.idx_to_char:
|
||
token_str = self.idx_to_char[token_idx]
|
||
if skip_special_tokens and token_str in special_tokens:
|
||
continue
|
||
token_strings.append(token_str)
|
||
|
||
# Join tokens and handle word boundaries
|
||
result = ''.join(token_strings)
|
||
result = result.replace('</w>', ' ') # Replace end-of-word markers with spaces
|
||
|
||
return result.strip()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Test Your BPE Implementation
|
||
|
||
Once you implement the BPE helper methods above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-bpe-tokenizer-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_bpe_tokenizer():
|
||
"""Unit test for the BPE tokenizer."""
|
||
print("🔬 Unit Test: BPE Tokenizer...")
|
||
|
||
# Create BPE tokenizer
|
||
bpe = BPETokenizer(vocab_size=50) # Small vocab for testing
|
||
|
||
# Test training data
|
||
training_texts = [
|
||
"hello world hello",
|
||
"world hello world",
|
||
"hello hello world world"
|
||
]
|
||
|
||
# Test training
|
||
bpe.train(training_texts)
|
||
|
||
# Verify training completed
|
||
assert bpe.trained, "Tokenizer should be marked as trained"
|
||
assert len(bpe.char_to_idx) >= 10, "Should have reasonable vocabulary size"
|
||
assert len(bpe.merges) > 0, "Should have learned some merges"
|
||
|
||
# Test encoding
|
||
test_text = "hello world"
|
||
tokens = bpe.encode(test_text, add_special_tokens=False)
|
||
assert len(tokens) > 0, "Should produce some tokens"
|
||
assert all(isinstance(t, int) for t in tokens), "All tokens should be integers"
|
||
|
||
# Test decoding
|
||
decoded = bpe.decode(tokens, skip_special_tokens=True)
|
||
# Should be similar to original (might have different spacing due to </w> markers)
|
||
assert "hello" in decoded.lower(), "Should contain 'hello'"
|
||
assert "world" in decoded.lower(), "Should contain 'world'"
|
||
|
||
# Test with special tokens
|
||
tokens_with_special = bpe.encode(test_text, add_special_tokens=True)
|
||
assert len(tokens_with_special) == len(tokens) + 2, "Should add BOS and EOS"
|
||
assert tokens_with_special[0] == bpe.char_to_idx['<BOS>'], "First should be BOS"
|
||
assert tokens_with_special[-1] == bpe.char_to_idx['<EOS>'], "Last should be EOS"
|
||
|
||
# Test helper functions
|
||
word_tokens = [['h', 'e', 'l', 'l', 'o']]
|
||
pair_counts = bpe._get_pair_counts(word_tokens)
|
||
assert ('l', 'l') in pair_counts, "Should find the 'll' pair"
|
||
assert pair_counts[('l', 'l')] == 1, "Should count 'll' pair once"
|
||
|
||
# Test merge function
|
||
merged = bpe._merge_pair(word_tokens, ('l', 'l'), 'll')
|
||
assert 'll' in merged[0], "Should contain merged token 'll'"
|
||
# After merging 'll' from ['h', 'e', 'l', 'l', 'o'], we get ['h', 'e', 'll', 'o']
|
||
# Count individual 'l' characters - should be 0 since they were merged into 'll'
|
||
individual_l_count = sum(1 for token in merged[0] if token == 'l')
|
||
assert individual_l_count == 0, f"Should have no individual 'l' tokens after merge, got {individual_l_count}"
|
||
|
||
print("PASS BPE tokenizer tests passed!")
|
||
print(f"PASS Trained vocabulary size: {len(bpe.char_to_idx)}")
|
||
print(f"PASS Learned {len(bpe.merges)} merges")
|
||
print(f"PASS Encode/decode cycle works")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## TARGET ML Systems: Performance Analysis & Tokenization Efficiency
|
||
|
||
Now let's develop systems engineering skills by analyzing tokenization performance and understanding how tokenization choices affect downstream ML system efficiency.
|
||
|
||
### **Learning Outcome**: *"I understand how tokenization affects model memory, training speed, and language understanding"*
|
||
|
||
### MAGNIFY Systems Insights Functions
|
||
|
||
The next few implementations include **executable analysis functions** that help you discover key insights about tokenization performance and memory scaling. These aren't just code - they're interactive learning tools that reveal how tokenization choices affect real ML systems.
|
||
|
||
### 📊 What We'll Measure
|
||
```
|
||
Performance Metrics:
|
||
+-----------------+ +-----------------+ +-----------------+
|
||
| Tokenization | | Memory Usage | | Scaling |
|
||
| Speed | | Analysis | | Behavior |
|
||
| | | | | |
|
||
| • tokens/sec | | • vocab memory | | • time complexity|
|
||
| • chars/sec | | • sequence mem | | • space complexity|
|
||
| • compression | | • total footprint| | • bottleneck ID |
|
||
+-----------------+ +-----------------+ +-----------------+
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tokenization-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
import time
|
||
|
||
class TokenizationProfiler:
|
||
"""
|
||
Performance profiling toolkit for tokenization systems.
|
||
|
||
Helps ML engineers understand computational costs and optimize
|
||
text processing pipelines for production deployment.
|
||
"""
|
||
|
||
def __init__(self):
|
||
self.results = {}
|
||
|
||
def measure_tokenization_speed(self, tokenizer, texts: List[str], tokenizer_name: str) -> Dict:
|
||
"""
|
||
Measure tokenization throughput and efficiency.
|
||
|
||
TODO: Implement tokenization speed measurement.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Record start time
|
||
2. Tokenize all texts
|
||
3. Record end time and calculate metrics
|
||
4. Calculate tokens per second, characters per second
|
||
5. Return comprehensive performance metrics
|
||
|
||
METRICS TO CALCULATE:
|
||
- Total time (seconds)
|
||
- Texts per second
|
||
- Characters per second
|
||
- Average tokens per text
|
||
- Average sequence length
|
||
|
||
Args:
|
||
tokenizer: Tokenizer instance (CharTokenizer or BPETokenizer)
|
||
texts: List of texts to tokenize
|
||
tokenizer_name: Name for reporting
|
||
|
||
Returns:
|
||
Dictionary with performance metrics
|
||
"""
|
||
### BEGIN SOLUTION
|
||
start_time = time.time()
|
||
|
||
# Tokenize all texts
|
||
all_tokens = []
|
||
total_chars = 0
|
||
|
||
for text in texts:
|
||
tokens = tokenizer.encode(text, add_special_tokens=False)
|
||
all_tokens.append(tokens)
|
||
total_chars += len(text)
|
||
|
||
end_time = time.time()
|
||
|
||
# Calculate metrics
|
||
total_time = end_time - start_time
|
||
total_texts = len(texts)
|
||
total_tokens = sum(len(tokens) for tokens in all_tokens)
|
||
|
||
metrics = {
|
||
'tokenizer_name': tokenizer_name,
|
||
'total_time_sec': total_time,
|
||
'total_texts': total_texts,
|
||
'total_characters': total_chars,
|
||
'total_tokens': total_tokens,
|
||
'texts_per_second': total_texts / total_time if total_time > 0 else 0,
|
||
'chars_per_second': total_chars / total_time if total_time > 0 else 0,
|
||
'tokens_per_second': total_tokens / total_time if total_time > 0 else 0,
|
||
'avg_tokens_per_text': total_tokens / total_texts if total_texts > 0 else 0,
|
||
'avg_sequence_length': total_tokens / total_texts if total_texts > 0 else 0,
|
||
'compression_ratio': total_chars / total_tokens if total_tokens > 0 else 0
|
||
}
|
||
|
||
return metrics
|
||
### END SOLUTION
|
||
|
||
def compare_tokenizers(self, texts: List[str]) -> Dict:
|
||
"""
|
||
Compare performance of different tokenization strategies.
|
||
|
||
This function is PROVIDED to show comprehensive comparison.
|
||
"""
|
||
print("MAGNIFY TOKENIZER COMPARISON")
|
||
print("=" * 50)
|
||
|
||
# Create tokenizers
|
||
char_tokenizer = CharTokenizer()
|
||
|
||
# Train small BPE tokenizer
|
||
bpe_tokenizer = BPETokenizer(vocab_size=200)
|
||
bpe_tokenizer.train(texts[:10]) # Train on subset for speed
|
||
|
||
tokenizers = [
|
||
(char_tokenizer, "Character"),
|
||
(bpe_tokenizer, "BPE")
|
||
]
|
||
|
||
results = {}
|
||
|
||
# Test each tokenizer
|
||
for tokenizer, name in tokenizers:
|
||
metrics = self.measure_tokenization_speed(tokenizer, texts, name)
|
||
results[name] = metrics
|
||
|
||
print(f"\n📊 {name} Tokenizer:")
|
||
print(f" Speed: {metrics['texts_per_second']:.1f} texts/sec")
|
||
print(f" Throughput: {metrics['chars_per_second']:.0f} chars/sec")
|
||
print(f" Avg sequence length: {metrics['avg_sequence_length']:.1f} tokens")
|
||
print(f" Compression ratio: {metrics['compression_ratio']:.2f} chars/token")
|
||
print(f" Vocabulary size: {tokenizer.vocab_size}")
|
||
|
||
return results
|
||
|
||
def analyze_memory_scaling(self, tokenizer, text_lengths: List[int]) -> Dict:
|
||
"""
|
||
Analyze how tokenization memory scales with text length.
|
||
|
||
This function is PROVIDED to demonstrate scaling analysis.
|
||
"""
|
||
print(f"\nMAGNIFY MEMORY SCALING ANALYSIS")
|
||
print("=" * 40)
|
||
|
||
scaling_results = []
|
||
|
||
for length in text_lengths:
|
||
# Create text of specified length
|
||
test_text = "Hello world! " * (length // 13 + 1)
|
||
test_text = test_text[:length]
|
||
|
||
# Measure tokenization
|
||
start_time = time.time()
|
||
tokens = tokenizer.encode(test_text, add_special_tokens=False)
|
||
end_time = time.time()
|
||
|
||
# Calculate metrics
|
||
time_taken = end_time - start_time
|
||
memory_chars = len(test_text) * 4 # Approximate char memory (bytes)
|
||
memory_tokens = len(tokens) * 4 # Approximate token memory (bytes)
|
||
|
||
result = {
|
||
'text_length': length,
|
||
'num_tokens': len(tokens),
|
||
'time_ms': time_taken * 1000,
|
||
'memory_chars_bytes': memory_chars,
|
||
'memory_tokens_bytes': memory_tokens,
|
||
'total_memory_bytes': memory_chars + memory_tokens
|
||
}
|
||
|
||
scaling_results.append(result)
|
||
print(f" {length:>6} chars -> {len(tokens):>4} tokens ({time_taken*1000:.2f}ms)")
|
||
|
||
# Analyze scaling pattern
|
||
if len(scaling_results) >= 2:
|
||
small = scaling_results[0]
|
||
large = scaling_results[-1]
|
||
|
||
length_ratio = large['text_length'] / small['text_length']
|
||
time_ratio = large['time_ms'] / small['time_ms']
|
||
memory_ratio = large['total_memory_bytes'] / small['total_memory_bytes']
|
||
|
||
print(f"\nPROGRESS Scaling Analysis:")
|
||
print(f" Text length increased {length_ratio:.1f}x")
|
||
print(f" Time increased {time_ratio:.1f}x")
|
||
print(f" Memory increased {memory_ratio:.1f}x")
|
||
print(f" Scaling pattern: {'Linear' if abs(time_ratio - length_ratio) < 1 else 'Non-linear'}")
|
||
|
||
return scaling_results
|
||
|
||
def analyze_tokenization_impact():
|
||
"""
|
||
Comprehensive analysis of how tokenization affects downstream ML systems.
|
||
|
||
This function is PROVIDED to show systems-level thinking.
|
||
"""
|
||
print("TARGET TOKENIZATION IMPACT ON ML SYSTEMS")
|
||
print("=" * 60)
|
||
|
||
# Sample texts for analysis
|
||
sample_texts = [
|
||
"The quick brown fox jumps over the lazy dog.",
|
||
"Machine learning models process tokenized text efficiently.",
|
||
"Byte pair encoding balances vocabulary size and sequence length.",
|
||
"Transformer models use attention mechanisms for sequence processing.",
|
||
"Production systems require fast tokenization for real-time inference."
|
||
]
|
||
|
||
# Create tokenizers
|
||
char_tokenizer = CharTokenizer()
|
||
bpe_tokenizer = BPETokenizer(vocab_size=100)
|
||
bpe_tokenizer.train(sample_texts * 3) # Train with more data
|
||
|
||
print("\n📊 TOKENIZATION COMPARISON:")
|
||
print(f"{'Strategy':<12} {'Vocab Size':<10} {'Avg Tokens':<10} {'Memory Impact':<15}")
|
||
print("-" * 60)
|
||
|
||
for tokenizer, name in [(char_tokenizer, "Character"), (bpe_tokenizer, "BPE")]:
|
||
# Analyze average sequence length
|
||
total_tokens = 0
|
||
for text in sample_texts:
|
||
tokens = tokenizer.encode(text, add_special_tokens=False)
|
||
total_tokens += len(tokens)
|
||
|
||
avg_tokens = total_tokens / len(sample_texts)
|
||
|
||
# Calculate memory impact
|
||
# Embedding table: vocab_size * embedding_dim * 4 bytes (float32)
|
||
embedding_dim = 256 # Typical small model
|
||
embedding_memory_mb = (tokenizer.vocab_size * embedding_dim * 4) / (1024 * 1024)
|
||
|
||
# Sequence memory: batch_size * seq_length * hidden_dim * 4 bytes
|
||
batch_size = 32
|
||
hidden_dim = 256
|
||
sequence_memory_mb = (batch_size * avg_tokens * hidden_dim * 4) / (1024 * 1024)
|
||
|
||
total_memory = embedding_memory_mb + sequence_memory_mb
|
||
|
||
print(f"{name:<12} {tokenizer.vocab_size:<10} {avg_tokens:<10.1f} {total_memory:<15.1f}MB")
|
||
|
||
print(f"\nTIP KEY INSIGHTS:")
|
||
print(f" 🔤 Character tokenizer: Small vocabulary, long sequences")
|
||
print(f" 🧩 BPE tokenizer: Medium vocabulary, shorter sequences")
|
||
print(f" PROGRESS Memory scaling: O(vocab_size * embed_dim + seq_len * batch_size)")
|
||
print(f" SPEED Attention complexity: O(seq_len²) - shorter sequences = faster attention")
|
||
print(f" 🏭 Production trade-off: Vocabulary size vs sequence length vs compute")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Test: Tokenization Performance Analysis
|
||
|
||
Let's test our tokenization profiler with realistic performance scenarios.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "test-tokenization-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
def test_tokenization_profiler():
|
||
"""Test tokenization profiler with various scenarios."""
|
||
print("🔬 Unit Test: Tokenization Performance Profiler...")
|
||
|
||
profiler = TokenizationProfiler()
|
||
|
||
# Create test data
|
||
test_texts = [
|
||
"Hello world!",
|
||
"This is a test sentence.",
|
||
"Tokenization speed matters for ML systems."
|
||
]
|
||
|
||
# Test with character tokenizer
|
||
char_tokenizer = CharTokenizer()
|
||
metrics = profiler.measure_tokenization_speed(char_tokenizer, test_texts, "Character")
|
||
|
||
# Verify metrics structure
|
||
expected_keys = ['tokenizer_name', 'total_time_sec', 'total_texts', 'total_characters',
|
||
'total_tokens', 'texts_per_second', 'chars_per_second', 'tokens_per_second',
|
||
'avg_tokens_per_text', 'avg_sequence_length', 'compression_ratio']
|
||
|
||
for key in expected_keys:
|
||
assert key in metrics, f"Missing metric: {key}"
|
||
assert isinstance(metrics[key], (int, float, str)), f"Invalid metric type for {key}"
|
||
|
||
# Verify reasonable values
|
||
assert metrics['total_texts'] == len(test_texts), "Should count texts correctly"
|
||
assert metrics['total_characters'] > 0, "Should count characters"
|
||
assert metrics['total_tokens'] > 0, "Should count tokens"
|
||
assert metrics['texts_per_second'] > 0, "Should measure throughput"
|
||
|
||
print("PASS Basic profiling functionality test passed")
|
||
|
||
# Test comparison
|
||
comparison_results = profiler.compare_tokenizers(test_texts)
|
||
assert isinstance(comparison_results, dict), "Should return comparison results"
|
||
assert len(comparison_results) >= 1, "Should test at least one tokenizer"
|
||
|
||
print("PASS Tokenizer comparison test passed")
|
||
|
||
# Test scaling analysis
|
||
scaling_results = profiler.analyze_memory_scaling(char_tokenizer, [50, 100])
|
||
assert isinstance(scaling_results, list), "Should return scaling results"
|
||
assert len(scaling_results) == 2, "Should test both sizes"
|
||
|
||
for result in scaling_results:
|
||
assert 'text_length' in result, "Should include text length"
|
||
assert 'num_tokens' in result, "Should include token count"
|
||
assert result['num_tokens'] > 0, "Should produce tokens"
|
||
|
||
print("PASS Scaling analysis test passed")
|
||
print("TARGET Tokenization Profiler: All tests passed!")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 📊 Systems Analysis: Tokenization Impact on Model Architecture
|
||
|
||
Let's analyze how different tokenization strategies affect real ML system design choices.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tokenization-systems-analysis", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
def analyze_tokenization_systems_impact():
|
||
"""
|
||
Analyze how tokenization affects ML system design and performance.
|
||
|
||
This analysis helps students understand the connection between
|
||
tokenization choices and downstream system architecture decisions.
|
||
"""
|
||
print("🏗️ TOKENIZATION SYSTEMS IMPACT ANALYSIS")
|
||
print("=" * 60)
|
||
|
||
# Example model configurations
|
||
model_configs = {
|
||
'Small Model': {'embed_dim': 128, 'hidden_dim': 256, 'batch_size': 16},
|
||
'Medium Model': {'embed_dim': 256, 'hidden_dim': 512, 'batch_size': 32},
|
||
'Large Model': {'embed_dim': 512, 'hidden_dim': 1024, 'batch_size': 64}
|
||
}
|
||
|
||
# Sample text for analysis
|
||
sample_text = "The transformer architecture revolutionized natural language processing through self-attention mechanisms."
|
||
|
||
# Create tokenizers
|
||
char_tokenizer = CharTokenizer()
|
||
bpe_tokenizer = BPETokenizer(vocab_size=500)
|
||
bpe_tokenizer.train([sample_text] * 10)
|
||
|
||
tokenizers = [
|
||
(char_tokenizer, "Character"),
|
||
(bpe_tokenizer, "BPE-500")
|
||
]
|
||
|
||
print(f"\n📋 ANALYSIS FOR TEXT: '{sample_text[:50]}...'")
|
||
print(f" Original length: {len(sample_text)} characters")
|
||
|
||
for tokenizer, tok_name in tokenizers:
|
||
tokens = tokenizer.encode(sample_text, add_special_tokens=False)
|
||
|
||
print(f"\n🔤 {tok_name} Tokenization:")
|
||
print(f" Vocabulary size: {tokenizer.vocab_size:,}")
|
||
print(f" Sequence length: {len(tokens)} tokens")
|
||
print(f" Compression ratio: {len(sample_text)/len(tokens):.2f} chars/token")
|
||
|
||
print(f"\n💾 Memory Analysis:")
|
||
for model_name, config in model_configs.items():
|
||
# Embedding table memory
|
||
embed_memory = tokenizer.vocab_size * config['embed_dim'] * 4 / (1024**2) # MB
|
||
|
||
# Sequence processing memory (attention)
|
||
seq_memory = config['batch_size'] * len(tokens) * config['hidden_dim'] * 4 / (1024**2) # MB
|
||
|
||
# Attention memory (O(N²))
|
||
attention_memory = config['batch_size'] * len(tokens)**2 * 4 / (1024**2) # MB
|
||
|
||
total_memory = embed_memory + seq_memory + attention_memory
|
||
|
||
print(f" {model_name}: {total_memory:.1f}MB total")
|
||
print(f" Embedding: {embed_memory:.1f}MB, Sequence: {seq_memory:.1f}MB, Attention: {attention_memory:.1f}MB")
|
||
|
||
print(f"\nTARGET KEY SYSTEM DESIGN INSIGHTS:")
|
||
print(f" 1. Vocabulary Size Trade-offs:")
|
||
print(f" - Larger vocab = more parameters = more memory")
|
||
print(f" - Smaller vocab = longer sequences = more compute")
|
||
print(f" 2. Sequence Length Impact:")
|
||
print(f" - Attention complexity: O(sequence_length²)")
|
||
print(f" - Memory scales quadratically with sequence length")
|
||
print(f" 3. Production Considerations:")
|
||
print(f" - Character tokenization: Simple but inefficient")
|
||
print(f" - BPE tokenization: Balanced approach used in GPT/BERT")
|
||
print(f" - Vocabulary size affects model download size")
|
||
print(f" 4. Hardware Implications:")
|
||
print(f" - GPU memory limits sequence length")
|
||
print(f" - Batch size limited by attention memory")
|
||
|
||
# Analysis function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## MAGNIFY Interactive Systems Insights
|
||
|
||
Let's build intuition about tokenization through hands-on analysis. These functions reveal how tokenization choices cascade through ML systems.
|
||
"""
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Ensure your tokenizers are complete before running
|
||
|
||
# THINK PREDICTION: Which tokenizer will use more memory - character or BPE? Why?
|
||
# Your guess: _______
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT #1: Vocabulary Size vs Memory Trade-offs
|
||
def analyze_tokenization_memory_impact():
|
||
"""Analyze how vocabulary size affects model memory usage."""
|
||
try:
|
||
print("MAGNIFY TOKENIZATION MEMORY IMPACT ANALYSIS")
|
||
print("=" * 50)
|
||
|
||
# Create tokenizers with different vocabulary sizes
|
||
char_tokenizer = CharTokenizer()
|
||
|
||
# Train small BPE for comparison
|
||
bpe_small = BPETokenizer(vocab_size=500)
|
||
bpe_large = BPETokenizer(vocab_size=2000)
|
||
|
||
sample_texts = [
|
||
"The quick brown fox jumps over the lazy dog",
|
||
"Machine learning models process tokenized text",
|
||
"Transformers use attention mechanisms effectively"
|
||
] * 3 # Repeat for training data
|
||
|
||
bpe_small.train(sample_texts)
|
||
bpe_large.train(sample_texts)
|
||
|
||
tokenizers = [
|
||
(char_tokenizer, "Character"),
|
||
(bpe_small, "BPE-500"),
|
||
(bpe_large, "BPE-2000")
|
||
]
|
||
|
||
test_text = "The transformer architecture revolutionized natural language processing."
|
||
embed_dim = 256 # Typical embedding dimension
|
||
|
||
print(f"\nAnalyzing text: '{test_text}'")
|
||
print(f"Text length: {len(test_text)} characters")
|
||
|
||
for tokenizer, name in tokenizers:
|
||
tokens = tokenizer.encode(test_text, add_special_tokens=False)
|
||
|
||
# Calculate memory requirements
|
||
vocab_size = tokenizer.vocab_size
|
||
seq_length = len(tokens)
|
||
|
||
# Embedding table memory (parameters)
|
||
embedding_memory_mb = (vocab_size * embed_dim * 4) / (1024 * 1024)
|
||
|
||
# Sequence memory for single sample (activations)
|
||
sequence_memory_kb = (seq_length * embed_dim * 4) / 1024
|
||
|
||
# Attention memory O(N²) for single sample
|
||
attention_memory_kb = (seq_length * seq_length * 4) / 1024
|
||
|
||
print(f"\n📊 {name} Tokenizer:")
|
||
print(f" Vocabulary size: {vocab_size:,}")
|
||
print(f" Sequence length: {seq_length} tokens")
|
||
print(f" Compression ratio: {len(test_text)/seq_length:.2f} chars/token")
|
||
print(f" Embedding table: {embedding_memory_mb:.1f} MB")
|
||
print(f" Sequence memory: {sequence_memory_kb:.1f} KB")
|
||
print(f" Attention memory: {attention_memory_kb:.1f} KB")
|
||
|
||
total_per_sample = sequence_memory_kb + attention_memory_kb
|
||
print(f" Total per sample: {total_per_sample:.1f} KB")
|
||
|
||
print(f"\nTIP KEY INSIGHTS:")
|
||
print(f" • Vocabulary size directly affects model parameters")
|
||
print(f" • Sequence length affects computation (attention is O(N²))")
|
||
print(f" • Character tokenization: Small vocab, long sequences")
|
||
print(f" • BPE tokenization: Large vocab, shorter sequences")
|
||
print(f" • Production trade-off: Parameters vs computation")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in memory analysis: {e}")
|
||
print("Make sure both tokenizers are implemented correctly")
|
||
|
||
# Run the analysis
|
||
analyze_tokenization_memory_impact()
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Ensure BPE merge functions are working
|
||
|
||
# THINK PREDICTION: How does tokenization speed scale with text length?
|
||
# Linear? Quadratic? Your guess: _______
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT #2: Tokenization Speed Scaling Analysis
|
||
def analyze_tokenization_speed_scaling():
|
||
"""Measure how tokenization performance scales with input size."""
|
||
try:
|
||
print("\nMAGNIFY TOKENIZATION SPEED SCALING ANALYSIS")
|
||
print("=" * 50)
|
||
|
||
char_tokenizer = CharTokenizer()
|
||
text_lengths = [100, 500, 1000, 2000, 5000]
|
||
|
||
print(f"Testing scaling with text lengths: {text_lengths}")
|
||
|
||
char_times = []
|
||
|
||
for length in text_lengths:
|
||
# Create text of specified length
|
||
test_text = "The quick brown fox jumps over the lazy dog. " * (length // 44 + 1)
|
||
test_text = test_text[:length]
|
||
|
||
# Measure character tokenization time
|
||
start_time = time.time()
|
||
char_tokens = char_tokenizer.encode(test_text, add_special_tokens=False)
|
||
char_time = time.time() - start_time
|
||
|
||
char_times.append(char_time)
|
||
|
||
print(f" {length:>5} chars -> {len(char_tokens):>5} tokens in {char_time*1000:.2f}ms")
|
||
|
||
# Analyze scaling pattern
|
||
if len(char_times) >= 2:
|
||
print(f"\nPROGRESS Scaling Analysis:")
|
||
for i in range(1, len(text_lengths)):
|
||
length_ratio = text_lengths[i] / text_lengths[0]
|
||
time_ratio = char_times[i] / char_times[0] if char_times[0] > 0 else 0
|
||
|
||
print(f" {text_lengths[i]:>5} chars: {length_ratio:.1f}x length -> {time_ratio:.1f}x time")
|
||
|
||
# Calculate approximate complexity
|
||
avg_scaling = sum(char_times[i]/char_times[0] / (text_lengths[i]/text_lengths[0])
|
||
for i in range(1, len(text_lengths)) if char_times[0] > 0) / (len(text_lengths) - 1)
|
||
|
||
print(f"\nTARGET SCALING INSIGHTS:")
|
||
print(f" • Character tokenization: ~O(N) time complexity")
|
||
print(f" • Average scaling factor: {avg_scaling:.2f} (1.0 = perfect linear)")
|
||
if avg_scaling < 1.2:
|
||
print(f" • Performance: Excellent linear scaling")
|
||
elif avg_scaling < 2.0:
|
||
print(f" • Performance: Good scaling with minor overhead")
|
||
else:
|
||
print(f" • Performance: Scaling overhead detected")
|
||
|
||
print(f" • Memory usage: O(N) with input length")
|
||
print(f" • Production implication: Tokenization speed rarely bottlenecks training")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in scaling analysis: {e}")
|
||
print("Make sure character tokenizer is implemented correctly")
|
||
|
||
# Run the scaling analysis
|
||
analyze_tokenization_speed_scaling()
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: All tokenization systems working
|
||
|
||
# THINK PREDICTION: For a 7B parameter model, what percentage of memory is vocabulary?
|
||
# Your estimate: _______%
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT #3: Production Model Memory Breakdown
|
||
def analyze_production_memory_breakdown():
|
||
"""Analyze vocabulary memory in production-scale language models."""
|
||
try:
|
||
print("\nMAGNIFY PRODUCTION MODEL MEMORY BREAKDOWN")
|
||
print("=" * 50)
|
||
|
||
# Model configurations based on real systems
|
||
models = {
|
||
'GPT-Small': {'params': 117_000_000, 'vocab': 50257, 'embed_dim': 768},
|
||
'GPT-Medium': {'params': 345_000_000, 'vocab': 50257, 'embed_dim': 1024},
|
||
'GPT-Large': {'params': 774_000_000, 'vocab': 50257, 'embed_dim': 1280},
|
||
'LLaMA-7B': {'params': 7_000_000_000, 'vocab': 32000, 'embed_dim': 4096}
|
||
}
|
||
|
||
print(f"{'Model':<12} {'Total Params':<12} {'Vocab Params':<12} {'Vocab %':<8} {'Vocab Memory'}")
|
||
print("-" * 70)
|
||
|
||
for model_name, config in models.items():
|
||
total_params = config['params']
|
||
vocab_size = config['vocab']
|
||
embed_dim = config['embed_dim']
|
||
|
||
# Vocabulary parameters (embedding table)
|
||
vocab_params = vocab_size * embed_dim
|
||
vocab_percentage = (vocab_params / total_params) * 100
|
||
|
||
# Memory in MB (float32)
|
||
vocab_memory_mb = (vocab_params * 4) / (1024 * 1024)
|
||
|
||
print(f"{model_name:<12} {total_params/1e6:>8.0f}M {vocab_params/1e6:>8.1f}M {vocab_percentage:>6.1f}% {vocab_memory_mb:>8.0f}MB")
|
||
|
||
print(f"\nTARGET PRODUCTION INSIGHTS:")
|
||
print(f" • Small models (100M): Vocabulary is ~20-30% of parameters")
|
||
print(f" • Large models (7B+): Vocabulary is ~1-2% of parameters")
|
||
print(f" • Vocabulary memory scales with vocab_size * embed_dim")
|
||
print(f" • GPT uses 50k vocabulary, LLaMA uses 32k (efficiency optimization)")
|
||
|
||
# Calculate tokenization efficiency comparison
|
||
print(f"\n📊 TOKENIZATION EFFICIENCY COMPARISON:")
|
||
char_vocab = 256
|
||
char_embed = 512
|
||
char_memory = (char_vocab * char_embed * 4) / (1024 * 1024)
|
||
|
||
gpt_vocab = 50257
|
||
gpt_embed = 768
|
||
gpt_memory = (gpt_vocab * gpt_embed * 4) / (1024 * 1024)
|
||
|
||
print(f" Character tokenizer: {char_memory:.1f} MB vocabulary")
|
||
print(f" GPT tokenizer: {gpt_memory:.1f} MB vocabulary")
|
||
print(f" Memory ratio: {gpt_memory/char_memory:.0f}x more memory for BPE")
|
||
|
||
# But compute advantage
|
||
sample_text = "The transformer architecture revolutionized NLP"
|
||
char_tokens = len(sample_text) # Approximate character count
|
||
gpt_tokens = char_tokens // 4 # Approximate GPT tokenization (4 chars/token)
|
||
|
||
print(f"\nSPEED COMPUTE EFFICIENCY:")
|
||
print(f" Sample text: '{sample_text}'")
|
||
print(f" Character tokens: ~{char_tokens}")
|
||
print(f" GPT tokens: ~{gpt_tokens}")
|
||
print(f" Attention complexity: O(N²)")
|
||
print(f" Character attention: O({char_tokens}²) = {char_tokens**2:,} operations")
|
||
print(f" GPT attention: O({gpt_tokens}²) = {gpt_tokens**2:,} operations")
|
||
print(f" Compute reduction: {(char_tokens**2)/(gpt_tokens**2):.1f}x faster attention")
|
||
|
||
print(f"\nTIP TRADE-OFF SUMMARY:")
|
||
print(f" • BPE uses {gpt_memory/char_memory:.0f}x more vocabulary memory")
|
||
print(f" • BPE provides {(char_tokens**2)/(gpt_tokens**2):.1f}x faster attention computation")
|
||
print(f" • Production systems choose BPE for compute efficiency")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in production analysis: {e}")
|
||
print("Error in memory calculation - check model configurations")
|
||
|
||
# Run the production analysis
|
||
analyze_production_memory_breakdown()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## ROCKET Advanced: Tokenization Efficiency Techniques
|
||
|
||
Production tokenization systems use several optimization techniques. Let's implement a few key ones:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tokenization-optimizations", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
#| export
|
||
class OptimizedTokenizer:
|
||
"""
|
||
Production-optimized tokenizer with caching and batch processing.
|
||
|
||
Demonstrates optimization techniques used in real ML systems:
|
||
- Caching for repeated texts
|
||
- Batch processing for efficiency
|
||
- Memory-efficient encoding
|
||
"""
|
||
|
||
def __init__(self, base_tokenizer):
|
||
"""Initialize with a base tokenizer and optimization features."""
|
||
self.base_tokenizer = base_tokenizer
|
||
self.encode_cache = {}
|
||
self.decode_cache = {}
|
||
self.cache_hits = 0
|
||
self.cache_misses = 0
|
||
|
||
def encode_with_cache(self, text: str, add_special_tokens: bool = True) -> List[int]:
|
||
"""
|
||
Encode text with caching for repeated inputs.
|
||
|
||
This optimization is critical for production systems where
|
||
the same texts are processed repeatedly.
|
||
"""
|
||
cache_key = (text, add_special_tokens)
|
||
|
||
if cache_key in self.encode_cache:
|
||
self.cache_hits += 1
|
||
return self.encode_cache[cache_key]
|
||
|
||
# Cache miss - compute and cache result
|
||
self.cache_misses += 1
|
||
tokens = self.base_tokenizer.encode(text, add_special_tokens)
|
||
self.encode_cache[cache_key] = tokens
|
||
|
||
return tokens
|
||
|
||
def batch_encode(self, texts: List[str], add_special_tokens: bool = True,
|
||
pad_to_max: bool = True) -> List[List[int]]:
|
||
"""
|
||
Efficiently encode multiple texts as a batch.
|
||
|
||
This function is PROVIDED to show batch processing optimization.
|
||
"""
|
||
# Encode all texts
|
||
token_sequences = []
|
||
for text in texts:
|
||
tokens = self.encode_with_cache(text, add_special_tokens)
|
||
token_sequences.append(tokens)
|
||
|
||
# Pad to uniform length if requested
|
||
if pad_to_max and hasattr(self.base_tokenizer, 'pad_sequences'):
|
||
token_sequences = self.base_tokenizer.pad_sequences(token_sequences)
|
||
|
||
return token_sequences
|
||
|
||
def get_cache_stats(self) -> Dict:
|
||
"""Get caching performance statistics."""
|
||
total_requests = self.cache_hits + self.cache_misses
|
||
hit_rate = self.cache_hits / total_requests if total_requests > 0 else 0
|
||
|
||
return {
|
||
'cache_hits': self.cache_hits,
|
||
'cache_misses': self.cache_misses,
|
||
'total_requests': total_requests,
|
||
'hit_rate': hit_rate,
|
||
'cache_size': len(self.encode_cache)
|
||
}
|
||
|
||
def demonstrate_production_optimizations():
|
||
"""
|
||
Demonstrate production-level tokenization optimizations.
|
||
|
||
This function is PROVIDED to show real-world optimization techniques.
|
||
"""
|
||
print("ROCKET PRODUCTION TOKENIZATION OPTIMIZATIONS")
|
||
print("=" * 60)
|
||
|
||
# Create optimized tokenizer
|
||
base_tokenizer = CharTokenizer()
|
||
optimized_tokenizer = OptimizedTokenizer(base_tokenizer)
|
||
|
||
# Test data with repeated texts (common in production)
|
||
test_texts = [
|
||
"Hello world!",
|
||
"Machine learning is amazing.",
|
||
"Hello world!", # Repeated
|
||
"Tokenization performance matters.",
|
||
"Hello world!", # Repeated again
|
||
"Machine learning is amazing.", # Repeated
|
||
]
|
||
|
||
print(f"📊 Testing with {len(test_texts)} texts ({len(set(test_texts))} unique)")
|
||
|
||
# Measure performance without caching
|
||
start_time = time.time()
|
||
tokens_no_cache = []
|
||
for text in test_texts:
|
||
tokens = base_tokenizer.encode(text, add_special_tokens=False)
|
||
tokens_no_cache.append(tokens)
|
||
no_cache_time = time.time() - start_time
|
||
|
||
# Measure performance with caching
|
||
start_time = time.time()
|
||
tokens_with_cache = []
|
||
for text in test_texts:
|
||
tokens = optimized_tokenizer.encode_with_cache(text, add_special_tokens=False)
|
||
tokens_with_cache.append(tokens)
|
||
cache_time = time.time() - start_time
|
||
|
||
# Test batch encoding
|
||
start_time = time.time()
|
||
batch_tokens = optimized_tokenizer.batch_encode(test_texts, add_special_tokens=False, pad_to_max=True)
|
||
batch_time = time.time() - start_time
|
||
|
||
# Report results
|
||
cache_stats = optimized_tokenizer.get_cache_stats()
|
||
|
||
print(f"\nSPEED PERFORMANCE COMPARISON:")
|
||
print(f" No caching: {no_cache_time*1000:.2f}ms")
|
||
print(f" With caching: {cache_time*1000:.2f}ms ({(no_cache_time/cache_time):.1f}x speedup)")
|
||
print(f" Batch processing: {batch_time*1000:.2f}ms")
|
||
|
||
print(f"\nPROGRESS CACHE PERFORMANCE:")
|
||
print(f" Hit rate: {cache_stats['hit_rate']*100:.1f}%")
|
||
print(f" Cache hits: {cache_stats['cache_hits']}")
|
||
print(f" Cache misses: {cache_stats['cache_misses']}")
|
||
print(f" Cache size: {cache_stats['cache_size']} entries")
|
||
|
||
print(f"\nTARGET PRODUCTION INSIGHTS:")
|
||
print(f" - Caching provides significant speedup for repeated texts")
|
||
print(f" - Batch processing enables vectorized operations")
|
||
print(f" - Memory-efficient encoding reduces allocation overhead")
|
||
print(f" - Cache hit rates >80% common in production systems")
|
||
|
||
# Function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Comprehensive Testing & Integration
|
||
|
||
Let's run comprehensive tests to ensure all tokenization functionality works correctly:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "test-tokenization-comprehensive", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
def test_tokenization_comprehensive():
|
||
"""Comprehensive test suite for all tokenization functionality."""
|
||
print("TEST Comprehensive Tokenization Tests...")
|
||
|
||
# Test 1: Character tokenizer edge cases
|
||
print(" Testing character tokenizer edge cases...")
|
||
char_tokenizer = CharTokenizer()
|
||
|
||
# Empty string
|
||
empty_tokens = char_tokenizer.encode("", add_special_tokens=True)
|
||
assert len(empty_tokens) == 2, "Empty string should have BOS and EOS tokens"
|
||
|
||
# Single character
|
||
single_tokens = char_tokenizer.encode("A", add_special_tokens=False)
|
||
assert len(single_tokens) == 1, "Single character should produce one token"
|
||
|
||
# Special characters
|
||
special_text = "!@#$%"
|
||
special_tokens = char_tokenizer.encode(special_text, add_special_tokens=False)
|
||
assert len(special_tokens) == len(special_text), "Should handle special characters"
|
||
|
||
# Round-trip encoding/decoding
|
||
original = "Hello, World! 123"
|
||
tokens = char_tokenizer.encode(original, add_special_tokens=False)
|
||
decoded = char_tokenizer.decode(tokens, skip_special_tokens=True)
|
||
assert decoded == original, "Round-trip should preserve text"
|
||
|
||
print(" PASS Character tokenizer edge cases passed")
|
||
|
||
# Test 2: BPE tokenizer robustness
|
||
print(" Testing BPE tokenizer robustness...")
|
||
bpe_tokenizer = BPETokenizer(vocab_size=100)
|
||
|
||
# Train with diverse data
|
||
training_data = [
|
||
"hello world",
|
||
"the quick brown fox",
|
||
"machine learning systems",
|
||
"neural network training",
|
||
"hello hello world world" # Repeated patterns for merging
|
||
]
|
||
|
||
bpe_tokenizer.train(training_data)
|
||
assert bpe_tokenizer.trained, "BPE should be trained"
|
||
|
||
# Test encoding various texts
|
||
test_cases = [
|
||
"hello world",
|
||
"new unseen text",
|
||
"machine learning",
|
||
"" # Empty string
|
||
]
|
||
|
||
for test_text in test_cases:
|
||
if test_text: # Skip empty string for basic tests
|
||
tokens = bpe_tokenizer.encode(test_text, add_special_tokens=False)
|
||
decoded = bpe_tokenizer.decode(tokens, skip_special_tokens=True)
|
||
# BPE decoding might have slightly different spacing due to word boundaries
|
||
assert test_text.replace(" ", "") in decoded.replace(" ", ""), f"BPE round-trip failed for '{test_text}'"
|
||
|
||
print(" PASS BPE tokenizer robustness passed")
|
||
|
||
# Test 3: Memory efficiency with large texts
|
||
print(" Testing memory efficiency...")
|
||
large_text = "This is a test sentence. " * 1000 # ~25k characters
|
||
|
||
start_time = time.time()
|
||
char_tokens = char_tokenizer.encode(large_text, add_special_tokens=False)
|
||
char_time = time.time() - start_time
|
||
|
||
assert len(char_tokens) > 20000, "Should handle large texts"
|
||
assert char_time < 1.0, "Should tokenize large text quickly"
|
||
|
||
print(" PASS Memory efficiency tests passed")
|
||
|
||
# Test 4: Integration with optimization features
|
||
print(" Testing optimization features...")
|
||
optimized = OptimizedTokenizer(char_tokenizer)
|
||
|
||
# Test caching
|
||
test_text = "Repeated text for caching test"
|
||
tokens1 = optimized.encode_with_cache(test_text)
|
||
tokens2 = optimized.encode_with_cache(test_text) # Should hit cache
|
||
|
||
assert tokens1 == tokens2, "Cached results should be identical"
|
||
|
||
cache_stats = optimized.get_cache_stats()
|
||
assert cache_stats['cache_hits'] > 0, "Should have cache hits"
|
||
assert cache_stats['hit_rate'] > 0, "Should have positive hit rate"
|
||
|
||
# Test batch processing
|
||
batch_texts = ["text one", "text two", "text three"]
|
||
batch_results = optimized.batch_encode(batch_texts, pad_to_max=True)
|
||
|
||
assert len(batch_results) == len(batch_texts), "Batch size should match input"
|
||
assert all(len(seq) == len(batch_results[0]) for seq in batch_results), "All sequences should be padded to same length"
|
||
|
||
print(" PASS Optimization features tests passed")
|
||
|
||
print("PASS All comprehensive tokenization tests passed!")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Main Execution Block
|
||
|
||
All tokenization tests and demonstrations are run from here when the module is executed directly:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tokenization-main", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
if __name__ == "__main__":
|
||
print("🔤 Starting TinyTorch Tokenization Module...")
|
||
print("="*60)
|
||
|
||
# Run all unit tests
|
||
print("\nTEST UNIT TESTS")
|
||
print("-" * 30)
|
||
test_unit_char_tokenizer()
|
||
test_unit_bpe_tokenizer()
|
||
test_tokenization_profiler()
|
||
|
||
# Run comprehensive integration tests
|
||
print("\n🔧 INTEGRATION TESTS")
|
||
print("-" * 30)
|
||
test_tokenization_comprehensive()
|
||
|
||
# Performance analysis
|
||
print("\n" + "="*60)
|
||
print("MAGNIFY TOKENIZATION PERFORMANCE ANALYSIS")
|
||
print("="*60)
|
||
|
||
# Create test data
|
||
sample_texts = [
|
||
"The transformer architecture has revolutionized natural language processing.",
|
||
"Machine learning models require efficient tokenization for text processing.",
|
||
"Character-level tokenization produces long sequences but small vocabularies.",
|
||
"Byte pair encoding balances vocabulary size with sequence length efficiency.",
|
||
"Production systems need fast tokenization to maintain training throughput."
|
||
]
|
||
|
||
print(f"\nTesting with {len(sample_texts)} sample texts...")
|
||
|
||
# Performance comparison
|
||
profiler = TokenizationProfiler()
|
||
comparison_results = profiler.compare_tokenizers(sample_texts)
|
||
|
||
# Systems impact analysis
|
||
analyze_tokenization_systems_impact()
|
||
|
||
# Production optimizations demonstration
|
||
demonstrate_production_optimizations()
|
||
|
||
print("\n" + "="*60)
|
||
print("TARGET TOKENIZATION MODULE COMPLETE!")
|
||
print("="*60)
|
||
print("PASS All tokenization tests passed!")
|
||
print("PASS Systems insights analysis complete!")
|
||
print("PASS Performance profiling successful!")
|
||
print("ROCKET Ready for embedding layer integration!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## THINK ML Systems Thinking: Interactive Questions
|
||
|
||
Now that you've built the text processing foundation for language models, let's connect this work to broader ML systems challenges. These questions help you think critically about how tokenization scales to production language processing systems.
|
||
|
||
Take time to reflect thoughtfully on each question - your insights will help you understand how tokenization connects to real-world ML systems engineering.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 1: Vocabulary Size vs Model Performance Analysis
|
||
|
||
**Context**: Your tokenization implementations show how vocabulary size affects both model parameters and sequence processing. In your CharTokenizer, you observed small vocabulary (~99 tokens) but long sequences. In your BPE implementation, you created larger vocabularies (~500-2000 tokens) with shorter sequences.
|
||
|
||
**Computational Assessment**: Analyze the memory and computational trade-offs in your tokenization implementations. Given a text corpus where your CharTokenizer produces average sequences of 200 tokens and your BPE tokenizer produces average sequences of 50 tokens, calculate the total memory requirements for a model with 256-dimensional embeddings processing batches of 32 sequences. Compare the embedding table memory, sequence processing memory, and attention computation complexity (O(N²)) for both approaches. Which tokenization strategy would be more efficient for training large language models and why?
|
||
|
||
Consider: embedding parameters, attention complexity, batch processing memory, and training throughput implications.
|
||
|
||
*Target length: 200-400 words with calculations*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-1-tokenization-strategy", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR REFLECTION ON TOKENIZATION STRATEGY AND PERFORMANCE TRADE-OFFS:
|
||
|
||
TODO: Replace this text with your thoughtful response about multilingual tokenization strategy design.
|
||
|
||
Consider addressing:
|
||
- How would you design a tokenization strategy for 50+ languages within a 100k token limit?
|
||
- What approaches would you use to handle different scripts and morphological complexity?
|
||
- How would you optimize for both cross-lingual transfer and computational efficiency?
|
||
- What trade-offs would you make between vocabulary sharing and language-specific optimization?
|
||
- How would you ensure consistent quality across languages with different characteristics?
|
||
|
||
Write a strategic analysis connecting your tokenization implementations to real multilingual system challenges.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Demonstrates understanding of multilingual tokenization challenges (3 points)
|
||
- Designs practical approaches to vocabulary size and language coverage (3 points)
|
||
- Addresses cross-lingual transfer and efficiency considerations (2 points)
|
||
- Shows systems thinking about production language model constraints (2 points)
|
||
- Clear strategic reasoning with multilingual optimization insights (bonus points for comprehensive understanding)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring strategic analysis of multilingual tokenization
|
||
# Students should demonstrate understanding of cross-lingual efficiency and performance trade-offs
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 2: BPE Training Complexity and Optimization
|
||
|
||
**Context**: Your BPE implementation performs iterative pair merging to build subword vocabularies. The `_get_pair_counts()` and `_merge_pair()` functions you implemented process the entire corpus in each iteration. You observed that BPE training can be computationally expensive as vocabulary size increases.
|
||
|
||
**Computational Assessment**: Analyze the computational complexity of your BPE training algorithm. If you have a corpus with C characters, V target vocabulary size, and your algorithm performs V-k merging iterations (where k is initial character vocabulary), calculate the time complexity of the complete training process. Compare the efficiency of training BPE vocabularies of 1000, 5000, and 50000 tokens on a 1GB text corpus. Design specific optimizations to your `_get_pair_counts()` and `_merge_pair()` implementations that would reduce training time while maintaining tokenization quality.
|
||
|
||
Consider: algorithm complexity, data structure choices, memory usage during training, and practical optimization strategies.
|
||
|
||
*Target length: 200-400 words with complexity analysis*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-2-pipeline-integration", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR REFLECTION ON TOKENIZATION PIPELINE INTEGRATION:
|
||
|
||
TODO: Replace this text with your thoughtful response about large-scale tokenization pipeline design.
|
||
|
||
Consider addressing:
|
||
- How would you architect parallel tokenization for processing 1TB of text daily?
|
||
- What caching strategies would you implement for repeated text patterns?
|
||
- How would you handle storage optimization and I/O bottleneck minimization?
|
||
- What approaches would you use to maintain consistency across distributed training?
|
||
- How would you design the system to handle dynamic vocabulary updates?
|
||
|
||
Write an architectural analysis connecting your tokenization implementations to large-scale training infrastructure.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Shows understanding of large-scale tokenization pipeline challenges (3 points)
|
||
- Designs practical approaches to parallel processing and caching (3 points)
|
||
- Addresses distributed training and consistency requirements (2 points)
|
||
- Demonstrates systems thinking about training infrastructure optimization (2 points)
|
||
- Clear architectural reasoning with scalability insights (bonus points for comprehensive system design)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring understanding of large-scale pipeline integration
|
||
# Students should demonstrate knowledge of distributed training and infrastructure optimization
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 3: Tokenization Efficiency in Production Systems
|
||
|
||
**Context**: Your OptimizedTokenizer implementation includes caching mechanisms that you tested with repeated text processing. You observed significant speedup for cache hits but also noted memory overhead for storing cached results. Production systems must balance caching benefits with memory constraints.
|
||
|
||
**Computational Assessment**: Design a caching strategy for your tokenization system that optimizes for production deployment with 10GB memory budget. Given that your character tokenization produces ~4 bytes per token and typical text repeats with 60% cache hit rate, calculate the optimal cache size that maximizes throughput while staying within memory limits. Analyze how cache eviction policies (LRU, LFU, or TTL-based) would affect performance for different workload patterns: academic paper processing (high repetition), social media feeds (medium repetition), and novel literature (low repetition). Propose specific modifications to your encode_with_cache() method that would adapt cache behavior based on workload characteristics.
|
||
|
||
Consider: memory allocation, cache eviction algorithms, workload patterns, and adaptive optimization strategies.
|
||
|
||
*Target length: 200-400 words with memory calculations*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-3-dynamic-tokenization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR REFLECTION ON DYNAMIC TOKENIZATION AND ADAPTIVE SYSTEMS:
|
||
|
||
TODO: Replace this text with your thoughtful response about adaptive tokenization system design.
|
||
|
||
Consider addressing:
|
||
- How would you design vocabulary expansion for incorporating new domain terminology?
|
||
- What strategies would you use to preserve existing token embeddings during updates?
|
||
- How would you maintain tokenization consistency during model evolution?
|
||
- What approaches would minimize retraining overhead for vocabulary changes?
|
||
- How would you balance stability and adaptability in production systems?
|
||
|
||
Write a design analysis connecting your tokenization work to adaptive language model systems.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Understands dynamic tokenization challenges and adaptation requirements (3 points)
|
||
- Designs practical approaches to vocabulary evolution and embedding preservation (3 points)
|
||
- Addresses consistency and backward compatibility considerations (2 points)
|
||
- Shows systems thinking about continuous adaptation in production (2 points)
|
||
- Clear design reasoning with adaptive system insights (bonus points for innovative approaches)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring understanding of adaptive tokenization systems
|
||
# Students should demonstrate knowledge of vocabulary evolution and continuous learning challenges
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 4: Out-of-Vocabulary Handling and System Robustness
|
||
|
||
**Context**: Your tokenization implementations handle unknown characters and tokens through UNK tokens. In your CharTokenizer, characters outside ASCII range become UNK. In your BPETokenizer, text not seen during training falls back to character-level processing. Production systems must gracefully handle diverse, evolving text inputs.
|
||
|
||
**Computational Assessment**: Analyze the robustness of your tokenization systems when processing multilingual and noisy text. Calculate the UNK token rate for processing text containing 20% non-ASCII characters using your CharTokenizer versus a trained BPE tokenizer. Design an enhanced fallback strategy that combines character-level, BPE subword, and whole-word tokenization to minimize information loss. Quantify how UNK token rates affect downstream model performance by estimating the impact on embedding quality when 15% of tokens are UNK versus 2% UNK. Propose specific modifications to your encode() methods that would improve out-of-vocabulary handling without significantly increasing vocabulary size.
|
||
|
||
Consider: fallback hierarchies, information preservation, embedding quality, vocabulary efficiency, and multilingual robustness.
|
||
|
||
*Target length: 200-400 words with impact analysis*
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "question-4-oov-handling", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
YOUR ANALYSIS ON OUT-OF-VOCABULARY HANDLING AND SYSTEM ROBUSTNESS:
|
||
|
||
TODO: Replace this text with your computational assessment of OOV handling strategies.
|
||
|
||
Consider addressing:
|
||
- How would you calculate UNK token rates for different text types?
|
||
- What fallback strategies would minimize information loss in your implementations?
|
||
- How do UNK token rates affect downstream model performance quantitatively?
|
||
- What modifications to your encode() methods would improve robustness?
|
||
- How would you design vocabulary expansion to handle evolving text patterns?
|
||
|
||
Write a technical analysis connecting your tokenization implementations to real multilingual robustness challenges.
|
||
|
||
GRADING RUBRIC (Instructor Use):
|
||
- Quantifies UNK token rates and their impact on system performance (3 points)
|
||
- Designs practical fallback strategies building on existing implementations (3 points)
|
||
- Analyzes downstream effects on embedding quality and model performance (2 points)
|
||
- Proposes concrete improvements to existing encode() methods (2 points)
|
||
- Clear technical reasoning with robustness engineering insights (bonus points for comprehensive analysis)
|
||
"""
|
||
|
||
### BEGIN SOLUTION
|
||
# Student response area - instructor will replace this section during grading setup
|
||
# This is a manually graded question requiring understanding of OOV handling and system robustness
|
||
# Students should demonstrate knowledge of tokenization robustness and multilingual challenges
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## TARGET MODULE SUMMARY: Tokenization
|
||
|
||
Congratulations! You have successfully implemented comprehensive tokenization systems for language processing:
|
||
|
||
### PASS What You Have Built
|
||
- **Character Tokenizer**: Simple character-level tokenization with special token handling
|
||
- **BPE Tokenizer**: Subword tokenization using Byte Pair Encoding algorithm
|
||
- **Vocabulary Management**: Efficient mapping between text and numerical representations
|
||
- **Padding & Truncation**: Batch processing utilities for uniform sequence lengths
|
||
- **Performance Optimization**: Caching and batch processing for production efficiency
|
||
- **🆕 Memory Efficiency**: Optimized string processing and token caching systems
|
||
- **🆕 Systems Analysis**: Comprehensive performance profiling and scaling analysis
|
||
|
||
### PASS Key Learning Outcomes
|
||
- **Understanding**: How text becomes numbers that neural networks can process
|
||
- **Implementation**: Built character and subword tokenizers from scratch
|
||
- **Systems Insight**: How tokenization affects model memory, performance, and capabilities
|
||
- **Performance Engineering**: Measured and optimized tokenization throughput
|
||
- **Production Context**: Understanding real-world tokenization challenges and solutions
|
||
|
||
### PASS Technical Mastery
|
||
- **Character Tokenization**: Simple but interpretable text processing
|
||
- **BPE Algorithm**: Iterative pair merging for subword discovery
|
||
- **Vocabulary Trade-offs**: Balancing vocabulary size vs sequence length
|
||
- **Memory Optimization**: Efficient caching and batch processing techniques
|
||
- **🆕 Performance Analysis**: Measuring tokenization impact on downstream systems
|
||
|
||
### PASS Professional Skills Developed
|
||
- **Algorithm Implementation**: Building complex text processing systems
|
||
- **Performance Engineering**: Optimizing for speed and memory efficiency
|
||
- **Systems Thinking**: Understanding tokenization's role in ML pipelines
|
||
- **Production Optimization**: Caching, batching, and scalability techniques
|
||
|
||
### PASS Ready for Next Steps
|
||
Your tokenization systems are now ready to power:
|
||
- **Embedding Layers**: Converting tokens to dense vector representations
|
||
- **Language Models**: Processing text for transformer architectures
|
||
- **Production Systems**: Efficient text processing pipelines
|
||
- **🧠 Text Understanding**: Foundation for natural language processing
|
||
|
||
### LINK Connection to Real ML Systems
|
||
Your implementations mirror production systems:
|
||
- **GPT Tokenizers**: Modern language models use sophisticated BPE variants
|
||
- **SentencePiece**: Unigram language model tokenization used in many systems
|
||
- **Hugging Face Tokenizers**: Production-optimized tokenization libraries
|
||
- **Industry Applications**: Every language model relies on efficient tokenization
|
||
|
||
### TARGET The Power of Text Processing
|
||
You have unlocked the bridge between human language and machine understanding:
|
||
- **Before**: Text was just strings of characters
|
||
- **After**: Text becomes structured numerical sequences for neural networks
|
||
|
||
**Next Module**: Embeddings - Converting your tokens into rich vector representations that capture semantic meaning!
|
||
|
||
Your tokenization systems are the first step in language understanding. Now let's build the embeddings that give tokens meaning!
|
||
""" |