# --- # jupyter: # jupytext: # text_representation: # extension: .py # format_name: percent # format_version: '1.3' # jupytext_version: 1.17.1 # kernelspec: # display_name: Python 3 (ipykernel) # language: python # name: python3 # --- #| default_exp text.tokenization #| export # %% [markdown] """ # Module 10: Tokenization - Converting Text to Numbers Welcome to Module 10! Today you'll build tokenization - the bridge that converts human-readable text into numerical representations that machine learning models can process. ## πŸ”— Prerequisites & Progress **You've Built**: Neural networks, layers, training loops, and data loading **You'll Build**: Text tokenization systems (character and BPE-based) **You'll Enable**: Text processing for language models and NLP tasks **Connection Map**: ``` DataLoader β†’ Tokenization β†’ Embeddings (batching) (textβ†’numbers) (learnable representations) ``` ## Learning Objectives By the end of this module, you will: 1. Implement character-based tokenization for simple text processing 2. Build a BPE (Byte Pair Encoding) tokenizer for efficient text representation 3. Understand vocabulary management and encoding/decoding operations 4. Create the foundation for text processing in neural networks Let's get started! """ # %% [markdown] """ ## πŸ“¦ Where This Code Lives in the Final Package **Learning Side:** You work in modules/10_tokenization/tokenization_dev.py **Building Side:** Code exports to tinytorch.text.tokenization ```python # Final package structure: from tinytorch.text.tokenization import Tokenizer, CharTokenizer, BPETokenizer # This module from tinytorch.core.tensor import Tensor # Foundation (always needed) from tinytorch.data.loader import DataLoader # For text data batching ``` **Why this matters:** - **Learning:** Complete tokenization system in one focused module for deep understanding - **Production:** Proper organization like Hugging Face's tokenizers with all text processing together - **Consistency:** All tokenization operations and vocabulary management in text.tokenization - **Integration:** Works seamlessly with embeddings and data loading for complete NLP pipeline """ # %% import numpy as np from typing import List, Dict, Tuple, Optional, Set import json import re from collections import defaultdict, Counter # Import only Module 01 (Tensor) - this module has minimal dependencies import sys import os sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor')) from tensor_dev import Tensor # %% [markdown] """ ## 1. Introduction - Why Tokenization? Neural networks operate on numbers, but humans communicate with text. Tokenization is the crucial bridge that converts text into numerical sequences that models can process. ### The Text-to-Numbers Challenge Consider the sentence: "Hello, world!" ``` Human Text: "Hello, world!" ↓ [Tokenization] ↓ Numerical IDs: [72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33] ``` ### The Four-Step Process How do we represent this for a neural network? We need to: 1. **Split text into tokens** - meaningful units like words, subwords, or characters 2. **Map tokens to integers** - create a vocabulary that assigns unique IDs 3. **Handle unknown text** - deal with words not seen during training 4. **Enable reconstruction** - convert numbers back to readable text ### Why This Matters The choice of tokenization strategy dramatically affects: - **Model performance** - How well the model understands text - **Vocabulary size** - Memory requirements for embedding tables - **Computational efficiency** - Sequence length affects processing time - **Robustness** - How well the model handles new/rare words """ # %% [markdown] """ ## 2. Foundations - Tokenization Strategies Different tokenization approaches make different trade-offs between vocabulary size, sequence length, and semantic understanding. ### Character-Level Tokenization **Approach**: Each character gets its own token ``` Text: "Hello world" ↓ Tokens: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd'] ↓ IDs: [8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4] ``` **Pros**: Small vocabulary (~100), handles any text, no unknown tokens **Cons**: Long sequences (1 char = 1 token), limited semantic understanding ### Word-Level Tokenization **Approach**: Each word gets its own token ``` Text: "Hello world" ↓ Tokens: ['Hello', 'world'] ↓ IDs: [5847, 1254] ``` **Pros**: Semantic meaning preserved, shorter sequences **Cons**: Huge vocabularies (100K+), many unknown tokens ### Subword Tokenization (BPE) **Approach**: Learn frequent character pairs, build subword units ``` Text: "tokenization" ↓ Character level Initial: ['t', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n'] ↓ Learn frequent pairs Merged: ['to', 'ken', 'ization'] ↓ IDs: [142, 1847, 2341] ``` **Pros**: Balance between vocabulary size and sequence length **Cons**: More complex training process ### Strategy Comparison ``` Text: "tokenization" (12 characters) Character: ['t','o','k','e','n','i','z','a','t','i','o','n'] β†’ 12 tokens, vocab ~100 Word: ['tokenization'] β†’ 1 token, vocab 100K+ BPE: ['token','ization'] β†’ 2 tokens, vocab 10-50K ``` The sweet spot for most applications is BPE with 10K-50K vocabulary size. """ # %% [markdown] """ ## 3. Implementation - Building Tokenization Systems Let's implement tokenization systems from simple character-based to sophisticated BPE. We'll start with the base interface and work our way up to advanced algorithms. """ # %% [markdown] """ ### Base Tokenizer Interface All tokenizers need to provide two core operations: encoding text to numbers and decoding numbers back to text. Let's define the common interface. ``` Tokenizer Interface: encode(text) β†’ [id1, id2, id3, ...] decode([id1, id2, id3, ...]) β†’ text ``` This ensures consistent behavior across different tokenization strategies. """ # %% nbgrader={"grade": false, "grade_id": "base-tokenizer", "solution": true} class Tokenizer: """ Base tokenizer class providing the interface for all tokenizers. This defines the contract that all tokenizers must follow: - encode(): text β†’ list of token IDs - decode(): list of token IDs β†’ text """ def encode(self, text: str) -> List[int]: """ Convert text to a list of token IDs. TODO: Implement encoding logic in subclasses APPROACH: 1. Subclasses will override this method 2. Return list of integer token IDs EXAMPLE: >>> tokenizer = CharTokenizer(['a', 'b', 'c']) >>> tokenizer.encode("abc") [0, 1, 2] """ ### BEGIN SOLUTION raise NotImplementedError("Subclasses must implement encode()") ### END SOLUTION def decode(self, tokens: List[int]) -> str: """ Convert list of token IDs back to text. TODO: Implement decoding logic in subclasses APPROACH: 1. Subclasses will override this method 2. Return reconstructed text string EXAMPLE: >>> tokenizer = CharTokenizer(['a', 'b', 'c']) >>> tokenizer.decode([0, 1, 2]) "abc" """ ### BEGIN SOLUTION raise NotImplementedError("Subclasses must implement decode()") ### END SOLUTION # %% nbgrader={"grade": true, "grade_id": "test-base-tokenizer", "locked": true, "points": 5} def test_unit_base_tokenizer(): """πŸ”¬ Test base tokenizer interface.""" print("πŸ”¬ Unit Test: Base Tokenizer Interface...") # Test that base class defines the interface tokenizer = Tokenizer() # Should raise NotImplementedError for both methods try: tokenizer.encode("test") assert False, "encode() should raise NotImplementedError" except NotImplementedError: pass try: tokenizer.decode([1, 2, 3]) assert False, "decode() should raise NotImplementedError" except NotImplementedError: pass print("βœ… Base tokenizer interface works correctly!") test_unit_base_tokenizer() # %% [markdown] """ ### Character-Level Tokenizer The simplest tokenization approach: each character becomes a token. This gives us perfect coverage of any text but produces long sequences. ``` Character Tokenization Process: Step 1: Build vocabulary from unique characters Text corpus: ["hello", "world"] Unique chars: ['h', 'e', 'l', 'o', 'w', 'r', 'd'] Vocabulary: ['', 'h', 'e', 'l', 'o', 'w', 'r', 'd'] # for unknown 0 1 2 3 4 5 6 7 Step 2: Encode text character by character Text: "hello" 'h' β†’ 1 'e' β†’ 2 'l' β†’ 3 'l' β†’ 3 'o' β†’ 4 Result: [1, 2, 3, 3, 4] Step 3: Decode by looking up each ID IDs: [1, 2, 3, 3, 4] 1 β†’ 'h' 2 β†’ 'e' 3 β†’ 'l' 3 β†’ 'l' 4 β†’ 'o' Result: "hello" ``` """ # %% nbgrader={"grade": false, "grade_id": "char-tokenizer", "solution": true} class CharTokenizer(Tokenizer): """ Character-level tokenizer that treats each character as a separate token. This is the simplest tokenization approach - every character in the vocabulary gets its own unique ID. """ def __init__(self, vocab: Optional[List[str]] = None): """ Initialize character tokenizer. TODO: Set up vocabulary mappings APPROACH: 1. Store vocabulary list 2. Create charβ†’id and idβ†’char mappings 3. Handle special tokens (unknown character) EXAMPLE: >>> tokenizer = CharTokenizer(['a', 'b', 'c']) >>> tokenizer.vocab_size 4 # 3 chars + 1 unknown token """ ### BEGIN SOLUTION if vocab is None: vocab = [] # Add special unknown token self.vocab = [''] + vocab self.vocab_size = len(self.vocab) # Create bidirectional mappings self.char_to_id = {char: idx for idx, char in enumerate(self.vocab)} self.id_to_char = {idx: char for idx, char in enumerate(self.vocab)} # Store unknown token ID self.unk_id = 0 ### END SOLUTION def build_vocab(self, corpus: List[str]) -> None: """ Build vocabulary from a corpus of text. TODO: Extract unique characters and build vocabulary APPROACH: 1. Collect all unique characters from corpus 2. Sort for consistent ordering 3. Rebuild mappings with new vocabulary HINTS: - Use set() to find unique characters - Join all texts then convert to set - Don't forget the token """ ### BEGIN SOLUTION # Collect all unique characters all_chars = set() for text in corpus: all_chars.update(text) # Sort for consistent ordering unique_chars = sorted(list(all_chars)) # Rebuild vocabulary with token first self.vocab = [''] + unique_chars self.vocab_size = len(self.vocab) # Rebuild mappings self.char_to_id = {char: idx for idx, char in enumerate(self.vocab)} self.id_to_char = {idx: char for idx, char in enumerate(self.vocab)} ### END SOLUTION def encode(self, text: str) -> List[int]: """ Encode text to list of character IDs. TODO: Convert each character to its vocabulary ID APPROACH: 1. Iterate through each character in text 2. Look up character ID in vocabulary 3. Use unknown token ID for unseen characters EXAMPLE: >>> tokenizer = CharTokenizer(['h', 'e', 'l', 'o']) >>> tokenizer.encode("hello") [1, 2, 3, 3, 4] # maps to h,e,l,l,o """ ### BEGIN SOLUTION tokens = [] for char in text: tokens.append(self.char_to_id.get(char, self.unk_id)) return tokens ### END SOLUTION def decode(self, tokens: List[int]) -> str: """ Decode list of token IDs back to text. TODO: Convert each token ID back to its character APPROACH: 1. Look up each token ID in vocabulary 2. Join characters into string 3. Handle invalid token IDs gracefully EXAMPLE: >>> tokenizer = CharTokenizer(['h', 'e', 'l', 'o']) >>> tokenizer.decode([1, 2, 3, 3, 4]) "hello" """ ### BEGIN SOLUTION chars = [] for token_id in tokens: # Use unknown token for invalid IDs char = self.id_to_char.get(token_id, '') chars.append(char) return ''.join(chars) ### END SOLUTION # %% nbgrader={"grade": true, "grade_id": "test-char-tokenizer", "locked": true, "points": 15} def test_unit_char_tokenizer(): """πŸ”¬ Test character tokenizer implementation.""" print("πŸ”¬ Unit Test: Character Tokenizer...") # Test basic functionality vocab = ['h', 'e', 'l', 'o', ' ', 'w', 'r', 'd'] tokenizer = CharTokenizer(vocab) # Test vocabulary setup assert tokenizer.vocab_size == 9 # 8 chars + UNK assert tokenizer.vocab[0] == '' assert 'h' in tokenizer.char_to_id # Test encoding text = "hello" tokens = tokenizer.encode(text) expected = [1, 2, 3, 3, 4] # h,e,l,l,o (based on actual vocab order) assert tokens == expected, f"Expected {expected}, got {tokens}" # Test decoding decoded = tokenizer.decode(tokens) assert decoded == text, f"Expected '{text}', got '{decoded}'" # Test unknown character handling tokens_with_unk = tokenizer.encode("hello!") assert tokens_with_unk[-1] == 0 # '!' should map to # Test vocabulary building corpus = ["hello world", "test text"] tokenizer.build_vocab(corpus) assert 't' in tokenizer.char_to_id assert 'x' in tokenizer.char_to_id print("βœ… Character tokenizer works correctly!") test_unit_char_tokenizer() # %% [markdown] """ ### πŸ§ͺ Character Tokenizer Analysis Character tokenization provides a simple, robust foundation for text processing. The key insight is that with a small vocabulary (typically <100 characters), we can represent any text without unknown tokens. **Trade-offs**: - **Pro**: No out-of-vocabulary issues, handles any language - **Con**: Long sequences (1 char = 1 token), limited semantic understanding - **Use case**: When robustness is more important than efficiency """ # %% [markdown] """ ### Byte Pair Encoding (BPE) Tokenizer BPE is the secret sauce behind modern language models. It learns to merge frequent character pairs, creating subword units that balance vocabulary size with sequence length. ``` BPE Training Process: Step 1: Start with character vocabulary Text: ["hello", "hello", "help"] Initial tokens: [['h','e','l','l','o'], ['h','e','l','l','o'], ['h','e','l','p']] Step 2: Count character pairs ('h','e'): 3 times ← Most frequent! ('e','l'): 3 times ('l','l'): 2 times ('l','o'): 2 times ('l','p'): 1 time Step 3: Merge most frequent pair Merge ('h','e') β†’ 'he' Tokens: [['he','l','l','o'], ['he','l','l','o'], ['he','l','p']] Vocab: ['h','e','l','o','p','','he'] ← New token added Step 4: Repeat until target vocabulary size Next merge: ('l','l') β†’ 'll' Tokens: [['he','ll','o'], ['he','ll','o'], ['he','l','p']] Vocab: ['h','e','l','o','p','','he','ll'] ← Growing vocabulary Final result: Text "hello" β†’ ['he', 'll', 'o'] β†’ 3 tokens (vs 5 characters) Text "help" β†’ ['he', 'l', 'p'] β†’ 3 tokens (vs 4 characters) ``` BPE discovers natural word boundaries and common patterns automatically! """ # %% nbgrader={"grade": false, "grade_id": "bpe-tokenizer", "solution": true} class BPETokenizer(Tokenizer): """ Byte Pair Encoding (BPE) tokenizer that learns subword units. BPE works by: 1. Starting with character-level vocabulary 2. Finding most frequent character pairs 3. Merging frequent pairs into single tokens 4. Repeating until desired vocabulary size """ def __init__(self, vocab_size: int = 1000): """ Initialize BPE tokenizer. TODO: Set up basic tokenizer state APPROACH: 1. Store target vocabulary size 2. Initialize empty vocabulary and merge rules 3. Set up mappings for encoding/decoding """ ### BEGIN SOLUTION self.vocab_size = vocab_size self.vocab = [] self.merges = [] # List of (pair, new_token) merges self.token_to_id = {} self.id_to_token = {} ### END SOLUTION def _get_word_tokens(self, word: str) -> List[str]: """ Convert word to list of characters with end-of-word marker. TODO: Tokenize word into character sequence APPROACH: 1. Split word into characters 2. Add marker to last character 3. Return list of tokens EXAMPLE: >>> tokenizer._get_word_tokens("hello") ['h', 'e', 'l', 'l', 'o'] """ ### BEGIN SOLUTION if not word: return [] tokens = list(word) tokens[-1] += '' # Mark end of word return tokens ### END SOLUTION def _get_pairs(self, word_tokens: List[str]) -> Set[Tuple[str, str]]: """ Get all adjacent pairs from word tokens. TODO: Extract all consecutive character pairs APPROACH: 1. Iterate through adjacent tokens 2. Create pairs of consecutive tokens 3. Return set of unique pairs EXAMPLE: >>> tokenizer._get_pairs(['h', 'e', 'l', 'l', 'o']) {('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o')} """ ### BEGIN SOLUTION pairs = set() for i in range(len(word_tokens) - 1): pairs.add((word_tokens[i], word_tokens[i + 1])) return pairs ### END SOLUTION def train(self, corpus: List[str], vocab_size: int = None) -> None: """ Train BPE on corpus to learn merge rules. TODO: Implement BPE training algorithm APPROACH: 1. Build initial character vocabulary 2. Count word frequencies in corpus 3. Iteratively merge most frequent pairs 4. Build final vocabulary and mappings HINTS: - Start with character-level tokens - Use frequency counts to guide merging - Stop when vocabulary reaches target size """ ### BEGIN SOLUTION if vocab_size: self.vocab_size = vocab_size # Count word frequencies word_freq = Counter(corpus) # Initialize vocabulary with characters vocab = set() word_tokens = {} for word in word_freq: tokens = self._get_word_tokens(word) word_tokens[word] = tokens vocab.update(tokens) # Convert to sorted list for consistency self.vocab = sorted(list(vocab)) # Add special tokens if '' not in self.vocab: self.vocab = [''] + self.vocab # Learn merges self.merges = [] while len(self.vocab) < self.vocab_size: # Count all pairs across all words pair_counts = Counter() for word, freq in word_freq.items(): tokens = word_tokens[word] pairs = self._get_pairs(tokens) for pair in pairs: pair_counts[pair] += freq if not pair_counts: break # Get most frequent pair best_pair = pair_counts.most_common(1)[0][0] # Merge this pair in all words for word in word_tokens: tokens = word_tokens[word] new_tokens = [] i = 0 while i < len(tokens): if (i < len(tokens) - 1 and tokens[i] == best_pair[0] and tokens[i + 1] == best_pair[1]): # Merge pair new_tokens.append(best_pair[0] + best_pair[1]) i += 2 else: new_tokens.append(tokens[i]) i += 1 word_tokens[word] = new_tokens # Add merged token to vocabulary merged_token = best_pair[0] + best_pair[1] self.vocab.append(merged_token) self.merges.append(best_pair) # Build final mappings self._build_mappings() ### END SOLUTION def _build_mappings(self): """Build token-to-ID and ID-to-token mappings.""" ### BEGIN SOLUTION self.token_to_id = {token: idx for idx, token in enumerate(self.vocab)} self.id_to_token = {idx: token for idx, token in enumerate(self.vocab)} ### END SOLUTION def _apply_merges(self, tokens: List[str]) -> List[str]: """ Apply learned merge rules to token sequence. TODO: Apply BPE merges to token list APPROACH: 1. Start with character-level tokens 2. Apply each merge rule in order 3. Continue until no more merges possible """ ### BEGIN SOLUTION if not self.merges: return tokens for merge_pair in self.merges: new_tokens = [] i = 0 while i < len(tokens): if (i < len(tokens) - 1 and tokens[i] == merge_pair[0] and tokens[i + 1] == merge_pair[1]): # Apply merge new_tokens.append(merge_pair[0] + merge_pair[1]) i += 2 else: new_tokens.append(tokens[i]) i += 1 tokens = new_tokens return tokens ### END SOLUTION def encode(self, text: str) -> List[int]: """ Encode text using BPE. TODO: Apply BPE encoding to text APPROACH: 1. Split text into words 2. Convert each word to character tokens 3. Apply BPE merges 4. Convert to token IDs """ ### BEGIN SOLUTION if not self.vocab: return [] # Simple word splitting (could be more sophisticated) words = text.split() all_tokens = [] for word in words: # Get character-level tokens word_tokens = self._get_word_tokens(word) # Apply BPE merges merged_tokens = self._apply_merges(word_tokens) all_tokens.extend(merged_tokens) # Convert to IDs token_ids = [] for token in all_tokens: token_ids.append(self.token_to_id.get(token, 0)) # 0 = return token_ids ### END SOLUTION def decode(self, tokens: List[int]) -> str: """ Decode token IDs back to text. TODO: Convert token IDs back to readable text APPROACH: 1. Convert IDs to tokens 2. Join tokens together 3. Clean up word boundaries and markers """ ### BEGIN SOLUTION if not self.id_to_token: return "" # Convert IDs to tokens token_strings = [] for token_id in tokens: token = self.id_to_token.get(token_id, '') token_strings.append(token) # Join and clean up text = ''.join(token_strings) # Replace end-of-word markers with spaces text = text.replace('', ' ') # Clean up extra spaces text = ' '.join(text.split()) return text ### END SOLUTION # %% nbgrader={"grade": true, "grade_id": "test-bpe-tokenizer", "locked": true, "points": 20} def test_unit_bpe_tokenizer(): """πŸ”¬ Test BPE tokenizer implementation.""" print("πŸ”¬ Unit Test: BPE Tokenizer...") # Test basic functionality with simple corpus corpus = ["hello", "world", "hello", "hell"] # "hell" and "hello" share prefix tokenizer = BPETokenizer(vocab_size=20) tokenizer.train(corpus) # Check that vocabulary was built assert len(tokenizer.vocab) > 0 assert '' in tokenizer.vocab # Test helper functions word_tokens = tokenizer._get_word_tokens("test") assert word_tokens[-1].endswith(''), "Should have end-of-word marker" pairs = tokenizer._get_pairs(['h', 'e', 'l', 'l', 'o']) assert ('h', 'e') in pairs assert ('l', 'l') in pairs # Test encoding/decoding text = "hello" tokens = tokenizer.encode(text) assert isinstance(tokens, list) assert all(isinstance(t, int) for t in tokens) decoded = tokenizer.decode(tokens) assert isinstance(decoded, str) # Test round-trip on training data should work well for word in corpus: tokens = tokenizer.encode(word) decoded = tokenizer.decode(tokens) # Allow some flexibility due to BPE merging assert len(decoded.strip()) > 0 print("βœ… BPE tokenizer works correctly!") test_unit_bpe_tokenizer() # %% [markdown] """ ### πŸ§ͺ BPE Tokenizer Analysis BPE provides a balance between vocabulary size and sequence length. By learning frequent subword patterns, it can handle new words through decomposition while maintaining reasonable sequence lengths. ``` BPE Merging Visualization: Original: "tokenization" β†’ ['t','o','k','e','n','i','z','a','t','i','o','n',''] ↓ Merge frequent pairs Step 1: ('t','o') is frequent β†’ ['to','k','e','n','i','z','a','t','i','o','n',''] Step 2: ('i','o') is frequent β†’ ['to','k','e','n','io','z','a','t','io','n',''] Step 3: ('io','n') is frequent β†’ ['to','k','e','n','io','z','a','t','ion',''] Step 4: ('to','k') is frequent β†’ ['tok','e','n','io','z','a','t','ion',''] ↓ Continue merging... Final: "tokenization" β†’ ['token','ization'] # 2 tokens vs 13 characters! ``` **Key insights**: - **Adaptive vocabulary**: Learns from data, not hand-crafted - **Subword robustness**: Handles rare/new words through decomposition - **Efficiency trade-off**: Larger vocabulary β†’ shorter sequences β†’ faster processing - **Morphological awareness**: Naturally discovers prefixes, suffixes, roots """ # %% [markdown] """ ## 4. Integration - Bringing It Together Now let's build utility functions that make tokenization easy to use in practice. These tools will help you tokenize datasets, analyze performance, and choose the right strategy. ``` Tokenization Workflow: 1. Choose Strategy β†’ 2. Train Tokenizer β†’ 3. Process Dataset β†’ 4. Analyze Results ↓ ↓ ↓ ↓ char/bpe corpus training batch encoding stats/metrics ``` """ # %% nbgrader={"grade": false, "grade_id": "tokenization-utils", "solution": true} def create_tokenizer(strategy: str = "char", vocab_size: int = 1000, corpus: List[str] = None) -> Tokenizer: """ Factory function to create and train tokenizers. TODO: Create appropriate tokenizer based on strategy APPROACH: 1. Check strategy type 2. Create appropriate tokenizer class 3. Train on corpus if provided 4. Return configured tokenizer EXAMPLE: >>> corpus = ["hello world", "test text"] >>> tokenizer = create_tokenizer("char", corpus=corpus) >>> tokens = tokenizer.encode("hello") """ ### BEGIN SOLUTION if strategy == "char": tokenizer = CharTokenizer() if corpus: tokenizer.build_vocab(corpus) elif strategy == "bpe": tokenizer = BPETokenizer(vocab_size=vocab_size) if corpus: tokenizer.train(corpus, vocab_size) else: raise ValueError(f"Unknown tokenization strategy: {strategy}") return tokenizer ### END SOLUTION def tokenize_dataset(texts: List[str], tokenizer: Tokenizer, max_length: int = None) -> List[List[int]]: """ Tokenize a dataset with optional length limits. TODO: Tokenize all texts with consistent preprocessing APPROACH: 1. Encode each text with the tokenizer 2. Apply max_length truncation if specified 3. Return list of tokenized sequences HINTS: - Handle empty texts gracefully - Truncate from the end if too long """ ### BEGIN SOLUTION tokenized = [] for text in texts: tokens = tokenizer.encode(text) # Apply length limit if max_length and len(tokens) > max_length: tokens = tokens[:max_length] tokenized.append(tokens) return tokenized ### END SOLUTION def analyze_tokenization(texts: List[str], tokenizer: Tokenizer) -> Dict[str, float]: """ Analyze tokenization statistics. TODO: Compute useful statistics about tokenization APPROACH: 1. Tokenize all texts 2. Compute sequence length statistics 3. Calculate compression ratio 4. Return analysis dictionary """ ### BEGIN SOLUTION all_tokens = [] total_chars = 0 for text in texts: tokens = tokenizer.encode(text) all_tokens.extend(tokens) total_chars += len(text) # Calculate statistics tokenized_lengths = [len(tokenizer.encode(text)) for text in texts] stats = { 'vocab_size': tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else len(tokenizer.vocab), 'avg_sequence_length': np.mean(tokenized_lengths), 'max_sequence_length': max(tokenized_lengths) if tokenized_lengths else 0, 'total_tokens': len(all_tokens), 'compression_ratio': total_chars / len(all_tokens) if all_tokens else 0, 'unique_tokens': len(set(all_tokens)) } return stats ### END SOLUTION # %% nbgrader={"grade": true, "grade_id": "test-tokenization-utils", "locked": true, "points": 10} def test_unit_tokenization_utils(): """πŸ”¬ Test tokenization utility functions.""" print("πŸ”¬ Unit Test: Tokenization Utils...") # Test tokenizer factory corpus = ["hello world", "test text", "more examples"] char_tokenizer = create_tokenizer("char", corpus=corpus) assert isinstance(char_tokenizer, CharTokenizer) assert char_tokenizer.vocab_size > 0 bpe_tokenizer = create_tokenizer("bpe", vocab_size=50, corpus=corpus) assert isinstance(bpe_tokenizer, BPETokenizer) # Test dataset tokenization texts = ["hello", "world", "test"] tokenized = tokenize_dataset(texts, char_tokenizer, max_length=10) assert len(tokenized) == len(texts) assert all(len(seq) <= 10 for seq in tokenized) # Test analysis stats = analyze_tokenization(texts, char_tokenizer) assert 'vocab_size' in stats assert 'avg_sequence_length' in stats assert 'compression_ratio' in stats assert stats['total_tokens'] > 0 print("βœ… Tokenization utils work correctly!") test_unit_tokenization_utils() # %% [markdown] """ ## 5. Systems Analysis - Tokenization Trade-offs Understanding the performance implications of different tokenization strategies is crucial for building efficient NLP systems. """ # %% nbgrader={"grade": false, "grade_id": "tokenization-analysis", "solution": true} def analyze_tokenization_strategies(): """πŸ“Š Compare different tokenization strategies on various texts.""" print("πŸ“Š Analyzing Tokenization Strategies...") # Create test corpus with different text types corpus = [ "Hello world", "The quick brown fox jumps over the lazy dog", "Machine learning is transforming artificial intelligence", "Tokenization is fundamental to natural language processing", "Subword units balance vocabulary size and sequence length" ] # Test different strategies strategies = [ ("Character", create_tokenizer("char", corpus=corpus)), ("BPE-100", create_tokenizer("bpe", vocab_size=100, corpus=corpus)), ("BPE-500", create_tokenizer("bpe", vocab_size=500, corpus=corpus)) ] print(f"{'Strategy':<12} {'Vocab':<8} {'Avg Len':<8} {'Compression':<12} {'Coverage':<10}") print("-" * 60) for name, tokenizer in strategies: stats = analyze_tokenization(corpus, tokenizer) print(f"{name:<12} {stats['vocab_size']:<8} " f"{stats['avg_sequence_length']:<8.1f} " f"{stats['compression_ratio']:<12.2f} " f"{stats['unique_tokens']:<10}") print("\nπŸ’‘ Key Insights:") print("- Character tokenization: Small vocab, long sequences, perfect coverage") print("- BPE: Larger vocab trades off with shorter sequences") print("- Higher compression ratio = more characters per token = efficiency") analyze_tokenization_strategies() # %% [markdown] """ ### πŸ“Š Performance Analysis: Vocabulary Size vs Sequence Length The fundamental trade-off in tokenization creates a classic systems engineering challenge: ``` Tokenization Trade-off Spectrum: Character BPE-Small BPE-Large Word-Level vocab: ~100 β†’ vocab: ~1K β†’ vocab: ~50K β†’ vocab: ~100K+ seq: very long β†’ seq: long β†’ seq: medium β†’ seq: short memory: low β†’ memory: med β†’ memory: high β†’ memory: very high compute: high β†’ compute: med β†’ compute: low β†’ compute: very low coverage: 100% β†’ coverage: 99% β†’ coverage: 95% β†’ coverage: <80% ``` **Character tokenization (vocab ~100)**: - Pro: Universal coverage, simple implementation, small embedding table - Con: Long sequences (high compute), limited semantic units - Use case: Morphologically rich languages, robust preprocessing **BPE tokenization (vocab 10K-50K)**: - Pro: Balanced efficiency, handles morphology, good coverage - Con: Training complexity, domain-specific vocabularies - Use case: Most modern language models (GPT, BERT family) **Real-world scaling examples**: ``` GPT-3/4: ~50K BPE tokens, avg 3-4 chars/token BERT: ~30K WordPiece tokens, avg 4-5 chars/token T5: ~32K SentencePiece tokens, handles 100+ languages ChatGPT: ~100K tokens with extended vocabulary ``` **Memory implications for embedding tables**: ``` Tokenizer Vocab Size Embed Dim Parameters Memory (fp32) Character 100 512 51K 204 KB BPE-1K 1,000 512 512K 2.0 MB BPE-50K 50,000 512 25.6M 102.4 MB Word-100K 100,000 512 51.2M 204.8 MB ``` """ # %% [markdown] """ ## 6. Module Integration Test Let's test our complete tokenization system to ensure everything works together. """ # %% nbgrader={"grade": true, "grade_id": "test-module", "locked": true, "points": 20} def test_module(): """ Comprehensive test of entire tokenization module. This final test runs before module summary to ensure: - All unit tests pass - Functions work together correctly - Module is ready for integration with TinyTorch """ print("πŸ§ͺ RUNNING MODULE INTEGRATION TEST") print("=" * 50) # Run all unit tests print("Running unit tests...") test_unit_base_tokenizer() test_unit_char_tokenizer() test_unit_bpe_tokenizer() test_unit_tokenization_utils() print("\nRunning integration scenarios...") # Test realistic tokenization workflow print("πŸ”¬ Integration Test: Complete tokenization pipeline...") # Create training corpus training_corpus = [ "Natural language processing", "Machine learning models", "Neural networks learn", "Tokenization enables text processing", "Embeddings represent meaning" ] # Train different tokenizers char_tokenizer = create_tokenizer("char", corpus=training_corpus) bpe_tokenizer = create_tokenizer("bpe", vocab_size=200, corpus=training_corpus) # Test on new text test_text = "Neural language models" # Test character tokenization char_tokens = char_tokenizer.encode(test_text) char_decoded = char_tokenizer.decode(char_tokens) assert char_decoded == test_text, "Character round-trip failed" # Test BPE tokenization (may not be exact due to subword splits) bpe_tokens = bpe_tokenizer.encode(test_text) bpe_decoded = bpe_tokenizer.decode(bpe_tokens) assert len(bpe_decoded.strip()) > 0, "BPE decoding failed" # Test dataset processing test_dataset = ["hello world", "tokenize this", "neural networks"] char_dataset = tokenize_dataset(test_dataset, char_tokenizer, max_length=20) bpe_dataset = tokenize_dataset(test_dataset, bpe_tokenizer, max_length=10) assert len(char_dataset) == len(test_dataset) assert len(bpe_dataset) == len(test_dataset) assert all(len(seq) <= 20 for seq in char_dataset) assert all(len(seq) <= 10 for seq in bpe_dataset) # Test analysis functions char_stats = analyze_tokenization(test_dataset, char_tokenizer) bpe_stats = analyze_tokenization(test_dataset, bpe_tokenizer) assert char_stats['vocab_size'] > 0 assert bpe_stats['vocab_size'] > 0 assert char_stats['compression_ratio'] < bpe_stats['compression_ratio'] # BPE should compress better print("βœ… End-to-end tokenization pipeline works!") print("\n" + "=" * 50) print("πŸŽ‰ ALL TESTS PASSED! Module ready for export.") print("Run: tito module complete 10") # Call the comprehensive test test_module() # %% if __name__ == "__main__": print("πŸš€ Running Tokenization module...") test_module() print("βœ… Module validation complete!") # %% [markdown] """ ## πŸ€” ML Systems Thinking: Text Processing Foundations ### Question 1: Vocabulary Size vs Memory You implemented tokenizers with different vocabulary sizes. If you have a BPE tokenizer with vocab_size=50,000 and embed_dim=512: - How many parameters are in the embedding table? _____ million - If using float32, how much memory does this embedding table require? _____ MB ### Question 2: Sequence Length Trade-offs Your character tokenizer produces longer sequences than BPE. For the text "machine learning" (16 characters): - Character tokenizer produces ~16 tokens - BPE tokenizer might produce ~3-4 tokens If processing batch_size=32 with max_length=512: - Character model needs _____ total tokens per batch - BPE model needs _____ total tokens per batch - Which requires more memory during training? _____ ### Question 3: Tokenization Coverage Your BPE tokenizer handles unknown words by decomposing into subwords. - Why is this better than word-level tokenization for real applications? _____ - What happens to model performance when many tokens map to ? _____ - How does vocabulary size affect the number of unknown decompositions? _____ """ # %% [markdown] """ ## 🎯 MODULE SUMMARY: Tokenization Congratulations! You've built a complete tokenization system for converting text to numerical representations! ### Key Accomplishments - Built character-level tokenizer with perfect text coverage - Implemented BPE tokenizer that learns efficient subword representations - Created vocabulary management and encoding/decoding systems - Discovered the vocabulary size vs sequence length trade-off - All tests pass βœ… (validated by `test_module()`) ### Ready for Next Steps Your tokenization implementation enables text processing for language models. Export with: `tito module complete 10` **Next**: Module 11 will add learnable embeddings that convert your token IDs into rich vector representations! """