fix: Add missing typing imports to Module 10 tokenization

Issue: CharTokenizer was failing with NameError: name 'List' is not defined
Root cause: typing imports were not marked with #| export

Fix:
 Added #| export directive to import block in tokenization_dev.py
 Re-exported module using 'tito export 10_tokenization'
 typing.List, Dict, Tuple, Optional, Set now properly exported

Verification:
- CharTokenizer.build_vocab() works 
- encode() and decode() work 
- Tested on Shakespeare sample text 

This fixes the integration with vaswani_shakespeare.py which now properly
uses CharTokenizer from Module 10 instead of manual tokenization.
This commit is contained in:
Vijay Janapa Reddi
2025-10-28 09:44:24 -04:00
parent 876d3406a0
commit 62636fa92a
3 changed files with 246 additions and 84 deletions

View File

@@ -21,6 +21,16 @@ __all__ = ['Tokenizer', 'CharTokenizer', 'BPETokenizer']
#| default_exp text.tokenization
#| export
# %% ../../modules/source/10_tokenization/tokenization_dev.ipynb 3
import numpy as np
from typing import List, Dict, Tuple, Optional, Set
import json
import re
from collections import defaultdict, Counter
# Import only Module 01 (Tensor) - this module has minimal dependencies
from ..core.tensor import Tensor
# %% ../../modules/source/10_tokenization/tokenization_dev.ipynb 8
class Tokenizer:
"""