TinyTorch

mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-25 10:21:47 -05:00

Files

Vijay Janapa Reddi fdeb707b02 docs: Improve tokenization module with enhanced ASCII diagrams

Following module developer guidelines, added comprehensive visual diagrams:

1. Text-to-Numbers Pipeline (Introduction):
   - Added full boxed diagram showing 4-step tokenization process
   - Clear visual flow from human text to numerical IDs
   - Each step explained inline with the diagram

2. Character Tokenization Process:
   - Step-by-step vocabulary building visualization
   - Shows corpus → unique chars → vocab with IDs
   - Encoding process with ID lookup visualization
   - Decoding process with reverse lookup
   - All in clear nested boxes

3. BPE Training Algorithm:
   - Comprehensive 4-step process with nested boxes
   - Pair frequency analysis with bar charts (████)
   - Before/After merge visualizations
   - Iteration examples showing vocabulary growth
   - Final results with key insights

4. Memory Layout for Embedding Tables:
   - Visual bars showing relative memory sizes
   - Character (204KB) vs BPE-50K (102MB) vs Word-100K (204MB)
   - Shows fp32/fp16/int8 precision trade-offs
   - Real production model examples (GPT-2/3, BERT, T5, LLaMA)
   - Clear table format for comparison

Educational improvements:
- More visual, less text-heavy
- Clearer step-by-step flows
- Better intuition building
- Production context throughout
- Following module developer ASCII diagram patterns

Students now see:
- HOW tokenization works (not just WHAT)
- WHY different strategies exist
- WHAT the memory implications are
- HOW production models make these choices

2025-10-24 17:51:11 -04:00

tokenization_dev.ipynb

feat: Complete transformer integration with milestones

2025-10-19 12:46:58 -04:00

tokenization_dev.py

docs: Improve tokenization module with enhanced ASCII diagrams

2025-10-24 17:51:11 -04:00