mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-25 10:21:47 -05:00
Following module developer guidelines, added comprehensive visual diagrams: 1. Text-to-Numbers Pipeline (Introduction): - Added full boxed diagram showing 4-step tokenization process - Clear visual flow from human text to numerical IDs - Each step explained inline with the diagram 2. Character Tokenization Process: - Step-by-step vocabulary building visualization - Shows corpus → unique chars → vocab with IDs - Encoding process with ID lookup visualization - Decoding process with reverse lookup - All in clear nested boxes 3. BPE Training Algorithm: - Comprehensive 4-step process with nested boxes - Pair frequency analysis with bar charts (████) - Before/After merge visualizations - Iteration examples showing vocabulary growth - Final results with key insights 4. Memory Layout for Embedding Tables: - Visual bars showing relative memory sizes - Character (204KB) vs BPE-50K (102MB) vs Word-100K (204MB) - Shows fp32/fp16/int8 precision trade-offs - Real production model examples (GPT-2/3, BERT, T5, LLaMA) - Clear table format for comparison Educational improvements: - More visual, less text-heavy - Clearer step-by-step flows - Better intuition building - Production context throughout - Following module developer ASCII diagram patterns Students now see: - HOW tokenization works (not just WHAT) - WHY different strategies exist - WHAT the memory implications are - HOW production models make these choices