Merge transformer-training into dev

Complete Milestone 05 - 2017 Transformer implementation Major Features: - TinyTalks interactive dashboard with rich CLI - Complete gradient flow fixes (13 tests passing) - Multiple training examples (5-min, 10-min, levels 1-2) - Milestone celebration card (perceptron style) - Comprehensive documentation Gradient Flow Fixes: - Fixed reshape, matmul (3D), embedding, sqrt, mean, sub, div, GELU - All transformer components now fully differentiable - Hybrid attention approach for educational clarity + gradients Training Results: - 10-min training: 96.6% loss improvement, 62.5% accuracy - 5-min training: 97.8% loss improvement, 66.7% accuracy - Working chatbot with coherent responses Files Added: - tinytalks_dashboard.py (main demo) - tinytalks_chatbot.py, tinytalks_dataset.py - level1_memorization.py, level2_patterns.py - Comprehensive docs and test suites Ready for student use 2>&1
2026-06-05 07:25:52 -05:00 · 2025-10-30 17:48:11 -04:00
parent ca93669fbc 330e1738db
commit 15d3ed5251
36 changed files with 7365 additions and 2240 deletions
--- a/tinytorch/text/tokenization.py
+++ b/tinytorch/text/tokenization.py
@@ -1,25 +1,14 @@
-# ╔═══════════════════════════════════════════════════════════════════════════════╗
-# ║                        🚨 CRITICAL WARNING 🚨                                ║
-# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
-# ║                                                                               ║
-# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
-# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
-# ║                                                                               ║
-# ║  ✅ TO EDIT: modules/source/XX_tokenization/tokenization_dev.py     ║
-# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
-# ║                                                                               ║
-# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
-# ║     Editing it directly may break module functionality and training.         ║
-# ║                                                                               ║
-# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
-# ║     happens! The tinytorch/ directory is just the compiled output.           ║
-# ╚═══════════════════════════════════════════════════════════════════════════════╝
+# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/10_tokenization/tokenization_dev.ipynb.
+
 # %% auto 0
 __all__ = ['Tokenizer', 'CharTokenizer', 'BPETokenizer']

 # %% ../../modules/source/10_tokenization/tokenization_dev.ipynb 0
-#| default_exp text.tokenization
-#| export
+import numpy as np
+from typing import List, Dict, Tuple, Optional, Set
+import json
+import re
+from collections import defaultdict, Counter

 # %% ../../modules/source/10_tokenization/tokenization_dev.ipynb 3
 import numpy as np