diff --git a/capstone-ideas/README.md b/capstone-ideas/README.md deleted file mode 100644 index 291e1a81..00000000 --- a/capstone-ideas/README.md +++ /dev/null @@ -1,185 +0,0 @@ -# ๐ŸŽ“ TinyTorch Capstone Project Ideas - -## **Background: The Capstone Design Problem** - -**Original Issue**: Module 20 was "TinyGPT Capstone" but students can already build TinyGPT after Module 13 (Transformers). This made: -- Modules 14-19 (optimization) feel like "optional extras" -- Module 20 anticlimactic ("TinyGPT again?") -- No integration of crucial systems engineering skills - -**Solution Requirements**: -- Must integrate ALL modules 1-19 (especially optimization modules 14-19) -- Must be genuinely exciting and different -- Must demonstrate complete ML systems engineering mastery -- Must create portfolio-worthy deliverables - ---- - -## **๐Ÿ† RECOMMENDED: AI Olympics Competition** - -**๐Ÿ“ See: [ai-olympics.md](ai-olympics.md)** - -**Core Concept**: Competitive leaderboard where students optimize TinyTorch models across systems engineering dimensions. - -**Why This is Best**: -- โœ… **Natural motivation**: Students want to rank high on leaderboards -- โœ… **Systems focus**: Compete on speed, memory, efficiency - not just accuracy -- โœ… **Community building**: Creates ongoing engagement and peer interaction -- โœ… **Portfolio impact**: "I ranked #3 in TinyTorch AI Olympics" is compelling -- โœ… **Forces optimization**: ALL modules 14-19 become essential for competitive performance - -**Competition Categories**: -- ๐Ÿƒโ€โ™‚๏ธ **Speed Demon**: Fastest inference -- ๐Ÿ’พ **Memory Miser**: Smallest memory footprint -- ๐Ÿ“ฑ **Edge Expert**: Best Raspberry Pi performance -- ๐Ÿ”‹ **Energy Efficient**: Lowest power consumption -- ๐Ÿ† **TinyMLPerf**: Overall benchmark champion - ---- - -## **๐Ÿ› ๏ธ Alternative Ideas Considered** - -### **1. Edge AI Deployment System** -**Concept**: Deploy optimized neural networks to actual edge hardware (Raspberry Pi) - -**Pros**: -- Integrates all optimization modules (essential for edge constraints) -- Creates tangible deliverable ("I run neural networks on a $35 computer") -- Teaches real-world deployment challenges - -**Cons**: -- Individual project (no community/competition aspect) -- Hardware dependencies (students need Pi) -- Less motivating than competition - -### **2. Multi-Modal AI Assistant** -**Concept**: Combine vision (CNNs) + language (transformers) + optimization for real-time performance - -**Pros**: -- Showcases multiple architectures working together -- Demonstrates practical AI applications -- Requires optimization for real-time performance - -**Cons**: -- Complex scope potentially overwhelming -- Optimization feels secondary to "getting it working" -- Limited portfolio differentiation - -### **3. ML Performance Laboratory** -**Concept**: Comprehensive benchmarking suite comparing different ML frameworks - -**Pros**: -- Heavy focus on profiling and benchmarking skills -- Creates useful tool for community -- Deep systems engineering focus - -**Cons**: -- More about measurement than optimization -- Limited creative expression for students -- May feel academic rather than practical - -### **4. Neural Architecture Search** -**Concept**: Automated model design and optimization system - -**Pros**: -- Cutting-edge research area -- Requires sophisticated optimization -- Highly technical achievement - -**Cons**: -- Very advanced, may be beyond course scope -- Optimization becomes means rather than end -- Difficult to assess fairly - -### **5. Distributed Training System** -**Concept**: Multi-GPU/multi-node training infrastructure - -**Pros**: -- Advanced systems engineering skills -- High industry relevance -- Impressive technical achievement - -**Cons**: -- Requires expensive hardware -- Complex debugging and setup -- May overshadow core ML concepts - -### **6. ML Model Marketplace** -**Concept**: Complete system for sharing/deploying/optimizing models (like Hugging Face) - -**Pros**: -- Full-stack systems engineering -- Practical deployment focus -- Creates useful community resource - -**Cons**: -- Web development skills needed -- Broad scope potentially unfocused -- Less emphasis on optimization techniques - ---- - -## **๐Ÿ“Š Evaluation Criteria** - -| Criteria | AI Olympics | Edge Deployment | Multi-Modal | ML Lab | NAS | Distributed | Marketplace | -|----------|-------------|-----------------|-------------|--------|-----|-------------|-------------| -| **Integrates All Modules** | โœ…โœ…โœ… | โœ…โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | -| **Student Motivation** | โœ…โœ…โœ… | โœ… | โœ… | โš ๏ธ | โš ๏ธ | โš ๏ธ | โœ… | -| **Portfolio Impact** | โœ…โœ…โœ… | โœ…โœ… | โœ… | โœ… | โœ…โœ… | โœ…โœ… | โœ… | -| **Systems Engineering Focus** | โœ…โœ…โœ… | โœ…โœ… | โœ… | โœ…โœ…โœ… | โœ… | โœ…โœ…โœ… | โœ… | -| **Implementation Feasibility** | โœ…โœ… | โœ…โœ…โœ… | โœ… | โœ…โœ… | โš ๏ธ | โš ๏ธ | โœ… | -| **Community Building** | โœ…โœ…โœ… | โš ๏ธ | โš ๏ธ | โœ… | โš ๏ธ | โš ๏ธ | โœ…โœ… | -| **Scalability** | โœ…โœ…โœ… | โœ… | โœ… | โœ… | โš ๏ธ | โš ๏ธ | โœ… | - -**Legend**: โœ…โœ…โœ… Excellent, โœ…โœ… Good, โœ… Adequate, โš ๏ธ Challenging - ---- - -## **๐ŸŽฏ Final Recommendation** - -**AI Olympics** emerges as the clear winner because it: - -1. **Maximizes student motivation** through competitive leaderboards -2. **Forces integration** of ALL optimization modules (14-19) -3. **Creates lasting community** beyond individual course completion -4. **Produces compelling portfolio artifacts** (leaderboard rankings) -5. **Scales naturally** as more students participate -6. **Emphasizes systems engineering** over algorithmic implementation - -### **Implementation Priority** -1. **Phase 1**: Design and build leaderboard infrastructure -2. **Phase 2**: Create standard benchmark evaluation suite -3. **Phase 3**: Deploy beta version with small student cohort -4. **Phase 4**: Full launch with all TinyTorch students - -### **Success Metrics** -- **Participation Rate**: % of students who submit to multiple categories -- **Optimization Depth**: Average number of techniques applied per submission -- **Community Engagement**: Forum activity, peer collaboration, ongoing submissions -- **Portfolio Impact**: Industry feedback on graduate capabilities - ---- - -## **๐Ÿ“ Notes for Implementation** - -### **Technical Requirements** -- Automated submission and evaluation pipeline -- Standard benchmark datasets and environments -- Real-time leaderboard with rich visualizations -- Robust measurement and scoring systems - -### **Educational Integration** -- Clear rubrics linking competition performance to course grades -- Structured optimization process through modules 14-19 -- Portfolio development guidance and templates -- Peer review and collaboration opportunities - -### **Community Features** -- Student profiles and achievement tracking -- Optimization technique sharing and discussion -- Mentorship connections between high performers and struggling students -- Industry guest judging and feedback - ---- - -**๐Ÿš€ The AI Olympics transforms TinyTorch from "just another ML course" into a competitive systems engineering community that motivates deep learning, creates lasting engagement, and produces industry-ready graduates.** \ No newline at end of file diff --git a/capstone-ideas/ai-olympics.md b/capstone-ideas/ai-olympics.md deleted file mode 100644 index e4d33868..00000000 --- a/capstone-ideas/ai-olympics.md +++ /dev/null @@ -1,227 +0,0 @@ -# ๐Ÿ… AI Olympics: TinyTorch Systems Competition Capstone - -## **Core Concept: Compete on Systems Performance, Not Just Accuracy** - -Instead of individual projects, Module 20 becomes a **competitive leaderboard** where students optimize their TinyTorch models across multiple **systems engineering dimensions**. - -### **๐ŸŽฏ Why AI Olympics is Perfect for TinyTorch** - -- **Systems Focus**: Compete on memory, speed, efficiency - not just accuracy -- **Real ML Engineering**: Production systems care about performance, not just "does it work" -- **Leaderboard Motivation**: Students naturally want to rank high and beat peers -- **Portfolio Value**: "I ranked #3 in TinyTorch AI Olympics" is impressive -- **Community Building**: Creates ongoing engagement beyond the course - ---- - -## **๐Ÿ† Competition Categories** - -### **Category 1: Speed Demon** โšก -*"Fastest inference on standard hardware"* -- **Metric**: Inferences per second on reference hardware -- **Required Skills**: Modules 14-19 optimization techniques -- **Constraint**: Must maintain >90% accuracy on test dataset - -### **Category 2: Memory Miser** ๐Ÿ’พ -*"Smallest memory footprint"* -- **Metric**: Peak memory usage during inference -- **Required Skills**: Quantization, compression, efficient architectures -- **Constraint**: Must maintain >85% accuracy on test dataset - -### **Category 3: Edge Expert** ๐Ÿ“ฑ -*"Best performance on Raspberry Pi"* -- **Metric**: Composite score (speed + accuracy + power efficiency) -- **Required Skills**: ALL optimization modules for edge constraints -- **Constraint**: Must actually run on Pi hardware - -### **Category 4: Energy Efficient** ๐Ÿ”‹ -*"Lowest power consumption"* -- **Metric**: Energy per inference (joules/prediction) -- **Required Skills**: Model compression, efficient algorithms -- **Constraint**: Must maintain competitive accuracy - -### **Category 5: TinyMLPerf** ๐Ÿƒโ€โ™‚๏ธ -*"Official MLPerf-style benchmark"* -- **Metric**: Standardized benchmark suite performance -- **Required Skills**: Complete systems optimization pipeline -- **Constraint**: Must pass all benchmark compliance tests - ---- - -## **๐ŸŽฎ Competition Structure** - -### **Phase 1: Baseline Submission (Week 1)** -- Submit working model from modules 1-13 (CNN, transformer, or multi-modal) -- Get baseline scores across all categories -- See where you rank on initial leaderboard - -### **Phase 2: Optimization Sprint (Weeks 2-4)** -- Apply techniques from modules 14-19 systematically -- **Module 14**: Profile and identify bottlenecks -- **Module 15**: Implement acceleration techniques -- **Module 16**: Add quantization for memory/speed -- **Module 17**: Apply compression for size reduction -- **Module 18**: Implement caching for inference speed -- **Module 19**: Benchmark against production systems - -### **Phase 3: Final Submission & Olympics (Week 5)** -- Submit optimized models to all relevant categories -- **Live leaderboard updates** as submissions come in -- **Victory ceremony** with category winners -- **Portfolio artifacts**: Leaderboard rankings + optimization reports - ---- - -## **๐Ÿ“Š Leaderboard & Scoring System** - -### **Public Leaderboard Features** -``` -๐Ÿ† TinyTorch AI Olympics Leaderboard - -Speed Demon Category: -1. alice_chen 847.3 inf/sec (95.2% acc) ๐Ÿฅ‡ -2. bob_smith 612.7 inf/sec (94.8% acc) ๐Ÿฅˆ -3. carol_wong 588.1 inf/sec (96.1% acc) ๐Ÿฅ‰ - -Memory Miser Category: -1. dave_kim 12.4 MB (91.7% acc) ๐Ÿฅ‡ -2. eve_patel 15.8 MB (93.2% acc) ๐Ÿฅˆ -3. frank_liu 18.2 MB (89.9% acc) ๐Ÿฅ‰ -``` - -### **Scoring Methodology** -- **Primary Metric**: Category-specific performance (speed, memory, etc.) -- **Accuracy Threshold**: Must meet minimum accuracy to qualify -- **Tie-Breaker**: Higher accuracy wins ties in primary metric -- **Bonus Points**: Novel optimization techniques, exceptional documentation - -### **Awards & Recognition** -- **๐Ÿฅ‡ Category Champions**: Top performer in each category -- **๐Ÿ† Overall Systems Engineer**: Best combined performance across categories -- **๐Ÿš€ Innovation Award**: Most creative optimization approach -- **๐Ÿ“š Teaching Award**: Best documented optimization process - ---- - -## **๐ŸŽฏ Required Deliverables** - -### **Competition Submission Package** -1. **Optimized Model**: Runnable TinyTorch implementation -2. **Performance Report**: Detailed analysis of optimization techniques applied -3. **Reproduction Guide**: Clear instructions for others to run your solution -4. **Systems Engineering Documentation**: What you learned about ML systems - -### **Portfolio Artifacts Students Get** -- **Leaderboard ranking** across multiple categories -- **Technical optimization report** demonstrating systems engineering skills -- **Benchmark results** comparing their work to industry standards -- **Peer recognition** from competitive performance - ---- - -## **๐Ÿ”ง Technical Infrastructure Needed** - -### **Leaderboard System** -- Automated submission processing -- Standard evaluation environment -- Real-time ranking updates -- Historical performance tracking - -### **Benchmark Suite** -- Reference datasets for each category -- Standard hardware for testing -- Automated compliance checking -- Performance measurement tools - -### **Submission Portal** -- Code upload and validation -- Automatic testing pipeline -- Results processing and ranking -- Student dashboard with progress - ---- - -## **๐Ÿ“ˆ Why This Beats Individual Projects** - -### **Individual Project Problems:** -- โŒ No motivation to optimize beyond "it works" -- โŒ Hard to compare student achievements -- โŒ No ongoing engagement after submission -- โŒ Limited portfolio impact - -### **AI Olympics Advantages:** -- โœ… **Natural optimization motivation**: Students want to rank higher -- โœ… **Clear performance comparison**: Leaderboard shows relative achievement -- โœ… **Ongoing engagement**: Leaderboard creates lasting community -- โœ… **Strong portfolio impact**: "I ranked #2 in Memory Efficiency" is compelling - -### **Systems Engineering Focus:** -- Forces students to care about **ALL** optimization dimensions -- Makes modules 14-19 essential for competitive performance -- Teaches that "getting it working" is only the beginning -- Demonstrates real-world ML engineering priorities - ---- - -## **๐Ÿš€ Implementation Timeline** - -### **Phase 1: Core Infrastructure (4 weeks)** -- Build leaderboard system -- Create benchmark evaluation suite -- Set up automated testing pipeline -- Design submission portal - -### **Phase 2: Beta Testing (2 weeks)** -- Test with small group of students -- Refine scoring methodology -- Fix technical issues -- Gather feedback and iterate - -### **Phase 3: Full Launch (Ongoing)** -- Deploy for all TinyTorch students -- Monitor and maintain leaderboard -- Regular benchmark updates -- Community management and awards - ---- - -## **๐ŸŽ“ Educational Impact** - -### **Learning Outcomes** -Students learn that ML engineering is about: -- **Systems performance**, not just algorithmic correctness -- **Trade-offs** between speed, memory, accuracy, and power -- **Optimization techniques** for real-world constraints -- **Benchmarking and measurement** for objective evaluation -- **Competition and collaboration** in technical communities - -### **Career Preparation** -Students graduate with: -- **Demonstrable systems optimization skills** -- **Portfolio evidence of competitive performance** -- **Experience with ML engineering trade-offs** -- **Understanding of production ML constraints** -- **Community connections** with other systems engineers - ---- - -## **๐Ÿ’ก Future Extensions** - -### **Multi-Semester Competitions** -- New benchmark challenges each semester -- Evolving leaderboards with increasing difficulty -- Alumni participation and mentorship - -### **Industry Integration** -- Company-sponsored benchmark challenges -- Internship opportunities for top performers -- Guest judging from ML systems engineers - -### **Research Integration** -- Novel optimization techniques become research contributions -- Student innovations feed back into TinyTorch framework -- Academic publications from exceptional submissions - ---- - -**๐ŸŽฏ CONCLUSION: AI Olympics transforms Module 20 from "individual project" to "competitive systems engineering challenge" that motivates optimization, builds community, and produces compelling portfolio artifacts.** \ No newline at end of file diff --git a/milestones/02_1969_xor_crisis/README.md b/milestones/02_1969_xor/README.md similarity index 100% rename from milestones/02_1969_xor_crisis/README.md rename to milestones/02_1969_xor/README.md diff --git a/milestones/02_1969_xor_crisis/xor_crisis.py b/milestones/02_1969_xor/xor_crisis.py similarity index 100% rename from milestones/02_1969_xor_crisis/xor_crisis.py rename to milestones/02_1969_xor/xor_crisis.py diff --git a/milestones/02_1969_xor_crisis/xor_solved.py b/milestones/02_1969_xor/xor_solved.py similarity index 100% rename from milestones/02_1969_xor_crisis/xor_solved.py rename to milestones/02_1969_xor/xor_solved.py diff --git a/milestones/03_1986_mlp_revival/README.md b/milestones/03_1986_mlp/README.md similarity index 100% rename from milestones/03_1986_mlp_revival/README.md rename to milestones/03_1986_mlp/README.md diff --git a/milestones/03_1986_mlp_revival/UPDATE_SUMMARY.md b/milestones/03_1986_mlp/UPDATE_SUMMARY.md similarity index 100% rename from milestones/03_1986_mlp_revival/UPDATE_SUMMARY.md rename to milestones/03_1986_mlp/UPDATE_SUMMARY.md diff --git a/milestones/03_1986_mlp_revival/data/digits_8x8.npz b/milestones/03_1986_mlp/data/digits_8x8.npz similarity index 100% rename from milestones/03_1986_mlp_revival/data/digits_8x8.npz rename to milestones/03_1986_mlp/data/digits_8x8.npz diff --git a/milestones/03_1986_mlp_revival/mlp_digits.py b/milestones/03_1986_mlp/mlp_digits.py similarity index 100% rename from milestones/03_1986_mlp_revival/mlp_digits.py rename to milestones/03_1986_mlp/mlp_digits.py diff --git a/milestones/03_1986_mlp_revival/mlp_mnist.py b/milestones/03_1986_mlp/mlp_mnist.py similarity index 100% rename from milestones/03_1986_mlp_revival/mlp_mnist.py rename to milestones/03_1986_mlp/mlp_mnist.py diff --git a/milestones/04_1998_cnn_revolution/README.md b/milestones/04_1998_cnn/README.md similarity index 100% rename from milestones/04_1998_cnn_revolution/README.md rename to milestones/04_1998_cnn/README.md diff --git a/milestones/04_1998_cnn_revolution/cnn_digits.py b/milestones/04_1998_cnn/cnn_digits.py similarity index 100% rename from milestones/04_1998_cnn_revolution/cnn_digits.py rename to milestones/04_1998_cnn/cnn_digits.py diff --git a/milestones/04_1998_cnn_revolution/lecun_cifar10.py b/milestones/04_1998_cnn/lecun_cifar10.py similarity index 100% rename from milestones/04_1998_cnn_revolution/lecun_cifar10.py rename to milestones/04_1998_cnn/lecun_cifar10.py diff --git a/milestones/05_2017_transformer_era/README.md b/milestones/05_2017_transformer/README.md similarity index 69% rename from milestones/05_2017_transformer_era/README.md rename to milestones/05_2017_transformer/README.md index f42e1461..a7098934 100644 --- a/milestones/05_2017_transformer_era/README.md +++ b/milestones/05_2017_transformer/README.md @@ -4,63 +4,20 @@ ## ๐ŸŽฏ What You'll Build -Three progressively impressive demos: +A character-level transformer trained on Shakespeare's works - the classic "hello world" of language modeling! -### Step 1: Quick Validation (5 minutes) -**File**: `step1_quick_validation.py` -**Goal**: Verify transformer pipeline works +### Shakespeare Text Generation +**File**: `vaswani_shakespeare.py` +**Goal**: Build a transformer that generates Shakespeare-style text ```bash -python step1_quick_validation.py -``` - -**What it does**: -- Trains on simple repeating text ("hello world") -- Proves modules 10-13 are connected correctly -- Quick sanity check before bigger demos - -**Success**: Generates "hello world" pattern - ---- - -### Step 2: TinyCoder (15 minutes) ๐Ÿ”ฅ -**File**: `step2_tinycoder.py` -**Goal**: Code completion like GitHub Copilot! - -```bash -python step2_tinycoder.py -``` - -**What it does**: -- Trains on YOUR TinyTorch Python code -- Learns code patterns (def, class, self, etc.) -- Generates syntactically valid Python completions - -**Demo**: -```python -Input: 'def forward(self, x):' -Output: 'def forward(self, x):\n return self.layer(x)' - -Input: 'import ' -Output: 'import numpy as np' -``` - -**Epic moment**: "I built GitHub Copilot!" - ---- - -### Step 3: Shakespeare (15 minutes) -**File**: `step3_shakespeare.py` -**Goal**: Traditional text generation demo - -```bash -python step3_shakespeare.py +python vaswani_shakespeare.py ``` **What it does**: - Downloads Tiny Shakespeare dataset -- Trains character-level transformer -- Generates Shakespeare-style text +- Trains character-level transformer (YOUR implementation!) +- Generates coherent Shakespeare-style text **Demo**: ``` @@ -69,8 +26,6 @@ Output: 'To be or not to be, that is the question Whether tis nobler in the mind to suffer...' ``` -**Classic**: Traditional "hello world" for language models - --- ## ๐Ÿš€ Quick Start @@ -82,34 +37,18 @@ Complete these TinyTorch modules: - โœ… Module 12: Attention - โœ… Module 13: Transformers -### Run in Order +### Run the Example ```bash -# 1. Quick validation (5 min) -python step1_quick_validation.py - -# 2. Code completion (15 min) - THE EPIC ONE -python step2_tinycoder.py - -# 3. Shakespeare (15 min) - traditional demo -python step3_shakespeare.py +# Train transformer on Shakespeare (15-20 min) +python vaswani_shakespeare.py ``` --- -## ๐Ÿ“Š What Each Demo Teaches - -| Demo | Dataset | Tokenizer | Time | Epic Factor | What You Learn | -|------|---------|-----------|------|-------------|----------------| -| **Step 1** | Simple text | CharTokenizer | 5 min | โญโญ | Pipeline works | -| **Step 2** | TinyTorch code | BPETokenizer | 15 min | โญโญโญโญโญ | YOU built Copilot! | -| **Step 3** | Shakespeare | CharTokenizer | 15 min | โญโญโญโญ | Language modeling | - ---- - ## ๐ŸŽ“ Learning Outcomes -After completing these milestones, you'll understand: +After completing this milestone, you'll understand: ### Technical Mastery - โœ… How tokenization bridges text and numbers @@ -248,11 +187,12 @@ model = TinyGPT( You've succeeded when: -**Step 1**: Model generates repeating pattern -**Step 2**: Code completions are syntactically valid -**Step 3**: Shakespeare text is coherent (even if not perfect) +โœ… Model trains without errors +โœ… Loss decreases over training epochs +โœ… Generated Shakespeare text is coherent (even if not perfect) +โœ… You can generate text with custom prompts -**Don't expect perfection!** Production models train for months on massive data. Your demos prove you understand the architecture! +**Don't expect perfection!** Production models train for months on massive data. Your demo proves you understand the architecture! --- @@ -285,4 +225,4 @@ The transformer architecture you implemented powers: --- -**Ready to generate some text?** Start with `step1_quick_validation.py`! \ No newline at end of file +**Ready to generate some text?** Run `python vaswani_shakespeare.py`! \ No newline at end of file diff --git a/milestones/05_2017_transformer_era/vaswani_shakespeare.py b/milestones/05_2017_transformer/vaswani_shakespeare.py similarity index 97% rename from milestones/05_2017_transformer_era/vaswani_shakespeare.py rename to milestones/05_2017_transformer/vaswani_shakespeare.py index fc75575e..573c5a75 100644 --- a/milestones/05_2017_transformer_era/vaswani_shakespeare.py +++ b/milestones/05_2017_transformer/vaswani_shakespeare.py @@ -23,12 +23,12 @@ MODULES EXERCISED IN THIS EXAMPLE: Transformer Architecture (Bottom to Top Flow): โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Output Logits โ”‚ - โ”‚ Vocabulary Predictions (1000) โ”‚ + โ”‚ Output Logits โ”‚ + โ”‚ Vocabulary Predictions (1000) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ฒ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Output Projection โ”‚ + โ”‚ Output Projection โ”‚ โ”‚ Module 04: vectors โ†’ vocabulary โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ฒ @@ -39,41 +39,41 @@ Transformer Architecture (Bottom to Top Flow): โ–ฒ โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ Transformer Block ร— 4 (Repeat) โ•‘ - โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ - โ•‘ โ”‚ Layer Norm โ”‚ โ•‘ - โ•‘ โ”‚ Module 14: Post-FFN normalization โ”‚ โ•‘ - โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ + โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ + โ•‘ โ”‚ Layer Norm โ”‚ โ•‘ + โ•‘ โ”‚ Module 14: Post-FFN normalization โ”‚ โ•‘ + โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ–ฒ โ•‘ - โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ - โ•‘ โ”‚ Feed Forward Network (FFN) โ”‚ โ•‘ - โ•‘ โ”‚ Module 04: Linear(128โ†’512) โ†’ ReLU โ†’ Linear(512โ†’128) โ”‚ โ•‘ - โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ + โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ + โ•‘ โ”‚ Feed Forward Network (FFN) โ”‚ โ•‘ + โ•‘ โ”‚ Module 04: Linear(128โ†’512) โ†’ ReLU โ†’ Linear(512โ†’128) โ”‚ โ•‘ + โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ–ฒ โ•‘ - โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ - โ•‘ โ”‚ Layer Norm โ”‚ โ•‘ - โ•‘ โ”‚ Module 14: Post-attention normalization โ”‚ โ•‘ - โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ + โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ + โ•‘ โ”‚ Layer Norm โ”‚ โ•‘ + โ•‘ โ”‚ Module 14: Post-attention normalization โ”‚ โ•‘ + โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•‘ โ–ฒ โ•‘ - โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ - โ•‘ โ”‚ Multi-Head Self-Attention โ”‚ โ•‘ - โ•‘ โ”‚ Module 13: 8 heads ร— (QยทK^T/โˆšd_k)ยทV โ”‚ โ•‘ - โ•‘ โ”‚ Each head: 16-dim attention on 128-dim embeddings โ”‚ โ•‘ - โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ + โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ + โ•‘ โ”‚ Multi-Head Self-Attention โ”‚ โ•‘ + โ•‘ โ”‚ Module 13: 8 heads ร— (QยทK^T/โˆšd_k)ยทV โ”‚ โ•‘ + โ•‘ โ”‚ Each head: 16-dim attention on 128-dim embeddings โ”‚ โ•‘ + โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - โ–ฒ + โ–ฒ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Positional Encoding โ”‚ - โ”‚ Module 12: Add position information (sin/cos) โ”‚ + โ”‚ Positional Encoding โ”‚ + โ”‚ Module 12: Add position information (sin/cos) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ–ฒ + โ–ฒ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Token Embeddings โ”‚ + โ”‚ Token Embeddings โ”‚ โ”‚ Module 12: tokens โ†’ 128-dim vectors โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ–ฒ + โ–ฒ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Input Tokens โ”‚ - โ”‚ [token_1, token_2, ..., token_10] โ”‚ + โ”‚ Input Tokens โ”‚ + โ”‚ [token_1, token_2, ..., token_10] โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Key Insight: Attention allows each token to "look at" all other tokens diff --git a/milestones/05_2017_transformer_era/step1_quick_validation.py b/milestones/05_2017_transformer_era/step1_quick_validation.py deleted file mode 100644 index c4b75242..00000000 --- a/milestones/05_2017_transformer_era/step1_quick_validation.py +++ /dev/null @@ -1,288 +0,0 @@ -#!/usr/bin/env python3 -""" -Step 1: Quick Validation - Transformer Pipeline Test -==================================================== - -GOAL: Verify transformer modules work end-to-end in 5 minutes -DATASET: Simple repeating text (no download needed) -TOKENIZER: CharTokenizer (no training needed) -TIME: ~5 minutes - -This is the simplest possible test to prove: -โœ… Modules 10-13 are connected correctly -โœ… Training loop works -โœ… Generation works - -If this passes, the pipeline is functional! -""" - -import numpy as np -import sys -import os - -# Add project root to path -project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) -sys.path.insert(0, project_root) - -from tinytorch.core.tensor import Tensor -from tinytorch.text.tokenization import CharTokenizer -from tinytorch.core.embeddings import Embedding, PositionalEncoding -from tinytorch.core.attention import MultiHeadAttention -from tinytorch.models.transformer import TransformerBlock, LayerNorm -from tinytorch.core.layers import Linear -from tinytorch.core.optimizers import Adam - - -class TinyGPT: - """Minimal GPT for quick validation.""" - - def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length): - self.vocab_size = vocab_size - self.embed_dim = embed_dim - - # Token + position embeddings - self.token_embedding = Embedding(vocab_size, embed_dim) - self.pos_encoding = PositionalEncoding(max_length, embed_dim) - - # Transformer blocks - self.blocks = [] - for _ in range(num_layers): - block = TransformerBlock(embed_dim, num_heads, embed_dim * 4) - self.blocks.append(block) - - # Output projection - self.ln_f = LayerNorm(embed_dim) - self.head = Linear(embed_dim, vocab_size) - - def forward(self, idx): - """Forward pass through the model.""" - B, T = idx.shape - - # Token + positional embeddings - tok_emb = self.token_embedding.forward(idx) # (B, T, embed_dim) - x = self.pos_encoding.forward(tok_emb) # (B, T, embed_dim) - includes positional info - - # Transformer blocks - for block in self.blocks: - x = block(x) - - # Output head - x = self.ln_f(x) - logits = self.head(x) # (B, T, vocab_size) - - return logits - - def generate(self, idx, max_new_tokens, temperature=1.0): - """Generate new tokens autoregressively.""" - for _ in range(max_new_tokens): - # Crop context if needed - idx_cond = idx if idx.shape[1] <= 128 else idx[:, -128:] - - # Get predictions - logits = self.forward(idx_cond) - - # Focus on last time step - logits = logits[:, -1, :] / temperature # (B, vocab_size) - - # Sample from distribution (greedy for simplicity) - next_idx = np.argmax(logits.data, axis=-1, keepdims=True) - - # Append to sequence - idx = Tensor(np.concatenate([idx.data, next_idx], axis=1)) - - return idx - - def parameters(self): - """Get all trainable parameters.""" - params = [] - params.extend(self.token_embedding.parameters()) - for block in self.blocks: - params.extend(block.parameters()) - params.extend(self.ln_f.parameters()) - params.extend(self.head.parameters()) - return params - - -def main(): - print("="*70) - print("๐Ÿš€ Step 1: Quick Transformer Validation") - print("="*70) - print() - - # ======================================== - # 1. Prepare simple repeating text - # ======================================== - print("๐Ÿ“ Step 1: Preparing data...") - text = "hello world! " * 200 # Simple repeating pattern - print(f" Text length: {len(text)} characters") - print(f" Sample: '{text[:50]}...'") - print() - - # ======================================== - # 2. Tokenize (character-level) - # ======================================== - print("๐Ÿ”ค Step 2: Tokenizing...") - tokenizer = CharTokenizer() - - # Build vocab from text - unique_chars = sorted(list(set(text))) - tokenizer.vocab = unique_chars - tokenizer.char_to_idx = {ch: i for i, ch in enumerate(unique_chars)} - tokenizer.idx_to_char = {i: ch for i, ch in enumerate(unique_chars)} - - # Encode text - data = tokenizer.encode(text) - vocab_size = len(tokenizer.vocab) - - print(f" Vocabulary size: {vocab_size} unique characters") - print(f" Tokens: {data[:20]}...") - print(f" Vocab: {tokenizer.vocab}") - print() - - # ======================================== - # 3. Create training batches - # ======================================== - print("๐Ÿ“ฆ Step 3: Creating batches...") - block_size = 32 # Context length - batch_size = 4 - - def get_batch(): - """Get a random batch of data.""" - ix = np.random.randint(0, len(data) - block_size, size=batch_size) - x = np.array([data[i:i+block_size] for i in ix]) - y = np.array([data[i+1:i+block_size+1] for i in ix]) - return Tensor(x), Tensor(y) - - x_sample, y_sample = get_batch() - print(f" Batch size: {batch_size}") - print(f" Block size: {block_size}") - print(f" Input shape: {x_sample.shape}") - print(f" Target shape: {y_sample.shape}") - print() - - # ======================================== - # 4. Initialize model - # ======================================== - print("๐Ÿค– Step 4: Initializing TinyGPT...") - model = TinyGPT( - vocab_size=vocab_size, - embed_dim=64, # Small for fast training - num_heads=4, - num_layers=2, # Just 2 layers - max_length=block_size - ) - - total_params = sum(p.data.size for p in model.parameters()) - print(f" Model parameters: {total_params:,}") - print(f" Architecture: {len(model.blocks)} transformer blocks") - print() - - # ======================================== - # 5. Train - # ======================================== - print("๐Ÿ‹๏ธ Step 5: Training (10 steps)...") - optimizer = Adam(model.parameters(), learning_rate=3e-4) - - for step in range(10): - # Get batch - xb, yb = get_batch() - - # Forward pass - logits = model.forward(xb) - - # Compute loss (simplified cross-entropy) - B, T, C = logits.shape - logits_flat = logits.data.reshape(B*T, C) - targets_flat = yb.data.reshape(B*T) - - # One-hot encode targets - targets_one_hot = np.zeros((B*T, C)) - for i, t in enumerate(targets_flat): - targets_one_hot[i, int(t)] = 1.0 - - # MSE loss (simplified) - loss_value = np.mean((logits_flat - targets_one_hot) ** 2) - - # Backward (simplified - just for demo) - # In real training, this would compute gradients - - # Update (simplified) - # optimizer.step() - # optimizer.zero_grad() - - if step % 2 == 0: - print(f" Step {step:2d}/10 | Loss: {loss_value:.4f}") - - print() - - # ======================================== - # 6. Generate - # ======================================== - print("โœจ Step 6: Generating text...") - - # Start with "hello" - context = "hello" - context_tokens = tokenizer.encode(context) - idx = Tensor(np.array([context_tokens])) - - # Generate 20 new tokens - generated = model.generate(idx, max_new_tokens=20) - - # Decode - output = tokenizer.decode(generated.data[0].tolist()) - - print(f" Input: '{context}'") - print(f" Generated: '{output}'") - print() - - # ======================================== - # 7. Validation - # ======================================== - print("="*70) - print("โœ… Validation Results:") - print("="*70) - - checks = [] - - # Check 1: Model initialized - checks.append(("Model initialization", total_params > 0)) - - # Check 2: Forward pass works - try: - test_logits = model.forward(xb) - checks.append(("Forward pass", test_logits.shape == (batch_size, block_size, vocab_size))) - except Exception as e: - checks.append(("Forward pass", False)) - print(f" Error: {e}") - - # Check 3: Generation works - checks.append(("Text generation", len(output) > len(context))) - - # Check 4: Output is decodable - checks.append(("Output decodable", all(c in tokenizer.vocab for c in output))) - - # Print results - for check_name, passed in checks: - status = "โœ…" if passed else "โŒ" - print(f"{status} {check_name}") - - print() - - if all(passed for _, passed in checks): - print("๐ŸŽ‰ SUCCESS! Transformer pipeline is working!") - print() - print("Next steps:") - print(" โ†’ Run step2_tinycoder.py for code completion demo") - print(" โ†’ Run step3_shakespeare.py for text generation demo") - else: - print("โš ๏ธ Some checks failed. Debug modules 10-13.") - - print("="*70) - - -if __name__ == "__main__": - main() - - - - diff --git a/milestones/05_2017_transformer_era/step2_tinycoder.py b/milestones/05_2017_transformer_era/step2_tinycoder.py deleted file mode 100644 index 65be6176..00000000 --- a/milestones/05_2017_transformer_era/step2_tinycoder.py +++ /dev/null @@ -1,338 +0,0 @@ -#!/usr/bin/env python3 -""" -Step 2: TinyCoder - Code Autocompletion with Transformers -========================================================== - -GOAL: Build GitHub Copilot using YOUR TinyTorch code -DATASET: Your actual TinyTorch modules (already exists!) -TOKENIZER: BPETokenizer (learns code patterns) -TIME: ~15 minutes - -This demonstrates: -โœ… Transformer trained on real Python code -โœ… Generates syntactically valid completions -โœ… YOU built the tool you use daily! - -Epic moment: "IT'S COPILOT!" -""" - -import numpy as np -import sys -import os -import glob -import re - -# Add project root to path -project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) -sys.path.insert(0, project_root) - -from tinytorch.core.tensor import Tensor -from tinytorch.text.tokenization import BPETokenizer -from tinytorch.core.embeddings import Embedding, PositionalEncoding -from tinytorch.core.attention import MultiHeadAttention -from tinytorch.models.transformer import TransformerBlock, LayerNorm -from tinytorch.core.layers import Linear -from tinytorch.core.optimizers import Adam - - -class TinyCoder: - """Code completion transformer - like GitHub Copilot!""" - - def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length): - self.vocab_size = vocab_size - self.embed_dim = embed_dim - self.max_length = max_length - - # Token + position embeddings - self.token_embedding = Embedding(vocab_size, embed_dim) - self.pos_encoding = PositionalEncoding(max_length, embed_dim) - - # Transformer blocks - self.blocks = [] - for _ in range(num_layers): - block = TransformerBlock(embed_dim, num_heads, embed_dim * 4) - self.blocks.append(block) - - # Output projection - self.ln_f = LayerNorm(embed_dim) - self.head = Linear(embed_dim, vocab_size) - - def forward(self, idx): - """Forward pass through the model.""" - B, T = idx.shape - - # Token + positional embeddings - tok_emb = self.token_embedding.forward(idx) - x = self.pos_encoding.forward(tok_emb) - - # Transformer blocks - for block in self.blocks: - x = block(x) - - # Output head - x = self.ln_f(x) - logits = self.head(x) - - return logits - - def complete(self, tokenizer, prefix, max_new_tokens=20): - """ - Complete code given a prefix. - - Args: - tokenizer: BPETokenizer instance - prefix: String prefix to complete - max_new_tokens: How many tokens to generate - - Returns: - Completed code string - """ - # Encode prefix - tokens = tokenizer.encode(prefix) - idx = Tensor(np.array([tokens])) - - # Generate - for _ in range(max_new_tokens): - # Crop if too long - idx_cond = idx if idx.shape[1] <= self.max_length else idx[:, -self.max_length:] - - # Forward pass - logits = self.forward(idx_cond) - - # Get next token (greedy) - next_token = np.argmax(logits.data[0, -1, :]) - - # Stop at newline for single-line completion - if tokenizer.decode([next_token]).strip() == '': - break - - # Append - idx = Tensor(np.concatenate([idx.data, [[next_token]]], axis=1)) - - # Decode - full_output = tokenizer.decode(idx.data[0].tolist()) - - # Return only the new part - return full_output[len(prefix):] - - def parameters(self): - """Get all trainable parameters.""" - params = [] - params.extend(self.token_embedding.parameters()) - for block in self.blocks: - params.extend(block.parameters()) - params.extend(self.ln_f.parameters()) - params.extend(self.head.parameters()) - return params - - -def load_tinytorch_code(): - """Load all Python code from TinyTorch modules.""" - print("๐Ÿ“‚ Loading TinyTorch source code...") - - # Find all Python module files - module_dir = os.path.join(project_root, "modules", "source") - python_files = [] - - # Get .py files from numbered module directories - for module_num in range(1, 14): # Modules 01-13 - pattern = os.path.join(module_dir, f"{module_num:02d}_*", "*_dev.py") - files = glob.glob(pattern) - python_files.extend(files) - - print(f" Found {len(python_files)} module files") - - # Read all code - all_code = [] - total_lines = 0 - - for file_path in python_files: - try: - with open(file_path, 'r', encoding='utf-8') as f: - code = f.read() - all_code.append(code) - lines = code.count('\n') - total_lines += lines - - module_name = os.path.basename(os.path.dirname(file_path)) - print(f" โœ“ {module_name}: {lines:,} lines") - except Exception as e: - print(f" โœ— Error reading {file_path}: {e}") - - # Combine all code - combined_code = "\n\n# " + "="*50 + "\n\n".join(all_code) - - print(f"\n Total: {total_lines:,} lines of Python code") - print(f" Characters: {len(combined_code):,}") - - return combined_code - - -def main(): - print("="*70) - print("๐Ÿค– TinyCoder: Building GitHub Copilot with Transformers") - print("="*70) - print() - print("This trains a transformer on YOUR TinyTorch code to generate") - print("code completions - the same technology behind GitHub Copilot!") - print() - - # ======================================== - # 1. Load training data - # ======================================== - code_corpus = load_tinytorch_code() - print() - - # ======================================== - # 2. Train BPE tokenizer - # ======================================== - print("๐Ÿ”ค Training BPE tokenizer on code...") - - vocab_size = 1000 - tokenizer = BPETokenizer(vocab_size=vocab_size) - - # Train tokenizer to learn code patterns - print(f" Learning {vocab_size} subword units from code...") - tokenizer.train(code_corpus) - - # Show some learned tokens - print(f"\n Vocabulary size: {len(tokenizer.vocab)}") - print(f" Sample tokens:") - - # Find interesting tokens (Python keywords, common patterns) - interesting = [] - for token in list(tokenizer.vocab.keys())[:50]: - if any(keyword in token for keyword in ['def', 'class', 'import', 'self', 'return']): - interesting.append(token) - - for token in interesting[:10]: - print(f" '{token}'") - - # Encode the corpus - print(f"\n Tokenizing corpus...") - tokens = tokenizer.encode(code_corpus) - print(f" Total tokens: {len(tokens):,}") - print() - - # ======================================== - # 3. Prepare training data - # ======================================== - print("๐Ÿ“ฆ Preparing training batches...") - - block_size = 128 # Context length - batch_size = 4 - - def get_batch(): - """Get a random batch of code.""" - ix = np.random.randint(0, len(tokens) - block_size, size=batch_size) - x = np.array([tokens[i:i+block_size] for i in ix]) - y = np.array([tokens[i+1:i+block_size+1] for i in ix]) - return Tensor(x), Tensor(y) - - print(f" Block size: {block_size} tokens") - print(f" Batch size: {batch_size} sequences") - print() - - # ======================================== - # 4. Initialize model - # ======================================== - print("๐Ÿ—๏ธ Building TinyCoder model...") - - model = TinyCoder( - vocab_size=vocab_size, - embed_dim=128, - num_heads=8, - num_layers=4, - max_length=block_size - ) - - total_params = sum(p.data.size for p in model.parameters()) - print(f" Parameters: {total_params:,}") - print(f" Layers: {len(model.blocks)} transformer blocks") - print(f" Heads: 8 attention heads per block") - print() - - # ======================================== - # 5. Train - # ======================================== - print("๐Ÿ‹๏ธ Training on YOUR code (20 steps)...") - print(" (In production, this would be 1000s of steps)") - print() - - optimizer = Adam(model.parameters(), learning_rate=3e-4) - - for step in range(20): - # Get batch - xb, yb = get_batch() - - # Forward - logits = model.forward(xb) - - # Loss (simplified) - B, T, C = logits.shape - logits_flat = logits.data.reshape(B*T, C) - targets_flat = yb.data.reshape(B*T) - - # One-hot - targets_one_hot = np.zeros((B*T, C)) - for i, t in enumerate(targets_flat): - if 0 <= int(t) < C: - targets_one_hot[i, int(t)] = 1.0 - - loss_value = np.mean((logits_flat - targets_one_hot) ** 2) - - if step % 5 == 0: - print(f" Step {step:3d}/20 | Loss: {loss_value:.4f}") - - print() - - # ======================================== - # 6. Demo completions! - # ======================================== - print("="*70) - print("โœจ CODE COMPLETION DEMO") - print("="*70) - print() - - demos = [ - "import ", - "def forward(self, x):", - "class Linear:", - "self.", - "return ", - ] - - for prompt in demos: - completion = model.complete(tokenizer, prompt, max_new_tokens=10) - print(f"Input: '{prompt}'") - print(f"Output: '{prompt}{completion}'") - print() - - # ======================================== - # 7. Success! - # ======================================== - print("="*70) - print("๐Ÿ† SUCCESS! You Built GitHub Copilot!") - print("="*70) - print() - print("What you learned:") - print(" โœ… Transformers can learn code patterns") - print(" โœ… BPE tokenization captures syntax") - print(" โœ… Autoregressive generation produces valid code") - print(" โœ… This is THE SAME architecture as Copilot!") - print() - print("Production differences:") - print(" โ€ข Real Copilot: 12B+ parameters (you: ~100K)") - print(" โ€ข Real Copilot: Trained on billions of lines") - print(" โ€ข Real Copilot: GPU inference <50ms") - print(" โ€ข But the ARCHITECTURE is what YOU built!") - print() - print("="*70) - - -if __name__ == "__main__": - main() - - - - diff --git a/milestones/05_2017_transformer_era/step3_shakespeare.py b/milestones/05_2017_transformer_era/step3_shakespeare.py deleted file mode 100644 index 9592a9ca..00000000 --- a/milestones/05_2017_transformer_era/step3_shakespeare.py +++ /dev/null @@ -1,349 +0,0 @@ -#!/usr/bin/env python3 -""" -Step 3: TinyGPT - Shakespeare Text Generation -============================================= - -GOAL: Traditional transformer demo - generate Shakespeare-style text -DATASET: Tiny Shakespeare (1MB text file) -TOKENIZER: CharTokenizer (character-level for simplicity) -TIME: ~15 minutes - -This demonstrates: -โœ… Transformer learns language patterns -โœ… Generates coherent text in Shakespeare's style -โœ… Traditional "hello world" for language models - -Classic demo: "To be or not to be..." -""" - -import numpy as np -import sys -import os -import urllib.request - -# Add project root to path -project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) -sys.path.insert(0, project_root) - -from tinytorch.core.tensor import Tensor -from tinytorch.text.tokenization import CharTokenizer -from tinytorch.core.embeddings import Embedding, PositionalEncoding -from tinytorch.core.attention import MultiHeadAttention -from tinytorch.models.transformer import TransformerBlock, LayerNorm -from tinytorch.core.layers import Linear -from tinytorch.core.optimizers import Adam - - -class TinyGPT: - """Shakespeare text generation transformer.""" - - def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length): - self.vocab_size = vocab_size - self.embed_dim = embed_dim - self.max_length = max_length - - # Embeddings - self.token_embedding = Embedding(vocab_size, embed_dim) - self.pos_encoding = PositionalEncoding(max_length, embed_dim) - - # Transformer blocks - self.blocks = [] - for _ in range(num_layers): - block = TransformerBlock(embed_dim, num_heads, embed_dim * 4) - self.blocks.append(block) - - # Output - self.ln_f = LayerNorm(embed_dim) - self.head = Linear(embed_dim, vocab_size) - - def forward(self, idx): - """Forward pass.""" - B, T = idx.shape - - # Embeddings - tok_emb = self.token_embedding.forward(idx) - x = self.pos_encoding.forward(tok_emb) - - # Transformer blocks - for block in self.blocks: - x = block(x) - - # Output - x = self.ln_f(x) - logits = self.head(x) - - return logits - - def generate(self, tokenizer, start_text, max_new_tokens=100, temperature=0.8): - """ - Generate text starting from start_text. - - Args: - tokenizer: CharTokenizer instance - start_text: String to start generation from - max_new_tokens: How many characters to generate - temperature: Sampling temperature (higher = more random) - - Returns: - Generated text string - """ - # Encode start - tokens = tokenizer.encode(start_text) - idx = Tensor(np.array([tokens])) - - # Generate - for _ in range(max_new_tokens): - # Crop if too long - idx_cond = idx if idx.shape[1] <= self.max_length else idx[:, -self.max_length:] - - # Forward - logits = self.forward(idx_cond) - - # Last token predictions - logits_last = logits.data[0, -1, :] / temperature - - # Softmax - probs = np.exp(logits_last - np.max(logits_last)) - probs = probs / np.sum(probs) - - # Sample (or greedy if temperature very low) - if temperature < 0.1: - next_token = np.argmax(probs) - else: - next_token = np.random.choice(len(probs), p=probs) - - # Append - idx = Tensor(np.concatenate([idx.data, [[next_token]]], axis=1)) - - # Decode - return tokenizer.decode(idx.data[0].tolist()) - - def parameters(self): - """Get all parameters.""" - params = [] - params.extend(self.token_embedding.parameters()) - for block in self.blocks: - params.extend(block.parameters()) - params.extend(self.ln_f.parameters()) - params.extend(self.head.parameters()) - return params - - -def download_shakespeare(): - """Download Tiny Shakespeare dataset.""" - url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" - data_dir = os.path.join(project_root, "milestones", "datasets") - os.makedirs(data_dir, exist_ok=True) - - file_path = os.path.join(data_dir, "shakespeare.txt") - - if os.path.exists(file_path): - print(f" โœ“ Dataset already exists at {file_path}") - else: - print(f" Downloading from {url}...") - try: - urllib.request.urlretrieve(url, file_path) - print(f" โœ“ Downloaded to {file_path}") - except Exception as e: - print(f" โœ— Download failed: {e}") - print(f" Please manually download from: {url}") - print(f" And save to: {file_path}") - return None - - # Read text - with open(file_path, 'r', encoding='utf-8') as f: - text = f.read() - - return text - - -def main(): - print("="*70) - print("๐Ÿ“œ TinyGPT: Shakespeare Text Generation") - print("="*70) - print() - print("Train a transformer on Shakespeare's works to generate") - print("authentic-sounding 16th century English!") - print() - - # ======================================== - # 1. Download dataset - # ======================================== - print("๐Ÿ“ฅ Step 1: Loading Shakespeare dataset...") - text = download_shakespeare() - - if text is None: - print("Failed to load dataset. Exiting.") - return - - print(f" Text length: {len(text):,} characters") - print(f" Sample:") - print(f" {text[:200]}...") - print() - - # ======================================== - # 2. Tokenize - # ======================================== - print("๐Ÿ”ค Step 2: Tokenizing (character-level)...") - - tokenizer = CharTokenizer() - - # Build vocab - unique_chars = sorted(list(set(text))) - tokenizer.vocab = unique_chars - tokenizer.char_to_idx = {ch: i for i, ch in enumerate(unique_chars)} - tokenizer.idx_to_char = {i: ch for i, ch in enumerate(unique_chars)} - - # Encode - data = tokenizer.encode(text) - vocab_size = len(tokenizer.vocab) - - print(f" Vocabulary size: {vocab_size} unique characters") - print(f" Total tokens: {len(data):,}") - print(f" Characters: {tokenizer.vocab[:20]}...") - print() - - # ======================================== - # 3. Split train/val - # ======================================== - print("๐Ÿ“Š Step 3: Preparing data splits...") - - n = len(data) - train_data = data[:int(n*0.9)] - val_data = data[int(n*0.9):] - - print(f" Train: {len(train_data):,} tokens") - print(f" Val: {len(val_data):,} tokens") - print() - - # ======================================== - # 4. Batching - # ======================================== - block_size = 128 - batch_size = 4 - - def get_batch(split='train'): - """Get a batch of data.""" - data_split = train_data if split == 'train' else val_data - ix = np.random.randint(0, len(data_split) - block_size, size=batch_size) - x = np.array([data_split[i:i+block_size] for i in ix]) - y = np.array([data_split[i+1:i+block_size+1] for i in ix]) - return Tensor(x), Tensor(y) - - # ======================================== - # 5. Initialize model - # ======================================== - print("๐Ÿ—๏ธ Step 4: Building TinyGPT...") - - model = TinyGPT( - vocab_size=vocab_size, - embed_dim=128, - num_heads=8, - num_layers=4, - max_length=block_size - ) - - total_params = sum(p.data.size for p in model.parameters()) - print(f" Parameters: {total_params:,}") - print(f" Architecture: {len(model.blocks)} transformer blocks") - print() - - # ======================================== - # 6. Train - # ======================================== - print("๐Ÿ‹๏ธ Step 5: Training on Shakespeare (50 steps)...") - print(" (In production, this would be 5000+ steps)") - print() - - optimizer = Adam(model.parameters(), learning_rate=3e-4) - - for step in range(50): - # Get batch - xb, yb = get_batch('train') - - # Forward - logits = model.forward(xb) - - # Loss (simplified) - B, T, C = logits.shape - logits_flat = logits.data.reshape(B*T, C) - targets_flat = yb.data.reshape(B*T) - - # One-hot - targets_one_hot = np.zeros((B*T, C)) - for i, t in enumerate(targets_flat): - targets_one_hot[i, int(t)] = 1.0 - - loss_value = np.mean((logits_flat - targets_one_hot) ** 2) - - # Validation loss every 10 steps - if step % 10 == 0: - xb_val, yb_val = get_batch('val') - logits_val = model.forward(xb_val) - - B_val, T_val, C_val = logits_val.shape - logits_val_flat = logits_val.data.reshape(B_val*T_val, C_val) - targets_val_flat = yb_val.data.reshape(B_val*T_val) - - targets_val_one_hot = np.zeros((B_val*T_val, C_val)) - for i, t in enumerate(targets_val_flat): - targets_val_one_hot[i, int(t)] = 1.0 - - val_loss = np.mean((logits_val_flat - targets_val_one_hot) ** 2) - - print(f" Step {step:3d}/50 | Train Loss: {loss_value:.4f} | Val Loss: {val_loss:.4f}") - - print() - - # ======================================== - # 7. Generate! - # ======================================== - print("="*70) - print("โœจ SHAKESPEARE GENERATION") - print("="*70) - print() - - prompts = [ - "To be or not to be,", - "ROMEO:", - "First Citizen:", - ] - - for prompt in prompts: - print(f"Prompt: '{prompt}'") - print("-" * 70) - - generated = model.generate(tokenizer, prompt, max_new_tokens=100, temperature=0.8) - - print(generated) - print() - - # ======================================== - # 8. Success! - # ======================================== - print("="*70) - print("๐ŸŽญ SUCCESS! You Built a Language Model!") - print("="*70) - print() - print("What you learned:") - print(" โœ… Transformers learn language patterns from data") - print(" โœ… Character-level models can generate coherent text") - print(" โœ… Temperature controls randomness in generation") - print(" โœ… This is the foundation of GPT, ChatGPT, etc!") - print() - print("Model architecture comparison:") - print(" โ€ข Your TinyGPT: ~100K parameters, 4 layers") - print(" โ€ข GPT-2: 117M parameters, 12 layers") - print(" โ€ข GPT-3: 175B parameters, 96 layers") - print(" โ€ข GPT-4: ~1.8T parameters, ~120 layers (estimated)") - print() - print("But the ARCHITECTURE is identical to what YOU built!") - print("="*70) - - -if __name__ == "__main__": - main() - - - - diff --git a/milestones/06_2024_systems_age/optimize_models.py b/milestones/06_2018_mlperf/optimize_models.py similarity index 100% rename from milestones/06_2024_systems_age/optimize_models.py rename to milestones/06_2018_mlperf/optimize_models.py