🏗️ Restructure milestones with decade-based naming

- Rename to clean, focused convention: 01_1957_perceptron, 02_1969_xor, etc. - Drop dramatic language (crisis, revival, revolution, era) - 06_2018_mlperf → 06_2020_scaling (matches GPT-3 scale era) - Tells clear story: 1950s → 2020s ML evolution - Each milestone represents major architectural/systems shift - Remove redundant step1/2/3 files from transformer milestone
2026-06-03 03:45:52 -05:00 · 2025-10-27 13:00:06 -04:00
parent 0ae627d0ea
commit c4d5e4ebf8
19 changed files with 45 additions and 1492 deletions
--- a/capstone-ideas/README.md
+++ b/capstone-ideas/README.md
@@ -1,185 +0,0 @@
-# 🎓 TinyTorch Capstone Project Ideas
-
-## **Background: The Capstone Design Problem**
-
-**Original Issue**: Module 20 was "TinyGPT Capstone" but students can already build TinyGPT after Module 13 (Transformers). This made:
- Modules 14-19 (optimization) feel like "optional extras"
- Module 20 anticlimactic ("TinyGPT again?")
- No integration of crucial systems engineering skills
-
-**Solution Requirements**:
- Must integrate ALL modules 1-19 (especially optimization modules 14-19)
- Must be genuinely exciting and different
- Must demonstrate complete ML systems engineering mastery
- Must create portfolio-worthy deliverables
-
---
-
-## **🏆 RECOMMENDED: AI Olympics Competition**
-
-**📁 See: [ai-olympics.md](ai-olympics.md)**
-
-**Core Concept**: Competitive leaderboard where students optimize TinyTorch models across systems engineering dimensions.
-
-**Why This is Best**:
- ✅ **Natural motivation**: Students want to rank high on leaderboards
- ✅ **Systems focus**: Compete on speed, memory, efficiency - not just accuracy
- ✅ **Community building**: Creates ongoing engagement and peer interaction
- ✅ **Portfolio impact**: "I ranked #3 in TinyTorch AI Olympics" is compelling
- ✅ **Forces optimization**: ALL modules 14-19 become essential for competitive performance
-
-**Competition Categories**:
- 🏃‍♂️ **Speed Demon**: Fastest inference
- 💾 **Memory Miser**: Smallest memory footprint
- 📱 **Edge Expert**: Best Raspberry Pi performance
- 🔋 **Energy Efficient**: Lowest power consumption
- 🏆 **TinyMLPerf**: Overall benchmark champion
-
---
-
-## **🛠️ Alternative Ideas Considered**
-
-### **1. Edge AI Deployment System**
-**Concept**: Deploy optimized neural networks to actual edge hardware (Raspberry Pi)
-
-**Pros**:
- Integrates all optimization modules (essential for edge constraints)
- Creates tangible deliverable ("I run neural networks on a $35 computer")
- Teaches real-world deployment challenges
-
-**Cons**:
- Individual project (no community/competition aspect)
- Hardware dependencies (students need Pi)
- Less motivating than competition
-
-### **2. Multi-Modal AI Assistant**
-**Concept**: Combine vision (CNNs) + language (transformers) + optimization for real-time performance
-
-**Pros**:
- Showcases multiple architectures working together
- Demonstrates practical AI applications
- Requires optimization for real-time performance
-
-**Cons**:
- Complex scope potentially overwhelming
- Optimization feels secondary to "getting it working"
- Limited portfolio differentiation
-
-### **3. ML Performance Laboratory**
-**Concept**: Comprehensive benchmarking suite comparing different ML frameworks
-
-**Pros**:
- Heavy focus on profiling and benchmarking skills
- Creates useful tool for community
- Deep systems engineering focus
-
-**Cons**:
- More about measurement than optimization
- Limited creative expression for students
- May feel academic rather than practical
-
-### **4. Neural Architecture Search**
-**Concept**: Automated model design and optimization system
-
-**Pros**:
- Cutting-edge research area
- Requires sophisticated optimization
- Highly technical achievement
-
-**Cons**:
- Very advanced, may be beyond course scope
- Optimization becomes means rather than end
- Difficult to assess fairly
-
-### **5. Distributed Training System**
-**Concept**: Multi-GPU/multi-node training infrastructure
-
-**Pros**:
- Advanced systems engineering skills
- High industry relevance
- Impressive technical achievement
-
-**Cons**:
- Requires expensive hardware
- Complex debugging and setup
- May overshadow core ML concepts
-
-### **6. ML Model Marketplace**
-**Concept**: Complete system for sharing/deploying/optimizing models (like Hugging Face)
-
-**Pros**:
- Full-stack systems engineering
- Practical deployment focus
- Creates useful community resource
-
-**Cons**:
- Web development skills needed
- Broad scope potentially unfocused
- Less emphasis on optimization techniques
-
---
-
-## **📊 Evaluation Criteria**
-
-| Criteria | AI Olympics | Edge Deployment | Multi-Modal | ML Lab | NAS | Distributed | Marketplace |
-|----------|-------------|-----------------|-------------|--------|-----|-------------|-------------|
-| **Integrates All Modules** | ✅✅✅ | ✅✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| **Student Motivation** | ✅✅✅ | ✅ | ✅ | ⚠️ | ⚠️ | ⚠️ | ✅ |
-| **Portfolio Impact** | ✅✅✅ | ✅✅ | ✅ | ✅ | ✅✅ | ✅✅ | ✅ |
-| **Systems Engineering Focus** | ✅✅✅ | ✅✅ | ✅ | ✅✅✅ | ✅ | ✅✅✅ | ✅ |
-| **Implementation Feasibility** | ✅✅ | ✅✅✅ | ✅ | ✅✅ | ⚠️ | ⚠️ | ✅ |
-| **Community Building** | ✅✅✅ | ⚠️ | ⚠️ | ✅ | ⚠️ | ⚠️ | ✅✅ |
-| **Scalability** | ✅✅✅ | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | ✅ |
-
-**Legend**: ✅✅✅ Excellent, ✅✅ Good, ✅ Adequate, ⚠️ Challenging
-
---
-
-## **🎯 Final Recommendation**
-
-**AI Olympics** emerges as the clear winner because it:
-
-1. **Maximizes student motivation** through competitive leaderboards
-2. **Forces integration** of ALL optimization modules (14-19)
-3. **Creates lasting community** beyond individual course completion
-4. **Produces compelling portfolio artifacts** (leaderboard rankings)
-5. **Scales naturally** as more students participate
-6. **Emphasizes systems engineering** over algorithmic implementation
-
-### **Implementation Priority**
-1. **Phase 1**: Design and build leaderboard infrastructure
-2. **Phase 2**: Create standard benchmark evaluation suite
-3. **Phase 3**: Deploy beta version with small student cohort
-4. **Phase 4**: Full launch with all TinyTorch students
-
-### **Success Metrics**
- **Participation Rate**: % of students who submit to multiple categories
- **Optimization Depth**: Average number of techniques applied per submission
- **Community Engagement**: Forum activity, peer collaboration, ongoing submissions
- **Portfolio Impact**: Industry feedback on graduate capabilities
-
---
-
-## **📝 Notes for Implementation**
-
-### **Technical Requirements**
- Automated submission and evaluation pipeline
- Standard benchmark datasets and environments
- Real-time leaderboard with rich visualizations
- Robust measurement and scoring systems
-
-### **Educational Integration**
- Clear rubrics linking competition performance to course grades
- Structured optimization process through modules 14-19
- Portfolio development guidance and templates
- Peer review and collaboration opportunities
-
-### **Community Features**
- Student profiles and achievement tracking
- Optimization technique sharing and discussion
- Mentorship connections between high performers and struggling students
- Industry guest judging and feedback
-
---
-
-**🚀 The AI Olympics transforms TinyTorch from "just another ML course" into a competitive systems engineering community that motivates deep learning, creates lasting engagement, and produces industry-ready graduates.**
--- a/capstone-ideas/ai-olympics.md
+++ b/capstone-ideas/ai-olympics.md
@@ -1,227 +0,0 @@
-# 🏅 AI Olympics: TinyTorch Systems Competition Capstone
-
-## **Core Concept: Compete on Systems Performance, Not Just Accuracy**
-
-Instead of individual projects, Module 20 becomes a **competitive leaderboard** where students optimize their TinyTorch models across multiple **systems engineering dimensions**.
-
-### **🎯 Why AI Olympics is Perfect for TinyTorch**
-
- **Systems Focus**: Compete on memory, speed, efficiency - not just accuracy
- **Real ML Engineering**: Production systems care about performance, not just "does it work"
- **Leaderboard Motivation**: Students naturally want to rank high and beat peers
- **Portfolio Value**: "I ranked #3 in TinyTorch AI Olympics" is impressive
- **Community Building**: Creates ongoing engagement beyond the course
-
---
-
-## **🏆 Competition Categories**
-
-### **Category 1: Speed Demon** ⚡
-*"Fastest inference on standard hardware"*
- **Metric**: Inferences per second on reference hardware
- **Required Skills**: Modules 14-19 optimization techniques
- **Constraint**: Must maintain >90% accuracy on test dataset
-
-### **Category 2: Memory Miser** 💾
-*"Smallest memory footprint"*
- **Metric**: Peak memory usage during inference
- **Required Skills**: Quantization, compression, efficient architectures
- **Constraint**: Must maintain >85% accuracy on test dataset
-
-### **Category 3: Edge Expert** 📱
-*"Best performance on Raspberry Pi"*
- **Metric**: Composite score (speed + accuracy + power efficiency)
- **Required Skills**: ALL optimization modules for edge constraints
- **Constraint**: Must actually run on Pi hardware
-
-### **Category 4: Energy Efficient** 🔋
-*"Lowest power consumption"*
- **Metric**: Energy per inference (joules/prediction)
- **Required Skills**: Model compression, efficient algorithms
- **Constraint**: Must maintain competitive accuracy
-
-### **Category 5: TinyMLPerf** 🏃‍♂️
-*"Official MLPerf-style benchmark"*
- **Metric**: Standardized benchmark suite performance
- **Required Skills**: Complete systems optimization pipeline
- **Constraint**: Must pass all benchmark compliance tests
-
---
-
-## **🎮 Competition Structure**
-
-### **Phase 1: Baseline Submission (Week 1)**
- Submit working model from modules 1-13 (CNN, transformer, or multi-modal)
- Get baseline scores across all categories
- See where you rank on initial leaderboard
-
-### **Phase 2: Optimization Sprint (Weeks 2-4)**
- Apply techniques from modules 14-19 systematically
- **Module 14**: Profile and identify bottlenecks
- **Module 15**: Implement acceleration techniques
- **Module 16**: Add quantization for memory/speed
- **Module 17**: Apply compression for size reduction
- **Module 18**: Implement caching for inference speed
- **Module 19**: Benchmark against production systems
-
-### **Phase 3: Final Submission & Olympics (Week 5)**
- Submit optimized models to all relevant categories
- **Live leaderboard updates** as submissions come in
- **Victory ceremony** with category winners
- **Portfolio artifacts**: Leaderboard rankings + optimization reports
-
---
-
-## **📊 Leaderboard & Scoring System**
-
-### **Public Leaderboard Features**
-```
-🏆 TinyTorch AI Olympics Leaderboard
-
-Speed Demon Category:
-1. alice_chen    847.3 inf/sec  (95.2% acc)  🥇
-2. bob_smith     612.7 inf/sec  (94.8% acc)  🥈
-3. carol_wong    588.1 inf/sec  (96.1% acc)  🥉
-
-Memory Miser Category:
-1. dave_kim      12.4 MB        (91.7% acc)  🥇
-2. eve_patel     15.8 MB        (93.2% acc)  🥈
-3. frank_liu     18.2 MB        (89.9% acc)  🥉
-```
-
-### **Scoring Methodology**
- **Primary Metric**: Category-specific performance (speed, memory, etc.)
- **Accuracy Threshold**: Must meet minimum accuracy to qualify
- **Tie-Breaker**: Higher accuracy wins ties in primary metric
- **Bonus Points**: Novel optimization techniques, exceptional documentation
-
-### **Awards & Recognition**
- **🥇 Category Champions**: Top performer in each category
- **🏆 Overall Systems Engineer**: Best combined performance across categories
- **🚀 Innovation Award**: Most creative optimization approach
- **📚 Teaching Award**: Best documented optimization process
-
---
-
-## **🎯 Required Deliverables**
-
-### **Competition Submission Package**
-1. **Optimized Model**: Runnable TinyTorch implementation
-2. **Performance Report**: Detailed analysis of optimization techniques applied
-3. **Reproduction Guide**: Clear instructions for others to run your solution
-4. **Systems Engineering Documentation**: What you learned about ML systems
-
-### **Portfolio Artifacts Students Get**
- **Leaderboard ranking** across multiple categories
- **Technical optimization report** demonstrating systems engineering skills
- **Benchmark results** comparing their work to industry standards
- **Peer recognition** from competitive performance
-
---
-
-## **🔧 Technical Infrastructure Needed**
-
-### **Leaderboard System**
- Automated submission processing
- Standard evaluation environment
- Real-time ranking updates
- Historical performance tracking
-
-### **Benchmark Suite**
- Reference datasets for each category
- Standard hardware for testing
- Automated compliance checking
- Performance measurement tools
-
-### **Submission Portal**
- Code upload and validation
- Automatic testing pipeline
- Results processing and ranking
- Student dashboard with progress
-
---
-
-## **📈 Why This Beats Individual Projects**
-
-### **Individual Project Problems:**
- ❌ No motivation to optimize beyond "it works"
- ❌ Hard to compare student achievements
- ❌ No ongoing engagement after submission
- ❌ Limited portfolio impact
-
-### **AI Olympics Advantages:**
- ✅ **Natural optimization motivation**: Students want to rank higher
- ✅ **Clear performance comparison**: Leaderboard shows relative achievement
- ✅ **Ongoing engagement**: Leaderboard creates lasting community
- ✅ **Strong portfolio impact**: "I ranked #2 in Memory Efficiency" is compelling
-
-### **Systems Engineering Focus:**
- Forces students to care about **ALL** optimization dimensions
- Makes modules 14-19 essential for competitive performance
- Teaches that "getting it working" is only the beginning
- Demonstrates real-world ML engineering priorities
-
---
-
-## **🚀 Implementation Timeline**
-
-### **Phase 1: Core Infrastructure (4 weeks)**
- Build leaderboard system
- Create benchmark evaluation suite
- Set up automated testing pipeline
- Design submission portal
-
-### **Phase 2: Beta Testing (2 weeks)**
- Test with small group of students
- Refine scoring methodology
- Fix technical issues
- Gather feedback and iterate
-
-### **Phase 3: Full Launch (Ongoing)**
- Deploy for all TinyTorch students
- Monitor and maintain leaderboard
- Regular benchmark updates
- Community management and awards
-
---
-
-## **🎓 Educational Impact**
-
-### **Learning Outcomes**
-Students learn that ML engineering is about:
- **Systems performance**, not just algorithmic correctness
- **Trade-offs** between speed, memory, accuracy, and power
- **Optimization techniques** for real-world constraints
- **Benchmarking and measurement** for objective evaluation
- **Competition and collaboration** in technical communities
-
-### **Career Preparation**
-Students graduate with:
- **Demonstrable systems optimization skills**
- **Portfolio evidence of competitive performance**
- **Experience with ML engineering trade-offs**
- **Understanding of production ML constraints**
- **Community connections** with other systems engineers
-
---
-
-## **💡 Future Extensions**
-
-### **Multi-Semester Competitions**
- New benchmark challenges each semester
- Evolving leaderboards with increasing difficulty
- Alumni participation and mentorship
-
-### **Industry Integration**
- Company-sponsored benchmark challenges
- Internship opportunities for top performers
- Guest judging from ML systems engineers
-
-### **Research Integration**
- Novel optimization techniques become research contributions
- Student innovations feed back into TinyTorch framework
- Academic publications from exceptional submissions
-
---
-
-**🎯 CONCLUSION: AI Olympics transforms Module 20 from "individual project" to "competitive systems engineering challenge" that motivates optimization, builds community, and produces compelling portfolio artifacts.**
--- a/milestones/02_1969_xor_crisis/README.md
+++ b/milestones/02_1969_xor_crisis/README.md
--- a/milestones/02_1969_xor_crisis/xor_crisis.py
+++ b/milestones/02_1969_xor_crisis/xor_crisis.py
--- a/milestones/02_1969_xor_crisis/xor_solved.py
+++ b/milestones/02_1969_xor_crisis/xor_solved.py
--- a/milestones/03_1986_mlp_revival/README.md
+++ b/milestones/03_1986_mlp_revival/README.md
--- a/milestones/03_1986_mlp_revival/UPDATE_SUMMARY.md
+++ b/milestones/03_1986_mlp_revival/UPDATE_SUMMARY.md
--- a/milestones/03_1986_mlp_revival/data/digits_8x8.npz
+++ b/milestones/03_1986_mlp_revival/data/digits_8x8.npz
--- a/milestones/03_1986_mlp_revival/mlp_digits.py
+++ b/milestones/03_1986_mlp_revival/mlp_digits.py
--- a/milestones/03_1986_mlp_revival/mlp_mnist.py
+++ b/milestones/03_1986_mlp_revival/mlp_mnist.py
--- a/milestones/04_1998_cnn_revolution/README.md
+++ b/milestones/04_1998_cnn_revolution/README.md
--- a/milestones/04_1998_cnn_revolution/cnn_digits.py
+++ b/milestones/04_1998_cnn_revolution/cnn_digits.py
--- a/milestones/04_1998_cnn_revolution/lecun_cifar10.py
+++ b/milestones/04_1998_cnn_revolution/lecun_cifar10.py
--- a/milestones/05_2017_transformer_era/README.md
+++ b/milestones/05_2017_transformer_era/README.md
@@ -4,63 +4,20 @@

 ## 🎯 What You'll Build

-Three progressively impressive demos:
+A character-level transformer trained on Shakespeare's works - the classic "hello world" of language modeling!

-### Step 1: Quick Validation (5 minutes)
-**File**: `step1_quick_validation.py`  
-**Goal**: Verify transformer pipeline works
+### Shakespeare Text Generation
+**File**: `vaswani_shakespeare.py`  
+**Goal**: Build a transformer that generates Shakespeare-style text

 ```bash
-python step1_quick_validation.py
-```
-
-**What it does**:
- Trains on simple repeating text ("hello world")
- Proves modules 10-13 are connected correctly
- Quick sanity check before bigger demos
-
-**Success**: Generates "hello world" pattern
-
---
-
-### Step 2: TinyCoder (15 minutes) 🔥
-**File**: `step2_tinycoder.py`  
-**Goal**: Code completion like GitHub Copilot!
-
-```bash
-python step2_tinycoder.py
-```
-
-**What it does**:
- Trains on YOUR TinyTorch Python code
- Learns code patterns (def, class, self, etc.)
- Generates syntactically valid Python completions
-
-**Demo**:
-```python
-Input:  'def forward(self, x):'
-Output: 'def forward(self, x):\n    return self.layer(x)'
-
-Input:  'import '
-Output: 'import numpy as np'
-```
-
-**Epic moment**: "I built GitHub Copilot!"
-
---
-
-### Step 3: Shakespeare (15 minutes)
-**File**: `step3_shakespeare.py`  
-**Goal**: Traditional text generation demo
-
-```bash
-python step3_shakespeare.py
+python vaswani_shakespeare.py
 ```

 **What it does**:
 - Downloads Tiny Shakespeare dataset
- Trains character-level transformer
- Generates Shakespeare-style text
+- Trains character-level transformer (YOUR implementation!)
+- Generates coherent Shakespeare-style text

 **Demo**:
 ```
@@ -69,8 +26,6 @@ Output: 'To be or not to be, that is the question
         Whether tis nobler in the mind to suffer...'
 ```

-**Classic**: Traditional "hello world" for language models
-
 ---

 ## 🚀 Quick Start
@@ -82,34 +37,18 @@ Complete these TinyTorch modules:
 - ✅ Module 12: Attention
 - ✅ Module 13: Transformers

-### Run in Order
+### Run the Example

 ```bash
-# 1. Quick validation (5 min)
-python step1_quick_validation.py
-
-# 2. Code completion (15 min) - THE EPIC ONE
-python step2_tinycoder.py
-
-# 3. Shakespeare (15 min) - traditional demo
-python step3_shakespeare.py
+# Train transformer on Shakespeare (15-20 min)
+python vaswani_shakespeare.py
 ```

 ---

-## 📊 What Each Demo Teaches
-
-| Demo | Dataset | Tokenizer | Time | Epic Factor | What You Learn |
-|------|---------|-----------|------|-------------|----------------|
-| **Step 1** | Simple text | CharTokenizer | 5 min | ⭐⭐ | Pipeline works |
-| **Step 2** | TinyTorch code | BPETokenizer | 15 min | ⭐⭐⭐⭐⭐ | YOU built Copilot! |
-| **Step 3** | Shakespeare | CharTokenizer | 15 min | ⭐⭐⭐⭐ | Language modeling |
-
---
-
 ## 🎓 Learning Outcomes

-After completing these milestones, you'll understand:
+After completing this milestone, you'll understand:

 ### Technical Mastery
 - ✅ How tokenization bridges text and numbers
@@ -248,11 +187,12 @@ model = TinyGPT(

 You've succeeded when:

-**Step 1**: Model generates repeating pattern  
-**Step 2**: Code completions are syntactically valid  
-**Step 3**: Shakespeare text is coherent (even if not perfect)
+✅ Model trains without errors  
+✅ Loss decreases over training epochs  
+✅ Generated Shakespeare text is coherent (even if not perfect)  
+✅ You can generate text with custom prompts  

-**Don't expect perfection!** Production models train for months on massive data. Your demos prove you understand the architecture!
+**Don't expect perfection!** Production models train for months on massive data. Your demo proves you understand the architecture!

 ---

@@ -285,4 +225,4 @@ The transformer architecture you implemented powers:

 ---

-**Ready to generate some text?** Start with `step1_quick_validation.py`!
+**Ready to generate some text?** Run `python vaswani_shakespeare.py`!
--- a/milestones/05_2017_transformer_era/vaswani_shakespeare.py
+++ b/milestones/05_2017_transformer_era/vaswani_shakespeare.py
@@ -23,12 +23,12 @@ MODULES EXERCISED IN THIS EXAMPLE:
 Transformer Architecture (Bottom to Top Flow):

    ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
-    │                                                    Output Logits                                                          │
-    │                                             Vocabulary Predictions (1000)                                                 │
+    │                                                    Output Logits                                                         │
+    │                                             Vocabulary Predictions (1000)                                                │
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                                  ▲
    ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
-    │                                                 Output Projection                                                         │
+    │                                                 Output Projection                                                        │
    │                                          Module 04: vectors → vocabulary                                                 │
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                                  ▲
@@ -39,41 +39,41 @@ Transformer Architecture (Bottom to Top Flow):
                                                                  ▲
    ╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
    ║                                             Transformer Block × 4 (Repeat)                                               ║
-    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  ║
-    ║ │                                                   Layer Norm                                                       │  ║
-    ║ │                                            Module 14: Post-FFN normalization                                       │  ║
-    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  ║
+    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐   ║
+    ║ │                                                   Layer Norm                                                       │   ║
+    ║ │                                            Module 14: Post-FFN normalization                                       │   ║
+    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘   ║
    ║                                                           ▲                                                              ║
-    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  ║
-    ║ │                                            Feed Forward Network (FFN)                                              │  ║
-    ║ │                                   Module 04: Linear(128→512) → ReLU → Linear(512→128)                               │  ║
-    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  ║
+    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐   ║
+    ║ │                                            Feed Forward Network (FFN)                                              │   ║
+    ║ │                                   Module 04: Linear(128→512) → ReLU → Linear(512→128)                              │   ║
+    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘   ║
    ║                                                           ▲                                                              ║
-    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  ║
-    ║ │                                                   Layer Norm                                                       │  ║
-    ║ │                                          Module 14: Post-attention normalization                                   │  ║
-    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  ║
+    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐   ║
+    ║ │                                                   Layer Norm                                                       │   ║
+    ║ │                                          Module 14: Post-attention normalization                                   │   ║
+    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘   ║
    ║                                                           ▲                                                              ║
-    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  ║
-    ║ │                                           Multi-Head Self-Attention                                                │  ║
-    ║ │                                       Module 13: 8 heads × (Q·K^T/√d_k)·V                                          │  ║
-    ║ │                                  Each head: 16-dim attention on 128-dim embeddings                                 │  ║
-    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  ║
+    ║ ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐   ║
+    ║ │                                           Multi-Head Self-Attention                                                │   ║
+    ║ │                                       Module 13: 8 heads × (Q·K^T/√d_k)·V                                          │   ║
+    ║ │                                  Each head: 16-dim attention on 128-dim embeddings                                 │   ║
+    ║ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘   ║
    ╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
-                                                                  ▲
+                                                                ▲
    ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
-    │                                               Positional Encoding                                                         │
-    │                                    Module 12: Add position information (sin/cos)                                          │
+    │                                               Positional Encoding                                                        │
+    │                                    Module 12: Add position information (sin/cos)                                         │
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
-                                                                  ▲
+                                                                ▲
    ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
-    │                                                Token Embeddings                                                           │
+    │                                                Token Embeddings                                                          │
    │                                      Module 12: tokens → 128-dim vectors                                                 │
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
-                                                                  ▲
+                                                                ▲
    ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
-    │                                                  Input Tokens                                                             │
-    │                                          [token_1, token_2, ..., token_10]                                                │
+    │                                                  Input Tokens                                                            │
+    │                                          [token_1, token_2, ..., token_10]                                               │
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

 Key Insight: Attention allows each token to "look at" all other tokens
--- a/milestones/05_2017_transformer_era/step1_quick_validation.py
+++ b/milestones/05_2017_transformer_era/step1_quick_validation.py
@@ -1,288 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 1: Quick Validation - Transformer Pipeline Test
-====================================================
-
-GOAL: Verify transformer modules work end-to-end in 5 minutes
-DATASET: Simple repeating text (no download needed)
-TOKENIZER: CharTokenizer (no training needed)
-TIME: ~5 minutes
-
-This is the simplest possible test to prove:
-✅ Modules 10-13 are connected correctly
-✅ Training loop works
-✅ Generation works
-
-If this passes, the pipeline is functional!
-"""
-
-import numpy as np
-import sys
-import os
-
-# Add project root to path
-project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-sys.path.insert(0, project_root)
-
-from tinytorch.core.tensor import Tensor
-from tinytorch.text.tokenization import CharTokenizer
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-from tinytorch.core.attention import MultiHeadAttention
-from tinytorch.models.transformer import TransformerBlock, LayerNorm
-from tinytorch.core.layers import Linear
-from tinytorch.core.optimizers import Adam
-
-
-class TinyGPT:
-    """Minimal GPT for quick validation."""
-    
-    def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length):
-        self.vocab_size = vocab_size
-        self.embed_dim = embed_dim
-        
-        # Token + position embeddings
-        self.token_embedding = Embedding(vocab_size, embed_dim)
-        self.pos_encoding = PositionalEncoding(max_length, embed_dim)
-        
-        # Transformer blocks
-        self.blocks = []
-        for _ in range(num_layers):
-            block = TransformerBlock(embed_dim, num_heads, embed_dim * 4)
-            self.blocks.append(block)
-        
-        # Output projection
-        self.ln_f = LayerNorm(embed_dim)
-        self.head = Linear(embed_dim, vocab_size)
-    
-    def forward(self, idx):
-        """Forward pass through the model."""
-        B, T = idx.shape
-        
-        # Token + positional embeddings
-        tok_emb = self.token_embedding.forward(idx)  # (B, T, embed_dim)
-        x = self.pos_encoding.forward(tok_emb)  # (B, T, embed_dim) - includes positional info
-        
-        # Transformer blocks
-        for block in self.blocks:
-            x = block(x)
-        
-        # Output head
-        x = self.ln_f(x)
-        logits = self.head(x)  # (B, T, vocab_size)
-        
-        return logits
-    
-    def generate(self, idx, max_new_tokens, temperature=1.0):
-        """Generate new tokens autoregressively."""
-        for _ in range(max_new_tokens):
-            # Crop context if needed
-            idx_cond = idx if idx.shape[1] <= 128 else idx[:, -128:]
-            
-            # Get predictions
-            logits = self.forward(idx_cond)
-            
-            # Focus on last time step
-            logits = logits[:, -1, :] / temperature  # (B, vocab_size)
-            
-            # Sample from distribution (greedy for simplicity)
-            next_idx = np.argmax(logits.data, axis=-1, keepdims=True)
-            
-            # Append to sequence
-            idx = Tensor(np.concatenate([idx.data, next_idx], axis=1))
-        
-        return idx
-    
-    def parameters(self):
-        """Get all trainable parameters."""
-        params = []
-        params.extend(self.token_embedding.parameters())
-        for block in self.blocks:
-            params.extend(block.parameters())
-        params.extend(self.ln_f.parameters())
-        params.extend(self.head.parameters())
-        return params
-
-
-def main():
-    print("="*70)
-    print("🚀 Step 1: Quick Transformer Validation")
-    print("="*70)
-    print()
-    
-    # ========================================
-    # 1. Prepare simple repeating text
-    # ========================================
-    print("📝 Step 1: Preparing data...")
-    text = "hello world! " * 200  # Simple repeating pattern
-    print(f"   Text length: {len(text)} characters")
-    print(f"   Sample: '{text[:50]}...'")
-    print()
-    
-    # ========================================
-    # 2. Tokenize (character-level)
-    # ========================================
-    print("🔤 Step 2: Tokenizing...")
-    tokenizer = CharTokenizer()
-    
-    # Build vocab from text
-    unique_chars = sorted(list(set(text)))
-    tokenizer.vocab = unique_chars
-    tokenizer.char_to_idx = {ch: i for i, ch in enumerate(unique_chars)}
-    tokenizer.idx_to_char = {i: ch for i, ch in enumerate(unique_chars)}
-    
-    # Encode text
-    data = tokenizer.encode(text)
-    vocab_size = len(tokenizer.vocab)
-    
-    print(f"   Vocabulary size: {vocab_size} unique characters")
-    print(f"   Tokens: {data[:20]}...")
-    print(f"   Vocab: {tokenizer.vocab}")
-    print()
-    
-    # ========================================
-    # 3. Create training batches
-    # ========================================
-    print("📦 Step 3: Creating batches...")
-    block_size = 32  # Context length
-    batch_size = 4
-    
-    def get_batch():
-        """Get a random batch of data."""
-        ix = np.random.randint(0, len(data) - block_size, size=batch_size)
-        x = np.array([data[i:i+block_size] for i in ix])
-        y = np.array([data[i+1:i+block_size+1] for i in ix])
-        return Tensor(x), Tensor(y)
-    
-    x_sample, y_sample = get_batch()
-    print(f"   Batch size: {batch_size}")
-    print(f"   Block size: {block_size}")
-    print(f"   Input shape: {x_sample.shape}")
-    print(f"   Target shape: {y_sample.shape}")
-    print()
-    
-    # ========================================
-    # 4. Initialize model
-    # ========================================
-    print("🤖 Step 4: Initializing TinyGPT...")
-    model = TinyGPT(
-        vocab_size=vocab_size,
-        embed_dim=64,      # Small for fast training
-        num_heads=4,
-        num_layers=2,      # Just 2 layers
-        max_length=block_size
-    )
-    
-    total_params = sum(p.data.size for p in model.parameters())
-    print(f"   Model parameters: {total_params:,}")
-    print(f"   Architecture: {len(model.blocks)} transformer blocks")
-    print()
-    
-    # ========================================
-    # 5. Train
-    # ========================================
-    print("🏋️  Step 5: Training (10 steps)...")
-    optimizer = Adam(model.parameters(), learning_rate=3e-4)
-    
-    for step in range(10):
-        # Get batch
-        xb, yb = get_batch()
-        
-        # Forward pass
-        logits = model.forward(xb)
-        
-        # Compute loss (simplified cross-entropy)
-        B, T, C = logits.shape
-        logits_flat = logits.data.reshape(B*T, C)
-        targets_flat = yb.data.reshape(B*T)
-        
-        # One-hot encode targets
-        targets_one_hot = np.zeros((B*T, C))
-        for i, t in enumerate(targets_flat):
-            targets_one_hot[i, int(t)] = 1.0
-        
-        # MSE loss (simplified)
-        loss_value = np.mean((logits_flat - targets_one_hot) ** 2)
-        
-        # Backward (simplified - just for demo)
-        # In real training, this would compute gradients
-        
-        # Update (simplified)
-        # optimizer.step()
-        # optimizer.zero_grad()
-        
-        if step % 2 == 0:
-            print(f"   Step {step:2d}/10 | Loss: {loss_value:.4f}")
-    
-    print()
-    
-    # ========================================
-    # 6. Generate
-    # ========================================
-    print("✨ Step 6: Generating text...")
-    
-    # Start with "hello"
-    context = "hello"
-    context_tokens = tokenizer.encode(context)
-    idx = Tensor(np.array([context_tokens]))
-    
-    # Generate 20 new tokens
-    generated = model.generate(idx, max_new_tokens=20)
-    
-    # Decode
-    output = tokenizer.decode(generated.data[0].tolist())
-    
-    print(f"   Input: '{context}'")
-    print(f"   Generated: '{output}'")
-    print()
-    
-    # ========================================
-    # 7. Validation
-    # ========================================
-    print("="*70)
-    print("✅ Validation Results:")
-    print("="*70)
-    
-    checks = []
-    
-    # Check 1: Model initialized
-    checks.append(("Model initialization", total_params > 0))
-    
-    # Check 2: Forward pass works
-    try:
-        test_logits = model.forward(xb)
-        checks.append(("Forward pass", test_logits.shape == (batch_size, block_size, vocab_size)))
-    except Exception as e:
-        checks.append(("Forward pass", False))
-        print(f"   Error: {e}")
-    
-    # Check 3: Generation works
-    checks.append(("Text generation", len(output) > len(context)))
-    
-    # Check 4: Output is decodable
-    checks.append(("Output decodable", all(c in tokenizer.vocab for c in output)))
-    
-    # Print results
-    for check_name, passed in checks:
-        status = "✅" if passed else "❌"
-        print(f"{status} {check_name}")
-    
-    print()
-    
-    if all(passed for _, passed in checks):
-        print("🎉 SUCCESS! Transformer pipeline is working!")
-        print()
-        print("Next steps:")
-        print("  → Run step2_tinycoder.py for code completion demo")
-        print("  → Run step3_shakespeare.py for text generation demo")
-    else:
-        print("⚠️  Some checks failed. Debug modules 10-13.")
-    
-    print("="*70)
-
-
-if __name__ == "__main__":
-    main()
-
-
-
-
--- a/milestones/05_2017_transformer_era/step2_tinycoder.py
+++ b/milestones/05_2017_transformer_era/step2_tinycoder.py
@@ -1,338 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 2: TinyCoder - Code Autocompletion with Transformers
-==========================================================
-
-GOAL: Build GitHub Copilot using YOUR TinyTorch code
-DATASET: Your actual TinyTorch modules (already exists!)
-TOKENIZER: BPETokenizer (learns code patterns)
-TIME: ~15 minutes
-
-This demonstrates:
-✅ Transformer trained on real Python code
-✅ Generates syntactically valid completions
-✅ YOU built the tool you use daily!
-
-Epic moment: "IT'S COPILOT!"
-"""
-
-import numpy as np
-import sys
-import os
-import glob
-import re
-
-# Add project root to path
-project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-sys.path.insert(0, project_root)
-
-from tinytorch.core.tensor import Tensor
-from tinytorch.text.tokenization import BPETokenizer
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-from tinytorch.core.attention import MultiHeadAttention
-from tinytorch.models.transformer import TransformerBlock, LayerNorm
-from tinytorch.core.layers import Linear
-from tinytorch.core.optimizers import Adam
-
-
-class TinyCoder:
-    """Code completion transformer - like GitHub Copilot!"""
-    
-    def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length):
-        self.vocab_size = vocab_size
-        self.embed_dim = embed_dim
-        self.max_length = max_length
-        
-        # Token + position embeddings
-        self.token_embedding = Embedding(vocab_size, embed_dim)
-        self.pos_encoding = PositionalEncoding(max_length, embed_dim)
-        
-        # Transformer blocks
-        self.blocks = []
-        for _ in range(num_layers):
-            block = TransformerBlock(embed_dim, num_heads, embed_dim * 4)
-            self.blocks.append(block)
-        
-        # Output projection
-        self.ln_f = LayerNorm(embed_dim)
-        self.head = Linear(embed_dim, vocab_size)
-    
-    def forward(self, idx):
-        """Forward pass through the model."""
-        B, T = idx.shape
-        
-        # Token + positional embeddings
-        tok_emb = self.token_embedding.forward(idx)
-        x = self.pos_encoding.forward(tok_emb)
-        
-        # Transformer blocks
-        for block in self.blocks:
-            x = block(x)
-        
-        # Output head
-        x = self.ln_f(x)
-        logits = self.head(x)
-        
-        return logits
-    
-    def complete(self, tokenizer, prefix, max_new_tokens=20):
-        """
-        Complete code given a prefix.
-        
-        Args:
-            tokenizer: BPETokenizer instance
-            prefix: String prefix to complete
-            max_new_tokens: How many tokens to generate
-            
-        Returns:
-            Completed code string
-        """
-        # Encode prefix
-        tokens = tokenizer.encode(prefix)
-        idx = Tensor(np.array([tokens]))
-        
-        # Generate
-        for _ in range(max_new_tokens):
-            # Crop if too long
-            idx_cond = idx if idx.shape[1] <= self.max_length else idx[:, -self.max_length:]
-            
-            # Forward pass
-            logits = self.forward(idx_cond)
-            
-            # Get next token (greedy)
-            next_token = np.argmax(logits.data[0, -1, :])
-            
-            # Stop at newline for single-line completion
-            if tokenizer.decode([next_token]).strip() == '':
-                break
-            
-            # Append
-            idx = Tensor(np.concatenate([idx.data, [[next_token]]], axis=1))
-        
-        # Decode
-        full_output = tokenizer.decode(idx.data[0].tolist())
-        
-        # Return only the new part
-        return full_output[len(prefix):]
-    
-    def parameters(self):
-        """Get all trainable parameters."""
-        params = []
-        params.extend(self.token_embedding.parameters())
-        for block in self.blocks:
-            params.extend(block.parameters())
-        params.extend(self.ln_f.parameters())
-        params.extend(self.head.parameters())
-        return params
-
-
-def load_tinytorch_code():
-    """Load all Python code from TinyTorch modules."""
-    print("📂 Loading TinyTorch source code...")
-    
-    # Find all Python module files
-    module_dir = os.path.join(project_root, "modules", "source")
-    python_files = []
-    
-    # Get .py files from numbered module directories
-    for module_num in range(1, 14):  # Modules 01-13
-        pattern = os.path.join(module_dir, f"{module_num:02d}_*", "*_dev.py")
-        files = glob.glob(pattern)
-        python_files.extend(files)
-    
-    print(f"   Found {len(python_files)} module files")
-    
-    # Read all code
-    all_code = []
-    total_lines = 0
-    
-    for file_path in python_files:
-        try:
-            with open(file_path, 'r', encoding='utf-8') as f:
-                code = f.read()
-                all_code.append(code)
-                lines = code.count('\n')
-                total_lines += lines
-                
-                module_name = os.path.basename(os.path.dirname(file_path))
-                print(f"   ✓ {module_name}: {lines:,} lines")
-        except Exception as e:
-            print(f"   ✗ Error reading {file_path}: {e}")
-    
-    # Combine all code
-    combined_code = "\n\n# " + "="*50 + "\n\n".join(all_code)
-    
-    print(f"\n   Total: {total_lines:,} lines of Python code")
-    print(f"   Characters: {len(combined_code):,}")
-    
-    return combined_code
-
-
-def main():
-    print("="*70)
-    print("🤖 TinyCoder: Building GitHub Copilot with Transformers")
-    print("="*70)
-    print()
-    print("This trains a transformer on YOUR TinyTorch code to generate")
-    print("code completions - the same technology behind GitHub Copilot!")
-    print()
-    
-    # ========================================
-    # 1. Load training data
-    # ========================================
-    code_corpus = load_tinytorch_code()
-    print()
-    
-    # ========================================
-    # 2. Train BPE tokenizer
-    # ========================================
-    print("🔤 Training BPE tokenizer on code...")
-    
-    vocab_size = 1000
-    tokenizer = BPETokenizer(vocab_size=vocab_size)
-    
-    # Train tokenizer to learn code patterns
-    print(f"   Learning {vocab_size} subword units from code...")
-    tokenizer.train(code_corpus)
-    
-    # Show some learned tokens
-    print(f"\n   Vocabulary size: {len(tokenizer.vocab)}")
-    print(f"   Sample tokens:")
-    
-    # Find interesting tokens (Python keywords, common patterns)
-    interesting = []
-    for token in list(tokenizer.vocab.keys())[:50]:
-        if any(keyword in token for keyword in ['def', 'class', 'import', 'self', 'return']):
-            interesting.append(token)
-    
-    for token in interesting[:10]:
-        print(f"      '{token}'")
-    
-    # Encode the corpus
-    print(f"\n   Tokenizing corpus...")
-    tokens = tokenizer.encode(code_corpus)
-    print(f"   Total tokens: {len(tokens):,}")
-    print()
-    
-    # ========================================
-    # 3. Prepare training data
-    # ========================================
-    print("📦 Preparing training batches...")
-    
-    block_size = 128  # Context length
-    batch_size = 4
-    
-    def get_batch():
-        """Get a random batch of code."""
-        ix = np.random.randint(0, len(tokens) - block_size, size=batch_size)
-        x = np.array([tokens[i:i+block_size] for i in ix])
-        y = np.array([tokens[i+1:i+block_size+1] for i in ix])
-        return Tensor(x), Tensor(y)
-    
-    print(f"   Block size: {block_size} tokens")
-    print(f"   Batch size: {batch_size} sequences")
-    print()
-    
-    # ========================================
-    # 4. Initialize model
-    # ========================================
-    print("🏗️  Building TinyCoder model...")
-    
-    model = TinyCoder(
-        vocab_size=vocab_size,
-        embed_dim=128,
-        num_heads=8,
-        num_layers=4,
-        max_length=block_size
-    )
-    
-    total_params = sum(p.data.size for p in model.parameters())
-    print(f"   Parameters: {total_params:,}")
-    print(f"   Layers: {len(model.blocks)} transformer blocks")
-    print(f"   Heads: 8 attention heads per block")
-    print()
-    
-    # ========================================
-    # 5. Train
-    # ========================================
-    print("🏋️  Training on YOUR code (20 steps)...")
-    print("   (In production, this would be 1000s of steps)")
-    print()
-    
-    optimizer = Adam(model.parameters(), learning_rate=3e-4)
-    
-    for step in range(20):
-        # Get batch
-        xb, yb = get_batch()
-        
-        # Forward
-        logits = model.forward(xb)
-        
-        # Loss (simplified)
-        B, T, C = logits.shape
-        logits_flat = logits.data.reshape(B*T, C)
-        targets_flat = yb.data.reshape(B*T)
-        
-        # One-hot
-        targets_one_hot = np.zeros((B*T, C))
-        for i, t in enumerate(targets_flat):
-            if 0 <= int(t) < C:
-                targets_one_hot[i, int(t)] = 1.0
-        
-        loss_value = np.mean((logits_flat - targets_one_hot) ** 2)
-        
-        if step % 5 == 0:
-            print(f"   Step {step:3d}/20 | Loss: {loss_value:.4f}")
-    
-    print()
-    
-    # ========================================
-    # 6. Demo completions!
-    # ========================================
-    print("="*70)
-    print("✨ CODE COMPLETION DEMO")
-    print("="*70)
-    print()
-    
-    demos = [
-        "import ",
-        "def forward(self, x):",
-        "class Linear:",
-        "self.",
-        "return ",
-    ]
-    
-    for prompt in demos:
-        completion = model.complete(tokenizer, prompt, max_new_tokens=10)
-        print(f"Input:  '{prompt}'")
-        print(f"Output: '{prompt}{completion}'")
-        print()
-    
-    # ========================================
-    # 7. Success!
-    # ========================================
-    print("="*70)
-    print("🏆 SUCCESS! You Built GitHub Copilot!")
-    print("="*70)
-    print()
-    print("What you learned:")
-    print("  ✅ Transformers can learn code patterns")
-    print("  ✅ BPE tokenization captures syntax")
-    print("  ✅ Autoregressive generation produces valid code")
-    print("  ✅ This is THE SAME architecture as Copilot!")
-    print()
-    print("Production differences:")
-    print("  • Real Copilot: 12B+ parameters (you: ~100K)")
-    print("  • Real Copilot: Trained on billions of lines")
-    print("  • Real Copilot: GPU inference <50ms")
-    print("  • But the ARCHITECTURE is what YOU built!")
-    print()
-    print("="*70)
-
-
-if __name__ == "__main__":
-    main()
-
-
-
-
--- a/milestones/05_2017_transformer_era/step3_shakespeare.py
+++ b/milestones/05_2017_transformer_era/step3_shakespeare.py
@@ -1,349 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 3: TinyGPT - Shakespeare Text Generation
-=============================================
-
-GOAL: Traditional transformer demo - generate Shakespeare-style text
-DATASET: Tiny Shakespeare (1MB text file)
-TOKENIZER: CharTokenizer (character-level for simplicity)
-TIME: ~15 minutes
-
-This demonstrates:
-✅ Transformer learns language patterns
-✅ Generates coherent text in Shakespeare's style
-✅ Traditional "hello world" for language models
-
-Classic demo: "To be or not to be..."
-"""
-
-import numpy as np
-import sys
-import os
-import urllib.request
-
-# Add project root to path
-project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-sys.path.insert(0, project_root)
-
-from tinytorch.core.tensor import Tensor
-from tinytorch.text.tokenization import CharTokenizer
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-from tinytorch.core.attention import MultiHeadAttention
-from tinytorch.models.transformer import TransformerBlock, LayerNorm
-from tinytorch.core.layers import Linear
-from tinytorch.core.optimizers import Adam
-
-
-class TinyGPT:
-    """Shakespeare text generation transformer."""
-    
-    def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length):
-        self.vocab_size = vocab_size
-        self.embed_dim = embed_dim
-        self.max_length = max_length
-        
-        # Embeddings
-        self.token_embedding = Embedding(vocab_size, embed_dim)
-        self.pos_encoding = PositionalEncoding(max_length, embed_dim)
-        
-        # Transformer blocks
-        self.blocks = []
-        for _ in range(num_layers):
-            block = TransformerBlock(embed_dim, num_heads, embed_dim * 4)
-            self.blocks.append(block)
-        
-        # Output
-        self.ln_f = LayerNorm(embed_dim)
-        self.head = Linear(embed_dim, vocab_size)
-    
-    def forward(self, idx):
-        """Forward pass."""
-        B, T = idx.shape
-        
-        # Embeddings
-        tok_emb = self.token_embedding.forward(idx)
-        x = self.pos_encoding.forward(tok_emb)
-        
-        # Transformer blocks
-        for block in self.blocks:
-            x = block(x)
-        
-        # Output
-        x = self.ln_f(x)
-        logits = self.head(x)
-        
-        return logits
-    
-    def generate(self, tokenizer, start_text, max_new_tokens=100, temperature=0.8):
-        """
-        Generate text starting from start_text.
-        
-        Args:
-            tokenizer: CharTokenizer instance
-            start_text: String to start generation from
-            max_new_tokens: How many characters to generate
-            temperature: Sampling temperature (higher = more random)
-            
-        Returns:
-            Generated text string
-        """
-        # Encode start
-        tokens = tokenizer.encode(start_text)
-        idx = Tensor(np.array([tokens]))
-        
-        # Generate
-        for _ in range(max_new_tokens):
-            # Crop if too long
-            idx_cond = idx if idx.shape[1] <= self.max_length else idx[:, -self.max_length:]
-            
-            # Forward
-            logits = self.forward(idx_cond)
-            
-            # Last token predictions
-            logits_last = logits.data[0, -1, :] / temperature
-            
-            # Softmax
-            probs = np.exp(logits_last - np.max(logits_last))
-            probs = probs / np.sum(probs)
-            
-            # Sample (or greedy if temperature very low)
-            if temperature < 0.1:
-                next_token = np.argmax(probs)
-            else:
-                next_token = np.random.choice(len(probs), p=probs)
-            
-            # Append
-            idx = Tensor(np.concatenate([idx.data, [[next_token]]], axis=1))
-        
-        # Decode
-        return tokenizer.decode(idx.data[0].tolist())
-    
-    def parameters(self):
-        """Get all parameters."""
-        params = []
-        params.extend(self.token_embedding.parameters())
-        for block in self.blocks:
-            params.extend(block.parameters())
-        params.extend(self.ln_f.parameters())
-        params.extend(self.head.parameters())
-        return params
-
-
-def download_shakespeare():
-    """Download Tiny Shakespeare dataset."""
-    url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
-    data_dir = os.path.join(project_root, "milestones", "datasets")
-    os.makedirs(data_dir, exist_ok=True)
-    
-    file_path = os.path.join(data_dir, "shakespeare.txt")
-    
-    if os.path.exists(file_path):
-        print(f"   ✓ Dataset already exists at {file_path}")
-    else:
-        print(f"   Downloading from {url}...")
-        try:
-            urllib.request.urlretrieve(url, file_path)
-            print(f"   ✓ Downloaded to {file_path}")
-        except Exception as e:
-            print(f"   ✗ Download failed: {e}")
-            print(f"   Please manually download from: {url}")
-            print(f"   And save to: {file_path}")
-            return None
-    
-    # Read text
-    with open(file_path, 'r', encoding='utf-8') as f:
-        text = f.read()
-    
-    return text
-
-
-def main():
-    print("="*70)
-    print("📜 TinyGPT: Shakespeare Text Generation")
-    print("="*70)
-    print()
-    print("Train a transformer on Shakespeare's works to generate")
-    print("authentic-sounding 16th century English!")
-    print()
-    
-    # ========================================
-    # 1. Download dataset
-    # ========================================
-    print("📥 Step 1: Loading Shakespeare dataset...")
-    text = download_shakespeare()
-    
-    if text is None:
-        print("Failed to load dataset. Exiting.")
-        return
-    
-    print(f"   Text length: {len(text):,} characters")
-    print(f"   Sample:")
-    print(f"   {text[:200]}...")
-    print()
-    
-    # ========================================
-    # 2. Tokenize
-    # ========================================
-    print("🔤 Step 2: Tokenizing (character-level)...")
-    
-    tokenizer = CharTokenizer()
-    
-    # Build vocab
-    unique_chars = sorted(list(set(text)))
-    tokenizer.vocab = unique_chars
-    tokenizer.char_to_idx = {ch: i for i, ch in enumerate(unique_chars)}
-    tokenizer.idx_to_char = {i: ch for i, ch in enumerate(unique_chars)}
-    
-    # Encode
-    data = tokenizer.encode(text)
-    vocab_size = len(tokenizer.vocab)
-    
-    print(f"   Vocabulary size: {vocab_size} unique characters")
-    print(f"   Total tokens: {len(data):,}")
-    print(f"   Characters: {tokenizer.vocab[:20]}...")
-    print()
-    
-    # ========================================
-    # 3. Split train/val
-    # ========================================
-    print("📊 Step 3: Preparing data splits...")
-    
-    n = len(data)
-    train_data = data[:int(n*0.9)]
-    val_data = data[int(n*0.9):]
-    
-    print(f"   Train: {len(train_data):,} tokens")
-    print(f"   Val:   {len(val_data):,} tokens")
-    print()
-    
-    # ========================================
-    # 4. Batching
-    # ========================================
-    block_size = 128
-    batch_size = 4
-    
-    def get_batch(split='train'):
-        """Get a batch of data."""
-        data_split = train_data if split == 'train' else val_data
-        ix = np.random.randint(0, len(data_split) - block_size, size=batch_size)
-        x = np.array([data_split[i:i+block_size] for i in ix])
-        y = np.array([data_split[i+1:i+block_size+1] for i in ix])
-        return Tensor(x), Tensor(y)
-    
-    # ========================================
-    # 5. Initialize model
-    # ========================================
-    print("🏗️  Step 4: Building TinyGPT...")
-    
-    model = TinyGPT(
-        vocab_size=vocab_size,
-        embed_dim=128,
-        num_heads=8,
-        num_layers=4,
-        max_length=block_size
-    )
-    
-    total_params = sum(p.data.size for p in model.parameters())
-    print(f"   Parameters: {total_params:,}")
-    print(f"   Architecture: {len(model.blocks)} transformer blocks")
-    print()
-    
-    # ========================================
-    # 6. Train
-    # ========================================
-    print("🏋️  Step 5: Training on Shakespeare (50 steps)...")
-    print("   (In production, this would be 5000+ steps)")
-    print()
-    
-    optimizer = Adam(model.parameters(), learning_rate=3e-4)
-    
-    for step in range(50):
-        # Get batch
-        xb, yb = get_batch('train')
-        
-        # Forward
-        logits = model.forward(xb)
-        
-        # Loss (simplified)
-        B, T, C = logits.shape
-        logits_flat = logits.data.reshape(B*T, C)
-        targets_flat = yb.data.reshape(B*T)
-        
-        # One-hot
-        targets_one_hot = np.zeros((B*T, C))
-        for i, t in enumerate(targets_flat):
-            targets_one_hot[i, int(t)] = 1.0
-        
-        loss_value = np.mean((logits_flat - targets_one_hot) ** 2)
-        
-        # Validation loss every 10 steps
-        if step % 10 == 0:
-            xb_val, yb_val = get_batch('val')
-            logits_val = model.forward(xb_val)
-            
-            B_val, T_val, C_val = logits_val.shape
-            logits_val_flat = logits_val.data.reshape(B_val*T_val, C_val)
-            targets_val_flat = yb_val.data.reshape(B_val*T_val)
-            
-            targets_val_one_hot = np.zeros((B_val*T_val, C_val))
-            for i, t in enumerate(targets_val_flat):
-                targets_val_one_hot[i, int(t)] = 1.0
-            
-            val_loss = np.mean((logits_val_flat - targets_val_one_hot) ** 2)
-            
-            print(f"   Step {step:3d}/50 | Train Loss: {loss_value:.4f} | Val Loss: {val_loss:.4f}")
-    
-    print()
-    
-    # ========================================
-    # 7. Generate!
-    # ========================================
-    print("="*70)
-    print("✨ SHAKESPEARE GENERATION")
-    print("="*70)
-    print()
-    
-    prompts = [
-        "To be or not to be,",
-        "ROMEO:",
-        "First Citizen:",
-    ]
-    
-    for prompt in prompts:
-        print(f"Prompt: '{prompt}'")
-        print("-" * 70)
-        
-        generated = model.generate(tokenizer, prompt, max_new_tokens=100, temperature=0.8)
-        
-        print(generated)
-        print()
-    
-    # ========================================
-    # 8. Success!
-    # ========================================
-    print("="*70)
-    print("🎭 SUCCESS! You Built a Language Model!")
-    print("="*70)
-    print()
-    print("What you learned:")
-    print("  ✅ Transformers learn language patterns from data")
-    print("  ✅ Character-level models can generate coherent text")
-    print("  ✅ Temperature controls randomness in generation")
-    print("  ✅ This is the foundation of GPT, ChatGPT, etc!")
-    print()
-    print("Model architecture comparison:")
-    print("  • Your TinyGPT: ~100K parameters, 4 layers")
-    print("  • GPT-2: 117M parameters, 12 layers")
-    print("  • GPT-3: 175B parameters, 96 layers")
-    print("  • GPT-4: ~1.8T parameters, ~120 layers (estimated)")
-    print()
-    print("But the ARCHITECTURE is identical to what YOU built!")
-    print("="*70)
-
-
-if __name__ == "__main__":
-    main()
-
-
-
-
--- a/milestones/06_2024_systems_age/optimize_models.py
+++ b/milestones/06_2024_systems_age/optimize_models.py