Restructure .claude directory with comprehensive guidelines

- Created organized guidelines/ directory with focused documentation: - DESIGN_PHILOSOPHY.md: KISS principle and simplicity focus - MODULE_DEVELOPMENT.md: How to build modules with systems focus - TESTING_STANDARDS.md: Immediate testing patterns - PERFORMANCE_CLAIMS.md: Honest reporting based on CIFAR-10 lessons - AGENT_COORDINATION.md: How agents work together effectively - GIT_WORKFLOW.md: Moved from root, branching standards - Added .claude/README.md as central navigation - Updated CLAUDE.md to reference guideline files - Created CLAUDE_SIMPLE.md as streamlined entry point All learnings from recent work captured in appropriate guidelines
2026-07-18 21:21:59 -05:00 · 2025-09-21 20:13:05 -04:00
parent 95c32b1ebe
commit 0d57736639
9 changed files with 1422 additions and 24 deletions
--- a/.claude/README.md
+++ b/.claude/README.md
@@ -0,0 +1,138 @@
+# TinyTorch .claude Directory Structure
+
+This directory contains all guidelines, standards, and agent definitions for the TinyTorch project.
+
+## 📁 Directory Structure
+
+```
+.claude/
+├── README.md                    # This file
+├── guidelines/                  # Development standards and principles
+│   ├── DESIGN_PHILOSOPHY.md    # KISS principle and simplicity guidelines
+│   ├── GIT_WORKFLOW.md         # Git branching and commit standards
+│   ├── MODULE_DEVELOPMENT.md   # How to develop TinyTorch modules
+│   ├── TESTING_STANDARDS.md    # Testing patterns and requirements
+│   ├── PERFORMANCE_CLAIMS.md   # How to make honest performance claims
+│   └── AGENT_COORDINATION.md   # How AI agents work together
+├── agents/                      # AI agent definitions
+│   ├── technical-program-manager.md
+│   ├── education-architect.md
+│   ├── module-developer.md
+│   ├── package-manager.md
+│   ├── quality-assurance.md
+│   ├── documentation-publisher.md
+│   ├── workflow-coordinator.md
+│   ├── devops-engineer.md
+│   └── tito-cli-developer.md
+└── [legacy files to review]
+
+```
+
+## 🎯 Quick Start for New Development
+
+1. **Read Core Principles First**
+   - `guidelines/DESIGN_PHILOSOPHY.md` - Understand KISS principle
+   - `guidelines/GIT_WORKFLOW.md` - Learn branching requirements
+
+2. **For Module Development**
+   - `guidelines/MODULE_DEVELOPMENT.md` - Module structure and patterns
+   - `guidelines/TESTING_STANDARDS.md` - How to write tests
+   - `guidelines/PERFORMANCE_CLAIMS.md` - How to report results
+
+3. **For Agent Coordination**
+   - `guidelines/AGENT_COORDINATION.md` - How agents work together
+   - Start with Technical Program Manager (TPM) for all requests
+
+## 📋 Key Principles Summary
+
+### 1. Keep It Simple, Stupid (KISS)
+- One file, one purpose
+- Clear over clever
+- Verified over theoretical
+- Direct over abstract
+
+### 2. Git Workflow
+- ALWAYS work on feature branches
+- NEVER commit directly to main/dev
+- Test before committing
+- No automated attribution in commits
+
+### 3. Module Development
+- Edit .py files only (never .ipynb)
+- Test immediately after implementation
+- Include systems analysis (memory, performance)
+- Follow exact structure pattern
+
+### 4. Testing Standards
+- Test immediately, not at the end
+- Simple assertions over complex frameworks
+- Tests should educate, not just verify
+- Always compare against baseline
+
+### 5. Performance Claims
+- Only claim what you've measured
+- Include all relevant metrics
+- Report failures honestly
+- Reproducibility is key
+
+### 6. Agent Coordination
+- TPM is primary interface
+- Sequential workflow with clear handoffs
+- QA testing is MANDATORY
+- Package integration is MANDATORY
+
+## 🚀 Common Workflows
+
+### Starting New Module Development
+```bash
+1. Create feature branch
+2. Request TPM agent assistance
+3. Follow MODULE_DEVELOPMENT.md structure
+4. Test with TESTING_STANDARDS.md patterns
+5. Verify performance per PERFORMANCE_CLAIMS.md
+6. Merge following GIT_WORKFLOW.md
+```
+
+### Making Performance Claims
+```bash
+1. Run baseline measurements
+2. Run actual measurements
+3. Calculate real improvements
+4. Document with all metrics
+5. No unverified claims
+```
+
+### Working with Agents
+```bash
+1. Always start with TPM agent
+2. Let TPM coordinate other agents
+3. Wait for QA approval before proceeding
+4. Wait for Package Manager integration
+5. Only then commit
+```
+
+## 📝 Important Notes
+
+- **Virtual Environment**: Always activate .venv before development
+- **Honesty**: Report actual results, not aspirations
+- **Simplicity**: When in doubt, choose the simpler option
+- **Education First**: We're teaching, not impressing
+
+## 🔗 Quick Links
+
+- Main Instructions: `/CLAUDE.md`
+- Module Source: `/modules/source/`
+- Examples: `/examples/`
+- Tests: `/tests/`
+
+## 📌 Remember
+
+> "If students can't understand it, we've failed."
+
+Every decision should be filtered through:
+1. Is it simple?
+2. Is it honest?
+3. Is it educational?
+4. Is it verified?
+
+If any answer is "no", reconsider.
--- a/.claude/guidelines/AGENT_COORDINATION.md
+++ b/.claude/guidelines/AGENT_COORDINATION.md
@@ -0,0 +1,204 @@
+# TinyTorch Agent Coordination Guidelines
+
+## 🎯 Core Principle
+
+**Agents work in sequence with clear handoffs, not in isolation.**
+
+## 🤖 The Agent Team
+
+### Primary Interface: Technical Program Manager (TPM)
+
+The TPM is your SINGLE point of communication for all development.
+
+```
+User Request → TPM → Coordinates Agents → Reports Back
+```
+
+**The TPM knows when to invoke:**
+- Education Architect - Learning design
+- Module Developer - Implementation  
+- Package Manager - Integration
+- Quality Assurance - Testing
+- Documentation Publisher - Content
+- Workflow Coordinator - Process
+- DevOps Engineer - Infrastructure
+- Tito CLI Developer - CLI features
+
+## 📋 Standard Development Workflow
+
+### The Sequential Pattern
+
+**For EVERY module development:**
+
+```
+1. Planning (Workflow Coordinator + Education Architect)
+   ↓
+2. Implementation (Module Developer)
+   ↓
+3. Testing (Quality Assurance) ← MANDATORY
+   ↓
+4. Integration (Package Manager) ← MANDATORY
+   ↓
+5. Documentation (Documentation Publisher)
+   ↓
+6. Review (Workflow Coordinator)
+```
+
+### Critical Handoff Points
+
+**Module Developer → QA Agent**
+```python
+# Module Developer completes implementation
+"Implementation complete. Ready for QA testing.
+Files modified: 02_tensor_dev.py
+Key changes: Added reshape operation with broadcasting"
+
+# QA MUST test before proceeding
+```
+
+**QA Agent → Package Manager**
+```python
+# QA completes testing
+"All tests passed.
+- Module imports correctly
+- All functions work as expected
+- Performance benchmarks met
+Ready for package integration"
+
+# Package Manager MUST verify integration
+```
+
+## 🚫 Blocking Rules
+
+### QA Agent Can Block Progress
+
+**If tests fail, STOP everything:**
+- No commits allowed
+- No integration permitted
+- Must fix and re-test
+
+### Package Manager Can Block Release
+
+**If integration fails:**
+- Module doesn't export correctly
+- Breaks other modules
+- Package won't build
+
+## 📝 Agent Communication Protocol
+
+### Structured Handoffs
+
+Every handoff must include:
+1. **What was completed**
+2. **What needs to be done next**
+3. **Any issues found**
+4. **Test results (if applicable)**
+5. **Recommendations**
+
+**Example:**
+```
+From: Module Developer
+To: QA Agent
+
+Completed:
+- Implemented attention mechanism in 07_attention_dev.py
+- Added scaled dot-product attention
+- Included positional encoding
+
+Needs Testing:
+- Attention score computation
+- Mask application
+- Memory usage with large sequences
+
+Known Issues:
+- Performance degrades with sequences >1000 tokens
+
+Recommendations:
+- Focus testing on edge cases with padding
+```
+
+## 🔄 Parallel vs Sequential Work
+
+### Can Work in Parallel
+
+✅ Different modules by different developers
+✅ Documentation while code is being tested
+✅ Planning next modules while current ones build
+
+### Must Be Sequential
+
+❌ Implementation → Testing (MUST test after implementation)
+❌ Testing → Integration (MUST pass tests first)
+❌ Integration → Commit (MUST integrate successfully)
+
+## 🎯 The Checkpoint Success Story
+
+**How agents successfully implemented the 16-checkpoint system:**
+
+1. **Education Architect** designed capability progression
+2. **Workflow Coordinator** orchestrated implementation
+3. **Module Developer** built checkpoint tests + CLI
+4. **QA Agent** validated all 16 checkpoints work
+5. **Package Manager** ensured integration with modules
+6. **Documentation Publisher** updated all docs
+
+**Result:** Complete working system with proper handoffs
+
+## ⚠️ Common Coordination Failures
+
+### Working in Isolation
+❌ Module Developer implements without QA testing
+❌ Documentation written before code works
+❌ Integration attempted before tests pass
+
+### Skipping Handoffs
+❌ Direct commit without QA approval
+❌ Missing Package Manager validation
+❌ No Workflow Coordinator review
+
+### Poor Communication
+❌ "It's done" (no details)
+❌ No test results provided
+❌ Issues discovered but not reported
+
+## 📋 Agent Checklist
+
+### Before Module Developer Starts
+- [ ] Education Architect defined learning objectives
+- [ ] Workflow Coordinator approved plan
+- [ ] Clear specifications provided
+
+### Before QA Testing
+- [ ] Module Developer completed ALL implementation
+- [ ] Code follows standards
+- [ ] Basic self-testing done
+
+### Before Package Integration  
+- [ ] QA Agent ran comprehensive tests
+- [ ] All tests PASSED
+- [ ] Performance acceptable
+
+### Before Commit
+- [ ] Package Manager verified integration
+- [ ] Documentation complete
+- [ ] Workflow Coordinator approved
+
+## 🔧 Conflict Resolution
+
+**If agents disagree:**
+
+1. **QA has veto on quality** - If tests fail, stop
+2. **Education Architect owns learning objectives**
+3. **Workflow Coordinator resolves other disputes**
+4. **User has final override**
+
+## 📌 Remember
+
+> Agents amplify capabilities when coordinated, create chaos when isolated.
+
+**Key Success Factors:**
+- Clear handoffs between agents
+- Mandatory testing and integration
+- Structured communication
+- Sequential workflow where needed
+- Parallel work where possible
--- a/.claude/guidelines/DESIGN_PHILOSOPHY.md
+++ b/.claude/guidelines/DESIGN_PHILOSOPHY.md
@@ -0,0 +1,212 @@
+# TinyTorch Design Philosophy
+
+## 🎯 Core Principle: Keep It Simple, Stupid (KISS)
+
+**Simplicity is the soul of TinyTorch. We are building an educational framework where clarity beats cleverness every time.**
+
+## 📚 Why Simplicity Matters
+
+TinyTorch is for students learning ML systems engineering. If they can't understand it, we've failed our mission. Every design decision should prioritize:
+
+1. **Readability** over performance
+2. **Clarity** over cleverness  
+3. **Directness** over abstraction
+4. **Honesty** over aspiration
+
+## 🚀 KISS Guidelines
+
+### Code Simplicity
+
+**✅ DO:**
+- Write code that reads like a textbook
+- Use descriptive variable names (`gradient` not `g`)
+- Implement one concept per file
+- Show the direct path from input to output
+- Keep functions short and focused
+
+**❌ DON'T:**
+- Use clever one-liners that require decoding
+- Create unnecessary abstractions
+- Optimize prematurely
+- Hide complexity behind magic
+
+**Example:**
+```python
+# ✅ GOOD: Clear and direct
+def forward(self, x):
+    h1 = self.relu(self.fc1(x))
+    h2 = self.relu(self.fc2(h1))
+    return self.fc3(h2)
+
+# ❌ BAD: Clever but unclear
+def forward(self, x):
+    return reduce(lambda h, l: self.relu(l(h)) if l != self.layers[-1] else l(h), 
+                  self.layers, x)
+```
+
+### File Organization
+
+**✅ DO:**
+- One purpose per file
+- Clear, descriptive filenames
+- Minimal file count
+
+**❌ DON'T:**
+- Create multiple versions of the same thing
+- Split related code unnecessarily
+- Create deep directory hierarchies
+
+**Example:**
+```
+✅ GOOD:
+examples/cifar10/
+├── random_baseline.py  # Shows untrained performance
+├── train.py           # Training script
+└── README.md          # Simple documentation
+
+❌ BAD:
+examples/cifar10/
+├── train_basic.py
+├── train_optimized.py
+├── train_advanced.py
+├── train_experimental.py
+├── train_with_ui.py
+└── ... (20 more variations)
+```
+
+### Documentation Simplicity
+
+**✅ DO:**
+- State what it does clearly
+- Give one good example
+- Report verified results only
+- Keep README files short
+
+**❌ DON'T:**
+- Write novels in docstrings
+- Promise theoretical performance
+- Add complex diagrams for simple concepts
+- Create documentation that's longer than the code
+
+**Example:**
+```python
+# ✅ GOOD: Clear and concise
+"""
+Train a neural network on CIFAR-10 images.
+Achieves 55% accuracy in 2 minutes.
+"""
+
+# ❌ BAD: Over-documented
+"""
+This advanced training framework implements state-of-the-art optimization
+techniques including adaptive learning rate scheduling, progressive data
+augmentation, and sophisticated regularization strategies to push the
+boundaries of what's possible with MLPs on CIFAR-10, potentially achieving
+60-70% accuracy with proper hyperparameter tuning...
+[continues for 500 more words]
+"""
+```
+
+### Performance Claims
+
+**✅ DO:**
+- Report what you actually measured
+- Include training time
+- Be honest about limitations
+- Compare against clear baselines
+
+**❌ DON'T:**
+- Claim unverified performance
+- Hide negative results
+- Exaggerate improvements
+- Make theoretical claims
+
+**Example:**
+```markdown
+✅ GOOD:
+- Random baseline: 10% (measured)
+- Trained model: 55% (measured)
+- Training time: 2 minutes
+
+❌ BAD:
+- Can achieve 60-70% with optimization (unverified)
+- State-of-the-art MLP performance (vague)
+- Approaches CNN-level accuracy (misleading)
+```
+
+## 🎓 Educational Simplicity
+
+### Learning Progression
+
+**✅ DO:**
+- Build concepts incrementally
+- Show before explaining
+- Test immediately after implementing
+- Keep examples minimal but complete
+
+**❌ DON'T:**
+- Jump to complex examples
+- Hide important details
+- Add unnecessary features
+- Overwhelm with options
+
+### Error Messages
+
+**✅ DO:**
+- Make errors educational
+- Suggest fixes
+- Show what went wrong clearly
+
+**❌ DON'T:**
+- Hide errors
+- Use cryptic messages
+- Stack trace without context
+
+## 🔍 Decision Framework
+
+When making any design decision, ask:
+
+1. **Can a student understand this in 30 seconds?**
+   - If no → simplify
+
+2. **Is there a simpler way that still works?**
+   - If yes → use it
+
+3. **Does this add essential value?**
+   - If no → remove it
+
+4. **Would I want to debug this at 2 AM?**
+   - If no → rewrite it
+
+## 📝 Examples of KISS in Action
+
+### Recent CIFAR-10 Cleanup
+**Before:** 20+ experimental files with complex optimizations
+**After:** 2 files (random_baseline.py, train.py)
+**Result:** Clearer story, same educational value
+
+### Module Structure
+**Before:** Complex inheritance hierarchies
+**After:** Direct implementations students can trace
+**Result:** Students understand what's happening
+
+### Testing
+**Before:** Complex test frameworks
+**After:** Simple assertions after each implementation
+**Result:** Immediate feedback and understanding
+
+## 🚨 When Complexity is OK
+
+Sometimes complexity is necessary, but it must be:
+1. **Essential** to the learning objective
+2. **Well-documented** with clear explanations
+3. **Isolated** from simpler concepts
+4. **Justified** by significant educational value
+
+Example: Autograd is complex, but it's the core learning objective of that module.
+
+## 📌 Remember
+
+> "Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away." - Antoine de Saint-Exupéry
+
+**Every line of code, every file, every feature should justify its existence. When in doubt, leave it out.**
--- a/.claude/guidelines/GIT_WORKFLOW.md
+++ b/.claude/guidelines/GIT_WORKFLOW.md
--- a/.claude/guidelines/MODULE_DEVELOPMENT.md
+++ b/.claude/guidelines/MODULE_DEVELOPMENT.md
@@ -0,0 +1,299 @@
+# TinyTorch Module Development Standards
+
+## 🎯 Core Principle
+
+**Modules teach ML systems engineering through building, not just ML algorithms through reading.**
+
+## 📁 File Structure
+
+### One Module = One .py File
+
+```
+modules/source/XX_modulename/
+├── modulename_dev.py     # The ONLY file you edit
+├── modulename_dev.ipynb  # Auto-generated from .py (DO NOT EDIT)
+└── README.md            # Module overview
+```
+
+**Critical Rules:**
+- ✅ ALWAYS edit `.py` files only
+- ❌ NEVER edit `.ipynb` notebooks directly
+- ✅ Use jupytext to sync .py → .ipynb
+
+## 📚 Module Structure Pattern
+
+Every module MUST follow this exact structure:
+
+```python
+# %% [markdown]
+"""
+# Module XX: [Name]
+
+**Learning Objectives:**
+- Build [component] from scratch
+- Understand [systems concept]
+- Analyze performance implications
+"""
+
+# %% [markdown]
+"""
+## Part 1: Mathematical Foundations
+[Theory and complexity analysis]
+"""
+
+# %% [code]
+# Implementation
+
+# %% [markdown]
+"""
+### Testing [Component]
+Let's verify our implementation works correctly.
+"""
+
+# %% [code]
+# Immediate test
+
+# %% [markdown]
+"""
+## Part 2: Systems Analysis
+### Memory Profiling
+Let's understand the memory implications.
+"""
+
+# %% [code]
+# Memory profiling code
+
+# %% [markdown]
+"""
+## Part 3: Production Context
+In real ML systems like PyTorch...
+"""
+
+# ... continue pattern ...
+
+# %% [code]
+if __name__ == "__main__":
+    run_all_tests()
+
+# %% [markdown]
+"""
+## 🤔 ML Systems Thinking
+[Interactive questions analyzing implementation]
+"""
+
+# %% [markdown]
+"""
+## 🎯 Module Summary
+[What was learned - ALWAYS LAST]
+"""
+```
+
+## 🧪 Implementation → Test Pattern
+
+**MANDATORY**: Every implementation must be immediately followed by a test.
+
+```python
+# ✅ CORRECT Pattern:
+
+# %% [markdown]
+"""
+## Building the Dense Layer
+"""
+
+# %% [code]
+class Dense:
+    def __init__(self, in_features, out_features):
+        self.weights = np.random.randn(in_features, out_features) * 0.1
+        self.bias = np.zeros(out_features)
+    
+    def forward(self, x):
+        return x @ self.weights + self.bias
+
+# %% [markdown]
+"""
+### Testing Dense Layer
+Let's verify our dense layer handles shapes correctly.
+"""
+
+# %% [code]
+def test_dense_layer():
+    layer = Dense(10, 5)
+    x = np.random.randn(32, 10)  # Batch of 32, 10 features
+    output = layer.forward(x)
+    assert output.shape == (32, 5), f"Expected (32, 5), got {output.shape}"
+    print("✅ Dense layer forward pass works!")
+
+test_dense_layer()
+```
+
+## 🔬 ML Systems Focus
+
+### MANDATORY Systems Analysis Sections
+
+Every module MUST include:
+
+1. **Complexity Analysis**
+```python
+# %% [markdown]
+"""
+### Computational Complexity
+- Matrix multiply: O(batch × in_features × out_features)
+- Memory usage: O(in_features × out_features) for weights
+- This becomes the bottleneck when...
+"""
+```
+
+2. **Memory Profiling**
+```python
+# %% [code]
+def profile_memory():
+    import tracemalloc
+    tracemalloc.start()
+    
+    layer = Dense(1000, 1000)
+    x = np.random.randn(128, 1000)
+    output = layer.forward(x)
+    
+    current, peak = tracemalloc.get_traced_memory()
+    print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
+    print("This shows why large models need GPUs!")
+```
+
+3. **Production Context**
+```python
+# %% [markdown]
+"""
+### In Production Systems
+PyTorch's nn.Linear does the same thing but with:
+- GPU acceleration via CUDA kernels
+- Automatic differentiation support
+- Optimized BLAS operations
+- Memory pooling for efficiency
+"""
+```
+
+## 📝 NBGrader Integration
+
+### Cell Metadata Structure
+
+```python
+# %% [code] {"nbgrader": {"grade": false, "locked": false, "solution": true, "grade_id": "dense_implementation"}}
+### BEGIN SOLUTION
+class Dense:
+    # Full implementation for instructors
+    ...
+### END SOLUTION
+
+### BEGIN HIDDEN TESTS
+# Instructor-only tests
+...
+### END HIDDEN TESTS
+```
+
+### Critical NBGrader Rules
+
+1. **Every cell needs unique grade_id**
+2. **Scaffolding stays OUTSIDE solution blocks**
+3. **Hidden tests validate student work**
+4. **Points should reflect complexity**
+
+## 🎓 Educational Patterns
+
+### The "Build → Measure → Understand" Pattern
+
+```python
+# 1. BUILD
+class LayerNorm:
+    def forward(self, x):
+        mean = np.mean(x, axis=-1, keepdims=True)
+        var = np.var(x, axis=-1, keepdims=True)
+        return (x - mean) / np.sqrt(var + 1e-5)
+
+# 2. MEASURE
+def measure_performance():
+    layer = LayerNorm()
+    x = np.random.randn(1000, 512)
+    
+    start = time.time()
+    for _ in range(100):
+        output = layer.forward(x)
+    elapsed = time.time() - start
+    
+    print(f"Time per forward pass: {elapsed/100*1000:.2f}ms")
+    print(f"Throughput: {100*1000*512/elapsed:.0f} tokens/sec")
+
+# 3. UNDERSTAND
+"""
+With 512 dimensions, normalization adds ~2ms overhead.
+This is why large models use fused kernels!
+"""
+```
+
+### Progressive Complexity
+
+Start simple, build up:
+
+```python
+# Step 1: Simplest possible version
+def relu_v1(x):
+    return np.maximum(0, x)
+
+# Step 2: Add complexity
+def relu_v2(x):
+    # Handle gradients
+    output = np.maximum(0, x)
+    output.grad_fn = lambda grad: grad * (x > 0)
+    return output
+
+# Step 3: Production version
+class ReLU:
+    def forward(self, x):
+        self.input = x  # Save for backward
+        return np.maximum(0, x)
+    
+    def backward(self, grad):
+        return grad * (self.input > 0)
+```
+
+## ⚠️ Common Pitfalls
+
+1. **Too Much Theory**
+   - Students want to BUILD, not read
+   - Show through code, not exposition
+
+2. **Missing Systems Analysis**
+   - Not just algorithms, but engineering
+   - Always discuss memory and performance
+
+3. **Tests at the End**
+   - Loses educational flow
+   - Test immediately after implementation
+
+4. **No Production Context**
+   - Students need to see real-world relevance
+   - Compare with PyTorch/TensorFlow
+
+## 📌 Module Checklist
+
+Before considering a module complete:
+
+- [ ] All code in .py file (not notebook)
+- [ ] Follows exact structure pattern
+- [ ] Every implementation has immediate test
+- [ ] Includes memory profiling
+- [ ] Includes complexity analysis
+- [ ] Shows production context
+- [ ] NBGrader metadata correct
+- [ ] ML systems thinking questions
+- [ ] Summary is LAST section
+- [ ] Tests run when module executed
+
+## 🎯 Remember
+
+> We're teaching ML systems engineering, not just ML algorithms.
+
+Every module should help students understand:
+- How to BUILD ML systems
+- Why performance matters
+- Where bottlenecks occur
+- How production systems work
--- a/.claude/guidelines/PERFORMANCE_CLAIMS.md
+++ b/.claude/guidelines/PERFORMANCE_CLAIMS.md
@@ -0,0 +1,245 @@
+# TinyTorch Performance Claims Guidelines
+
+## 🎯 Core Principle
+
+**Only claim what you have measured and verified. Honesty builds trust.**
+
+## ✅ Verified Performance Standards
+
+### The Three-Step Verification
+
+1. **Measure Baseline**
+```python
+# Random/untrained performance
+random_model = create_untrained_model()
+baseline_accuracy = evaluate(random_model, test_data)
+print(f"Baseline: {baseline_accuracy:.1%}")  # Measured: 10%
+```
+
+2. **Measure Actual Performance**
+```python
+# Trained model performance
+trained_model = train_model(epochs=15)
+actual_accuracy = evaluate(trained_model, test_data)
+print(f"Actual: {actual_accuracy:.1%}")  # Measured: 55%
+```
+
+3. **Calculate Real Improvement**
+```python
+improvement = actual_accuracy / baseline_accuracy
+print(f"Improvement: {improvement:.1f}×")  # Measured: 5.5×
+```
+
+### Reporting Requirements
+
+**ALWAYS include:**
+- Exact accuracy percentage
+- Training time
+- Hardware used
+- Number of epochs
+- Dataset size
+
+**Example:**
+```markdown
+✅ GOOD:
+- Accuracy: 55% on CIFAR-10 test set
+- Training time: 2 minutes on M1 MacBook
+- Epochs: 15
+- Batch size: 64
+
+❌ BAD:
+- "State-of-the-art performance"
+- "Can achieve 60-70% with optimization"
+- "Approaches CNN-level accuracy"
+```
+
+## 📊 The CIFAR-10 Lesson
+
+### What We Claimed vs Reality
+
+**Initial Claims (unverified):**
+- "60-70% accuracy achievable with optimization"
+- "Advanced techniques push beyond baseline"
+- "Sophisticated MLPs rival simple CNNs"
+
+**Actual Results (verified):**
+- Baseline: 51-55% consistently
+- With optimization attempts: Still ~55%
+- Deep networks: Too slow, no improvement
+- **Honest conclusion: MLPs achieve 55% reliably**
+
+### The Right Response
+
+When results don't match expectations:
+
+✅ **CORRECT Approach:**
+- Test thoroughly
+- Report actual results
+- Update documentation
+- Explain limitations
+
+❌ **WRONG Approach:**
+- Keep unverified claims
+- Hide negative results
+- Blame implementation
+- Make excuses
+
+## 🔬 Performance Testing Protocol
+
+### Minimum Testing Requirements
+
+```python
+def verify_performance_claim():
+    """
+    Every performance claim must pass this verification.
+    """
+    results = []
+    
+    # Run multiple trials
+    for trial in range(3):
+        model = create_model()
+        accuracy = train_and_evaluate(model)
+        results.append(accuracy)
+    
+    mean_acc = np.mean(results)
+    std_acc = np.std(results)
+    
+    # Report with confidence intervals
+    print(f"Performance: {mean_acc:.1%} ± {std_acc:.1%}")
+    
+    # Only claim if consistent
+    if std_acc > 0.02:  # >2% variance
+        print("⚠️ High variance - need more testing")
+        return False
+    
+    return True
+```
+
+### Time Complexity Reporting
+
+```python
+# ✅ GOOD: Measured complexity
+def measure_scalability():
+    sizes = [100, 1000, 10000]
+    times = []
+    
+    for size in sizes:
+        data = create_data(size)
+        start = time.time()
+        process(data)
+        times.append(time.time() - start)
+    
+    # Analyze scaling
+    print("Scaling behavior:")
+    for size, time in zip(sizes, times):
+        print(f"  n={size}: {time:.2f}s")
+    
+    # Determine complexity
+    if times[2] / times[1] > 90:  # 10x data → 100x time
+        print("Complexity: O(n²)")
+
+# ❌ BAD: Theoretical claims
+def theoretical_complexity():
+    print("Should be O(n log n)")  # Not measured
+```
+
+## 📝 Documentation Standards
+
+### Performance Tables
+
+```markdown
+✅ GOOD Table:
+
+| Model | Dataset | Accuracy | Time | Hardware |
+|-------|---------|----------|------|----------|
+| MLP-4-layer | CIFAR-10 | 55% | 2 min | M1 CPU |
+| Random baseline | CIFAR-10 | 10% | 0 sec | N/A |
+| MLP-4-layer | MNIST | 98% | 30 sec | M1 CPU |
+
+❌ BAD Table:
+
+| Model | Performance |
+|-------|------------|
+| Our MLP | State-of-the-art |
+| With optimization | Up to 70% |
+| Best case | Rivals CNNs |
+```
+
+### Comparison Claims
+
+```markdown
+✅ GOOD Comparisons:
+- "5.5× better than random baseline (10% → 55%)"
+- "Matches typical educational MLP benchmarks"
+- "20% below simple CNN performance"
+
+❌ BAD Comparisons:
+- "Competitive with modern architectures"
+- "Approaching state-of-the-art"
+- "Best-in-class for educational frameworks"
+```
+
+## ⚠️ Red Flags to Avoid
+
+### Weasel Words
+- "Can achieve..." (but didn't)
+- "Up to..." (theoretical maximum)
+- "Potentially..." (unverified)
+- "Should be able to..." (untested)
+- "With proper tuning..." (hand-waving)
+
+### Unverified Optimizations
+- "With these 10 techniques..." (didn't implement)
+- "Research shows..." (not our research)
+- "In theory..." (not in practice)
+- "Could reach..." (but didn't)
+
+### Vague Metrics
+- "Good performance"
+- "Impressive results"
+- "Significant improvement"
+- "Fast training"
+
+## 🎯 The Integrity Test
+
+Before making any performance claim, ask:
+
+1. **Did I measure this myself?**
+   - If no → Don't claim it
+
+2. **Can someone reproduce this?**
+   - If no → Don't publish it
+
+3. **Is this the typical case?**
+   - If no → Note it's exceptional
+
+4. **Would I bet money on this?**
+   - If no → Reconsider the claim
+
+## 📌 Remember
+
+> "It's better to under-promise and over-deliver than the opposite."
+
+**Trust is earned through:**
+- Honest reporting
+- Reproducible results  
+- Clear limitations
+- Verified claims
+
+**Trust is lost through:**
+- Exaggerated claims
+- Unverified results
+- Hidden failures
+- Theoretical promises
+
+## 🏆 Good Examples from TinyTorch
+
+### CIFAR-10 Cleanup
+**Before:** "60-70% achievable with optimization"
+**After:** "55% verified performance"
+**Result:** Honest, trustworthy documentation
+
+### XOR Network
+**Claim:** "100% accuracy on XOR"
+**Verified:** Yes, consistently achieves 100%
+**Result:** Credible claim that builds trust
--- a/.claude/guidelines/TESTING_STANDARDS.md
+++ b/.claude/guidelines/TESTING_STANDARDS.md
@@ -0,0 +1,228 @@
+# TinyTorch Testing Standards
+
+## 🎯 Core Testing Philosophy
+
+**Test immediately, test simply, test educationally.**
+
+Testing in TinyTorch serves two purposes:
+1. **Verification**: Ensure the code works
+2. **Education**: Help students understand what they built
+
+## 📋 Testing Patterns
+
+### The Immediate Testing Pattern
+
+**MANDATORY**: Test immediately after each implementation, not at the end.
+
+```python
+# ✅ CORRECT: Implementation followed by immediate test
+class Tensor:
+    def __init__(self, data):
+        self.data = data
+
+# Test Tensor creation immediately
+def test_tensor_creation():
+    t = Tensor([1, 2, 3])
+    assert t.data == [1, 2, 3], "Tensor should store data"
+    print("✅ Tensor creation works")
+
+test_tensor_creation()
+
+# ❌ WRONG: All tests grouped at the end
+# [100 lines of implementations]
+# [Then all tests at the bottom]
+```
+
+### Simple Assertion Testing
+
+**Use simple assertions, not complex frameworks.**
+
+```python
+# ✅ GOOD: Simple and clear
+def test_forward_pass():
+    model = SimpleMLP()
+    x = Tensor(np.random.randn(32, 784))
+    output = model.forward(x)
+    assert output.shape == (32, 10), f"Expected (32, 10), got {output.shape}"
+    print("✅ Forward pass shapes correct")
+
+# ❌ BAD: Over-engineered
+class TestMLPForwardPass(unittest.TestCase):
+    def setUp(self):
+        self.model = SimpleMLP()
+    
+    def test_forward_pass_shape_validation_with_mock_data(self):
+        # ... 50 lines of test setup
+```
+
+### Educational Test Messages
+
+**Tests should teach, not just verify.**
+
+```python
+# ✅ GOOD: Educational
+def test_backpropagation():
+    # Create simple network: 2 inputs → 2 hidden → 1 output
+    net = TwoLayerNet(2, 2, 1)
+    
+    # Forward pass with XOR data
+    x = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
+    y = Tensor([[0], [1], [1], [0]])
+    
+    output = net.forward(x)
+    loss = mse_loss(output, y)
+    
+    print(f"Initial loss: {loss.data:.4f}")
+    print("This high loss shows the network hasn't learned XOR yet")
+    
+    # Backward pass
+    loss.backward()
+    
+    # Check gradients exist
+    assert net.w1.grad is not None, "Gradients should be computed"
+    print("✅ Backpropagation computed gradients")
+    print("The network can now learn from its mistakes!")
+
+# ❌ BAD: Just verification
+def test_backprop():
+    net = TwoLayerNet(2, 2, 1)
+    # ... minimal test
+    assert net.w1.grad is not None
+    # No educational value
+```
+
+## 🧪 Performance Testing
+
+### Baseline Comparisons
+
+**Always test against a clear baseline.**
+
+```python
+def test_model_performance():
+    # 1. Test random baseline
+    random_model = create_random_network()
+    random_acc = evaluate(random_model, test_data)
+    print(f"Random network accuracy: {random_acc:.1%}")
+    
+    # 2. Test trained model
+    trained_model = load_trained_model()
+    trained_acc = evaluate(trained_model, test_data)
+    print(f"Trained network accuracy: {trained_acc:.1%}")
+    
+    # 3. Show improvement
+    improvement = trained_acc / random_acc
+    print(f"Improvement: {improvement:.1f}× better than random")
+    
+    assert trained_acc > random_acc * 2, "Should be at least 2× better than random"
+```
+
+### Honest Performance Reporting
+
+```python
+# ✅ GOOD: Report actual measurements
+def test_training_performance():
+    start_time = time.time()
+    accuracy = train_model(epochs=10)
+    train_time = time.time() - start_time
+    
+    print(f"Achieved accuracy: {accuracy:.1%}")
+    print(f"Training time: {train_time:.1f} seconds")
+    print(f"Status: {'✅ PASS' if accuracy > 0.5 else '❌ FAIL'}")
+
+# ❌ BAD: Theoretical claims
+def test_training():
+    # ... training code
+    print("Can achieve 60-70% with proper tuning")  # Unverified claim
+```
+
+## 🔍 Test Organization
+
+### Test Placement
+
+```python
+# Module structure with immediate tests
+# module_name.py
+
+# Part 1: Core implementation
+class Tensor:
+    ...
+
+# Immediate test
+test_tensor_creation()
+
+# Part 2: Operations
+def add(a, b):
+    ...
+
+# Immediate test
+test_addition()
+
+# Part 3: Advanced features
+def backward():
+    ...
+
+# Immediate test
+test_backward()
+
+# At the end: Run all tests when executed directly
+if __name__ == "__main__":
+    print("Running all tests...")
+    test_tensor_creation()
+    test_addition()
+    test_backward()
+    print("✅ All tests passed!")
+```
+
+## ⚠️ Common Testing Mistakes
+
+1. **Grouping all tests at the end**
+   - Loses educational flow
+   - Students don't see immediate verification
+
+2. **Over-complicated test frameworks**
+   - Obscures what's being tested
+   - Adds unnecessary complexity
+
+3. **Testing without teaching**
+   - Missing opportunity to reinforce concepts
+   - No educational value
+
+4. **Unverified performance claims**
+   - Damages credibility
+   - Misleads students
+
+## 📝 Test Documentation
+
+```python
+def test_attention_mechanism():
+    """
+    Test that attention correctly weighs different positions.
+    
+    This test demonstrates the key insight of attention:
+    the model learns what to focus on.
+    """
+    # Create simple sequence
+    sequence = Tensor([[1, 0, 0],  # Position 0: important
+                       [0, 0, 0],  # Position 1: padding
+                       [0, 0, 1]]) # Position 2: important
+    
+    attention_weights = compute_attention(sequence)
+    
+    # Check that important positions get more weight
+    assert attention_weights[0] > attention_weights[1]
+    assert attention_weights[2] > attention_weights[1]
+    
+    print("✅ Attention focuses on important positions")
+    print(f"Weights: {attention_weights}")
+    print("Notice how padding (position 1) gets less attention")
+```
+
+## 🎯 Remember
+
+> Tests are teaching tools, not just verification tools.
+
+Every test should help a student understand:
+- What the code does
+- Why it matters
+- How to verify it works
+- What success looks like
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,7 +1,20 @@
 # Claude Code Instructions for TinyTorch

-## ⚡ **MANDATORY: Read Git Policies First**
-**Before any development work, you MUST read and follow the Git Workflow Standards section below.**
+## 📚 **MANDATORY: Read Guidelines First**
+
+**All development standards are documented in the `.claude/` directory.**
+
+### Required Reading Order:
+1. `.claude/guidelines/DESIGN_PHILOSOPHY.md` - KISS principle and core values
+2. `.claude/guidelines/GIT_WORKFLOW.md` - Git policies and branching standards
+3. `.claude/guidelines/MODULE_DEVELOPMENT.md` - How to build modules
+4. `.claude/guidelines/TESTING_STANDARDS.md` - Testing requirements
+5. `.claude/guidelines/PERFORMANCE_CLAIMS.md` - Honest reporting standards
+6. `.claude/guidelines/AGENT_COORDINATION.md` - How to work with AI agents
+
+**Start with `.claude/README.md` for a complete overview.**
+
+## ⚡ **CRITICAL: Core Policies**

 **CRITICAL POLICIES - NO EXCEPTIONS:**
 - ✅ Always use virtual environment (`.venv`)
@@ -15,28 +28,6 @@

 ---

-## 💡 **CORE PRINCIPLE: Keep It Simple, Stupid (KISS)**
-
-**Simplicity is a fundamental principle of TinyTorch. Always prefer simple, clear solutions over complex ones.**
-
-**KISS Guidelines:**
- **One file, one purpose** - Don't create multiple versions doing the same thing
- **Clear over clever** - Code should be readable by students learning ML
- **Minimal dependencies** - Avoid unnecessary libraries or complex UI
- **Direct implementation** - Show the core concepts without abstraction layers
- **Honest performance** - Report what actually works, not theoretical possibilities
-
-**Examples:**
- ✅ `random_baseline.py` and `train.py` - two files, clear story
- ❌ Multiple optimization scripts with unverified claims
- ✅ Simple console output showing progress
- ❌ Complex dashboards with ASCII plots that don't add educational value
- ✅ "Achieves 55% accuracy" (verified)
- ❌ "Can achieve 60-70% with optimization" (unverified)
-
-**When in doubt, choose the simpler option. If students can't understand it, we've failed.**
-
---

 ## 🚨 **CRITICAL: Think First, Don't Just Agree**

--- a/CLAUDE_SIMPLE.md
+++ b/CLAUDE_SIMPLE.md
@@ -0,0 +1,81 @@
+# Claude Code Instructions for TinyTorch
+
+## 📚 **START HERE: Read the Guidelines**
+
+All development standards, principles, and workflows are documented in the `.claude/` directory.
+
+### Quick Start
+```bash
+# First, read the overview
+cat .claude/README.md
+
+# Then read core guidelines in order:
+cat .claude/guidelines/DESIGN_PHILOSOPHY.md    # KISS principle
+cat .claude/guidelines/GIT_WORKFLOW.md        # Git standards  
+cat .claude/guidelines/MODULE_DEVELOPMENT.md  # Building modules
+cat .claude/guidelines/TESTING_STANDARDS.md   # Testing patterns
+```
+
+## 🎯 Core Mission
+
+**Build an educational ML framework where students learn ML systems engineering by implementing everything from scratch.**
+
+Key principles:
+- **KISS**: Keep It Simple, Stupid
+- **Build to Learn**: Implementation teaches more than reading
+- **Systems Focus**: Not just algorithms, but engineering
+- **Honest Claims**: Only report verified performance
+
+## ⚡ Critical Policies
+
+1. **ALWAYS use virtual environment** (`.venv`)
+2. **ALWAYS work on feature branches** (never main/dev directly)
+3. **ALWAYS test before committing**
+4. **NEVER add automated attribution** to commits
+5. **NEVER edit .ipynb files directly** (edit .py only)
+
+## 🤖 Working with AI Agents
+
+**Always start with the Technical Program Manager (TPM)**:
+- TPM coordinates all other agents
+- Don't invoke agents directly
+- Follow the workflow in `.claude/guidelines/AGENT_COORDINATION.md`
+
+## 📁 Key Directories
+
+```
+.claude/guidelines/     # All development standards
+.claude/agents/        # AI agent definitions
+modules/source/        # Module implementations (.py files)
+examples/             # Working examples (keep simple)
+tests/               # Test suites
+```
+
+## 🚨 Think Critically
+
+**Don't just agree with suggestions. Always:**
+1. Evaluate if it makes pedagogical sense
+2. Check if there's a simpler way
+3. Verify it actually works
+4. Consider student perspective
+
+## 📋 Before Any Work
+
+1. **Read guidelines**: Start with `.claude/README.md`
+2. **Create branch**: Follow `.claude/guidelines/GIT_WORKFLOW.md`
+3. **Activate venv**: `source .venv/bin/activate`
+4. **Use TPM agent**: For coordinated development
+
+## 🎓 Remember
+
+> "If students can't understand it, we've failed."
+
+Every decision should be:
+- Simple
+- Verified
+- Educational
+- Honest
+
+---
+
+**For detailed instructions on any topic, see the appropriate file in `.claude/guidelines/`**