Restructure .claude directory with comprehensive guidelines

- Created organized guidelines/ directory with focused documentation:
  - DESIGN_PHILOSOPHY.md: KISS principle and simplicity focus
  - MODULE_DEVELOPMENT.md: How to build modules with systems focus
  - TESTING_STANDARDS.md: Immediate testing patterns
  - PERFORMANCE_CLAIMS.md: Honest reporting based on CIFAR-10 lessons
  - AGENT_COORDINATION.md: How agents work together effectively
  - GIT_WORKFLOW.md: Moved from root, branching standards

- Added .claude/README.md as central navigation
- Updated CLAUDE.md to reference guideline files
- Created CLAUDE_SIMPLE.md as streamlined entry point

All learnings from recent work captured in appropriate guidelines
This commit is contained in:
Vijay Janapa Reddi
2025-09-21 20:13:05 -04:00
parent 95c32b1ebe
commit 0d57736639
9 changed files with 1422 additions and 24 deletions

138
.claude/README.md Normal file
View File

@@ -0,0 +1,138 @@
# TinyTorch .claude Directory Structure
This directory contains all guidelines, standards, and agent definitions for the TinyTorch project.
## 📁 Directory Structure
```
.claude/
├── README.md # This file
├── guidelines/ # Development standards and principles
│ ├── DESIGN_PHILOSOPHY.md # KISS principle and simplicity guidelines
│ ├── GIT_WORKFLOW.md # Git branching and commit standards
│ ├── MODULE_DEVELOPMENT.md # How to develop TinyTorch modules
│ ├── TESTING_STANDARDS.md # Testing patterns and requirements
│ ├── PERFORMANCE_CLAIMS.md # How to make honest performance claims
│ └── AGENT_COORDINATION.md # How AI agents work together
├── agents/ # AI agent definitions
│ ├── technical-program-manager.md
│ ├── education-architect.md
│ ├── module-developer.md
│ ├── package-manager.md
│ ├── quality-assurance.md
│ ├── documentation-publisher.md
│ ├── workflow-coordinator.md
│ ├── devops-engineer.md
│ └── tito-cli-developer.md
└── [legacy files to review]
```
## 🎯 Quick Start for New Development
1. **Read Core Principles First**
- `guidelines/DESIGN_PHILOSOPHY.md` - Understand KISS principle
- `guidelines/GIT_WORKFLOW.md` - Learn branching requirements
2. **For Module Development**
- `guidelines/MODULE_DEVELOPMENT.md` - Module structure and patterns
- `guidelines/TESTING_STANDARDS.md` - How to write tests
- `guidelines/PERFORMANCE_CLAIMS.md` - How to report results
3. **For Agent Coordination**
- `guidelines/AGENT_COORDINATION.md` - How agents work together
- Start with Technical Program Manager (TPM) for all requests
## 📋 Key Principles Summary
### 1. Keep It Simple, Stupid (KISS)
- One file, one purpose
- Clear over clever
- Verified over theoretical
- Direct over abstract
### 2. Git Workflow
- ALWAYS work on feature branches
- NEVER commit directly to main/dev
- Test before committing
- No automated attribution in commits
### 3. Module Development
- Edit .py files only (never .ipynb)
- Test immediately after implementation
- Include systems analysis (memory, performance)
- Follow exact structure pattern
### 4. Testing Standards
- Test immediately, not at the end
- Simple assertions over complex frameworks
- Tests should educate, not just verify
- Always compare against baseline
### 5. Performance Claims
- Only claim what you've measured
- Include all relevant metrics
- Report failures honestly
- Reproducibility is key
### 6. Agent Coordination
- TPM is primary interface
- Sequential workflow with clear handoffs
- QA testing is MANDATORY
- Package integration is MANDATORY
## 🚀 Common Workflows
### Starting New Module Development
```bash
1. Create feature branch
2. Request TPM agent assistance
3. Follow MODULE_DEVELOPMENT.md structure
4. Test with TESTING_STANDARDS.md patterns
5. Verify performance per PERFORMANCE_CLAIMS.md
6. Merge following GIT_WORKFLOW.md
```
### Making Performance Claims
```bash
1. Run baseline measurements
2. Run actual measurements
3. Calculate real improvements
4. Document with all metrics
5. No unverified claims
```
### Working with Agents
```bash
1. Always start with TPM agent
2. Let TPM coordinate other agents
3. Wait for QA approval before proceeding
4. Wait for Package Manager integration
5. Only then commit
```
## 📝 Important Notes
- **Virtual Environment**: Always activate .venv before development
- **Honesty**: Report actual results, not aspirations
- **Simplicity**: When in doubt, choose the simpler option
- **Education First**: We're teaching, not impressing
## 🔗 Quick Links
- Main Instructions: `/CLAUDE.md`
- Module Source: `/modules/source/`
- Examples: `/examples/`
- Tests: `/tests/`
## 📌 Remember
> "If students can't understand it, we've failed."
Every decision should be filtered through:
1. Is it simple?
2. Is it honest?
3. Is it educational?
4. Is it verified?
If any answer is "no", reconsider.

View File

@@ -0,0 +1,204 @@
# TinyTorch Agent Coordination Guidelines
## 🎯 Core Principle
**Agents work in sequence with clear handoffs, not in isolation.**
## 🤖 The Agent Team
### Primary Interface: Technical Program Manager (TPM)
The TPM is your SINGLE point of communication for all development.
```
User Request → TPM → Coordinates Agents → Reports Back
```
**The TPM knows when to invoke:**
- Education Architect - Learning design
- Module Developer - Implementation
- Package Manager - Integration
- Quality Assurance - Testing
- Documentation Publisher - Content
- Workflow Coordinator - Process
- DevOps Engineer - Infrastructure
- Tito CLI Developer - CLI features
## 📋 Standard Development Workflow
### The Sequential Pattern
**For EVERY module development:**
```
1. Planning (Workflow Coordinator + Education Architect)
2. Implementation (Module Developer)
3. Testing (Quality Assurance) ← MANDATORY
4. Integration (Package Manager) ← MANDATORY
5. Documentation (Documentation Publisher)
6. Review (Workflow Coordinator)
```
### Critical Handoff Points
**Module Developer → QA Agent**
```python
# Module Developer completes implementation
"Implementation complete. Ready for QA testing.
Files modified: 02_tensor_dev.py
Key changes: Added reshape operation with broadcasting"
# QA MUST test before proceeding
```
**QA Agent → Package Manager**
```python
# QA completes testing
"All tests passed.
- Module imports correctly
- All functions work as expected
- Performance benchmarks met
Ready for package integration"
# Package Manager MUST verify integration
```
## 🚫 Blocking Rules
### QA Agent Can Block Progress
**If tests fail, STOP everything:**
- No commits allowed
- No integration permitted
- Must fix and re-test
### Package Manager Can Block Release
**If integration fails:**
- Module doesn't export correctly
- Breaks other modules
- Package won't build
## 📝 Agent Communication Protocol
### Structured Handoffs
Every handoff must include:
1. **What was completed**
2. **What needs to be done next**
3. **Any issues found**
4. **Test results (if applicable)**
5. **Recommendations**
**Example:**
```
From: Module Developer
To: QA Agent
Completed:
- Implemented attention mechanism in 07_attention_dev.py
- Added scaled dot-product attention
- Included positional encoding
Needs Testing:
- Attention score computation
- Mask application
- Memory usage with large sequences
Known Issues:
- Performance degrades with sequences >1000 tokens
Recommendations:
- Focus testing on edge cases with padding
```
## 🔄 Parallel vs Sequential Work
### Can Work in Parallel
✅ Different modules by different developers
✅ Documentation while code is being tested
✅ Planning next modules while current ones build
### Must Be Sequential
❌ Implementation → Testing (MUST test after implementation)
❌ Testing → Integration (MUST pass tests first)
❌ Integration → Commit (MUST integrate successfully)
## 🎯 The Checkpoint Success Story
**How agents successfully implemented the 16-checkpoint system:**
1. **Education Architect** designed capability progression
2. **Workflow Coordinator** orchestrated implementation
3. **Module Developer** built checkpoint tests + CLI
4. **QA Agent** validated all 16 checkpoints work
5. **Package Manager** ensured integration with modules
6. **Documentation Publisher** updated all docs
**Result:** Complete working system with proper handoffs
## ⚠️ Common Coordination Failures
### Working in Isolation
❌ Module Developer implements without QA testing
❌ Documentation written before code works
❌ Integration attempted before tests pass
### Skipping Handoffs
❌ Direct commit without QA approval
❌ Missing Package Manager validation
❌ No Workflow Coordinator review
### Poor Communication
❌ "It's done" (no details)
❌ No test results provided
❌ Issues discovered but not reported
## 📋 Agent Checklist
### Before Module Developer Starts
- [ ] Education Architect defined learning objectives
- [ ] Workflow Coordinator approved plan
- [ ] Clear specifications provided
### Before QA Testing
- [ ] Module Developer completed ALL implementation
- [ ] Code follows standards
- [ ] Basic self-testing done
### Before Package Integration
- [ ] QA Agent ran comprehensive tests
- [ ] All tests PASSED
- [ ] Performance acceptable
### Before Commit
- [ ] Package Manager verified integration
- [ ] Documentation complete
- [ ] Workflow Coordinator approved
## 🔧 Conflict Resolution
**If agents disagree:**
1. **QA has veto on quality** - If tests fail, stop
2. **Education Architect owns learning objectives**
3. **Workflow Coordinator resolves other disputes**
4. **User has final override**
## 📌 Remember
> Agents amplify capabilities when coordinated, create chaos when isolated.
**Key Success Factors:**
- Clear handoffs between agents
- Mandatory testing and integration
- Structured communication
- Sequential workflow where needed
- Parallel work where possible

View File

@@ -0,0 +1,212 @@
# TinyTorch Design Philosophy
## 🎯 Core Principle: Keep It Simple, Stupid (KISS)
**Simplicity is the soul of TinyTorch. We are building an educational framework where clarity beats cleverness every time.**
## 📚 Why Simplicity Matters
TinyTorch is for students learning ML systems engineering. If they can't understand it, we've failed our mission. Every design decision should prioritize:
1. **Readability** over performance
2. **Clarity** over cleverness
3. **Directness** over abstraction
4. **Honesty** over aspiration
## 🚀 KISS Guidelines
### Code Simplicity
**✅ DO:**
- Write code that reads like a textbook
- Use descriptive variable names (`gradient` not `g`)
- Implement one concept per file
- Show the direct path from input to output
- Keep functions short and focused
**❌ DON'T:**
- Use clever one-liners that require decoding
- Create unnecessary abstractions
- Optimize prematurely
- Hide complexity behind magic
**Example:**
```python
# ✅ GOOD: Clear and direct
def forward(self, x):
h1 = self.relu(self.fc1(x))
h2 = self.relu(self.fc2(h1))
return self.fc3(h2)
# ❌ BAD: Clever but unclear
def forward(self, x):
return reduce(lambda h, l: self.relu(l(h)) if l != self.layers[-1] else l(h),
self.layers, x)
```
### File Organization
**✅ DO:**
- One purpose per file
- Clear, descriptive filenames
- Minimal file count
**❌ DON'T:**
- Create multiple versions of the same thing
- Split related code unnecessarily
- Create deep directory hierarchies
**Example:**
```
✅ GOOD:
examples/cifar10/
├── random_baseline.py # Shows untrained performance
├── train.py # Training script
└── README.md # Simple documentation
❌ BAD:
examples/cifar10/
├── train_basic.py
├── train_optimized.py
├── train_advanced.py
├── train_experimental.py
├── train_with_ui.py
└── ... (20 more variations)
```
### Documentation Simplicity
**✅ DO:**
- State what it does clearly
- Give one good example
- Report verified results only
- Keep README files short
**❌ DON'T:**
- Write novels in docstrings
- Promise theoretical performance
- Add complex diagrams for simple concepts
- Create documentation that's longer than the code
**Example:**
```python
# ✅ GOOD: Clear and concise
"""
Train a neural network on CIFAR-10 images.
Achieves 55% accuracy in 2 minutes.
"""
# ❌ BAD: Over-documented
"""
This advanced training framework implements state-of-the-art optimization
techniques including adaptive learning rate scheduling, progressive data
augmentation, and sophisticated regularization strategies to push the
boundaries of what's possible with MLPs on CIFAR-10, potentially achieving
60-70% accuracy with proper hyperparameter tuning...
[continues for 500 more words]
"""
```
### Performance Claims
**✅ DO:**
- Report what you actually measured
- Include training time
- Be honest about limitations
- Compare against clear baselines
**❌ DON'T:**
- Claim unverified performance
- Hide negative results
- Exaggerate improvements
- Make theoretical claims
**Example:**
```markdown
✅ GOOD:
- Random baseline: 10% (measured)
- Trained model: 55% (measured)
- Training time: 2 minutes
❌ BAD:
- Can achieve 60-70% with optimization (unverified)
- State-of-the-art MLP performance (vague)
- Approaches CNN-level accuracy (misleading)
```
## 🎓 Educational Simplicity
### Learning Progression
**✅ DO:**
- Build concepts incrementally
- Show before explaining
- Test immediately after implementing
- Keep examples minimal but complete
**❌ DON'T:**
- Jump to complex examples
- Hide important details
- Add unnecessary features
- Overwhelm with options
### Error Messages
**✅ DO:**
- Make errors educational
- Suggest fixes
- Show what went wrong clearly
**❌ DON'T:**
- Hide errors
- Use cryptic messages
- Stack trace without context
## 🔍 Decision Framework
When making any design decision, ask:
1. **Can a student understand this in 30 seconds?**
- If no → simplify
2. **Is there a simpler way that still works?**
- If yes → use it
3. **Does this add essential value?**
- If no → remove it
4. **Would I want to debug this at 2 AM?**
- If no → rewrite it
## 📝 Examples of KISS in Action
### Recent CIFAR-10 Cleanup
**Before:** 20+ experimental files with complex optimizations
**After:** 2 files (random_baseline.py, train.py)
**Result:** Clearer story, same educational value
### Module Structure
**Before:** Complex inheritance hierarchies
**After:** Direct implementations students can trace
**Result:** Students understand what's happening
### Testing
**Before:** Complex test frameworks
**After:** Simple assertions after each implementation
**Result:** Immediate feedback and understanding
## 🚨 When Complexity is OK
Sometimes complexity is necessary, but it must be:
1. **Essential** to the learning objective
2. **Well-documented** with clear explanations
3. **Isolated** from simpler concepts
4. **Justified** by significant educational value
Example: Autograd is complex, but it's the core learning objective of that module.
## 📌 Remember
> "Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away." - Antoine de Saint-Exupéry
**Every line of code, every file, every feature should justify its existence. When in doubt, leave it out.**

View File

@@ -0,0 +1,299 @@
# TinyTorch Module Development Standards
## 🎯 Core Principle
**Modules teach ML systems engineering through building, not just ML algorithms through reading.**
## 📁 File Structure
### One Module = One .py File
```
modules/source/XX_modulename/
├── modulename_dev.py # The ONLY file you edit
├── modulename_dev.ipynb # Auto-generated from .py (DO NOT EDIT)
└── README.md # Module overview
```
**Critical Rules:**
- ✅ ALWAYS edit `.py` files only
- ❌ NEVER edit `.ipynb` notebooks directly
- ✅ Use jupytext to sync .py → .ipynb
## 📚 Module Structure Pattern
Every module MUST follow this exact structure:
```python
# %% [markdown]
"""
# Module XX: [Name]
**Learning Objectives:**
- Build [component] from scratch
- Understand [systems concept]
- Analyze performance implications
"""
# %% [markdown]
"""
## Part 1: Mathematical Foundations
[Theory and complexity analysis]
"""
# %% [code]
# Implementation
# %% [markdown]
"""
### Testing [Component]
Let's verify our implementation works correctly.
"""
# %% [code]
# Immediate test
# %% [markdown]
"""
## Part 2: Systems Analysis
### Memory Profiling
Let's understand the memory implications.
"""
# %% [code]
# Memory profiling code
# %% [markdown]
"""
## Part 3: Production Context
In real ML systems like PyTorch...
"""
# ... continue pattern ...
# %% [code]
if __name__ == "__main__":
run_all_tests()
# %% [markdown]
"""
## 🤔 ML Systems Thinking
[Interactive questions analyzing implementation]
"""
# %% [markdown]
"""
## 🎯 Module Summary
[What was learned - ALWAYS LAST]
"""
```
## 🧪 Implementation → Test Pattern
**MANDATORY**: Every implementation must be immediately followed by a test.
```python
# ✅ CORRECT Pattern:
# %% [markdown]
"""
## Building the Dense Layer
"""
# %% [code]
class Dense:
def __init__(self, in_features, out_features):
self.weights = np.random.randn(in_features, out_features) * 0.1
self.bias = np.zeros(out_features)
def forward(self, x):
return x @ self.weights + self.bias
# %% [markdown]
"""
### Testing Dense Layer
Let's verify our dense layer handles shapes correctly.
"""
# %% [code]
def test_dense_layer():
layer = Dense(10, 5)
x = np.random.randn(32, 10) # Batch of 32, 10 features
output = layer.forward(x)
assert output.shape == (32, 5), f"Expected (32, 5), got {output.shape}"
print("✅ Dense layer forward pass works!")
test_dense_layer()
```
## 🔬 ML Systems Focus
### MANDATORY Systems Analysis Sections
Every module MUST include:
1. **Complexity Analysis**
```python
# %% [markdown]
"""
### Computational Complexity
- Matrix multiply: O(batch × in_features × out_features)
- Memory usage: O(in_features × out_features) for weights
- This becomes the bottleneck when...
"""
```
2. **Memory Profiling**
```python
# %% [code]
def profile_memory():
import tracemalloc
tracemalloc.start()
layer = Dense(1000, 1000)
x = np.random.randn(128, 1000)
output = layer.forward(x)
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
print("This shows why large models need GPUs!")
```
3. **Production Context**
```python
# %% [markdown]
"""
### In Production Systems
PyTorch's nn.Linear does the same thing but with:
- GPU acceleration via CUDA kernels
- Automatic differentiation support
- Optimized BLAS operations
- Memory pooling for efficiency
"""
```
## 📝 NBGrader Integration
### Cell Metadata Structure
```python
# %% [code] {"nbgrader": {"grade": false, "locked": false, "solution": true, "grade_id": "dense_implementation"}}
### BEGIN SOLUTION
class Dense:
# Full implementation for instructors
...
### END SOLUTION
### BEGIN HIDDEN TESTS
# Instructor-only tests
...
### END HIDDEN TESTS
```
### Critical NBGrader Rules
1. **Every cell needs unique grade_id**
2. **Scaffolding stays OUTSIDE solution blocks**
3. **Hidden tests validate student work**
4. **Points should reflect complexity**
## 🎓 Educational Patterns
### The "Build → Measure → Understand" Pattern
```python
# 1. BUILD
class LayerNorm:
def forward(self, x):
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
return (x - mean) / np.sqrt(var + 1e-5)
# 2. MEASURE
def measure_performance():
layer = LayerNorm()
x = np.random.randn(1000, 512)
start = time.time()
for _ in range(100):
output = layer.forward(x)
elapsed = time.time() - start
print(f"Time per forward pass: {elapsed/100*1000:.2f}ms")
print(f"Throughput: {100*1000*512/elapsed:.0f} tokens/sec")
# 3. UNDERSTAND
"""
With 512 dimensions, normalization adds ~2ms overhead.
This is why large models use fused kernels!
"""
```
### Progressive Complexity
Start simple, build up:
```python
# Step 1: Simplest possible version
def relu_v1(x):
return np.maximum(0, x)
# Step 2: Add complexity
def relu_v2(x):
# Handle gradients
output = np.maximum(0, x)
output.grad_fn = lambda grad: grad * (x > 0)
return output
# Step 3: Production version
class ReLU:
def forward(self, x):
self.input = x # Save for backward
return np.maximum(0, x)
def backward(self, grad):
return grad * (self.input > 0)
```
## ⚠️ Common Pitfalls
1. **Too Much Theory**
- Students want to BUILD, not read
- Show through code, not exposition
2. **Missing Systems Analysis**
- Not just algorithms, but engineering
- Always discuss memory and performance
3. **Tests at the End**
- Loses educational flow
- Test immediately after implementation
4. **No Production Context**
- Students need to see real-world relevance
- Compare with PyTorch/TensorFlow
## 📌 Module Checklist
Before considering a module complete:
- [ ] All code in .py file (not notebook)
- [ ] Follows exact structure pattern
- [ ] Every implementation has immediate test
- [ ] Includes memory profiling
- [ ] Includes complexity analysis
- [ ] Shows production context
- [ ] NBGrader metadata correct
- [ ] ML systems thinking questions
- [ ] Summary is LAST section
- [ ] Tests run when module executed
## 🎯 Remember
> We're teaching ML systems engineering, not just ML algorithms.
Every module should help students understand:
- How to BUILD ML systems
- Why performance matters
- Where bottlenecks occur
- How production systems work

View File

@@ -0,0 +1,245 @@
# TinyTorch Performance Claims Guidelines
## 🎯 Core Principle
**Only claim what you have measured and verified. Honesty builds trust.**
## ✅ Verified Performance Standards
### The Three-Step Verification
1. **Measure Baseline**
```python
# Random/untrained performance
random_model = create_untrained_model()
baseline_accuracy = evaluate(random_model, test_data)
print(f"Baseline: {baseline_accuracy:.1%}") # Measured: 10%
```
2. **Measure Actual Performance**
```python
# Trained model performance
trained_model = train_model(epochs=15)
actual_accuracy = evaluate(trained_model, test_data)
print(f"Actual: {actual_accuracy:.1%}") # Measured: 55%
```
3. **Calculate Real Improvement**
```python
improvement = actual_accuracy / baseline_accuracy
print(f"Improvement: {improvement:.1f}×") # Measured: 5.5×
```
### Reporting Requirements
**ALWAYS include:**
- Exact accuracy percentage
- Training time
- Hardware used
- Number of epochs
- Dataset size
**Example:**
```markdown
✅ GOOD:
- Accuracy: 55% on CIFAR-10 test set
- Training time: 2 minutes on M1 MacBook
- Epochs: 15
- Batch size: 64
❌ BAD:
- "State-of-the-art performance"
- "Can achieve 60-70% with optimization"
- "Approaches CNN-level accuracy"
```
## 📊 The CIFAR-10 Lesson
### What We Claimed vs Reality
**Initial Claims (unverified):**
- "60-70% accuracy achievable with optimization"
- "Advanced techniques push beyond baseline"
- "Sophisticated MLPs rival simple CNNs"
**Actual Results (verified):**
- Baseline: 51-55% consistently
- With optimization attempts: Still ~55%
- Deep networks: Too slow, no improvement
- **Honest conclusion: MLPs achieve 55% reliably**
### The Right Response
When results don't match expectations:
**CORRECT Approach:**
- Test thoroughly
- Report actual results
- Update documentation
- Explain limitations
**WRONG Approach:**
- Keep unverified claims
- Hide negative results
- Blame implementation
- Make excuses
## 🔬 Performance Testing Protocol
### Minimum Testing Requirements
```python
def verify_performance_claim():
"""
Every performance claim must pass this verification.
"""
results = []
# Run multiple trials
for trial in range(3):
model = create_model()
accuracy = train_and_evaluate(model)
results.append(accuracy)
mean_acc = np.mean(results)
std_acc = np.std(results)
# Report with confidence intervals
print(f"Performance: {mean_acc:.1%} ± {std_acc:.1%}")
# Only claim if consistent
if std_acc > 0.02: # >2% variance
print("⚠️ High variance - need more testing")
return False
return True
```
### Time Complexity Reporting
```python
# ✅ GOOD: Measured complexity
def measure_scalability():
sizes = [100, 1000, 10000]
times = []
for size in sizes:
data = create_data(size)
start = time.time()
process(data)
times.append(time.time() - start)
# Analyze scaling
print("Scaling behavior:")
for size, time in zip(sizes, times):
print(f" n={size}: {time:.2f}s")
# Determine complexity
if times[2] / times[1] > 90: # 10x data → 100x time
print("Complexity: O(n²)")
# ❌ BAD: Theoretical claims
def theoretical_complexity():
print("Should be O(n log n)") # Not measured
```
## 📝 Documentation Standards
### Performance Tables
```markdown
✅ GOOD Table:
| Model | Dataset | Accuracy | Time | Hardware |
|-------|---------|----------|------|----------|
| MLP-4-layer | CIFAR-10 | 55% | 2 min | M1 CPU |
| Random baseline | CIFAR-10 | 10% | 0 sec | N/A |
| MLP-4-layer | MNIST | 98% | 30 sec | M1 CPU |
❌ BAD Table:
| Model | Performance |
|-------|------------|
| Our MLP | State-of-the-art |
| With optimization | Up to 70% |
| Best case | Rivals CNNs |
```
### Comparison Claims
```markdown
✅ GOOD Comparisons:
- "5.5× better than random baseline (10% → 55%)"
- "Matches typical educational MLP benchmarks"
- "20% below simple CNN performance"
❌ BAD Comparisons:
- "Competitive with modern architectures"
- "Approaching state-of-the-art"
- "Best-in-class for educational frameworks"
```
## ⚠️ Red Flags to Avoid
### Weasel Words
- "Can achieve..." (but didn't)
- "Up to..." (theoretical maximum)
- "Potentially..." (unverified)
- "Should be able to..." (untested)
- "With proper tuning..." (hand-waving)
### Unverified Optimizations
- "With these 10 techniques..." (didn't implement)
- "Research shows..." (not our research)
- "In theory..." (not in practice)
- "Could reach..." (but didn't)
### Vague Metrics
- "Good performance"
- "Impressive results"
- "Significant improvement"
- "Fast training"
## 🎯 The Integrity Test
Before making any performance claim, ask:
1. **Did I measure this myself?**
- If no → Don't claim it
2. **Can someone reproduce this?**
- If no → Don't publish it
3. **Is this the typical case?**
- If no → Note it's exceptional
4. **Would I bet money on this?**
- If no → Reconsider the claim
## 📌 Remember
> "It's better to under-promise and over-deliver than the opposite."
**Trust is earned through:**
- Honest reporting
- Reproducible results
- Clear limitations
- Verified claims
**Trust is lost through:**
- Exaggerated claims
- Unverified results
- Hidden failures
- Theoretical promises
## 🏆 Good Examples from TinyTorch
### CIFAR-10 Cleanup
**Before:** "60-70% achievable with optimization"
**After:** "55% verified performance"
**Result:** Honest, trustworthy documentation
### XOR Network
**Claim:** "100% accuracy on XOR"
**Verified:** Yes, consistently achieves 100%
**Result:** Credible claim that builds trust

View File

@@ -0,0 +1,228 @@
# TinyTorch Testing Standards
## 🎯 Core Testing Philosophy
**Test immediately, test simply, test educationally.**
Testing in TinyTorch serves two purposes:
1. **Verification**: Ensure the code works
2. **Education**: Help students understand what they built
## 📋 Testing Patterns
### The Immediate Testing Pattern
**MANDATORY**: Test immediately after each implementation, not at the end.
```python
# ✅ CORRECT: Implementation followed by immediate test
class Tensor:
def __init__(self, data):
self.data = data
# Test Tensor creation immediately
def test_tensor_creation():
t = Tensor([1, 2, 3])
assert t.data == [1, 2, 3], "Tensor should store data"
print("✅ Tensor creation works")
test_tensor_creation()
# ❌ WRONG: All tests grouped at the end
# [100 lines of implementations]
# [Then all tests at the bottom]
```
### Simple Assertion Testing
**Use simple assertions, not complex frameworks.**
```python
# ✅ GOOD: Simple and clear
def test_forward_pass():
model = SimpleMLP()
x = Tensor(np.random.randn(32, 784))
output = model.forward(x)
assert output.shape == (32, 10), f"Expected (32, 10), got {output.shape}"
print("✅ Forward pass shapes correct")
# ❌ BAD: Over-engineered
class TestMLPForwardPass(unittest.TestCase):
def setUp(self):
self.model = SimpleMLP()
def test_forward_pass_shape_validation_with_mock_data(self):
# ... 50 lines of test setup
```
### Educational Test Messages
**Tests should teach, not just verify.**
```python
# ✅ GOOD: Educational
def test_backpropagation():
# Create simple network: 2 inputs → 2 hidden → 1 output
net = TwoLayerNet(2, 2, 1)
# Forward pass with XOR data
x = Tensor([[0, 0], [0, 1], [1, 0], [1, 1]])
y = Tensor([[0], [1], [1], [0]])
output = net.forward(x)
loss = mse_loss(output, y)
print(f"Initial loss: {loss.data:.4f}")
print("This high loss shows the network hasn't learned XOR yet")
# Backward pass
loss.backward()
# Check gradients exist
assert net.w1.grad is not None, "Gradients should be computed"
print("✅ Backpropagation computed gradients")
print("The network can now learn from its mistakes!")
# ❌ BAD: Just verification
def test_backprop():
net = TwoLayerNet(2, 2, 1)
# ... minimal test
assert net.w1.grad is not None
# No educational value
```
## 🧪 Performance Testing
### Baseline Comparisons
**Always test against a clear baseline.**
```python
def test_model_performance():
# 1. Test random baseline
random_model = create_random_network()
random_acc = evaluate(random_model, test_data)
print(f"Random network accuracy: {random_acc:.1%}")
# 2. Test trained model
trained_model = load_trained_model()
trained_acc = evaluate(trained_model, test_data)
print(f"Trained network accuracy: {trained_acc:.1%}")
# 3. Show improvement
improvement = trained_acc / random_acc
print(f"Improvement: {improvement:.1f}× better than random")
assert trained_acc > random_acc * 2, "Should be at least 2× better than random"
```
### Honest Performance Reporting
```python
# ✅ GOOD: Report actual measurements
def test_training_performance():
start_time = time.time()
accuracy = train_model(epochs=10)
train_time = time.time() - start_time
print(f"Achieved accuracy: {accuracy:.1%}")
print(f"Training time: {train_time:.1f} seconds")
print(f"Status: {'✅ PASS' if accuracy > 0.5 else '❌ FAIL'}")
# ❌ BAD: Theoretical claims
def test_training():
# ... training code
print("Can achieve 60-70% with proper tuning") # Unverified claim
```
## 🔍 Test Organization
### Test Placement
```python
# Module structure with immediate tests
# module_name.py
# Part 1: Core implementation
class Tensor:
...
# Immediate test
test_tensor_creation()
# Part 2: Operations
def add(a, b):
...
# Immediate test
test_addition()
# Part 3: Advanced features
def backward():
...
# Immediate test
test_backward()
# At the end: Run all tests when executed directly
if __name__ == "__main__":
print("Running all tests...")
test_tensor_creation()
test_addition()
test_backward()
print("✅ All tests passed!")
```
## ⚠️ Common Testing Mistakes
1. **Grouping all tests at the end**
- Loses educational flow
- Students don't see immediate verification
2. **Over-complicated test frameworks**
- Obscures what's being tested
- Adds unnecessary complexity
3. **Testing without teaching**
- Missing opportunity to reinforce concepts
- No educational value
4. **Unverified performance claims**
- Damages credibility
- Misleads students
## 📝 Test Documentation
```python
def test_attention_mechanism():
"""
Test that attention correctly weighs different positions.
This test demonstrates the key insight of attention:
the model learns what to focus on.
"""
# Create simple sequence
sequence = Tensor([[1, 0, 0], # Position 0: important
[0, 0, 0], # Position 1: padding
[0, 0, 1]]) # Position 2: important
attention_weights = compute_attention(sequence)
# Check that important positions get more weight
assert attention_weights[0] > attention_weights[1]
assert attention_weights[2] > attention_weights[1]
print("✅ Attention focuses on important positions")
print(f"Weights: {attention_weights}")
print("Notice how padding (position 1) gets less attention")
```
## 🎯 Remember
> Tests are teaching tools, not just verification tools.
Every test should help a student understand:
- What the code does
- Why it matters
- How to verify it works
- What success looks like

View File

@@ -1,7 +1,20 @@
# Claude Code Instructions for TinyTorch
## **MANDATORY: Read Git Policies First**
**Before any development work, you MUST read and follow the Git Workflow Standards section below.**
## 📚 **MANDATORY: Read Guidelines First**
**All development standards are documented in the `.claude/` directory.**
### Required Reading Order:
1. `.claude/guidelines/DESIGN_PHILOSOPHY.md` - KISS principle and core values
2. `.claude/guidelines/GIT_WORKFLOW.md` - Git policies and branching standards
3. `.claude/guidelines/MODULE_DEVELOPMENT.md` - How to build modules
4. `.claude/guidelines/TESTING_STANDARDS.md` - Testing requirements
5. `.claude/guidelines/PERFORMANCE_CLAIMS.md` - Honest reporting standards
6. `.claude/guidelines/AGENT_COORDINATION.md` - How to work with AI agents
**Start with `.claude/README.md` for a complete overview.**
## ⚡ **CRITICAL: Core Policies**
**CRITICAL POLICIES - NO EXCEPTIONS:**
- ✅ Always use virtual environment (`.venv`)
@@ -15,28 +28,6 @@
---
## 💡 **CORE PRINCIPLE: Keep It Simple, Stupid (KISS)**
**Simplicity is a fundamental principle of TinyTorch. Always prefer simple, clear solutions over complex ones.**
**KISS Guidelines:**
- **One file, one purpose** - Don't create multiple versions doing the same thing
- **Clear over clever** - Code should be readable by students learning ML
- **Minimal dependencies** - Avoid unnecessary libraries or complex UI
- **Direct implementation** - Show the core concepts without abstraction layers
- **Honest performance** - Report what actually works, not theoretical possibilities
**Examples:**
-`random_baseline.py` and `train.py` - two files, clear story
- ❌ Multiple optimization scripts with unverified claims
- ✅ Simple console output showing progress
- ❌ Complex dashboards with ASCII plots that don't add educational value
- ✅ "Achieves 55% accuracy" (verified)
- ❌ "Can achieve 60-70% with optimization" (unverified)
**When in doubt, choose the simpler option. If students can't understand it, we've failed.**
---
## 🚨 **CRITICAL: Think First, Don't Just Agree**

81
CLAUDE_SIMPLE.md Normal file
View File

@@ -0,0 +1,81 @@
# Claude Code Instructions for TinyTorch
## 📚 **START HERE: Read the Guidelines**
All development standards, principles, and workflows are documented in the `.claude/` directory.
### Quick Start
```bash
# First, read the overview
cat .claude/README.md
# Then read core guidelines in order:
cat .claude/guidelines/DESIGN_PHILOSOPHY.md # KISS principle
cat .claude/guidelines/GIT_WORKFLOW.md # Git standards
cat .claude/guidelines/MODULE_DEVELOPMENT.md # Building modules
cat .claude/guidelines/TESTING_STANDARDS.md # Testing patterns
```
## 🎯 Core Mission
**Build an educational ML framework where students learn ML systems engineering by implementing everything from scratch.**
Key principles:
- **KISS**: Keep It Simple, Stupid
- **Build to Learn**: Implementation teaches more than reading
- **Systems Focus**: Not just algorithms, but engineering
- **Honest Claims**: Only report verified performance
## ⚡ Critical Policies
1. **ALWAYS use virtual environment** (`.venv`)
2. **ALWAYS work on feature branches** (never main/dev directly)
3. **ALWAYS test before committing**
4. **NEVER add automated attribution** to commits
5. **NEVER edit .ipynb files directly** (edit .py only)
## 🤖 Working with AI Agents
**Always start with the Technical Program Manager (TPM)**:
- TPM coordinates all other agents
- Don't invoke agents directly
- Follow the workflow in `.claude/guidelines/AGENT_COORDINATION.md`
## 📁 Key Directories
```
.claude/guidelines/ # All development standards
.claude/agents/ # AI agent definitions
modules/source/ # Module implementations (.py files)
examples/ # Working examples (keep simple)
tests/ # Test suites
```
## 🚨 Think Critically
**Don't just agree with suggestions. Always:**
1. Evaluate if it makes pedagogical sense
2. Check if there's a simpler way
3. Verify it actually works
4. Consider student perspective
## 📋 Before Any Work
1. **Read guidelines**: Start with `.claude/README.md`
2. **Create branch**: Follow `.claude/guidelines/GIT_WORKFLOW.md`
3. **Activate venv**: `source .venv/bin/activate`
4. **Use TPM agent**: For coordinated development
## 🎓 Remember
> "If students can't understand it, we've failed."
Every decision should be:
- Simple
- Verified
- Educational
- Honest
---
**For detailed instructions on any topic, see the appropriate file in `.claude/guidelines/`**