Converted all paragraph headings to bold text format for consistent styling throughout the document. This improves visual consistency and follows the requested formatting guidelines. Changes: - Paper Organization (introduction) - Build/Use/Reflect cycle descriptions - Why Milestones Matter - The Six Historical Milestones - Experiencing Performance Reality - All future work subsection headings (Roofline Models, ASTRA-sim, Energy and Power Profiling, The Three-Tier Systems Pedagogy) Table 3 remains correctly positioned in Systems Integration section where performance trade-offs are discussed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
12 KiB
TinyTorch Paper: Evidence Inventory
What We Can Prove vs. What We're Claiming
✅ STRONG EVIDENCE (Can Defend to Reviewers)
Technical Calculations (All Verified)
| Claim | Evidence | Status |
|---|---|---|
| Adam 2× optimizer state | momentum + variance = 2× model params | ✅ Mathematically verified |
| Adam 4× total training memory | weights + grads + momentum + variance | ✅ Mathematically verified |
| Conv2d 109× parameter efficiency | 896 params vs 98,336 params | ✅ Calculated and verified |
| MNIST: ~180 MB | 60,000 × 784 × 4 = 188 MB | ✅ Within rounding error |
| ImageNet: ~670 GB | 1.2M × 224×224×3 × 4 = 722.5 GB | ✅ Within rounding error |
| GPT-3 training: ~2.6 TB | 175B × 4 × 4 = 2.8 TB | ✅ Within rounding error |
| CIFAR conv: 241M ops | 128×32×28×28×3×5×5 = 241,228,800 | ✅ Exact |
Reviewer Defense: "All memory and complexity calculations are mathematically derived and verified against standard formulas."
Implementation Artifacts (All Exist)
| Claim | Evidence | Verification Command | Status |
|---|---|---|---|
| 20 modules implemented | 20 directories in modules/ | ls -1 modules/|grep ^[0-9]|wc -l |
✅ 20 found |
| NBGrader infrastructure | 283 solution cells | grep -r "BEGIN SOLUTION" modules/|wc -l |
✅ 283 found |
| Progressive disclosure code | Dormant features in Module 01 | modules/01_tensor/tensor_dev.py:606-609 |
✅ Implemented |
| PyTorch-inspired package | nbdev export directives | grep "default_exp" modules/*/\*.py |
✅ Found |
| TinyDigits dataset | Dataset directory exists | ls datasets/tinydigits/ |
✅ Exists |
| TinyTalks dataset | Dataset directory exists | ls datasets/tinytalks/ |
✅ Exists |
| Milestone templates | 6 milestone directories | ls milestones/0*/ |
✅ 6 found |
Reviewer Defense: "All claimed infrastructure is publicly available and documented at github.com/harvard-edge/TinyTorch"
Learning Theory Grounding (Well-Cited)
| Claim | Evidence | Citation | Status |
|---|---|---|---|
| Cognitive load theory | Cited Sweller (1988) | Line 717, references.bib:51 | ✅ Peer-reviewed |
| Constructionism | Cited Papert (1980) | Line 393, references.bib:366 | ✅ Peer-reviewed |
| Cognitive apprenticeship | Cited Collins et al. (1989) | Line 395, references.bib:104 | ✅ Peer-reviewed |
| Productive failure | Cited Kapur (2008) | Line 397, references.bib:382 | ✅ Peer-reviewed |
| Threshold concepts | Cited Meyer & Land (2003) | Line 399, references.bib:397 | ✅ Peer-reviewed |
| Situated learning | Cited Lave & Wenger (1991) | Line 730, references.bib:92 | ✅ Peer-reviewed |
Reviewer Defense: "Pedagogical design grounded in established CS education research with peer-reviewed citations."
⚠️ WEAK EVIDENCE (Needs Hedging or Removal)
Workforce Statistics (Cannot Verify)
| Claim | Citation | Problem | Status |
|---|---|---|---|
| 3:1 supply/demand ratio | keller2025ai | Industry report, not peer-reviewed | ❌ Unverifiable |
| 150,000 practitioners worldwide | roberthalf2024talent | Specific number without source quote | ❌ Unverifiable |
| 78% job posting growth | roberthalf2024talent | No page number or quote provided | ❌ Unverifiable |
| 40-50% executives cite shortage | keller2025ai | Range suggests uncertainty | ❌ Unverifiable |
Reviewer Challenge: "These are industry marketing materials, not research. Can you cite peer-reviewed workforce studies?"
Recommendation: Remove specific numbers, keep general statement:
Industry surveys identify demand-supply imbalances for ML systems
engineers~\citep{roberthalf2024talent,keller2025ai}
Time Estimates (No Empirical Data)
| Claim | Evidence | Problem | Status |
|---|---|---|---|
| 60-80 hours curriculum | NONE | No student tracking data | ❌ Unsupported |
| 2-3 weeks bootcamp | NONE | Contradicts 60-80 hours (implies 80-120hrs) | ❌ Inconsistent |
Reviewer Challenge: "What data supports these time estimates? How many students completed the curriculum?"
Recommendation: Add "estimated based on pilot testing"
Learning Outcomes (Design Goals, Not Proven Results)
| Claim | Evidence | Problem | Status |
|---|---|---|---|
| "Students transition from users to engineers" | Curriculum design | No pre/post assessment | ❌ Unproven outcome |
| "Makes tacit knowledge explicit" | Module structure | No knowledge transfer tests | ❌ Design goal |
| "Validates correctness through milestones" | Milestone templates exist | No student completion data | ❌ Overstated |
| "Reduces cognitive load" | Already hedged as hypothesis | Properly scoped | ✅ Acceptable hedging |
Reviewer Challenge: "How do you know students learn better with this approach? Where's the comparison data?"
Recommendation: Change to design goals rather than proven outcomes:
- "aims to transition students"
- "designed to make tacit knowledge explicit"
- "provides validation targets through milestones"
🔍 MISSING EVIDENCE (Should Collect for Future Paper)
Student Usage Data
- ❌ Number of students who completed curriculum
- ❌ Completion rate per module
- ❌ Drop-off points (which modules students abandon)
- ❌ Time per module (actual measurements)
- ❌ Background characteristics (ML experience, programming proficiency)
Milestone Achievement Data
- ❌ Percentage achieving target accuracies (95% MNIST, 75% CIFAR)
- ❌ Common implementation bugs (qualitative failure analysis)
- ❌ Debugging time per milestone
- ❌ Success rate: students who attempt vs. complete milestones
Learning Outcome Assessments
- ❌ Pre/post knowledge tests
- ❌ Transfer tasks (debugging PyTorch code with TinyTorch knowledge)
- ❌ Comparison with control group (traditional ML course students)
- ❌ Cognitive load measurements (dual-task, self-report scales)
- ❌ Six-month retention follow-up
Deployment Evidence
- ❌ Number of institutions using curriculum
- ❌ Student enrollment numbers
- ❌ TA/instructor feedback
- ❌ Integration model effectiveness (self-paced vs. institutional)
Timeline: Fall 2025 deployment can collect this data
📊 EVIDENCE STRENGTH BY CLAIM TYPE
Mathematical/Technical Claims: 95% Strong
- All calculations verified
- Code implementations exist
- Can reproduce all numbers
- Action: None needed, these are solid
Infrastructure Claims: 90% Strong
- Modules, datasets, NBGrader all exist
- Publicly available and verifiable
- Package structure documented
- Action: Verify dataset sizes, clarify test count
Learning Theory Claims: 85% Strong
- Well-cited peer-reviewed sources
- Design grounded in established research
- Properly hedged (progressive disclosure as "hypothesized")
- Action: Ensure consistent hedging throughout
Pedagogical Effectiveness Claims: 30% Strong
- Design exists and is well-documented
- No empirical validation of learning outcomes
- Time estimates unsubstantiated
- Milestone "validation" overstated
- Action: Hedge as design goals, not proven results
Workforce Motivation Claims: 20% Strong
- Based on industry reports, not research
- Cannot verify specific statistics
- May not be appropriate for academic paper
- Action: Remove specifics or verify sources
🎯 WHAT REVIEWERS WILL ACCEPT
Acceptable Claims (Evidence Exists)
✅ "We implemented a 20-module curriculum" ✅ "Progressive disclosure uses monkey-patching for runtime activation" ✅ "Adam requires 2× optimizer state (momentum + variance)" ✅ "Conv2d achieves 109× parameter efficiency over dense layers" ✅ "Design grounded in cognitive load theory~\citep{sweller1988}" ✅ "Curriculum provides historical milestone templates" ✅ "NBGrader infrastructure enables automated assessment"
Questionable Claims (Needs Hedging)
⚠️ "Students transition from users to engineers" → "aims to transition" ⚠️ "Validates correctness through milestones" → "provides validation targets" ⚠️ "60-80 hours completion time" → "estimated 60-80 hours" ⚠️ "Makes tacit knowledge explicit" → "designed to make explicit"
Unacceptable Claims (Remove or Verify)
❌ "3:1 supply/demand ratio" (cannot verify) ❌ "150,000 practitioners worldwide" (cannot verify) ❌ "78% job posting growth" (cannot verify) ❌ "Students recreate 70 years of ML history" (milestones are templates, not proven)
📝 RECOMMENDED EVIDENCE LANGUAGE
For Unverified Claims:
DON'T SAY:
- "X demonstrates that..."
- "This proves..."
- "Evidence shows..."
- "Validates that..."
DO SAY:
- "X is designed to..."
- "We hypothesize that..."
- "This approach aims to..."
- "Preliminary observations suggest..." (if you have pilot data)
For Future Work:
DON'T SAY:
- "Will be tested in Fall 2025"
DO SAY:
- "Empirical validation planned for Fall 2025 deployment"
- "Requires controlled studies comparing to traditional approaches"
- "Future work will measure..."
🔬 EVIDENCE QUALITY TIERS
Tier 1: Mathematical/Reproducible Evidence
- Anyone can verify these claims
- Examples: Conv2d 109×, Adam 4×, memory calculations
- Strength: Unassailable
Tier 2: Implemented Artifacts
- Reviewers can inspect code
- Examples: 20 modules, NBGrader cells, milestone templates
- Strength: Strong (publicly verifiable)
Tier 3: Cited Learning Theory
- Grounded in peer-reviewed research
- Examples: Cognitive load theory, constructionism
- Strength: Acceptable (design justification)
Tier 4: Design Claims
- Infrastructure exists but effectiveness unproven
- Examples: Integration models, progressive disclosure
- Strength: Acceptable if hedged as design goals
Tier 5: Learning Outcome Claims
- No empirical validation yet
- Examples: "Students learn better," "Reduces cognitive load"
- Strength: Weak (requires hedging or future work framing)
Tier 6: External Statistics
- Industry reports, not research
- Examples: Workforce numbers
- Strength: Very weak (verify or remove)
🎓 FINAL GUIDANCE
What This Paper CAN Claim:
- "We designed and implemented a complete 20-module ML systems curriculum"
- "The design is grounded in established learning theory (X, Y, Z)"
- "Progressive disclosure is a novel pedagogical pattern for ML education"
- "Systems-first integration differs from traditional algorithm-focused curricula"
- "All infrastructure is open-source and publicly available"
- "The curriculum provides historical milestone templates for validation"
What This Paper CANNOT (Yet) Claim:
- "Students learn better with this approach" (no comparison data)
- "Curriculum takes 60-80 hours" (no timing data)
- "Students successfully recreate ML history" (no completion data)
- "Progressive disclosure reduces cognitive load" (no measurements)
- "Specific workforce shortage statistics" (cannot verify sources)
Paper Positioning:
This is a design contribution with empirical validation planned, not a learning outcomes study with proven effectiveness.
Frame as:
- "We present a curriculum design..."
- "This approach is hypothesized to..."
- "Future work will empirically validate..."
NOT as:
- "We prove that..."
- "Results show that..."
- "Students demonstrate improved..."
📋 EVIDENCE COLLECTION PRIORITY
Before Submission (Critical):
- ✅ Verify or remove workforce statistics
- ✅ Hedge learning outcome claims
- ✅ Clarify milestone templates vs. validation
- ✅ Add "estimated" to time claims
For Fall 2025 (High Priority):
- ⏳ Student completion tracking
- ⏳ Time-per-module measurements
- ⏳ Milestone achievement rates
- ⏳ Pre/post knowledge assessments
For Future Research (Medium Priority):
- ⏳ Cognitive load experiments
- ⏳ Transfer task assessments
- ⏳ Comparison with control groups
- ⏳ Long-term retention studies
Bottom Line: You have strong evidence for what you built. You have weak evidence for how well it works. Frame accordingly.