[PR #1618] [MERGED] fix(tinytorch): milestone 3 xor convergence and reporting (#1613, #1614) #9224

Closed
opened 2026-05-03 01:29:22 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1618
Author: @profvjreddi
Created: 4/30/2026
Status: Merged
Merged: 4/30/2026
Merged by: @profvjreddi

Base: devHead: fix/milestone3-xor


📝 Commits (4)

  • 1aaf779 fix(tinytorch): re-seed layers RNG so XOR milestone converges to 100%
  • 81d299d fix(tinytorch): gate XOR success messages on actual convergence
  • 9ab252f Merge dev to pick up codespell fix
  • 6183529 Merge branch 'dev' into fix/milestone3-xor

📊 Changes

1 file changed (+108 additions, -46 deletions)

View changed files

📝 tinytorch/milestones/02_1969_xor/02_xor_solved.py (+108 -46)

📄 Description

Summary

Fixes two related bugs in Milestone 3 ("MLP Revival, 1986"), specifically the XOR Solved script (`tinytorch/milestones/02_1969_xor/02_xor_solved.py`, which is part 1 of milestone 3 per `MILESTONE_SCRIPTS` in `tito/commands/milestone.py`).

#1614 - XOR stuck at 75%

The migration commit `d30257577c` ("refactor(tinytorch): migrate from legacy np.random to default_rng(7)") inadvertently broke the XOR convergence guarantee. The original line:

```python
np.random.seed(1986) # set global state - influenced layer init
```

was replaced with:

```python
rng = np.random.default_rng(7) # local var, never used - dead code
```

The active weight-init RNG is `tinytorch.core.layers.rng`, which the milestone never touches. With its module-load default (seed=7), the 4-unit hidden layer initializes into a dead-ReLU saddle point: training stalls at exactly 75% accuracy (one of the four XOR cases pinned at p≈0.5) regardless of how long it runs.

Fix: Re-seed `tinytorch.core.layers.rng` to 1986 (the year of the backprop paper) right before model creation. This restores deterministic 100% convergence in 500 epochs while preserving the original `hidden_size=4` pedagogy and all surrounding educational text (which references "4 hidden units", "13 parameters", "2→4", etc.).

#1613 - false "XOR solved" advertisement

Even when training got stuck at 75%, the script unconditionally printed:

  • ` Training Complete - XOR Solved!`
  • A green `🎉 Success! You Ended the AI Winter!` panel reading `Final accuracy: 75.0% (Perfect XOR solution!)`

Fix: Gate both messages on a 0.95 convergence threshold. Below threshold, show a yellow "Training Did Not Converge" panel that explains the dead-ReLU saddle-point symptom, tells the student to re-run or use a larger hidden layer, and explicitly warns against moving on to Milestone 03 TinyDigits with a broken XOR result.

Repro

Before:
```
Epoch 500/500 Loss: 0.3490 Accuracy: 75.0%
Training Complete - XOR Solved! <- false
║ Final accuracy: 75.0% (Perfect XOR solution!) <- false
```

After (success path):
```
Epoch 500/500 Loss: 0.0053 Accuracy: 100.0%
Training Complete - XOR Solved!
║ Final accuracy: 100.0% (Perfect XOR solution!)
```

After (failure path, simulated by forcing the bad RNG state):
```
⚠️ Training Complete - but only 75.0% accuracy.
The network did not converge (likely stuck in a saddle point).
Try re-running the milestone - random init can pin a 4-unit hidden
layer at 75% on XOR. See issue #1614.
╔══════════════════════ ⚠️ XOR Not Solved Yet ══════════════════════╗
║ Final accuracy: 75.0% (below the 95% convergence threshold) ║
║ ... ║
```

Files changed

  • `tinytorch/milestones/02_1969_xor/02_xor_solved.py` (only file)

Test plan

  • `pytest tinytorch/tests/milestones/test_milestones_smoke.py -k xor` passes (4/4)
  • Manual run produces 100.0% test accuracy and the success panel
  • Simulated failure path produces the "did not converge" warning panel and recovery instructions
  • CI on dev once the PR is opened

Relates to #1613
Relates to #1614


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1618 **Author:** [@profvjreddi](https://github.com/profvjreddi) **Created:** 4/30/2026 **Status:** ✅ Merged **Merged:** 4/30/2026 **Merged by:** [@profvjreddi](https://github.com/profvjreddi) **Base:** `dev` ← **Head:** `fix/milestone3-xor` --- ### 📝 Commits (4) - [`1aaf779`](https://github.com/harvard-edge/cs249r_book/commit/1aaf77907085612368a5942bb842cad1222dd4ed) fix(tinytorch): re-seed layers RNG so XOR milestone converges to 100% - [`81d299d`](https://github.com/harvard-edge/cs249r_book/commit/81d299d529562a982b9f2a4107d684e13a63024a) fix(tinytorch): gate XOR success messages on actual convergence - [`9ab252f`](https://github.com/harvard-edge/cs249r_book/commit/9ab252fcf28e5594bfbf3344c8ed31f07d12413b) Merge dev to pick up codespell fix - [`6183529`](https://github.com/harvard-edge/cs249r_book/commit/6183529864f91fe44ad02c38cb447d9fe7ad84b5) Merge branch 'dev' into fix/milestone3-xor ### 📊 Changes **1 file changed** (+108 additions, -46 deletions) <details> <summary>View changed files</summary> 📝 `tinytorch/milestones/02_1969_xor/02_xor_solved.py` (+108 -46) </details> ### 📄 Description ## Summary Fixes two related bugs in Milestone 3 (\"MLP Revival, 1986\"), specifically the **XOR Solved** script (\`tinytorch/milestones/02_1969_xor/02_xor_solved.py\`, which is part 1 of milestone 3 per \`MILESTONE_SCRIPTS\` in \`tito/commands/milestone.py\`). ### #1614 - XOR stuck at 75% The migration commit \`d30257577c\` (\"refactor(tinytorch): migrate from legacy np.random to default_rng(7)\") inadvertently broke the XOR convergence guarantee. The original line: \`\`\`python np.random.seed(1986) # set global state - influenced layer init \`\`\` was replaced with: \`\`\`python rng = np.random.default_rng(7) # local var, never used - dead code \`\`\` The active weight-init RNG is \`tinytorch.core.layers.rng\`, which the milestone never touches. With its module-load default (seed=7), the 4-unit hidden layer initializes into a dead-ReLU saddle point: training stalls at exactly 75% accuracy (one of the four XOR cases pinned at p≈0.5) regardless of how long it runs. **Fix:** Re-seed \`tinytorch.core.layers.rng\` to 1986 (the year of the backprop paper) right before model creation. This restores deterministic 100% convergence in 500 epochs while preserving the original \`hidden_size=4\` pedagogy and all surrounding educational text (which references \"4 hidden units\", \"13 parameters\", \"2→4\", etc.). ### #1613 - false \"XOR solved\" advertisement Even when training got stuck at 75%, the script unconditionally printed: - \`✅ Training Complete - XOR Solved!\` - A green \`🎉 Success! You Ended the AI Winter!\` panel reading \`Final accuracy: 75.0% (Perfect XOR solution!)\` **Fix:** Gate both messages on a 0.95 convergence threshold. Below threshold, show a yellow \"Training Did Not Converge\" panel that explains the dead-ReLU saddle-point symptom, tells the student to re-run or use a larger hidden layer, and explicitly warns against moving on to Milestone 03 TinyDigits with a broken XOR result. ## Repro Before: \`\`\` Epoch 500/500 Loss: 0.3490 Accuracy: 75.0% ✅ Training Complete - XOR Solved! <- false ║ Final accuracy: 75.0% (Perfect XOR solution!) <- false \`\`\` After (success path): \`\`\` Epoch 500/500 Loss: 0.0053 Accuracy: 100.0% ✅ Training Complete - XOR Solved! ║ Final accuracy: 100.0% (Perfect XOR solution!) \`\`\` After (failure path, simulated by forcing the bad RNG state): \`\`\` ⚠️ Training Complete - but only 75.0% accuracy. The network did not converge (likely stuck in a saddle point). Try re-running the milestone - random init can pin a 4-unit hidden layer at 75% on XOR. See issue #1614. ╔══════════════════════ ⚠️ XOR Not Solved Yet ══════════════════════╗ ║ Final accuracy: 75.0% (below the 95% convergence threshold) ║ ║ ... ║ \`\`\` ## Files changed - \`tinytorch/milestones/02_1969_xor/02_xor_solved.py\` (only file) ## Test plan - [x] \`pytest tinytorch/tests/milestones/test_milestones_smoke.py -k xor\` passes (4/4) - [x] Manual run produces 100.0% test accuracy and the success panel - [x] Simulated failure path produces the \"did not converge\" warning panel and recovery instructions - [ ] CI on dev once the PR is opened Relates to #1613 Relates to #1614 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-03 01:29:22 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#9224