mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/harvard-edge/cs249r_book/pull/1618
Author: @profvjreddi
Created: 4/30/2026
Status: ✅ Merged
Merged: 4/30/2026
Merged by: @profvjreddi
Base:
dev← Head:fix/milestone3-xor📝 Commits (4)
1aaf779fix(tinytorch): re-seed layers RNG so XOR milestone converges to 100%81d299dfix(tinytorch): gate XOR success messages on actual convergence9ab252fMerge dev to pick up codespell fix6183529Merge branch 'dev' into fix/milestone3-xor📊 Changes
1 file changed (+108 additions, -46 deletions)
View changed files
📝
tinytorch/milestones/02_1969_xor/02_xor_solved.py(+108 -46)📄 Description
Summary
Fixes two related bugs in Milestone 3 ("MLP Revival, 1986"), specifically the XOR Solved script (`tinytorch/milestones/02_1969_xor/02_xor_solved.py`, which is part 1 of milestone 3 per `MILESTONE_SCRIPTS` in `tito/commands/milestone.py`).
#1614 - XOR stuck at 75%
The migration commit `d30257577c` ("refactor(tinytorch): migrate from legacy np.random to default_rng(7)") inadvertently broke the XOR convergence guarantee. The original line:
```python
np.random.seed(1986) # set global state - influenced layer init
```
was replaced with:
```python
rng = np.random.default_rng(7) # local var, never used - dead code
```
The active weight-init RNG is `tinytorch.core.layers.rng`, which the milestone never touches. With its module-load default (seed=7), the 4-unit hidden layer initializes into a dead-ReLU saddle point: training stalls at exactly 75% accuracy (one of the four XOR cases pinned at p≈0.5) regardless of how long it runs.
Fix: Re-seed `tinytorch.core.layers.rng` to 1986 (the year of the backprop paper) right before model creation. This restores deterministic 100% convergence in 500 epochs while preserving the original `hidden_size=4` pedagogy and all surrounding educational text (which references "4 hidden units", "13 parameters", "2→4", etc.).
#1613 - false "XOR solved" advertisement
Even when training got stuck at 75%, the script unconditionally printed:
Fix: Gate both messages on a 0.95 convergence threshold. Below threshold, show a yellow "Training Did Not Converge" panel that explains the dead-ReLU saddle-point symptom, tells the student to re-run or use a larger hidden layer, and explicitly warns against moving on to Milestone 03 TinyDigits with a broken XOR result.
Repro
Before:
```
Epoch 500/500 Loss: 0.3490 Accuracy: 75.0%
✅ Training Complete - XOR Solved! <- false
║ Final accuracy: 75.0% (Perfect XOR solution!) <- false
```
After (success path):
```
Epoch 500/500 Loss: 0.0053 Accuracy: 100.0%
✅ Training Complete - XOR Solved!
║ Final accuracy: 100.0% (Perfect XOR solution!)
```
After (failure path, simulated by forcing the bad RNG state):
```
⚠️ Training Complete - but only 75.0% accuracy.
The network did not converge (likely stuck in a saddle point).
Try re-running the milestone - random init can pin a 4-unit hidden
layer at 75% on XOR. See issue #1614.
╔══════════════════════ ⚠️ XOR Not Solved Yet ══════════════════════╗
║ Final accuracy: 75.0% (below the 95% convergence threshold) ║
║ ... ║
```
Files changed
Test plan
Relates to #1613
Relates to #1614
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.