[GH-ISSUE #1342] backward class math audit — finite diff mismatch #4379

Closed
opened 2026-04-19 12:24:01 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @profvjreddi on GitHub (Apr 16, 2026).
Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1342

so the finite diff tests from #1336 flagged some value mismatches. did a code read of every Backward class that came up and the math is actually right. so this is narrower than it first looked.

audit results (reading math against textbook derivatives):

  • AddBackward, SubBackward, MulBackward, DivBackward, MatmulBackward: math correct
  • ReLUBackward, SigmoidBackward, GELUBackward: math correct
  • _reduce_broadcast_grad helper: logic correct

so the "value mismatch" failures are most likely:

  1. tolerance too tight for legitimate numerical noise. RTOL=1e-3 + EPS=1e-4 may need to be looser per-op, especially for GELU which chains multiple nonlinearities
  2. something subtle in the test helper, like how loss.backward(np.ones_like(loss.data)) propagates through .sum() for 0-d outputs

what needs to happen:

  1. run one failing test at a time with analytical vs numerical printed side by side
  2. if they match within ~1e-2, tune rtol/atol for that op
  3. if they differ by more than that, look at the test helper or graph traversal, not the Backward class

note: tanh bypass was a separate real bug, fixed in #1343 and tracked in #1341.

Originally created by @profvjreddi on GitHub (Apr 16, 2026). Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1342 so the finite diff tests from #1336 flagged some value mismatches. did a code read of every Backward class that came up and the math is actually right. so this is narrower than it first looked. audit results (reading math against textbook derivatives): - AddBackward, SubBackward, MulBackward, DivBackward, MatmulBackward: math correct - ReLUBackward, SigmoidBackward, GELUBackward: math correct - _reduce_broadcast_grad helper: logic correct so the "value mismatch" failures are most likely: 1. tolerance too tight for legitimate numerical noise. RTOL=1e-3 + EPS=1e-4 may need to be looser per-op, especially for GELU which chains multiple nonlinearities 2. something subtle in the test helper, like how `loss.backward(np.ones_like(loss.data))` propagates through `.sum()` for 0-d outputs what needs to happen: 1. run one failing test at a time with analytical vs numerical printed side by side 2. if they match within ~1e-2, tune rtol/atol for that op 3. if they differ by more than that, look at the test helper or graph traversal, not the Backward class note: tanh bypass was a separate real bug, fixed in #1343 and tracked in #1341.
GiteaMirror added the type: bugarea: tinytorch labels 2026-04-19 12:24:02 -05:00
Author
Owner

@profvjreddi commented on GitHub (Apr 16, 2026):

updating the scope above after a code review pass. every Backward class that came up here has correct math against the textbook derivative. the real issue is tolerance tuning or something in the test setup, not the Backward classes. saves anyone picking this up from chasing the wrong thing.

<!-- gh-comment-id:4262249241 --> @profvjreddi commented on GitHub (Apr 16, 2026): updating the scope above after a code review pass. every Backward class that came up here has correct math against the textbook derivative. the real issue is tolerance tuning or something in the test setup, not the Backward classes. saves anyone picking this up from chasing the wrong thing.
Author
Owner

@profvjreddi commented on GitHub (Apr 17, 2026):

Closing this out — the audit is complete and there's nothing remaining:

  1. Backward class math: All verified correct (Add, Sub, Mul, Div, Matmul, ReLU, Sigmoid, GELU, _reduce_broadcast_grad)
  2. Tanh bug: Fixed separately in #1343
  3. Finite diff tests: The tests from #1336 that flagged mismatches were on the test/autograd-gradient-correctness branch, which has been merged and deleted. The permanent test suite
    (tests/06_autograd/) uses analytical expected values with np.allclose defaults — no finite-difference checker exists in the current codebase, so there are no tolerance-sensitive tests left to tune.

If a numerical gradient checker is added in the future, per-op tolerance tuning (especially for GELU) should be considered at that time.

<!-- gh-comment-id:4271804530 --> @profvjreddi commented on GitHub (Apr 17, 2026): Closing this out — the audit is complete and there's nothing remaining: 1. **Backward class math**: All verified correct (Add, Sub, Mul, Div, Matmul, ReLU, Sigmoid, GELU, `_reduce_broadcast_grad`) 2. **Tanh bug**: Fixed separately in #1343 3. **Finite diff tests**: The tests from #1336 that flagged mismatches were on the `test/autograd-gradient-correctness` branch, which has been merged and deleted. The permanent test suite (`tests/06_autograd/`) uses analytical expected values with `np.allclose` defaults — no finite-difference checker exists in the current codebase, so there are no tolerance-sensitive tests left to tune. If a numerical gradient checker is added in the future, per-op tolerance tuning (especially for GELU) should be considered at that time.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#4379