[PR #1504] [MERGED] fix(tinytorch): LayerNorm gamma/beta missing requires_grad=True #7371

New Issue

GiteaMirror · 2026-04-24T17:27:14-05:00

GiteaMirror commented

2026-04-24 17:27:14 -05:00

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1504
Author: @Shashank-Tripathi-07
Created: 4/23/2026
Status: ✅ Merged
Merged: 4/24/2026
Merged by: @profvjreddi

Base: dev ← Head: fix/layernorm-requires-grad

📝 Commits (1)

c19109f fix(tinytorch): LayerNorm gamma/beta missing requires_grad=True

📊 Changes

2 files changed (+2 additions, -6 deletions)

View changed files

📝 tinytorch/src/13_transformers/13_transformers.py (+2 -2)
📝 tinytorch/tests/13_transformers/test_transformer_gradient_flow.py (+0 -4)

📄 Description

What

LayerNorm.__init__ creates gamma and beta as plain Tensor(np.ones(n)) and Tensor(np.zeros(n)) -- no requires_grad flag. After enable_autograd() patches Tensor.__init__, the default is requires_grad=False.

_LayerNormBackward.apply() guards gradient computation on beta.requires_grad and gamma.requires_grad (lines 475, 482). Since both are False, grad_gamma and grad_beta are always None -- LayerNorm silently never learns its scale and shift parameters during training.

Why it was hidden

test_layernorm_gradient_flow() worked around the bug by manually setting param.requires_grad = True after construction:

for param in ln.parameters():
    param.requires_grad = True  # workaround masking the bug

This meant the test passed but the actual source code was still broken for any real training run that didn't manually patch the parameters.

Fix

Pass requires_grad=True at construction in LayerNorm.__init__:

self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True)
self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True)

Also removes the manual workaround from test_layernorm_gradient_flow() so the test now validates the source code directly.

Verification

The backward math in _LayerNormBackward.apply() is correct -- the only thing missing was the flag that gates it. With requires_grad=True set at construction, grad_gamma and grad_beta are computed and populated on every backward pass.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1504 **Author:** [@Shashank-Tripathi-07](https://github.com/Shashank-Tripathi-07) **Created:** 4/23/2026 **Status:** ✅ Merged **Merged:** 4/24/2026 **Merged by:** [@profvjreddi](https://github.com/profvjreddi) **Base:** `dev` ← **Head:** `fix/layernorm-requires-grad` --- ### 📝 Commits (1) - [`c19109f`](https://github.com/harvard-edge/cs249r_book/commit/c19109f9a22af6506b9f9387d5a198b6c6b96456) fix(tinytorch): LayerNorm gamma/beta missing requires_grad=True ### 📊 Changes **2 files changed** (+2 additions, -6 deletions) <details> <summary>View changed files</summary> 📝 `tinytorch/src/13_transformers/13_transformers.py` (+2 -2) 📝 `tinytorch/tests/13_transformers/test_transformer_gradient_flow.py` (+0 -4) </details> ### 📄 Description ## What `LayerNorm.__init__` creates `gamma` and `beta` as plain `Tensor(np.ones(n))` and `Tensor(np.zeros(n))` -- no `requires_grad` flag. After `enable_autograd()` patches `Tensor.__init__`, the default is `requires_grad=False`. `_LayerNormBackward.apply()` guards gradient computation on `beta.requires_grad` and `gamma.requires_grad` (lines 475, 482). Since both are `False`, `grad_gamma` and `grad_beta` are always `None` -- LayerNorm silently never learns its scale and shift parameters during training. ## Why it was hidden `test_layernorm_gradient_flow()` worked around the bug by manually setting `param.requires_grad = True` after construction: ```python for param in ln.parameters(): param.requires_grad = True # workaround masking the bug ``` This meant the test passed but the actual source code was still broken for any real training run that didn't manually patch the parameters. ## Fix Pass `requires_grad=True` at construction in `LayerNorm.__init__`: ```python self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True) self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True) ``` Also removes the manual workaround from `test_layernorm_gradient_flow()` so the test now validates the source code directly. ## Verification The backward math in `_LayerNormBackward.apply()` is correct -- the only thing missing was the flag that gates it. With `requires_grad=True` set at construction, `grad_gamma` and `grad_beta` are computed and populated on every backward pass. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-04-24 17:27:14 -05:00

GiteaMirror closed this issue

2026-04-24 17:27:15 -05:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#7371