fix(layers): correct Xavier/Glorot initialization terminology

The formula sqrt(1/fan_in) is actually LeCun initialization (1998), not Xavier/Glorot. True Xavier uses sqrt(2/(fan_in+fan_out)). - Rename XAVIER_SCALE_FACTOR → INIT_SCALE_FACTOR - Update all comments to say "LeCun-style initialization" - Add note explaining difference between LeCun, Xavier, and He init - Keep the simpler formula for pedagogical clarity Fixes #1161
2026-03-11 17:49:25 -05:00 · 2026-02-05 20:11:50 -05:00
parent 852bc5c2fc
commit c1c8c11eec
2 changed files with 29 additions and 24 deletions
--- a/tinytorch/src/03_layers/03_layers.py
+++ b/tinytorch/src/03_layers/03_layers.py
@@ -68,7 +68,9 @@ from tinytorch.core.tensor import Tensor
 from tinytorch.core.activations import ReLU, Sigmoid

 # Constants for weight initialization
-XAVIER_SCALE_FACTOR = 1.0  # Xavier/Glorot initialization uses sqrt(1/fan_in)
+# Note: True Xavier/Glorot uses sqrt(2/(fan_in+fan_out)), but we use the simpler
+# LeCun-style sqrt(1/fan_in) for pedagogical clarity. Both achieve stable gradients.
+INIT_SCALE_FACTOR = 1.0  # LeCun-style initialization: sqrt(1/fan_in)
 HE_SCALE_FACTOR = 2.0  # He initialization uses sqrt(2/fan_in) for ReLU

 # Constants for dropout
@@ -135,11 +137,14 @@ Input x (batch_size, in_features)  @  Weight W (in_features, out_features)  +  B

 ### Weight Initialization
 Random initialization is crucial for breaking symmetry:
- **Xavier/Glorot**: Scale by sqrt(1/fan_in) for stable gradients
- **He**: Scale by sqrt(2/fan_in) for ReLU activation
+- **LeCun**: Scale by sqrt(1/fan_in) for stable gradients (simple, effective)
+- **Xavier/Glorot**: Scale by sqrt(2/(fan_in+fan_out)) considers both dimensions
+- **He**: Scale by sqrt(2/fan_in) optimized for ReLU activation
 - **Too small**: Gradients vanish, learning is slow
 - **Too large**: Gradients explode, training unstable

+We use LeCun-style initialization for simplicity—it works well in practice.
+
 ### Parameter Counting
 ```
 Linear(784, 256): 784 × 256 + 256 = 200,960 parameters
@@ -285,10 +290,10 @@ class Linear(Layer):
        """
        Initialize linear layer with proper weight initialization.

-        TODO: Initialize weights and bias with Xavier initialization
+        TODO: Initialize weights and bias with proper scaling

        APPROACH:
-        1. Create weight matrix (in_features, out_features) with Xavier scaling
+        1. Create weight matrix (in_features, out_features) with LeCun scaling
        2. Create bias vector (out_features,) initialized to zeros if bias=True
        3. Store as Tensor objects for use in forward pass

@@ -300,7 +305,7 @@ class Linear(Layer):
        (10,)

        HINTS:
-        - Xavier init: scale = sqrt(1/in_features)
+        - LeCun-style init: scale = sqrt(1/in_features)
        - Use np.random.randn() for normal distribution
        - bias=None when bias=False
        """
@@ -308,8 +313,8 @@ class Linear(Layer):
        self.in_features = in_features
        self.out_features = out_features

-        # Xavier/Glorot initialization for stable gradients
-        scale = np.sqrt(XAVIER_SCALE_FACTOR / in_features)
+        # LeCun-style initialization for stable gradients
+        scale = np.sqrt(INIT_SCALE_FACTOR / in_features)
        weight_data = np.random.randn(in_features, out_features) * scale
        self.weight = Tensor(weight_data)

@@ -400,7 +405,7 @@ This test validates our Linear layer implementation works correctly.

 **What we're testing**: Weight initialization, forward pass, parameter management
 **Why it matters**: Foundation for all neural network architectures
-**Expected**: Proper shapes, Xavier scaling, parameter counting
+**Expected**: Proper shapes, LeCun-style scaling, parameter counting
 """

 # %% nbgrader={"grade": true, "grade_id": "test-linear", "locked": true, "points": 15}
@@ -415,10 +420,10 @@ def test_unit_linear_layer():
    assert layer.weight.shape == (784, 256)
    assert layer.bias.shape == (256,)

-    # Test Xavier initialization (weights should be reasonably scaled)
+    # Test LeCun-style initialization (weights should be reasonably scaled)
    weight_std = np.std(layer.weight.data)
-    expected_std = np.sqrt(XAVIER_SCALE_FACTOR / 784)
-    assert 0.5 * expected_std < weight_std < 2.0 * expected_std, f"Weight std {weight_std} not close to Xavier {expected_std}"
+    expected_std = np.sqrt(INIT_SCALE_FACTOR / 784)
+    assert 0.5 * expected_std < weight_std < 2.0 * expected_std, f"Weight std {weight_std} not close to expected {expected_std}"

    # Test bias initialization (should be zeros)
    assert np.allclose(layer.bias.data, 0), "Bias should be initialized to zeros"
@@ -1152,8 +1157,8 @@ Answer these to deepen your understanding of layer operations and their systems

 ---

-### 3. Xavier Initialization Trade-offs
-**Question**: We initialize weights with scale = sqrt(1/in_features). For Linear(1000, 10), how does this compare to Linear(10, 1000)?
+### 3. Weight Initialization Trade-offs
+**Question**: We initialize weights with scale = sqrt(1/in_features) (LeCun-style). For Linear(1000, 10), how does this compare to Linear(10, 1000)?

 **Calculate**:
 - Linear(1000, 10): scale = sqrt(1/1000) = ___________
@@ -1256,7 +1261,7 @@ if __name__ == "__main__":
 Congratulations! You've built the fundamental building blocks that make neural networks possible!

 ### Key Accomplishments
- Built Linear layers with proper Xavier initialization and parameter management
+- Built Linear layers with proper weight initialization and parameter management
 - Created Dropout layers for regularization with training/inference mode handling
 - Demonstrated manual layer composition for building neural networks
 - Analyzed memory scaling and computational complexity of layer operations
--- a/tinytorch/src/03_layers/ABOUT.md
+++ b/tinytorch/src/03_layers/ABOUT.md
@@ -331,7 +331,7 @@ By the end, your layers will support parameter management, proper initialization

 ```{tip} By completing this module, you will:

- **Implement** Linear layers with Xavier initialization and proper parameter management for gradient-based training
+- **Implement** Linear layers with proper weight initialization and parameter management for gradient-based training
 - **Master** the mathematical operation `y = xW + b` and understand how parameter counts scale with layer dimensions
 - **Understand** memory usage patterns (parameter memory vs activation memory) and computational complexity of matrix operations
 - **Connect** your implementation to production PyTorch patterns, including `nn.Linear`, `nn.Dropout`, and parameter tracking
@@ -366,7 +366,7 @@ flowchart LR
 | Part | What You'll Implement | Key Concept |
 |------|----------------------|-------------|
 | 1 | `Layer` base class with `forward()`, `__call__()`, `parameters()` | Consistent interface for all layers |
-| 2 | `Linear` layer with Xavier initialization | Learned transformation `y = xW + b` |
+| 2 | `Linear` layer with proper initialization | Learned transformation `y = xW + b` |
 | 3 | `Dropout` with training/inference modes | Regularization through random masking |
 | 4 | `Sequential` container for layer composition | Chaining layers together |

@@ -498,7 +498,7 @@ The elegance is in the simplicity. Matrix multiplication handles all the feature

 How you initialize weights determines whether your network can learn at all. Initialize too small and gradients vanish, making learning impossibly slow. Initialize too large and gradients explode, making training unstable. The sweet spot ensures stable gradient flow through the network.

-Xavier (Glorot) initialization solves this by scaling weights based on the number of inputs. For a layer with `in_features` inputs, Xavier uses scale `sqrt(1/in_features)`. This keeps the variance of activations roughly constant as data flows through layers, preventing vanishing or exploding gradients.
+We use LeCun-style initialization, which scales weights by `sqrt(1/in_features)`. This keeps the variance of activations roughly constant as data flows through layers, preventing vanishing or exploding gradients. (True Xavier/Glorot uses `sqrt(2/(fan_in+fan_out))` which also considers output dimensions, but the simpler LeCun formula works well in practice.)

 Here's your initialization code:

@@ -508,8 +508,8 @@ def __init__(self, in_features, out_features, bias=True):
    self.in_features = in_features
    self.out_features = out_features

-    # Xavier/Glorot initialization for stable gradients
-    scale = np.sqrt(XAVIER_SCALE_FACTOR / in_features)
+    # LeCun-style initialization for stable gradients
+    scale = np.sqrt(INIT_SCALE_FACTOR / in_features)
    weight_data = np.random.randn(in_features, out_features) * scale
    self.weight = Tensor(weight_data)

@@ -726,10 +726,10 @@ def parameters(self):

 **Cause**: Weights initialized too large (exploding gradients) or too small (vanishing gradients).

-**Fix**: Use Xavier initialization with proper scale:
+**Fix**: Use proper initialization scaling:

 ```python
-scale = np.sqrt(1.0 / in_features)  # Not just random()!
+scale = np.sqrt(1.0 / in_features)  # LeCun-style, not just random()!
 weight_data = np.random.randn(in_features, out_features) * scale
 ```

@@ -742,7 +742,7 @@ Your TinyTorch layers and PyTorch's `nn.Linear` and `nn.Dropout` share the same
 | Feature | Your Implementation | PyTorch |
 |---------|---------------------|---------|
 | **Backend** | NumPy (Python) | C++/CUDA |
-| **Initialization** | Xavier manual | Multiple schemes (`init.xavier_uniform_`) |
+| **Initialization** | LeCun-style manual | Multiple schemes (`init.xavier_uniform_`, `init.kaiming_normal_`) |
 | **Parameter Management** | Manual `parameters()` list | `nn.Module` base class with auto-registration |
 | **Training Mode** | Manual `training` flag | `model.train()` / `model.eval()` state |
 | **Layer Types** | Linear, Dropout | 100+ layer types (Conv, LSTM, Attention, etc.) |
@@ -915,7 +915,7 @@ What happens if you initialize all weights to zero? To the same non-zero value?

 **Same non-zero value (e.g., all 1s)**: Same problem - symmetry. All neurons remain identical throughout training. You need randomness to break symmetry.

-**Xavier initialization**: Random values scaled by `sqrt(1/in_features)` break symmetry AND maintain stable gradient variance. This is why proper initialization is essential for learning.
+**Proper initialization**: Random values scaled by `sqrt(1/in_features)` break symmetry AND maintain stable gradient variance. This is why proper initialization is essential for learning.
 ```

 **Q6: Batch Size vs Throughput**