Foundations --> Neural Network #29

New Issue

GiteaMirror · 2025-11-02T00:01:51-05:00

GiteaMirror commented

2025-11-02 00:01:51 -05:00

Originally created by @gitgithan on GitHub (Oct 17, 2021).

In the table at the top, outputs from second layer shows NxH should be NxC?
SyntaxError: plt.scatter(X[:, 0], X[:, 1], c=[colors[_y] for _y in y], edgecolors="k"', s=25) Extra single quote behind "k" in notebook
Is def init_weights(self): used anywhere? It seems this was defined but not applied anywhere, or does pytorch implicitly apply it during some step? I was expecting model.apply(init_weights) somewhere
The objective is to have weights that are able to produce outputs that follow a similar distribution across all neurons
Could there be more clarity on this statement? What exactly is a "distribution across neurons" , and what does "similar" mean? What are the objects that we want similar? Is it we have 1 distribution per layer of neurons, and each neuron's single output value contributes to this discrete distribution of outputs in a layer, and we're comparing similarity across layers? (but this sounds wrong because each layer would have different number of neurons, can discrete distributions with different number of items in x-axis be compared?)
Is there missing - sign in term (with 1/y) on the left side of = a(y-1) in gradient derivation of dJ/dW2y

Originally created by @gitgithan on GitHub (Oct 17, 2021). 1. In the table at the top, outputs from second layer shows NxH should be NxC? 2. SyntaxError: `plt.scatter(X[:, 0], X[:, 1], c=[colors[_y] for _y in y], edgecolors="k"', s=25)` Extra single quote behind "k" in notebook 3. Is `def init_weights(self):` used anywhere? It seems this was defined but not applied anywhere, or does pytorch implicitly apply it during some step? I was expecting `model.apply(init_weights)` somewhere 4. `The objective is to have weights that are able to produce outputs that follow a similar distribution across all neurons` Could there be more clarity on this statement? What exactly is a "distribution across neurons" , and what does "similar" mean? What are the objects that we want similar? Is it we have 1 distribution per layer of neurons, and each neuron's single output value contributes to this discrete distribution of outputs in a layer, and we're comparing similarity across layers? (but this sounds wrong because each layer would have different number of neurons, can discrete distributions with different number of items in x-axis be compared?) 5. Is there missing - sign in term (with 1/y) on the left side of = a(y-1) in gradient derivation of dJ/dW2y

GiteaMirror closed this issue

2025-11-02 00:01:51 -05:00

GiteaMirror commented

2025-11-02 00:01:52 -05:00

@GokuMohandas commented on GitHub (Oct 18, 2021):

Good catch!
fixed
yes I remember reading a long time ago that pytorch layers are initialized using different methods depending on the layer type. You can of course override with any initialization method that you want to use.
I've edited it to say, I'll clarify more during next version: "So far we have been initializing weights with small random values but this isn't optimal for convergence during training. The objective is to initialize the appropriate weights such that our activations (outputs of layers) don't vanish (too small) or explode (too large), as either of these situations will hinder convergence. We can do this by sampling the weights uniformly from a bound distribution (many that take into account the precise activation function used) such that all activations have unit variance."
nice catch!

@GokuMohandas commented on GitHub (Oct 18, 2021): 1. Good catch! 2. fixed 3. yes I remember reading a long time ago that pytorch layers are initialized using different methods depending on the layer type. You can of course override with any initialization method that you want to use. 4. I've edited it to say, I'll clarify more during next version: "So far we have been initializing weights with small random values but this isn't optimal for convergence during training. The objective is to initialize the appropriate weights such that our activations (outputs of layers) don't vanish (too small) or explode (too large), as either of these situations will hinder convergence. We can do this by sampling the weights uniformly from a bound distribution (many that take into account the precise activation function used) such that all activations have unit variance." 5. nice catch!