mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 06:49:18 -05:00
feat: Update mathematical equations to use proper LaTeX formatting
- Updated autograd module: chain rule, partial derivatives, gradient rules - Updated activations module: ReLU, sigmoid, tanh, softmax formulas - Updated layers module: linear transformation, matrix multiplication - Updated networks module: function composition formulas All mathematical equations now use LaTeX formatting ($...$ and 9983...9983) for better rendering in Jupyter notebooks and documentation.
This commit is contained in:
@@ -305,25 +305,25 @@ This theorem guarantees that neural networks with nonlinear activations can lear
|
||||
### The Four Fundamental Activation Functions
|
||||
|
||||
#### **1. ReLU (Rectified Linear Unit)**
|
||||
- **Formula**: f(x) = max(0, x)
|
||||
- **Formula**: $f(x) = \max(0, x)$
|
||||
- **Use case**: Hidden layers in most networks
|
||||
- **Advantages**: Simple, fast, no vanishing gradients
|
||||
- **Disadvantages**: "Dead neurons" problem
|
||||
|
||||
#### **2. Sigmoid**
|
||||
- **Formula**: f(x) = 1/(1 + e^(-x))
|
||||
- **Formula**: $f(x) = \frac{1}{1 + e^{-x}}$
|
||||
- **Use case**: Binary classification output
|
||||
- **Advantages**: Smooth, probabilistic interpretation
|
||||
- **Disadvantages**: Vanishing gradients, computationally expensive
|
||||
|
||||
#### **3. Tanh (Hyperbolic Tangent)**
|
||||
- **Formula**: f(x) = (e^x - e^(-x))/(e^x + e^(-x))
|
||||
- **Formula**: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
|
||||
- **Use case**: Hidden layers (better than sigmoid)
|
||||
- **Advantages**: Zero-centered, stronger gradients than sigmoid
|
||||
- **Disadvantages**: Still suffers from vanishing gradients
|
||||
|
||||
#### **4. Softmax**
|
||||
- **Formula**: f(x_i) = e^(x_i) / Σ(e^(x_j))
|
||||
- **Formula**: $f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$
|
||||
- **Use case**: Multi-class classification output
|
||||
- **Advantages**: Probabilistic, sums to 1
|
||||
- **Disadvantages**: Computationally expensive, can saturate
|
||||
|
||||
@@ -99,10 +99,8 @@ from tinytorch.core.activations import ReLU, Sigmoid # Nonlinearity
|
||||
### Linear Algebra at the Heart of ML
|
||||
Neural networks are fundamentally about **linear transformations** followed by **nonlinear activations**:
|
||||
|
||||
```
|
||||
Layer: y = Wx + b (linear transformation)
|
||||
Activation: z = σ(y) (nonlinear transformation)
|
||||
```
|
||||
$$\text{Layer: } y = Wx + b \text{ (linear transformation)}$$
|
||||
$$\text{Activation: } z = \sigma(y) \text{ (nonlinear transformation)}$$
|
||||
|
||||
### Matrix Multiplication: The Engine of Deep Learning
|
||||
Every forward pass in a neural network involves matrix multiplication:
|
||||
@@ -138,11 +136,9 @@ Every framework optimizes matrix multiplication:
|
||||
### What is Matrix Multiplication?
|
||||
Matrix multiplication is the **fundamental operation** that powers neural networks. When we multiply matrices A and B:
|
||||
|
||||
```
|
||||
C = A @ B
|
||||
```
|
||||
$$C = A \times B$$
|
||||
|
||||
Each element C[i,j] is the **dot product** of row i from A and column j from B.
|
||||
Each element $C_{i,j}$ is the **dot product** of row $i$ from A and column $j$ from B.
|
||||
|
||||
### The Mathematical Foundation: Linear Algebra in Neural Networks
|
||||
|
||||
|
||||
@@ -106,9 +106,7 @@ from tinytorch.core.tensor import Tensor # Foundation
|
||||
### Function Composition at Scale
|
||||
Neural networks are fundamentally about **function composition**:
|
||||
|
||||
```
|
||||
f(x) = f_n(f_{n-1}(...f_2(f_1(x))))
|
||||
```
|
||||
$$f(x) = f_n(f_{n-1}(\ldots f_2(f_1(x)) \ldots))$$
|
||||
|
||||
Each layer is a function, and the network is the composition of all these functions.
|
||||
|
||||
@@ -155,15 +153,10 @@ Input → Layer1 → Layer2 → Layer3 → Output
|
||||
#### **Function Composition in Mathematics**
|
||||
In mathematics, function composition combines simple functions to create complex ones:
|
||||
|
||||
```python
|
||||
# Mathematical composition: (f ∘ g)(x) = f(g(x))
|
||||
def compose(f, g):
|
||||
return lambda x: f(g(x))
|
||||
$$(f \circ g)(x) = f(g(x))$$
|
||||
|
||||
# Neural network composition: h(x) = f_n(f_{n-1}(...f_2(f_1(x))))
|
||||
def network(layers):
|
||||
return lambda x: reduce(lambda acc, layer: layer(acc), layers, x)
|
||||
```
|
||||
Neural network composition:
|
||||
$$h(x) = f_n(f_{n-1}(\ldots f_2(f_1(x)) \ldots))$$
|
||||
|
||||
#### **Why Composition is Powerful**
|
||||
1. **Modularity**: Each layer has a specific, well-defined purpose
|
||||
|
||||
@@ -131,35 +131,35 @@ output = add_result * sub_result = 5
|
||||
#### **Backward Pass: Computing Gradients**
|
||||
Traverse the graph from outputs to inputs, computing gradients using the chain rule:
|
||||
|
||||
```python
|
||||
# Backward pass for f(x, y) = (x + y) * (x - y)
|
||||
# Starting from output gradient = 1
|
||||
∂output/∂multiply = 1
|
||||
∂output/∂add = ∂output/∂multiply * ∂multiply/∂add = 1 * sub_result = 1
|
||||
∂output/∂sub = ∂output/∂multiply * ∂multiply/∂sub = 1 * add_result = 5
|
||||
∂output/∂x = ∂output/∂add * ∂add/∂x + ∂output/∂sub * ∂sub/∂x = 1 * 1 + 5 * 1 = 6
|
||||
∂output/∂y = ∂output/∂add * ∂add/∂y + ∂output/∂sub * ∂sub/∂y = 1 * 1 + 5 * (-1) = -4
|
||||
For $f(x, y) = (x + y) \cdot (x - y)$ with $x = 3, y = 2$:
|
||||
|
||||
$$\frac{\partial \text{output}}{\partial \text{multiply}} = 1$$
|
||||
|
||||
$$\frac{\partial \text{output}}{\partial \text{add}} = \frac{\partial \text{output}}{\partial \text{multiply}} \cdot \frac{\partial \text{multiply}}{\partial \text{add}} = 1 \cdot \text{sub\_result} = 1$$
|
||||
|
||||
$$\frac{\partial \text{output}}{\partial \text{sub}} = \frac{\partial \text{output}}{\partial \text{multiply}} \cdot \frac{\partial \text{multiply}}{\partial \text{sub}} = 1 \cdot \text{add\_result} = 5$$
|
||||
|
||||
$$\frac{\partial \text{output}}{\partial x} = \frac{\partial \text{output}}{\partial \text{add}} \cdot \frac{\partial \text{add}}{\partial x} + \frac{\partial \text{output}}{\partial \text{sub}} \cdot \frac{\partial \text{sub}}{\partial x} = 1 \cdot 1 + 5 \cdot 1 = 6$$
|
||||
|
||||
$$\frac{\partial \text{output}}{\partial y} = \frac{\partial \text{output}}{\partial \text{add}} \cdot \frac{\partial \text{add}}{\partial y} + \frac{\partial \text{output}}{\partial \text{sub}} \cdot \frac{\partial \text{sub}}{\partial y} = 1 \cdot 1 + 5 \cdot (-1) = -4$$
|
||||
```
|
||||
|
||||
### Mathematical Foundation: The Chain Rule
|
||||
|
||||
#### **Single Variable Chain Rule**
|
||||
For composite functions: If z = f(g(x)), then:
|
||||
```
|
||||
dz/dx = (dz/df) * (df/dx)
|
||||
```
|
||||
For composite functions: If $z = f(g(x))$, then:
|
||||
|
||||
$$\frac{dz}{dx} = \frac{dz}{df} \cdot \frac{df}{dx}$$
|
||||
|
||||
#### **Multivariable Chain Rule**
|
||||
For functions of multiple variables: If z = f(x, y) where x = g(t) and y = h(t), then:
|
||||
```
|
||||
dz/dt = (∂z/∂x) * (dx/dt) + (∂z/∂y) * (dy/dt)
|
||||
```
|
||||
For functions of multiple variables: If $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$, then:
|
||||
|
||||
$$\frac{dz}{dt} = \frac{\partial z}{\partial x} \cdot \frac{dx}{dt} + \frac{\partial z}{\partial y} \cdot \frac{dy}{dt}$$
|
||||
|
||||
#### **Chain Rule in Computational Graphs**
|
||||
For any path from input to output through intermediate nodes:
|
||||
```
|
||||
∂output/∂input = ∏(∂node_{i+1}/∂node_i) for all nodes in the path
|
||||
```
|
||||
|
||||
$$\frac{\partial \text{output}}{\partial \text{input}} = \prod_{i} \frac{\partial \text{node}_{i+1}}{\partial \text{node}_i}$$
|
||||
|
||||
### Automatic Differentiation Modes
|
||||
|
||||
@@ -472,10 +472,10 @@ Every differentiable operation follows the same pattern:
|
||||
3. **Return Variable**: With the result and grad_fn
|
||||
|
||||
### Mathematical Rules
|
||||
- **Addition**: `d(x + y)/dx = 1, d(x + y)/dy = 1`
|
||||
- **Multiplication**: `d(x * y)/dx = y, d(x * y)/dy = x`
|
||||
- **Subtraction**: `d(x - y)/dx = 1, d(x - y)/dy = -1`
|
||||
- **Division**: `d(x / y)/dx = 1/y, d(x / y)/dy = -x/y²`
|
||||
- **Addition**: $\frac{d(x + y)}{dx} = 1$, $\frac{d(x + y)}{dy} = 1$
|
||||
- **Multiplication**: $\frac{d(x \cdot y)}{dx} = y$, $\frac{d(x \cdot y)}{dy} = x$
|
||||
- **Subtraction**: $\frac{d(x - y)}{dx} = 1$, $\frac{d(x - y)}{dy} = -1$
|
||||
- **Division**: $\frac{d(x / y)}{dx} = \frac{1}{y}$, $\frac{d(x / y)}{dy} = -\frac{x}{y^2}$
|
||||
|
||||
### Implementation Strategy
|
||||
Each operation creates a closure that captures the input variables and implements the gradient computation rule.
|
||||
@@ -680,7 +680,7 @@ def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Va
|
||||
4. Return Variable with result and grad_fn
|
||||
|
||||
MATHEMATICAL RULE:
|
||||
If z = x / y, then dz/dx = 1/y, dz/dy = -x/y²
|
||||
If z = x / y, then dz/dx = \frac{1}{y}, dz/dy = -\frac{x}{y^2}
|
||||
|
||||
EXAMPLE:
|
||||
x = Variable(6.0), y = Variable(2.0)
|
||||
|
||||
Reference in New Issue
Block a user