Create Next App

1. The Forward Pass (Sensitivity)

Before we learn Backprop, let's remember sensitivity. If a machine multiplies by 3, nudging the input by 1 moves the output by 3.

Input 2 → [ * 3 ] → Output 6

Sensitivity = 3 (Constant)

2. The Chain Rule

When machines are chained, their sensitivities multiply.

In 1 → [*2] → 2 → [*5] → 10

3. Deriving Equations: The Power Rule

Why is the gradient of b² equal to 2*b? The slope gets steeper as the number gets bigger.

In b=3, Out b²=9.00

Current Sensitivity (Slope):
slope = 2b6.0

4. Deriving Equations: The Product Rule

In a = x * y, why is the derivative of x equal to y? Use the "Hourly Wage" analogy.

Hours Worked (x)5

Hourly Wage (y)10

Total Pay = 50

1 more hour (+1 to x) → Pay increases by $10 (y).
$1 raise (+1 to y) → Pay increases by $5 (x).

5. Putting it Together: The Full Chain Rule

Step 1: Local derivatives

Operation

Local Derivative

a = x * y

∂a/∂x = y

b = a + z

∂b/∂a = 1

c = b²

∂c/∂b = 2*b

Step 2: The Graph Structure

Step 3: Backward Pass
Pink = Upstream Gradient
Blue = Local Derivative

1. Start at Output c

∂c/∂c = 1

2. Backprop to b

∂c/∂b = 1 * (2*b) = 2b

3. Backprop to a and z

∂c/∂a = (2b) * (1) = 2b

∂c/∂z = (2b) * (1) = 2b

4. Backprop to x and y

∂c/∂x = (2b) * (y) = 2by

∂c/∂y = (2b) * (x) = 2bx

Rule:Gradient = Sum of (Upstream Gradient × Local Derivative)

6. Important Functions and their derivatives

1. Sigmoid Function: σ(x)

Squashes any input into a probability between 0 and 1. Used in binary classification.
Formula: 1 / (1 + e⁻ˣ)

Derivative: d(σ)/dx = σ(x) · (1 - σ(x))

Key Intuition: The gradient is highest (0.25) at 0, and vanishes to 0 at extremes.

x: 0.0 | σ(x): 0.50 | Grad: 0.250

2. ReLU (Rectified Linear Unit)

Standard for hidden layers. Keeps positive values, kills negatives.
Formula: max(0, x)

d(ReLU)/dx = 1 if x>0, else 0

It's a switch. If active, gradient flows perfectly (1). If inactive, gradient dies (0).

3. Logarithm

Used in Loss functions (like Cross-Entropy) to penalize errors.
Formula: ln(x)

d(log x)/dx = 1/x

Gradient gets huge for small x, forcing the model to learn quickly when wrong.

4. Vector Math

The dot product of a vector with itself. Represents squared magnitude (L2 Norm).
Formula: xᵀx

d(xᵀx)/dx = 2x

This is just the vector version of d(x²)/dx = 2x.

5. Scalar Mult

Simple linear scaling (Weights).
Formula: a · x

d(ax)/dx = a

Constant multiplier rule.