1. The Forward Pass (Sensitivity)

Before we learn Backprop, let's remember sensitivity. If a machine multiplies by 3, nudging the input by 1 moves the output by 3.

Input 2 → [ * 3 ] → Output 6

Sensitivity = 3 (Constant)

2. The Chain Rule

When machines are chained, their sensitivities multiply.

In 1 → [*2] → 2 → [*5] → 10

3. Deriving Equations: The Power Rule

Why is the gradient of equal to 2*b? The slope gets steeper as the number gets bigger.

In b=3, Out b²=9.00
Current Sensitivity (Slope):
slope = 2b
6.0

4. Deriving Equations: The Product Rule

In a = x * y, why is the derivative of x equal to y? Use the "Hourly Wage" analogy.

5
10
Total Pay = 50

1 more hour (+1 to x) → Pay increases by $10 (y).
$1 raise (+1 to y) → Pay increases by $5 (x).

5. Putting it Together: The Full Chain Rule

Step 1: Local derivatives

Operation
Local Derivative
a = x * y
∂a/∂x = y
b = a + z
∂b/∂a = 1
c = b²
∂c/∂b = 2*b

Step 2: The Graph Structure

×
+
()²
x
y
a
z
b
c

Step 3: Backward Pass
Pink = Upstream Gradient
Blue = Local Derivative

1. Start at Output c
∂c/∂c = 1
2. Backprop to b
∂c/∂b = 1 * (2*b) = 2b
3. Backprop to a and z
∂c/∂a = (2b) * (1) = 2b
∂c/∂z = (2b) * (1) = 2b
4. Backprop to x and y
∂c/∂x = (2b) * (y) = 2by
∂c/∂y = (2b) * (x) = 2bx
Rule:Gradient = Sum of (Upstream Gradient × Local Derivative)

6. Important Functions and their derivatives

1. Sigmoid Function: σ(x)
Squashes any input into a probability between 0 and 1. Used in binary classification.
Formula: 1 / (1 + e⁻ˣ)
Derivative: d(σ)/dx = σ(x) · (1 - σ(x))
Key Intuition: The gradient is highest (0.25) at 0, and vanishes to 0 at extremes.
x: 0.0  |  σ(x): 0.50  | Grad: 0.250
2. ReLU (Rectified Linear Unit)
Standard for hidden layers. Keeps positive values, kills negatives.
Formula: max(0, x)
d(ReLU)/dx = 1 if x>0, else 0
It's a switch. If active, gradient flows perfectly (1). If inactive, gradient dies (0).
3. Logarithm
Used in Loss functions (like Cross-Entropy) to penalize errors.
Formula: ln(x)
d(log x)/dx = 1/x
Gradient gets huge for small x, forcing the model to learn quickly when wrong.
4. Vector Math
The dot product of a vector with itself. Represents squared magnitude (L2 Norm).
Formula: xᵀx
d(xᵀx)/dx = 2x
This is just the vector version of d(x²)/dx = 2x.
5. Scalar Mult
Simple linear scaling (Weights).
Formula: a · x
d(ax)/dx = a
Constant multiplier rule.