Visualization of Gradient Descent

Initialize Trainable Parameters

We randomize W within constraints ⓘ and set B = 0. Next, we need to calculate the Loss to see how far off our predictions are.

Training Parameters

Learning Rate: 0.085

Batch Processing ⓘ

Batch Size: 5

Using all 50 samples

Loss Calculation (Using BGD) ⓘ ×

Input (x)	True Output (y)	Predicted (ŷ)	Error (ŷ - y)	Squared Error

MSE =

Σ Squared Error

∂Loss

∂ŷ

Loss =

× Σ(y - ŷ)²

∂Loss

∂ŷ

× Σ(ŷ - y)

Plug in Numbers

(Using the sum of squared errors from the loss calculation step)

× ?

∂Loss

∂ŷ

= ?

∂Loss

∂w

Chain Rule for w

∂Loss

∂w

∂Loss

∂ŷ

∂w

Find ∂ŷ/∂w

Since ŷ = w×x + b:

∂ŷ

∂w

= x

Combine

∂Loss

∂w

× (ŷ - y) × x

Sum over all Training Examples

∂Loss

∂w

× Σi=1n (ŷᵢ - yᵢ) × xᵢ

Plug in Numbers

× ?

∂Loss

∂w

= ?

∂Loss

∂b

Chain Rule for b

∂Loss

∂b

∂Loss

∂ŷ

∂b

Find ∂ŷ/∂b

Since ŷ = w×x + b:

∂ŷ

∂b

= 1

Combine

∂Loss

∂b

× (ŷ - y)

Sum over all Training Examples

∂Loss

∂b

× Σi=1n (ŷᵢ - yᵢ)

Plug in Numbers

× ?

∂Loss

∂b

= ?

Current Gradients

∂Loss

∂w

= ?

∂Loss

∂b

= ?

Learning Rate:

Gradient Descent Formula

Weight Update:

w_new = w_old - α ×

∂Loss

∂w

w_new = ? - ? × ? = ?

Bias Update:

b_new = b_old - α ×

∂Loss

∂b

b_new = ? - ? × ? = ?

y = x + 5

Data Space

50 training points

Target: y = x + 5

Model: y = 0.50x + 0.00

Visualization of Gradient Descent

Training Parameters

Batch Processing ⓘ

Loss Calculation (Using BGD) ⓘ ×

∂Loss

∂ŷ

Plug in Numbers

∂Loss

∂w

Chain Rule for w

Find ∂ŷ/∂w

Combine

Sum over all Training Examples

Plug in Numbers

∂Loss

∂b

Chain Rule for b

Find ∂ŷ/∂b

Combine

Sum over all Training Examples

Plug in Numbers

Current Gradients

Gradient Descent Formula

Loss vs Weight

Loss vs Bias