Gradient Descent

Gradient descent is an iterative optimization algorithm used to find the minimum value of cost function (or any function).

For illustration purpose, cost function is given by half of the MSE. i = index of example, j = index of feature.

\[J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(\hat{y_i}-y_i)^2\] \[\frac{\partial}{\partial\theta_j}=\frac{1}{m}\sum_{i=1}^{m}(\hat{y_i}-y_i)\cdot x_j^i\]

When updating the parameters with gradient descent, the 2 in the power set will be cancelled with the 1/2 in the multiplier, to make the derivation cleaner.

Example: $f(x)=x^2-4x+1$, learning rate = 0.1, start with x = 9.

$\frac{df(x)}{dx}=2x-4$ $x_1=9-0.1\cdot(2\cdot9-4)=7.6$ $x_2=7.6-0.1\cdot(2\cdot7.6-4)=6.48$ $x_3=6.48-0.1\cdot(2\cdot6.48-4)=5.584$

Three types of commonly used gradient descent learning algorithms:

batch gradient descent: parameters are updated after the loss gets aggregated across all examples in the dataset. $\theta_j:=\theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)$
stochastic gradient descent (SGD): parameters are updated for each example. $\theta_j:=\theta_j-\alpha(\hat{y_i}-y_i)x^i_j$
mini-batch gradient descent: combination of batch gradient descent and SGD.