Gradient Descent

What does Gradient Descent do

It minimizes cost function using derivatives by proceeding in epochs (an epoch consists of using the training set entirely to update each parameter).

Adaptive methods

The adaptive methods in gradient descent can automatically adjust the learning rate.

Gradient descent in Neural networks

Variants of Gradient Descent

Variants Outline:

Batch gradient descent

Stochastic gradient descent (SGD)

Batch gradient descent vs. Stochastic gradient descent

Mini-batch gradient descent

Exponentially weighted averages

(Exponentially weighted moving averages)

Momentum

v(t)=γv(t1)+learning rategradientweight(t)=weight(t1)v(t)

where v(t) is the "velocity" vector at time t, γ is the momentum coefficient (typically set to a value between 0.9 and 0.99), learning rate is the step size or learning rate, gradient is the gradient of the loss function with respect to the weights, and weight(t) is the weight vector at time t.

RMSprop (Root Mean Square Prop)

Adam (Adaptive Moment Estimation)

Learning rate decay