Gradient Descent

What does Gradient Descent do

It minimizes cost function using derivatives by proceeding in epochs (an epoch consists of using the training set entirely to update each parameter).

Gradient descent in Neural networks

Types of Gradient Descent

Quote

From The Hundred-Page Machine Learning Book:

  • Minibatch stochastic gradient descent (minibatch SGD) is a version of the algorithm that speeds up the computation by approximating the gradient using smaller batches (subsets) of the training data.
  • SGD itself has various “upgrades”.
    • Adagrad is a version of SGD that scales α for each parameter according to the history of gradients. As a result, α is reduced for very large gradients and vice-versa.
    • Momentum is a method that helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations.
    • In neural network training, variants of SGD such as RMSprop and Adam, are very frequently used.

Outline of Types

Batch gradient descent vs. Stochastic gradient descent

Mini-batch gradient Descent

Exponentially weighted averages

(Exponentially weighted moving averages)

Momentum

v(t)=γv(t1)+learning rategradientweight(t)=weight(t1)v(t)

where v(t) is the "velocity" vector at time t, γ is the momentum coefficient (typically set to a value between 0.9 and 0.99), learning rate is the step size or learning rate, gradient is the gradient of the loss function with respect to the weights, and weight(t) is the weight vector at time t.

RMSprop (Root Mean Square prop)

Adam (Adaptive Moment Estimation)

Learning rate decay