Optimizers explained for training Neural Networks

By | October 5, 2020

Overview

Training a Deep Learning model (or any machine learning model in fact) is all about bringing the model predictions (model output) close to the real output(Ground truth) for a given set of input-output pairs. Once the model’s results are close to the real results our job is done.

To understand how close model predictions are with respect to the real outputs, we need a mathematical function(measure) known as a loss function. This loss function tells you how good your model is doing at a given step of training.

Ganga River India | Drops of AI | dropsofai.com
Ganga River in India | Image by Author

Though our loss function helps us understand the quality of our model, it doesn’t solve the complete problem. We actually need a different algorithm that will update the coefficients(or weights) of our model. This optimization algorithm will make sure that the loss value(on training data) decreases at each training step and our model learns from the input-output pairs of the training data.

In this article, we will discuss some common optimization techniques (Optimizers) used in training neural networks (Deep Learning models).


This article will cover the following topics-

  1. Gradient Descent Algorithm
  2. Stochastic Gradient Descent
  3. Momentum
  4. Adagrad
  5. RmsProp
  6. Adadelta
  7. Adam
  8. Summary

Gradient Descent Algorithm

Gradient Descent is a popular algorithm used to find the local minima of any given differentiable function. If the given function is convex, this local minima becomes the global minima.

While training any machine learning (or Deep Learning) model our objective is to achieve minimum loss value. Thus our optimization algorithm will try to find the local/global minima for our loss function with respect to the given training data.

The basic intuition behind the working of gradient descent algorithms is-At each point in time, It tries to find the best direction to move and takes a small step such that the overall loss value decreases. To find the best direction at a given point, it calculates the gradient of the loss function at that point.

Gradient Descent Optimizer
Gradient Descent | Image Source

The gradient of a differentiable function-f at a given point p is equal to the value of the partial differentiation of function-f at the point-p.

To get a lower value of the loss function at each succeeding step, we must move towards the direction of the negative gradient from the starting point-p.

Here is how gradient descent updates the coefficients(weights) of the model with respect to the calculated gradient value over loss function at a given step.

# calculate the current gradient
gradient = calculate_gradient_(args)
# update weights towards negative gradient
weight_new = weight + learning_rate * ((-1)*(gradient))
# resulting weight value
weight_new = weight - learning_rate * gradient

Note: calculate_gradient_(args) is a function that calcualtes current gradient.

learning_rate here decides-how fast(size of jump) you want to move towards the local best direction hinted the by the gradient!

Convergence to the global minimum is guaranteed when the loss function is convex. As for a convex function-all local minima’s are also the global minima.


Stochastic Gradient Descent

Although gradient descent guarantees convergence, there is a problem with this optimization method. As it calculates the gradient over the full dataset at each step, it is not always feasible due to the high dimensionality of most real-life datasets(also it’s not always feasible to do such computations over full data due to the hardware constraints).

To overcome this issue, stochastic gradient descent(also known as ‘SGD‘) comes into the picture. The stochastic gradient descent algorithm also works in a similar way but instead of calculating the gradient over the full dataset, it calculates the gradient value over each example of the training dataset and updates the parameters(weights) accordingly.

#SGD signature in tf.keras
tf.keras.optimizers.SGD(
    learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', **kwargs
)

This makes is computationally awesome and extremely fast. But this great power comes at a cost, that is-lower convergence rate (As noisy examples might mislead the convergence direction).

Mini batch Stochastic Gradient Descent

To make things better-the mini-batch stochastic gradient descent algorithm calculates the gradient value over a mini-batch (a subset of training data) instead of every single example. Now, this modification keeps the algorithm computationally feasible and as the gradient value is averaged over a few examples it also provides smooth convergence.

SGD (as an optimization algorithm) has become an important part of machine learning. With small learning rates and some other modifications, mini-batch SGD almost surely converges to the global minimum(or the local minimum in case of pseudo-convex loss functions).

SGD is a popular algorithm used in optimizing many machine learning algorithms like-Linear Regression, Logistic Regression, Support Vector machines (SVM). And it’s the backbone of Artificial Neural Networks(Deep Learning models) when combined with backpropagation.


Momentum

Consider a situation where the loss function surface is not ideal(strictly convex) and you are at a point where local gradient points towards the wrong direction(based on local derivatives). In this scenario, our model might suffer in finding the correct direction towards the global minima.

Another reason for encountering the same problem could be noisy mini-batch, where the training batch is noisy and pointing the model towards wrong updates to the parameters, thus resulting in slow convergence (or oscillations in convergence).

Momentum is a term taken from physics and applied to SGD in order to encounter this issue. The momentum method of SGD optimizer, remembers the last few gradient values from the past training batches and does not just rely on the recently calculated gradient while making updates to the parameters. Instead, it takes inputs from the past gradient values to make sure that we are always moving towards the correct direction of convergence.

In this way, the momentum method of the SGD optimizer achieves fast convergence without oscillations. Here is how it updates the parameters (weights) of the model-

#momentum and velocity terms are from physics used in the similar context here
gradient = calculate_gradient_(args)
velocity_new = momentum * velocity - learning_rate * gradient
weight_new = weight + velocity_new
weight_new = weight + momentum * velocity - learning_rate * gradient

Adagrad Optimizer

SGD optimizer works with a global learning-rate (A user-defined initial learning rate). It means that it updates each parameter with the same global learning-rate at each step and It may cause the convergence to be quite slow.

Adagrad, also known as Adaptive Gradient, is a modification of the SGD algorithm. On top of the actual SGD algorithm, it provides parameter specific learning rates (while the SGD method works with a global learning_rate that is applied to all the parameters of the model throughout the training process).

# signature tf.keras
tf.keras.optimizers.Adagrad(
    learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-07,
    name='Adagrad', **kwargs
)

Instead of using a global learning-rate for each weight, the Adagrad algorithm keeps a separate learning rate of each parameter. The learning-rate of a parameter is decided based on the frequency of updates it gets. The parameters that are updated very frequently are updated with smaller learning rates while the parameters that are updated less-frequently are updated with a slightly larger learning rate.

In mathematical terms- The Adagrad algorithm keeps track of the sum of squares of the gradients for each parameter and divides the learning-rate with it(square-root value of the squared sum of the gradients) before each new update to that parameter. In this way, parameters updated with large values in the past are updated slowly in the next future steps.

Squared Sum: because some gradients are +ve while some are -ve.

# Here is the algorithm
gradient = calculate_gradient_(args)
#keep track of the squared sum
sq_sum_grad = sq_sum_grad + gradient * gradient
#penalize learning rate with square-root of the squared sum of gradients
weight_new = weight - (learning_rate/sq_root(sq_sum_grad)) * gradient

This algorithm is called ‘adaptive gradient’ because it adapts the learning-rates for each parameter based on its past updates.


RMSProp Optimizer

RMSProp (Root Mean Square Propagation) algorithm is again based on the Stochastic Gradient algorithm (SGD). RMSProp is very similar to the Adagrad algorithm as it also works with adaptive learning-rates for the parameters.

The main difference is that instead of calculating the plain squared sum of gradients(like in Adagrad method), it calculates a discounted (or weighted) average of the squared gradients and uses that to penalize the learning-rates for each parameter.

# signature of RMSProp in tf.keras
tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False,    name='RMSprop', **kwargs)

here rho is that discounting factor, here is how this algorithm works-

# Here is the algorithm
gradient = calculate_gradient_(args)
dis_sq_grad = rho * dis_sq_grad + (1 - rho) * gradient * gradient
weight_new = weight - (learning_rate/sq_root(dis_sq_grad)) * gradient

Discounted average is a better option here because the plain squared sum keeps increasing with time and keeps penalizing the learning-rate more and more. Thus model learning becomes quite slow with time.

RMSProp is a very popular optimizer and provides a really fast convergence in practice.


Adadelta Optimizer

Adadelta (Adaptive Delta Gradient) is again based on stochastic gradient descent algorithms and is an optimized version of the adaptive gradient(Adagrad) algorithm.

Instead of adapting learning rates based on all the past gradients(like Adagrad does), it uses a moving window of only a few past gradients in order to penalize the learning-rates for each parameter. This saves the algorithm from continuous decay of learning-rates throughout the training. With this strategy, the model keeps learning even after many updates/iterations.

Adadelta performs well even when your model takes very long to train. Another advantage of using the ‘Adadelta’ algorithm is that you don’t even have to provide the initial learning rate (though you can provide your own) as the default one gets adapted soon.


Adam Optimizer

Similarly, Adam (Adaptive moments of gradient) is again a stochastic gradient descent based optimization algorithm. Just like Adagrad, RMSProp, and Adadelta, it also works with adaptive learning-rates of the parameters(weights). It combines the benefits of Adagrad and RMSProp both.

tf.keras.optimizers.Adam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
    name='Adam', **kwargs
)

Instead of considering just 1st-order moment of past gradient values(like-RMSProp), Adam utilizes second-order moments of the past gradients also in order to update the weights(parameters) of the model.

It maintains two different parameters beta_1 and beta_2 as decay rates for both the moments. You don’t have to set these values explicitly as the default ones often work best.

Adam has been proved to be very efficient and is widely accepted across different domains/problems/architectures.

Adam is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters

According to Kingma et al., 2014

Summary

In this article, we went through multiple optimization algorithms used for training artificial neural network-based models.

We discussed how each of the algorithms provides additional benefits over vanilla stochastic gradient descent algorithm.

Though all these optimizers work well (they differ mostly in training time, overall average results do not differ much). But if you want to choose one(one all-time best)-Adam is considered the most effective optimizer among all others.

Here are few good things about Adam-

  1. Easy to understand and implement
  2. Lesser memory requirements
  3. Fewer computation requirements
  4. Works well even for big/large-data problems.
  5. Hyperparameters don’t require much tuning as defaults work well.

With this I would like to end this post here. Thanks for reading!

Do let me know your thoughts/feedback by commenting below. See you in the next article.


Read Next>>

  1. Optimizing TensorFlow models with Quantization Techniques
  2. Deep Learning with PyTorch: Introduction
  3. Deep Learning with PyTorch: First Neural Network
  4. 1D-CNN based Fully Convolutional Model for Handwriting Recognition

References

  1. Tensorflow Documentation (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
  2. Wikipedia: Gradient Descent
  3. Wikipedia: Stochastic Gradient Descent