alternative
  • Home (current)
  • About
  • Tutorial
    Technologies
    C#
    Deep Learning
    Statistics for AIML
    Natural Language Processing
    Machine Learning
    SQL -Structured Query Language
    Python
    Ethical Hacking
    Placement Preparation
    Quantitative Aptitude
    View All Tutorial
  • Quiz
    C#
    SQL -Structured Query Language
    Quantitative Aptitude
    Java
    View All Quiz Course
  • Q & A
    C#
    Quantitative Aptitude
    Java
    View All Q & A course
  • Programs
  • Articles
    Identity And Access Management
    Artificial Intelligence & Machine Learning Project
    How to publish your local website on github pages with a custom domain name?
    How to download and install Xampp on Window Operating System ?
    How To Download And Install MySql Workbench
    How to install Pycharm ?
    How to install Python ?
    How to download and install Visual Studio IDE taking an example of C# (C Sharp)
    View All Post
  • Tools
    Program Compiler
    Sql Compiler
    Replace Multiple Text
    Meta Data From Multiple Url
  • Contact
  • User
    Login
    Register

Deep Learning - ANN - Artificial Neural Network - Optimizer Tutorial

Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. Optimizers help to get results faster.

Different types of optimizer technique is -

  • Gradient Descent – Batch, Stochastic and Mini-Batch

We already have optimizer techniques like gradient descent. then why it is needed for more optimizer techniques the reason is-

1] To find the right learning rate

2] The solution is a learning rate schedule, but the problem with a learning schedule is we need a predefined schedule before training, which means it does not depend on training.

3] If there are two weight parameters w1 and w2, then we can’t define the different learning parameters for different weight - restrictions.

4] Problem of multiple local minima - There are chances to get stuck in local minima rather than global minima.

5] Problem of saddle point

 

This is the reason why we need more optimizer techniques. which are given below-

  • Momentum
  • Adagrad
  • NAG
  • RMSProp
  • Adam

before this we need to learn EWMA - exponentially weighted moving average

 

 

EWMA – exponentially weighted moving average

EWMA is a technique using which we find hidden trends in time-series based data.

The moving average is designed such that older observations are given lower weights. The weights fall exponentially as the data point gets older – hence the name is exponentially weighted.

It is used in time series forecasting, financial forecasting, signal processing, and deep learning optimizer.

Formula -

Vt = ßVt-1 + (1-ß)Øt

 

Vt – Exponentially weighted moving average at a particular time t           

ß(beta) – constant between 0 and 1. If beta = 0.5, then

day = \(\frac{1}{1-\beta}\)= \(\frac{1}{1-0.5}\)= 2 days.

In this case, EWMA will act as a normal average of the previous 2 days.

Vt-1 - Exponentially weighted moving average at a particular time t -1

Øt – Current time instance data

index temp(Ø)
D1 25
D2 13
D3 17
D4 31
D5 43

Let ß(beta) = 0.9

V0 = 0 then V0 = Ø0

V1 = 0.9 x V0 + 0.1 x 13 = 1.3

V2 = 0.9 x 1.3 + 0.1 x 17 = 2.87

Mathematical Intuition -

Vt = ßVt-1 + (1-ß)Øt

V0 = 0

V1 = (1-ß)Ø1

V2 = ßV1 + (1-ß)Ø2 = ß(1 - ß)Ø1 + (1-ß)Ø2

V3 = ßV2 + (1-ß)Ø3  = ß2(1 - ß)Ø2 + (1-ßb)Ø3

V4 = ßV3 + (1-ß)Ø4  = ß3(1 - ß)Ø1 + ß2(1 - ß)Ø2 + ß(1-ß)Ø3 + (1-ß)Ø4 = (1 - ß)[ ß3Ø1 + ß2Ø2 + ßØ3 + Ø4 ]

ß3 is multiplying by Ø1, ß2 is multiplying by Ø2 and ß is multiplying by Ø3, but ß value lies between 0 and 1.

Such that older observations are given lower weights start from Ø1, Ø2 and Ø3

ß3 < ß2 < ß

Exponetially Weighted Moving Average

 

For different beta, if the beta value is high, then the day will also be high, such that many old/past points will be given less weight as compared to new data points.

Understanding Graph-

 

Convex Vs Non-Convex Optimization

SGD with Momentum | Optimizers in Deep Learning

The problem faced by Non-convex optimization is

1] High Curvature

2] Consistent Gradient – slope change slowly

3] Noisy Gradient – Local minima

 

Momentum will solve the below problem-

SGD with momentum is a method that helps accelerate gradient vectors in the right directions based on previous gradients, thus leading to faster converging.

Momentum with SGD is faster than normal SGD and batch gradient descent.

Real-time example - suppose we have to reach the restaurant, the direction we don’t know, and on the road, four people provide the same direction for the restaurant, then we are 100% confident that we are reaching toward right direction and we will increase our speed.

If all four of them provide a different direction. then obviously we are 0% confident about reaching the hotel and we will decrease our speed.

How the momentum optimization work?

Wt+1 = Wt - \(\eta\lambda\)Wt

V → Velocity

Wt+1 = Wt - Vt

Vt = ß x Vt-1 + \(\eta \bigtriangledown\)Wt  → History of velocity you are using it as past velocity. Due to this, it provides acceleration.

0<ß<1

When beta ß is 0, momentum will act as a normal SGD i.e Vt =   Vt-1 + \(\eta \bigtriangledown\)Wt 

 

 

NAG – Nesterov Accelerated Gradient

Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed.

NAG is a process of implementing momentum, but there is one improvement to reduce the oscillation.

That improvement is -

In Momentum, the jump to the next step will depend simultaneously both on the history of velocity and gradient at that point.

While in NAG, the jump to the next step will first depend on the history of velocity(momentum) till it reaches a look-ahead point, and on the next point(look ahead point), it will calculate the gradient at that point and on the basis of that it will make next jump.

Wla = Wt - ßVt-1

Vt = ßVt-1 +  \(\eta \bigtriangledown\)Wla 

Vt = (Wt - Wla) +  \(\eta \bigtriangledown\)Wla 

Wt+1 = Wt - Vt 

Disadvantage – gradient descent can stuck in local minima, because of a reduction in oscillation.

 

Keras Implementation-

tf.keras.optimizers.SGD(
learning_rate=0.01, momentum=0.0, nestrov=False, name="SGD", **kwargs
)

For SGD – momentum = 0.0 and Nesterov = false

For momentum – momentum= any constant value (e.g 0.9 ), Nesterov = false

For Nesterov – momentum= any constant value (e.g 0.9 ), Nesterov = true

 

 

 

AdaGrad Optimization

Adaptive Gradient or Adagrad is used when we have sparse data.

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a learning rate gets updated during training. The more updates a learning rate receives, the smaller the updates.

If the gradient/slope is big then the learning rate is small and then the resultant update will also be smaller.

If the gradient/slope is smaller then the learning rate is big and then the resultant update will also be smaller.

In short, in Adagrad for different parameters, we set different learning rates.

Formula-

Wt+1 = Wt -  \(\frac{\eta\bigtriangledown W_t}{\sqrt{V_t +\epsilon}}\)

 Vt = Vt-1 +  \({(\bigtriangledown W_t)}^2\)

where Vt is past gradient squared sum

 

epsilon - a small number so that the denominator value does not become zero.

Vt - Current magnitude = past magnitude + gradient2

When to use Adagrad?

1] When input features scale differently

2] When the feature is sparse(contains 0 value)

the sparse feature will create an elongated bowl problem.

In Elongated, the Slope will change on the 1st axis and remain constant on the 2nd axis.

 

Disadvantage-

Due to this AdaGrad is not used in neural networks -

It does not converge totally to minima, it will reach near minima after using a large no. of epoch.

the reason is as the no. of epoch increases the Vt value also increases and the learning rate decreases. due to which the update is so small and doesn’t converge to global minima.

 

 

RMSProp

Root mean square propagation - This optimization technique is an improvement of AdaGrad

The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster.

It will solve the problem of AdaGrad i.e. in AdaGrad it reaches global minima but it doesn’t get converge to global minima.

Formula-

Vt = ßVt-1 + (1 - ß) ( Wt)2

Wt+1 = Wt -   \(\frac{\eta\bigtriangledown W_t}{\sqrt{V_t +\epsilon}}\)

where ß = 0.95

 

In this, we are giving more weightage to recent points than old values.

No Disadvantage

 

 

Adam Optimizer

Adaptive Moment Estimation - It is the most popular optimization technique

Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems

Formula-

Wt+1 = Wt -  \(\frac{\eta\times M_t}{\sqrt{V_t +\epsilon}}\)

where

Mt = ß1Mt-1 + (1 - ß1) \(\bigtriangledown W_t\) → momentum

Vt  = ß2Mt-1 + (1 - ß2)(\(\bigtriangledown W_t\))2→ Adagrad

Deep Learning

Deep Learning

  • Introduction
  • LSTM - Long Short Term Memory
    • Introduction
  • ANN - Artificial Neural Network
    • Perceptron
    • Multilayer Perceptron (Notation & Memoization)
    • Forward Propagation
    • Backward Propagation
    • Perceptron Loss Function
    • Loss Function
    • Gradient Descent | Batch, Stochastics, Mini Batch
    • Vanishing & Exploding Gradient Problem
    • Early Stopping, Dropout. Weight Decay
    • Data Scaling & Feature Scaling
    • Regularization
    • Activation Function
    • Weight Initialization Techniques
    • Optimizer
    • Keras Tuner | Hyperparameter Tuning
  • CNN - Convolutional Neural Network
    • Introduction
    • Padding & Strides
    • Pooling Layer
    • CNN Architecture
    • Backpropagation in CNN
    • Data Augmentation
    • Pretrained Model & Transfer Learning
    • Keras Functional Model
  • RNN - Recurrent Neural Network
    • RNN Architecture & Forward Propagation
    • Types Of RNN
    • Backpropagation in RNN
    • Problems with RNN

About Fresherbell

Best learning portal that provides you great learning experience of various technologies with modern compilation tools and technique

Important Links

Don't hesitate to give us a call or send us a contact form message

Terms & Conditions
Privacy Policy
Contact Us

Social Media

© Untitled. All rights reserved. Demo Images: Unsplash. Design: HTML5 UP.