Deep Learning - ANN - Artificial Neural Network - Optimizer Tutorial

Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. Optimizers help to get results faster.

Different types of optimizer technique is -

Gradient Descent – Batch, Stochastic and Mini-Batch

We already have optimizer techniques like gradient descent. then why it is needed for more optimizer techniques the reason is-

1] To find the right learning rate

2] The solution is a learning rate schedule, but the problem with a learning schedule is we need a predefined schedule before training, which means it does not depend on training.

3] If there are two weight parameters w1 and w2, then we can’t define the different learning parameters for different weight - restrictions.

4] Problem of multiple local minima - There are chances to get stuck in local minima rather than global minima.

5] Problem of saddle point

This is the reason why we need more optimizer techniques. which are given below-

Momentum
Adagrad
NAG
RMSProp
Adam

before this we need to learn EWMA - exponentially weighted moving average

EWMA – exponentially weighted moving average

EWMA is a technique using which we find hidden trends in time-series based data.

The moving average is designed such that older observations are given lower weights. The weights fall exponentially as the data point gets older – hence the name is exponentially weighted.

It is used in time series forecasting, financial forecasting, signal processing, and deep learning optimizer.

Formula -

V_t = ßV_t_-1 + (1-ß)Ø_t

V_t– Exponentially weighted moving average at a particular time t

ß(beta) – constant between 0 and 1. If beta = 0.5, then

day = $\frac{1}{1-\beta}$ = $\frac{1}{1-0.5}$ = 2 days.

In this case, EWMA will act as a normal average of the previous 2 days.

V_t-1- Exponentially weighted moving average at a particular time t -1

Ø_t – Current time instance data

index	temp(Ø)
D1	25
D2	13
D3	17
D4	31
D5	43

Let ß(beta) = 0.9

V₀ = 0 then V₀ = Ø₀

V₁ = 0.9 x V₀ + 0.1 x 13 = 1.3

V₂ = 0.9 x 1.3 + 0.1 x 17 = 2.87

Mathematical Intuition -

V_t = ßV_t_-1 + (1-ß)Ø_t

V₀ = 0

V₁ = (1-ß)Ø₁

V₂ = ßV₁ + (1-ß)Ø₂ = ß(1 - ß)Ø₁ + (1-ß)Ø₂

V₃ = ßV₂+ (1-ß)Ø₃ = ß²(1 - ß)Ø₂ + (1-ßb)Ø₃

V4 = ßV₃+ (1-ß)Ø₄ = ß³(1 - ß)Ø₁ + ß²(1 - ß)Ø₂ + ß(1-ß)Ø₃+ (1-ß)Ø₄= (1 - ß)[ ß³Ø₁ + ß²Ø₂+ ßØ₃+ Ø₄]

ß³ is multiplying by Ø₁, ß² is multiplying by Ø₂and ß is multiplying by Ø₃, but ß value lies between 0 and 1.

Such that older observations are given lower weights start from Ø₁, Ø₂and Ø₃

ß³ < ß² < ß

Exponetially Weighted Moving Average

For different beta, if the beta value is high, then the day will also be high, such that many old/past points will be given less weight as compared to new data points.

Understanding Graph-

Convex Vs Non-Convex Optimization

SGD with Momentum | Optimizers in Deep Learning

The problem faced by Non-convex optimization is

1] High Curvature

2] Consistent Gradient – slope change slowly

3] Noisy Gradient – Local minima

Momentum will solve the below problem-

SGD with momentum is a method that helps accelerate gradient vectors in the right directions based on previous gradients, thus leading to faster converging.

Momentum with SGD is faster than normal SGD and batch gradient descent.

Real-time example - suppose we have to reach the restaurant, the direction we don’t know, and on the road, four people provide the same direction for the restaurant, then we are 100% confident that we are reaching toward right direction and we will increase our speed.

If all four of them provide a different direction. then obviously we are 0% confident about reaching the hotel and we will decrease our speed.

How the momentum optimization work?

W_t₊₁ = W_t - $\eta\lambda$ W_t

V → Velocity

W_{t+1 =}W_t - V_t

V_t = ß x V_t-1 + $\eta \bigtriangledown$ W_t → History of velocity you are using it as past velocity. Due to this, it provides acceleration.

0<ß<1

When beta ß is 0, momentum will act as a normal SGD i.e V_t = V_t-1 + $\eta \bigtriangledown$ W_t

NAG – Nesterov Accelerated Gradient

Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed.

NAG is a process of implementing momentum, but there is one improvement to reduce the oscillation.

That improvement is -

In Momentum, the jump to the next step will depend simultaneously both on the history of velocity and gradient at that point.

While in NAG, the jump to the next step will first depend on the history of velocity(momentum) till it reaches a look-ahead point, and on the next point(look ahead point), it will calculate the gradient at that point and on the basis of that it will make next jump.

W_la = W_t - ßV_t-1

V_t = ßV_t-1 + $\eta \bigtriangledown$ W_la

V_t = (W_t - W_la) + $\eta \bigtriangledown$ W_la

W_t+1 = W_t - V_t

Disadvantage – gradient descent can stuck in local minima, because of a reduction in oscillation.

Keras Implementation-

tf.keras.optimizers.SGD(
learning_rate=0.01, momentum=0.0, nestrov=False, name="SGD", **kwargs
)

For SGD – momentum = 0.0 and Nesterov = false

For momentum – momentum= any constant value (e.g 0.9 ), Nesterov = false

For Nesterov – momentum= any constant value (e.g 0.9 ), Nesterov = true

AdaGrad Optimization

Adaptive Gradient or Adagrad is used when we have sparse data.

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a learning rate gets updated during training. The more updates a learning rate receives, the smaller the updates.

If the gradient/slope is big then the learning rate is small and then the resultant update will also be smaller.

If the gradient/slope is smaller then the learning rate is big and then the resultant update will also be smaller.

In short, in Adagrad for different parameters, we set different learning rates.

Formula-

W_t+1 = W_t - $\frac{\eta\bigtriangledown W_t}{\sqrt{V_t +\epsilon}}$

V_t = V_t-1 + ${(\bigtriangledown W_t)}^2$

where V_t is past gradient squared sum

epsilon - a small number so that the denominator value does not become zero.

V_t - Current magnitude = past magnitude + gradient²

When to use Adagrad?

1] When input features scale differently

2] When the feature is sparse(contains 0 value)

the sparse feature will create an elongated bowl problem.

In Elongated, the Slope will change on the 1^staxis and remain constant on the 2^nd axis.

Disadvantage-

Due to this AdaGrad is not used in neural networks -

It does not converge totally to minima, it will reach near minima after using a large no. of epoch.

the reason is as the no. of epoch increases the Vt value also increases and the learning rate decreases. due to which the update is so small and doesn’t converge to global minima.

RMSProp

Root mean square propagation - This optimization technique is an improvement of AdaGrad

The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster.

It will solve the problem of AdaGrad i.e. in AdaGrad it reaches global minima but it doesn’t get converge to global minima.

Formula-

V_t = ßV_t-1 + (1 - ß) ( W_t)²

W_t+1= W_t - $\frac{\eta\bigtriangledown W_t}{\sqrt{V_t +\epsilon}}$

where ß = 0.95

In this, we are giving more weightage to recent points than old values.

No Disadvantage

Adam Optimizer

Adaptive Moment Estimation - It is the most popular optimization technique

Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems

Formula-

W_t+1= W_t - $\frac{\eta\times M_t}{\sqrt{V_t +\epsilon}}$

where

M_t = ß₁M_t-1 + (1 - ß₁) $\bigtriangledown W_t$ → momentum

V_t = ß₂M_t-1 + (1 - ß₂)( $\bigtriangledown W_t$ )²→ Adagrad

Deep Learning - ANN - Artificial Neural Network - Optimizer Tutorial

About Fresherbell

Important Links

Social Media