Deep Learning - ANN - Artificial Neural Network - Vanishing & Exploding Gradient Problem Tutorial

Vanishing Gradient Problem

In machine learning, the vanishing gradient problem occurs when training artificial neural networks with gradient-based learning methods and backpropagation.

In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

This generally occurs in below cases-

When there is a product with a vanishingly small value like 0.1 multiple times, then the resultant value will be less than the multiplicated value 0.1

i.e 0.1 x 0.1 x 0.1 x 0.1 = 0.0001

$W_{new} = W_{old} - \eta \frac{\delta L}{\delta W}$

let's assume W is a W¹₁₁. Therefore,

$\frac{\delta L}{\delta W^1_{11}} =\frac{\delta L}{\delta Y'}\times \frac{\delta Y'}{\delta Z}\times \frac{\delta Z}{\delta O_{11}}\times\frac{\delta O_{11}}{\delta W^1_{11}}$

suppose each value is between 0 to 0.5 for $\frac{\delta L}{\delta Y'}, \frac{\delta Y'}{\delta Z},\frac{\delta Z}{\delta O_{11}},\frac{\delta O_{11}}{\delta W^1_{11}}$ , then the resultant value $\frac{\delta L}{\delta W^1_{11}}$ will be much less. Due to this backpropagation model will not converge and the model will not get trained further i.e W_new = W_old.

When there are multiple layers in a Deep Neural Network.
When there is an activation function like Sigmoid or Tanh

How to recognize it?

1] Focus on Loss - If there are no changes then there is a chance that the Vanishing Gradient Problem will occur.

2] Plot the weight graph with respect to epoch, if it is consistent(no changes) then there is a chance that the Vanishing Gradient Problem will occur.

Vanishing Gradient Descent Practical

How to handle Vanishing Gradient Problem-

1] Reduce model complexity by reducing hidden layer

2] Using RELU activation function because of max(0,z), it will either give 0 or 1

3] Proper Weight Initialization using Glorant or Xavior

4] Batch Normalization – layer

5] Residual Network – special building block used in ANN - RESNET

Exploding Gradient Descent-

The exploding gradient is the inverse of the vanishing gradient and occurs when there are large error gradients, resulting in very large updates to neural network model weights during training. As a result, the model is unstable and incapable of learning from your training data.

There is an exploding gradient problem opposite of the vanishing gradient problem, which will only be seen in RNN

How to improve the performance of neural networks? – using Weight initialization, Batch Normalisation, Activation Function, and Optimizer

1] Fine Tuning Neural Network Hyperparameters

No. Of Hidden Layer, Neurons Per Layer, Learning Rate, Optimizer, Batch Size, Activation Function, Epochs

2] By Solving Problems:-

Vanishing / Exploding Gradient
Not Enough Data
Slow Training
Overfitting

1] Fine Tuning Neural Network Hyperparameters

No. Of Hidden Layer

Rather than taking 1 hidden layer with lots of neurons, it is better to take more hidden layers with less no. of neurons i.e. deep neural network.

Why should we keep more no. of hidden layers?

The initial hidden layer is used to capture primitive features such as edges, etc and the next upcoming hidden layer is used to capture more advanced features such as eyes in the case of human face. Therefore, we should keep more hidden layers to capture advanced/complex data.
Also, it is used for transfer learning where the initial hidden layer can be used in other ANN research. for example, we can use the primitive feature ( initial hidden layer ) of a human face as a transfer learning in the case of a monkey face too. This will reduce the training cost which cannot be achieved by a low no. of hidden layers.

How much no. of hidden layers is required?

There is no such number. but it should be sufficient.

Whenever overfitting starts, you should stop adding hidden layers.

Neurons Per Layer

How many neurons per layer should be required?

For Input Layer, it depends on the no. of input.

For Output Layer, depends on no. of classes for classification problems. For regression, it is 1.

For Hidden Layer, you can any follow anything pattern. But the no. of nodes should be sufficient.

Learning Rate

By warming up / scheduling the learning layer i.e. keep the learning rate small in the starting epoch and increase it as the epoch increases.

Optimizer
Batch Size

Warming up the learning rate – learning rate will vary

Activation Function
Epochs

Use early stopping – which will stop the iteration if there is no improvement.

2] By Solving Problems:-

Vanishing / Exploding Gradient

We can solve this problem using weight initialization, activation function change, using batch normalization, using gradient clipping

Not Enough Data

We can solve this problem using transfer learning, using unsupervised pretraining

Slow Training

We can solve this problem using Optimizer (e.g adam, etc), using a learning rate scheduler

Overfitting

We can solve this problem using l1 and l2 regularization, using dropout, using early stopping, reducing complexity/increasing data

Deep Learning - ANN - Artificial Neural Network - Vanishing & Exploding Gradient Problem Tutorial

About Fresherbell

Important Links

Social Media