Deep Learning - ANN - Artificial Neural Network - Vanishing & Exploding Gradient Problem Tutorial
Vanishing Gradient Problem
In machine learning, the vanishing gradient problem occurs when training artificial neural networks with gradient-based learning methods and backpropagation.
In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.
This generally occurs in below cases-
- When there is a product with a vanishingly small value like 0.1 multiple times, then the resultant value will be less than the multiplicated value 0.1
i.e 0.1 x 0.1 x 0.1 x 0.1 = 0.0001
\(W_{new} = W_{old} - \eta \frac{\delta L}{\delta W} \)
let's assume W is a W111. Therefore,
\(\frac{\delta L}{\delta W^1_{11}} =\frac{\delta L}{\delta Y'}\times \frac{\delta Y'}{\delta Z}\times \frac{\delta Z}{\delta O_{11}}\times\frac{\delta O_{11}}{\delta W^1_{11}}\)
suppose each value is between 0 to 0.5 for \(\frac{\delta L}{\delta Y'}, \frac{\delta Y'}{\delta Z},\frac{\delta Z}{\delta O_{11}},\frac{\delta O_{11}}{\delta W^1_{11}}\), then the resultant value \(\frac{\delta L}{\delta W^1_{11}}\)will be much less. Due to this backpropagation model will not converge and the model will not get trained further i.e Wnew = Wold.
- When there are multiple layers in a Deep Neural Network.
- When there is an activation function like Sigmoid or Tanh
How to recognize it?
1] Focus on Loss - If there are no changes then there is a chance that the Vanishing Gradient Problem will occur.
2] Plot the weight graph with respect to epoch, if it is consistent(no changes) then there is a chance that the Vanishing Gradient Problem will occur.
Vanishing Gradient Descent Practical
How to handle Vanishing Gradient Problem-
1] Reduce model complexity by reducing hidden layer
2] Using RELU activation function because of max(0,z), it will either give 0 or 1
3] Proper Weight Initialization using Glorant or Xavior
4] Batch Normalization – layer
5] Residual Network – special building block used in ANN - RESNET
Exploding Gradient Descent-
The exploding gradient is the inverse of the vanishing gradient and occurs when there are large error gradients, resulting in very large updates to neural network model weights during training. As a result, the model is unstable and incapable of learning from your training data.
There is an exploding gradient problem opposite of the vanishing gradient problem, which will only be seen in RNN
How to improve the performance of neural networks? – using Weight initialization, Batch Normalisation, Activation Function, and Optimizer
1] Fine Tuning Neural Network Hyperparameters
No. Of Hidden Layer, Neurons Per Layer, Learning Rate, Optimizer, Batch Size, Activation Function, Epochs
2] By Solving Problems:-
- Vanishing / Exploding Gradient
- Not Enough Data
- Slow Training
- Overfitting
1] Fine Tuning Neural Network Hyperparameters
- No. Of Hidden Layer
Rather than taking 1 hidden layer with lots of neurons, it is better to take more hidden layers with less no. of neurons i.e. deep neural network.
Why should we keep more no. of hidden layers?
- The initial hidden layer is used to capture primitive features such as edges, etc and the next upcoming hidden layer is used to capture more advanced features such as eyes in the case of human face. Therefore, we should keep more hidden layers to capture advanced/complex data.
- Also, it is used for transfer learning where the initial hidden layer can be used in other ANN research. for example, we can use the primitive feature ( initial hidden layer ) of a human face as a transfer learning in the case of a monkey face too. This will reduce the training cost which cannot be achieved by a low no. of hidden layers.
How much no. of hidden layers is required?
There is no such number. but it should be sufficient.
Whenever overfitting starts, you should stop adding hidden layers.
- Neurons Per Layer
How many neurons per layer should be required?
For Input Layer, it depends on the no. of input.
For Output Layer, depends on no. of classes for classification problems. For regression, it is 1.
For Hidden Layer, you can any follow anything pattern. But the no. of nodes should be sufficient.
- Learning Rate
By warming up / scheduling the learning layer i.e. keep the learning rate small in the starting epoch and increase it as the epoch increases.
- Optimizer
- Batch Size
Warming up the learning rate – learning rate will vary
- Activation Function
- Epochs
Use early stopping – which will stop the iteration if there is no improvement.
2] By Solving Problems:-
- Vanishing / Exploding Gradient
We can solve this problem using weight initialization, activation function change, using batch normalization, using gradient clipping
- Not Enough Data
We can solve this problem using transfer learning, using unsupervised pretraining
- Slow Training
We can solve this problem using Optimizer (e.g adam, etc), using a learning rate scheduler
- Overfitting
We can solve this problem using l1 and l2 regularization, using dropout, using early stopping, reducing complexity/increasing data