alternative
  • Home (current)
  • About
  • Tutorial
    Technologies
    C#
    Deep Learning
    Statistics for AIML
    Natural Language Processing
    Machine Learning
    SQL -Structured Query Language
    Python
    Ethical Hacking
    Placement Preparation
    Quantitative Aptitude
    View All Tutorial
  • Quiz
    C#
    SQL -Structured Query Language
    Quantitative Aptitude
    Java
    View All Quiz Course
  • Q & A
    C#
    Quantitative Aptitude
    Java
    View All Q & A course
  • Programs
  • Articles
    Identity And Access Management
    Artificial Intelligence & Machine Learning Project
    How to publish your local website on github pages with a custom domain name?
    How to download and install Xampp on Window Operating System ?
    How To Download And Install MySql Workbench
    How to install Pycharm ?
    How to install Python ?
    How to download and install Visual Studio IDE taking an example of C# (C Sharp)
    View All Post
  • Tools
    Program Compiler
    Sql Compiler
    Replace Multiple Text
    Meta Data From Multiple Url
  • Contact
  • User
    Login
    Register

Deep Learning - ANN - Artificial Neural Network - Vanishing & Exploding Gradient Problem Tutorial

Vanishing Gradient Problem

In machine learning, the vanishing gradient problem occurs when training artificial neural networks with gradient-based learning methods and backpropagation.

In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

This generally occurs in below cases-

  • When there is a product with a vanishingly small value like 0.1 multiple times, then the resultant value will be less than the multiplicated value 0.1

i.e 0.1 x 0.1 x 0.1 x 0.1 = 0.0001

\(W_{new} = W_{old} - \eta \frac{\delta L}{\delta W} \)

let's assume W is a W111. Therefore,

\(\frac{\delta L}{\delta W^1_{11}} =\frac{\delta L}{\delta Y'}\times \frac{\delta Y'}{\delta Z}\times \frac{\delta Z}{\delta O_{11}}\times\frac{\delta O_{11}}{\delta W^1_{11}}\)

suppose each value is between 0 to 0.5 for \(\frac{\delta L}{\delta Y'}, \frac{\delta Y'}{\delta Z},\frac{\delta Z}{\delta O_{11}},\frac{\delta O_{11}}{\delta W^1_{11}}\), then the resultant value \(\frac{\delta L}{\delta W^1_{11}}\)will be much less. Due to this backpropagation model will not converge and the model will not get trained further i.e Wnew =  Wold.

  • When there are multiple layers in a Deep Neural Network.
  • When there is an activation function like Sigmoid or Tanh

How to recognize it?

1] Focus on Loss - If there are no changes then there is a chance that the Vanishing Gradient Problem will occur.

2] Plot the weight graph with respect to epoch, if it is consistent(no changes) then there is a chance that the Vanishing Gradient Problem will occur.

Vanishing Gradient Descent Practical

How to handle Vanishing Gradient Problem-

1] Reduce model complexity by reducing hidden layer

2] Using RELU activation function because of max(0,z), it will either give 0 or 1

3] Proper Weight Initialization using Glorant or Xavior

4] Batch Normalization – layer

5] Residual Network – special building block used in ANN - RESNET

 

Exploding Gradient Descent-

The exploding gradient is the inverse of the vanishing gradient and occurs when there are large error gradients, resulting in very large updates to neural network model weights during training. As a result, the model is unstable and incapable of learning from your training data.

There is an exploding gradient problem opposite of the vanishing gradient problem, which will only be seen in RNN

 

How to improve the performance of neural networks? – using Weight initialization, Batch Normalisation, Activation Function, and Optimizer

1] Fine Tuning Neural Network Hyperparameters

No. Of Hidden Layer, Neurons Per Layer, Learning Rate, Optimizer, Batch Size, Activation Function, Epochs

2] By Solving Problems:-

  • Vanishing / Exploding Gradient
  • Not Enough Data
  • Slow Training
  • Overfitting

 

 

1] Fine Tuning Neural Network Hyperparameters

  • No. Of Hidden Layer

Rather than taking 1 hidden layer with lots of neurons, it is better to take more hidden layers with less no. of neurons i.e. deep neural network.

Why should we keep more no. of hidden layers?

  • The initial hidden layer is used to capture primitive features such as edges, etc and the next upcoming hidden layer is used to capture more advanced features such as eyes in the case of human face. Therefore, we should keep more hidden layers to capture advanced/complex data.
  • Also, it is used for transfer learning where the initial hidden layer can be used in other ANN research. for example, we can use the primitive feature ( initial hidden layer ) of a human face as a transfer learning in the case of a monkey face too. This will reduce the training cost which cannot be achieved by a low no. of hidden layers.

How much no. of hidden layers is required?

There is no such number. but it should be sufficient.

Whenever overfitting starts, you should stop adding hidden layers.

  • Neurons Per Layer

How many neurons per layer should be required?

For Input Layer, it depends on the no. of input.

For Output Layer, depends on no. of classes for classification problems. For regression, it is 1.

For Hidden Layer, you can any follow anything pattern. But the no. of nodes should be sufficient.

 

  • Learning Rate

By warming up / scheduling the learning layer i.e. keep the learning rate small in the starting epoch and increase it as the epoch increases.

  • Optimizer
  • Batch Size

Warming up the learning rate – learning rate will vary

  • Activation Function
  • Epochs

Use early stopping – which will stop the iteration if there is no improvement.

 

2] By Solving Problems:-

  • Vanishing / Exploding Gradient

We can solve this problem using weight initialization, activation function change, using batch normalization, using gradient clipping

  • Not Enough Data

We can solve this problem using transfer learning, using unsupervised pretraining

  • Slow Training

We can solve this problem using Optimizer (e.g adam, etc), using a learning rate scheduler

  • Overfitting

We can solve this problem using l1 and l2 regularization, using dropout, using early stopping, reducing complexity/increasing data

Deep Learning

Deep Learning

  • Introduction
  • LSTM - Long Short Term Memory
    • Introduction
  • ANN - Artificial Neural Network
    • Perceptron
    • Multilayer Perceptron (Notation & Memoization)
    • Forward Propagation
    • Backward Propagation
    • Perceptron Loss Function
    • Loss Function
    • Gradient Descent | Batch, Stochastics, Mini Batch
    • Vanishing & Exploding Gradient Problem
    • Early Stopping, Dropout. Weight Decay
    • Data Scaling & Feature Scaling
    • Regularization
    • Activation Function
    • Weight Initialization Techniques
    • Optimizer
    • Keras Tuner | Hyperparameter Tuning
  • CNN - Convolutional Neural Network
    • Introduction
    • Padding & Strides
    • Pooling Layer
    • CNN Architecture
    • Backpropagation in CNN
    • Data Augmentation
    • Pretrained Model & Transfer Learning
    • Keras Functional Model
  • RNN - Recurrent Neural Network
    • RNN Architecture & Forward Propagation
    • Types Of RNN
    • Backpropagation in RNN
    • Problems with RNN

About Fresherbell

Best learning portal that provides you great learning experience of various technologies with modern compilation tools and technique

Important Links

Don't hesitate to give us a call or send us a contact form message

Terms & Conditions
Privacy Policy
Contact Us

Social Media

© Untitled. All rights reserved. Demo Images: Unsplash. Design: HTML5 UP.