Deep Learning - RNN - Recurrent Neural Network - Problems with RNN Tutorial
RNN works well with Sequential Data like textual data and time series data.
It is not mostly used because of 2 major problems i.e.
- Problem of Long-Term Dependency
- Unstable/Stagnated Training
the main cause of these two problems is an Unstable Gradient.
Problem 1 of Long-Term Dependency
In RNN we have sequential data where current data depend on previous data. but if the length of sequential data increases then current data will not be able to remember the very old data and hence long-term dependency gets reduced.
Example 1- the iPhone is the top-selling mobile.
here, the word 'mobile' is dependent on the 'iPhone' and is also known as short-term dependency due to short sentences.
Example 2- the iPhone is a superb mobile. I purchased it last year. But, I found it very difficult to understand its features.
here, the word 'feature' is dependent on the word 'iPhone', due to such long sentences case RNN will fail and this problem is known as the long-term dependency problem.
This problem arises because of the Vanishing Gradient Problem.
For 100 Steps the new value(100th) will face difficulty in remembering the old value(2 nd)
\(\frac{\delta L}{\delta W_{h}} = \frac{\delta L}{\delta Y'} \frac{\delta Y'}{\delta O_{3}} \frac{\delta O_{3}}{\delta W_{h}} + \frac{\delta L}{\delta Y'} \frac{\delta Y'}{\delta O_{3}} \frac{\delta O_{3}}{\delta O_{2}}\frac{\delta O_{2}}{\delta W_{h}} + \frac{\delta L}{\delta Y'} \frac{\delta Y'}{\delta O_{3}} \frac{\delta O_{3}}{\delta O_{2}}\frac{\delta O_{2}}{\delta O_{1}}\frac{\delta O_{1}}{\delta W_{h}}+.........+\frac{\delta L}{\delta Y'} \frac{\delta Y'}{\delta O_{100}} \frac{\delta O_{100}}{\delta O_{99}} ....\frac{\delta O_{3}}{\delta O_{2}}\frac{\delta O_{2}}{\delta O_{1}}\frac{\delta O_{1}}{\delta W_{h}}\)
Solution -
1] To use different activation functions like relu or leaky relu
2] Better weight initialization
3] Skip RNNs
4] Use LSTM architecture
Problem 2 of Unstable Training
What if long-term dependency is so big that it will dominate the short-term dependency. Due to this gradient update will be infinite, the weight infinite, and the model will not get trained.
Solution-
1] Gradient Clipping
2] To have a Controlled Learning Rate
3] Use LSTM architecture