Machine Learning - Supervised Learning - Regularization Tutorial
- Regularization
It is a technique to prevent the model from overfitting by adding extra information to it.
There are mainly two types of regularization techniques, which are given below:
- Ridge Regression (L2) [Regression]
Hyperparameter – alpha which is basically lambda
Ridge regression is one of the types of linear regression in which a small amount of bias or penalty equivalent to the sum of the squares of the coefficients added to the loss function so that overfitting in model get reduced and we can get better long-term predictions.
Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization.
In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared weight of each individual feature.
The equation for the cost function in ridge regression will be:
cost function=i=1n(Yi-Yi)2+ λ(m2)
In the above equation, the penalty term m2 regularizes the coefficients of the model, and hence ridge regression reduces the amplitudes of the coefficients by decreasing the complexity of the model.
For example-
For overfitted model, Loss LN( Linear Regression) > Loss LR (Ridge Regression).
Hence, we will select Loss LR model over Loss LN.
In, Linear Regression Line, due to overfitting, Loss is equal to zero. Hence to reduce overfitting, we will add penalty term, which will increase the bias but reduce variance, to adjust bias variance trade off.
5 Key Points – Ridge Regression
- How the coefficient get affected?
Coefficient will shrink toward 0, but not become 0, whenever the lambda increase.
- Higher value are impacted more on increasing lambda value.
Coefficient with high value is impacted more as compare to coefficient with less value on increasing lambda value.
- Bias Variance TradeOff
Bias and variance depend on lambda value
Lambda(decrease) - Bias(decrease) overfit Variance (increase)
Lambda(increase) - Bias(increase) underfit Variance (decrease)
- Impact on the Loss function
Increasing lambda function, will shrink loss function to 0.
- Why called Ridge
Because solution lie on circle perimeter .i.e on ridge, hence it is called ridge.
Practical Tips
Apply ridge regression only when there input field is greater than or equal to 2
- Lasso Regression (L1) [Regression]
Hyperparameter – alpha which is basically lambda
Lasso regression is another regularization technique to reduce the overfitting OR complexity of the model. It stands for Least Absolute Shrinkage and Selection Operator.
It is similar to the Ridge Regression except that the penalty term contains only the absolute values of coefficients instead of a square of coefficients.
Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to 0.
Due to this we can perform feature selection in lasso regression. Hence, lasso is better in term of Ridge.
It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
cost function=i=1n(Yi-Yi)2+ |m|
Some of the features in this technique are completely neglected for model evaluation.
Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature selection.
Why Lasso regression create sparsity?
In Ridge Regression, on increasing lambda value Coefficient will shrink toward 0, but not become 0. But in case of Lasso it will shrink to 0.
In ridge, value of lamda is in denominator in the coefficient formula, hence lamda can’t affect numerator to shrink coefficient value to zero.
While in case of lasso, value of lamda is in numerator in the coefficient formula, hence lamda can affect numerator to shrink coefficient value to zero.
This means that variables are removed from the model, hence the sparsity.
- Elastic Net [Regression]
Hyperparameter – alpha which is basically lambda, l1_ratio which will decide weightage of lasso and ridge
Elastic net linear regression uses the penalties from both the lasso and ridge techniques to regularize regression models. The technique combines both the lasso and ridge regression methods by learning from their shortcomings to improve the regularization of statistical models
The elastic net method performs variable selection and regularization simultaneously.
The elastic net technique is most appropriate where the dimensional data is greater than the number of samples used.
Groupings and variables selection are the key roles of the elastic net technique.
L1 Ratio will decide weightage of Ridge and Lasso both. If L1 Ratio =0.9, then 0.9% ridge and 0.1% lasso. Alpha(lamda) = a+b
When to use elastic net?
- When you are unsure about whether to use lasso or ridge
- If input column has multicollinearity, then elastic net is perfect.