Machine Learning - Ensemble Learning - Random Forest Classifier Tutorial
A ‘random forest’ is a supervised machine learning algorithm that is generally used for classification problems. It is a bagging based ensemble learning technique, which is a process of combining multiple decision tree classifiers to solve a complex problem and to improve the performance of the model.
The random forest chooses the decision of the majority of the trees as the final decision in classification problem and it will use mean or average in case of regression problem.
Random Forest –> Bagging Forest -> Bootstrap aggregation with Multiple decision tree
Advantages of Random Forest
- It takes less training time as compared to other model.
- It is capable of handling large datasets with high dimensionality.
- It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
- Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks.
How Random Forest Perform so well? Bias variance trade off
Answer – 92
Feature Importance using Random Forest and Decision Tree
Feature importance is a key concept in machine learning that refers to the relative importance of each feature in the training data. In other words, it tells us which features are most predictive of the target variable. Feature importance can be calculated in a number of ways, but all methods typically rely on calculating some sort of score that measures how often a feature is used in the model and how much it contributes to the overall predictions
Formula-
ni = N_tN impurity-N_t_rN_t*right_impurity-N_t_lN_t*left_impurity
fik = ni (j=node split on feature k)ni (j=all nodes)
ni – node importance
N_t - No. of row in current node
N – No. of row in total dataset
Impurity – gini impurity of current node
N_t_r – No. of row in right subnode
right_impurity – gini impurity of right subnode
N_t_l – No. of row in left subnode
left_impurity – gini impurity of left subnode
fik – feature importance, k is feature index
For the above decision tree-
Feature importance calculation in Decision Tree-
For the above problem-
node importance of 1st node
ni – node importance = ?
N_t - No. of row in current node = 5
N – No. of row in total dataset =5
Impurity – gini impurity of current node = 0.48
N_t_r – No. of row in right subnode = 4
right_impurity – gini impurity of right subnode = 0.375
N_t_l – No. of row in left subnode = 0
left_impurity – gini impurity of left subnode = 0.
ni = 55 0.48-45*0.375-05*0 = 0.48 – 0.3 = 0.18
node importance of 2nd node
ni – node importance = ?
N_t - No. of row in current node = 4
N – No. of row in total dataset =5
Impurity – gini impurity of current node = 0.375
N_t_r – No. of row in right subnode = 0
right_impurity – gini impurity of right subnode = 0
N_t_l – No. of row in left subnode = 0
left_impurity – gini impurity of left subnode = 0.
ni = 45 0.375-05*0-05*0 = 0.3
feature importance of 0th feature = 0.30.3+0.18 = 0.625
feature importance of 1th feature = 0.180.3+0.18 = 0.375
in python use feature_importances_ attribute to check feature importance-
In random forest, there is multiple decision tree, hence we use average of all decision tree to calculate feature importance-