Machine Learning - Ensemble Learning - Random Forest Classifier Tutorial

A ‘random forest’ is a supervised machine learning algorithm that is generally used for classification problems. It is a bagging based ensemble learning technique, which is a process of combining multiple decision tree classifiers to solve a complex problem and to improve the performance of the model.

The random forest chooses the decision of the majority of the trees as the final decision in classification problem and it will use mean or average in case of regression problem.

Random Forest –> Bagging Forest -> Bootstrap aggregation with Multiple decision tree

Advantages of Random Forest

It takes less training time as compared to other model.
It is capable of handling large datasets with high dimensionality.
It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks.

How Random Forest Perform so well? Bias variance trade off

Answer – 92

Feature Importance using Random Forest and Decision Tree

Feature importance is a key concept in machine learning that refers to the relative importance of each feature in the training data. In other words, it tells us which features are most predictive of the target variable. Feature importance can be calculated in a number of ways, but all methods typically rely on calculating some sort of score that measures how often a feature is used in the model and how much it contributes to the overall predictions

Formula-

ni = N_tN impurity-N_t_rN_t*right_impurity-N_t_lN_t*left_impurity

fi_k = ni (j=node split on feature k)ni (j=all nodes)

ni – node importance

N_t - No. of row in current node

N – No. of row in total dataset

Impurity – gini impurity of current node

N_t_r – No. of row in right subnode

right_impurity – gini impurity of right subnode

N_t_l – No. of row in left subnode

left_impurity – gini impurity of left subnode

fi_{k – feature importance, k is feature index}

For the above decision tree-

Feature importance calculation in Decision Tree-

For the above problem-

node importance of 1^st node

ni – node importance = ?

N_t - No. of row in current node = 5

N – No. of row in total dataset =5

Impurity – gini impurity of current node = 0.48

N_t_r – No. of row in right subnode = 4

right_impurity – gini impurity of right subnode = 0.375

N_t_l – No. of row in left subnode = 0

left_impurity – gini impurity of left subnode = 0.

ni = 55 0.48-45*0.375-05*0 = 0.48 – 0.3 = 0.18

node importance of 2^nd node

ni – node importance = ?

N_t - No. of row in current node = 4

N – No. of row in total dataset =5

Impurity – gini impurity of current node = 0.375

N_t_r – No. of row in right subnode = 0

right_impurity – gini impurity of right subnode = 0

N_t_l – No. of row in left subnode = 0

left_impurity – gini impurity of left subnode = 0.

ni = 45 0.375-05*0-05*0 = 0.3

feature importance of 0^th feature = 0.30.3+0.18 = 0.625

feature importance of 1^th feature = 0.180.3+0.18 = 0.375

in python use feature_importances_ attribute to check feature importance-

In random forest, there is multiple decision tree, hence we use average of all decision tree to calculate feature importance-

Machine Learning - Ensemble Learning - Random Forest Classifier Tutorial

About Fresherbell

Important Links

Social Media