Machine Learning - Machine Learning Development Life Cycle - Exploratory Data Analysis (EDA) Tutorial
Exploratory Data Analysis (EDA)-
Univariate Analysis – Single Variable Analysis
import seaborn as sns
Taking titanic dataset
1] Categorical Data
- CountPlot
sns.countplot(df[‘Survived’])
Or
df[‘Survived’].value_counts().plot(kind=’bar’)
- PieChart
The same information (count plot) in terms of percentage
df[‘Survived’].value_counts().plot(kind=’pie’,autopct=’%.2f’)
2] Numerical Data
import matplotlib.pyplot as plt
- Histogram
plt.hist(df[‘age’],bins=10)
- Distplot
Improvement of the histogram, it shows probability density function, the line is called kernel density estimation
sns.distplot(df[‘age’])
- Boxplot
It provides 5 number summary, i.e IQR
sns.boxplot(df[‘age])
Bivariate & Multivariate Analysis – Two or more Variable Analysis
1] Scatterplot (Numerical-Numerical Data)
sns.scatterplot(tips[‘total_bill’],tips[‘tips’],hue=tips[‘sex’],style=tips[‘smoker’],size=tips[‘size’])
2] Bar Plot (Numerical – Categorical)
sns.barplot(titanic[‘Pclass’],titanic[‘Fare])
the black line is the confidence interval.
3] BoxPlot (Numerical – Categorical)
sns.boxplot(titanic[‘Sex],titanic[‘Age])
4] Distplot (Numerical – Categorical)
5] HeatMap (Categorical – Categorical)
6] ClusterMap (Categorical – Categorical)
7] PairPlot
sns.pairplot(iris,hue=’species’)
8] LinePlot (Numerical – Numerical)
Used mostly in time