Statistics for AIML - Regression Metrics - Outlier Tutorial
What is outlier?
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst to decide what will be considered abnormal.
Common Causes of Outliers
- Data entry errors (human errors)
- Measurement errors (instrument errors)
- Experimental errors (data extraction or experiment planning/executing errors)
- Intentional (dummy outliers made to test detection methods)
- Data processing errors (data manipulation or data set unintended mutations)
- Sampling errors (extracting or mixing data from wrong or various sources)
- Natural (not an error, novelties in data)
Common methods of determining an Outlier
1. Sort the data and see for the extreme values
2. Plotting – Boxplot, Scatterplot
3. IQR Method
4. Z-Score Method
Why do we need to treat outliers?
Outliers can impact the results of our analysis and statistical modeling in a drastic way.
IQR Method
A Data value is considered to be an outlier if
Data Value < Q1 - 1.5(IQR)
OR
Data Value > Q3 + 1.5(IQR)
Q. Can you identify the outliers from the below dataset, using the IQR method?
26.0 ℃ , 15.0 ℃ , 20.5 ℃ , 31 ℃ , -350.0 ℃ , 31.0 ℃ , 30.5 ℃
Arranging in ascending order - -350,15,20.5,26,30.5,31,31
minimum = -350, maximum = 31
median = 26 (Q2), Q1 = 15,Q3=31
Q1-1.5*(IQR) = Q1 - 1.5(Q3-Q1) = 15 -1.5(31-15) = 15 - 1.5(16) = -9
Q3+1.5*(IQR) = Q3 + 1.5(Q3-Q1) = 31 + 1.5(31-15) = 31 + 1.5(16) = 55
-350 is an outlier as it is not in the range of (-9,55)