Machine Learning - Introduction
Missing Value Imputation
Handling Categorical Features
PCA - Principal Component Analysis
It is used in Unsupervised learning & complex technique
Main aim to reduce curse of dimensionality, to avoid the number of computation on high dimensional data.
Transform higher dimensional data to lower dimensional data while keeping the essence of data, i.e visualization
Benefit of PCA:-
Faster Exceution Of Algoritm
Visualization (e.g reducing 10 D to 2 D)
In Dataset,suppose there is (No. Of Rooms, No. Of Grocery Shop, target column- Price), from this only No. of Room is more important than no. of grcocey. Hence, using PCA –feature selection ,we will select No. of rooms column only.
If you don’t have an idea of domain of any project, plot the graph both column, and check variance, select column with higher projection.
Feature selection will not work, when both the column are equally important(e.g No. of rooms & No. of bathroom),same variance. In such case,we need to use feature extraction.
In feature extraction, when both the column are equally important(e.g No. of rooms & No. of bathroom),same variance. Then it will combine both column and convert it into single column i.e total flat size.
Range, IQR, Variance, and Standard Deviation are the methods used to understand the distribution of data.
Range - The Range is a measure of variability. It is calculated by subtracting lower value from higher value. The wide range indicates high variability, and the small range specifies low variability in the distribution.
Range = Highest_value – Lowest_value
Interquartile Range (IQR)
IQR is a range between the third and first quartile. IQR is preferred over a range, because IQR does not influence by outlier like a range. IQR is used to measure variability by splitting a data set into four equal quartiles.
Formula To Find Outliers
[Q1 – 1.5 * IQR, Q3 + 1.5 * IQR]
If the value does not fall in the above range it considers outliers.
The variance is a measure of variability. It is calculated by taking the average of squared deviations from the mean. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean.
Population vs Sample variance
When you have collected data from every member of the population
The population variance formula looks like this:
= population variance
= summation from 1 to N
Χ = each value
= population mean
Ν = number of values in the population
When you have collected data from a sample
The sample variance formula looks like this:
= sample variance
= summation from 1 to n-1
Χ = each value
= sample mean
n = number of values in the sample
why do we use n-1 in the sample deviation and variance formula instead of n?
The simple answer: the calculations for both the sample standard deviation and the sample variance both contain a little bias (that’s the statistics way of saying “error”) estimate that consistently underestimates variability. Bessel’s correction (i.e. subtracting 1 from your sample size) corrects this bias. In other words, sample variance would tend to be lower than the real variance of the population. Therefore to get accurate result same as population, we you use n-1 instead of n.
Variance Proportional to Spread
Variance gives added weight to the values that impact outliers (the numbers that are far from the mean and squaring of these numbers can skew the data like 10 square is 100, and 100 square is 10,000 ) to overcome the drawback of variance, standard deviation came into the picture.
The standard deviation is derived from variance and tells you, on average, how far each value lies from the mean. It’s the square root of variance.
Variance Vs Standard Deviation
Both measures reflect in a distribution, but their units differ:
Standard deviation is expressed in the same units as the original values (e.g., meters).
Variance is expressed in much larger units (e.g., meters squared)
Why MAD (Mean Absolute Deviation) is not used instead of variance?
The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset.
Xi = Each value from population
Xbar = The population mean
n = Size Of Population
Here | gives the absolute value that means all negative deviation (distance) made positive.
MAD is not differentiable at 0. It will not work in optimization. While variance is differentiable
Correlation means, correlation between two variables which is a normalized version of the covariance. The range of correlation coefficients is always between -1 to 1. The correlation coefficient is also known as Pearson’s correlation coefficient.
Negative means they are inversely proportional to each other with the factor of correlation coefficient value.
Positive means they are directly proportional to each other mean vary in the same direction with the factor of correlation coefficient value.
if the correlation coefficient is 0 then it means there is no linear relationship between variables.
Covariance and Covariance matrix
Covariance is a measure of the relationship between two random variables and to what extent, they change together. It range between -∞ and +∞. The covariance of two variables (x and y) can be represented by cov(x,y)