Machine Learning - UnSupervised Learning - K Mean Clustering Tutorial

K-Means clustering is most commonly used unsupervised learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters and the process of creating these groups is known as clustering.

In this, the process is repeated until convergence or till it does not find the best clusters.

Inner_Working_K_Means

Step to achieve clustering-

step 1] - Finalize the number of cluster you want to identify in your data. This is the K in K mean clustering.

step 2] - Now randomly initialize the points or centroid anywhere in dataset equal to the number of clusters K.

step 3] - Assign cluster based on the centroid and point near to the centroid.

step 4] - Then again find the centroid using mean/average of cluster that has formed in step 3, the centroid is slightly moved from earlier centroid position

If the centroid moved from earlier centroid position, then again repeat step 3 and step 4 till it get converge(Or the centroid stopped moving).

step 5] - Once the centroid stopped moving. Then the final cluster and cluster centroid will formed

How to choose number of cluster?

To choose number of cluster, we can use-

1] Visualization-

If we have domain knowledge and proper understanding of given data which also help to make more informed decisions.

Some clusters are visible by naked eye.

2] Elbow method

Elbow method is widely used method to find number of cluster in dataset. The elbow method constitutes running K-Means clustering on the dataset.

Elbow method is achieved by plotting graph between no. of cluster on x axis and WSS (Within sum of squared distance) or inertia on y axis.

In this, we run the K-Means algorithm multiple times over a loop, with an increasing number of cluster choice(say from 1 to 10) and then plotting a graph with respect to WSS and no. of cluster.

For one cluster or one centroid, WSS is calculated by totaling the squared euclidian distance between the each point and centroid.

Example for 1 centroid and 50 data point, the WSS = d1² + d2² + d3² + …. d50²

And for 50 centroid and 50 data point, the WSS = 0, because distance become zero for all centroid, as all centroid are itself a datapoint.

Which means as the cluster increase then WSS decrease.

After certain point the graph become steady in the form of elbow, that steady point/elbow point can be determined as the number of cluster. In the below graph we can determine the number of cluster as 3

K_Means_Elbow_Method

But sometimes we don’t get clear elbow point on the plot, in such cases its very hard to finalize the number of clusters.

See video number 103

Advantages-

One of the simplest algorithm to understand and efficient
It can work on any dimension with same piece of code.
Gives better results when there is less data overlapping

Disadvantages-

Number of clusters need to be defined by user
Doesn’t work well in case of overlapping data
Unable to handle the noisy data and outliers
Algorithm fails for non-linear data set

Compare K-means and KNN Algorithms.

K-means	KNN
K-Means is unsupervised K-Means is a clustering algorithm K represent number of cluster The points in each cluster are similar to each other, and each cluster is different from its neighboring clusters	KNN is supervised in nature KNN is a classification algorithm K represent number of nearest neigbor It classifies an unlabeled observation based on its K (can be any number) surrounding neighbors

Hierarchical Clustering
PCA

Machine Learning - UnSupervised Learning - K Mean Clustering Tutorial

About Fresherbell

Important Links

Social Media