Table of Contents
- What is Clustering?
- Different Clustering Methods
- What is K-means Clustering?
- Video Tutorial
1 What is Clustering?
- Clustering is one of the primary domain of machine learning like regression and classification and testing. Clustering is one of the main domains of machine learning in which the data needed to be kept in a group of similar data functions.
- For example, if you have squares and circles given inside the data you need to cluster all the squares together and all circles together.
2 Different Clustering Methods
Clustering is an unsupervised learning mechanism in which the target variable is not given but the algorithm automatically processes and looks for the clusters that are present inside a data. So, talking about one of the approaches given to clustering is k-means clustering and another approach for clustering which is highly popular is hierarchical clustering. Both of those approaches works close to perfect in most cases. In this blog, we talk about k-means clustering.
3 What is K-means Clustering?
k-means clustering consists of two major words k and mean. Where K is the number of clusters that you want to group and mean is the centroid or the mean of the distance. In the image, we can see we have two clusters that are green and red.
The point in the middle of the red area and the green area is known as the centroid of the two clusters. k-means clustering takes examples of two distances. One is the Manhattan distance and the second is the Euclidean distance.
Now let’s take the example.
Here we are using Iris dataset to implement this model.
- Now usually what happened with supervised learning was we were taking the Y as a target variable in which the training data which we were providing was telling if an instance is having a specific value or indication it leads to a specific kind of output or the target.
- But that practice is not applied in clustering since it is not a supervised learning algorithm. Now machine learning has three major classifications of supervised and reinforcement learning. And clustering comes in unsupervised since no target is given. So, we are just taking the x value while loading the data from the dataset, no y value.
- So, Now in the data we already know that we have two clusters but for example, if you are given a completely new data set how would you come to realize what is the optimal number of data set or the clusters that are required for it?
- That can be done using the elbow method.
- Now the elbow method gives us a steep curve in which the clustered squared sum is given. Now when the number of clusters is increased we see a sudden fall in the curve and making an arrow-like elbow. Now we have to select the path where the fall or the slope value is not that high.
- In this plot shown above, for the clusters from ‘0’ to ‘2’, we have a very downfall that cannot be taken and from ‘2’ the slope changes, and at ‘3’ it gets started with a normal curve. So, the optimal number of clusters that we’ll be taking in this part is ‘3’.So, we pass the number of clusters as ‘3’. Now we predict the fit value of x.
We try to predict the y_means bypassing the x into the fit_predict function. It is always better to see the model by plotting. So we plot the model using the plot function.
When we give plt.legend it gives the labels of colours shown in the plot.
So, this is all about K-means Clustering.