DEV Community

Cover image for K-means Clustering Using the Elbow Method.
Victor Alando
Victor Alando

Posted on

K-means Clustering Using the Elbow Method.

Introduction

Clustering or cluster analysis is machine learning technique, which groups the unlabeled dataset. It can be said that as "way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group and those that have less or no similarities with another group"

Let's understand the clustering technique with the real-world example of Mall. When customers visit any shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apple, bananas, Mangoes, e.t.c are grouped in a separate section, so that customers can easily find out the things. The clustering technique also works in the same way. Other examples of clustering are grouping of documents according to topics.

Python Implementation of K-means Clustering Algorithm.

Prerequisites

  • What is K-means Clustering Algorithm.
  • How does the k-means algorithm work?
  • How to find and choose the value of "k: number of clusters in k-means clustering.
  • Data preprocessing.
  • Standardization and feature scaling.
  • Fitting the training and Data Transformation.
  • Training the K-means Algorithm on the Training Dataset.
  • Make Predictions.
  • Inspect the coordinates of the 5 centroids
  • Finding the Optimal (k) number of clusters using the Elbow Method.
  • Visualizing the Clusters
  • Summary Findings

What is K-means Clustering Algorithm?

K-Means clustering is an unsupervised learning algorith, which groups the unlabeled dataset into different clusters. Here k defines the number of pre-defined clusters that need to be created in the line process, as if K=2, there will be two clusters, and K=3 it means that there will be clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only to only one group that has similar properties

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where clusters are associated with centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their correspond clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks: - Determines the best value for k center points or centroids by an iterative process. - Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other clusters.

Consider the below diagram that explains the working of the K-means Clustering Algorithm.

Image description

How does the k-means algorithm work?

The workings of the Kmeans algorithm can explained in the below steps:

  • Select the number of k to decide the number of clusters.

  • Select random K points or centroids (it can be from the input data set).

  • Assign each data point to their closet centroid, which will form the predefined k clusters.

  • Calculate the variance and place a new centroid of each cluster.

  • Repeat the third step which means, reassign each data point to the new closet centroid of each cluster.

  • If any reassignment occurs, then go to step 4 else go to finish.

  • The model is ready.

Let’s now understand the above steps with the help of a visual plots: Suppose we have two variables M1 and M2.

The x-y axis scatter plot of these two variables is given below:

Image description

  • Let’s take number k of clusters, i.e. K=2, to identify the dataset and to put them into different clusters. It means that we will try to group these datasets into two different clusters.

  • We need to choose some random k points or centroids to form the cluster. These points can be either or any other point. So, here wea re selecting the below two points as k points, which are not the part of our dataset.

Consider the below visual plots:

Image description

  • Now we will assign each data point of the scatter plots to its closest K-point or centroids. We will compute it by applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a median between both the centroids. See the visual plot below:

Image description

From the above scatter visualization, it is clear that points left side of the line is near to the K1 or blue centroids, and points to the right of line are close to the orange centroid. Let's color them as blue and orange for clear visualization.

Image description

We need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroid, we will compute the center of gravity of these centroids and will find new centroids as shown in the below image.

Image description

Next, we will reassign each data point to the new centroid. For this, we will repeat the same process of finding a median line. The median will be look like the one below.

Image description

From the above image, we can see one orange point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.

Image description

As reassignment has taken place, so we will again go to step 4, which is finding new centroids or k-points.

  • We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:

Image description

As we got the new centroids so again, we will draw the median line and reassign the data points. So, the image will be;

Image description

We can see in the above image, there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image.

Image description

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the image below:

Image description

Top comments (0)