Introduction:
Clustering is one of the most important techniques of unsupervised machine learning.
But before we understand what clustering is, let’s dive deep into the basics.
A cluster is a set of data objects that are similar to one another within the same group but distinct from the objects in other clusters. So, in short, the data objects in one cluster have similar characteristics, than the data objects in another cluster.
Fig 1- image of 4 clusters on a 3D scale
source: http://onwunalu.com/data/data-clustering/
Clustering is the process of assigning objects to homogeneous groups known as clusters while ensuring that objects in different groups are not comparable. It is an unsupervised activity because it seeks to characterise the hidden structure of items.
A set of characters known as features describe each object. The first stage in dividing things into clusters is to determine the distance between them. The definition of an appropriate distance metric is critical to the effectiveness of the clustering process.
There are 4 different metrics to calculate the similarity between the data points. These are:
- Euclidean distance: It is the distance that represents the shortest distance between two points.
The formula for Euclidean Distance is:
- Manhatten distance: It is the sum of absolute differences between points across all the dimensions.
source: 9 Distance Measures in Data Science | Towards Data Science
The formula for Manhatten Distance is:
- Minowski distance: It is the generalised form of Euclidean and Manhattan Distance.
source: 9 Distance Measures in Data Science | Towards Data Science
The formula for Minowski Distance is:
- Hamming distance: It compares the similarity of two strings of equal length. The Hamming Distance between two identical strings is the number of points where the corresponding characters differ.
source: 9 Distance Measures in Data Science | Towards Data Science
Now that we know the different metrics now, let’s dive into the different types of clustering.
Different types of Clustering Algorithms:
Hard clustering and Soft clustering are the two primary categories of clustering algorithms. However, there are alternative Clustering techniques available. These are:
- Partitioning Clustering: It is a kind of clustering in which data is divided into non-hierarchical groupings. It is often referred to as the centroid-based technique. The K-Means Clustering technique is the most prominent example of partitioning clustering.
The dataset is separated into a group of k groups, where K is the number of pre-defined groups. The cluster centre is designed in such a way that the distance between the data points of one cluster and the centroid of another cluster is as minimal as possible.
- Density-based Clustering: The density-based clustering technique groups dense areas into clusters, and arbitrary shaped distributions are generated as long as the dense region can be connected. This method accomplishes this by detecting different clusters in the dataset and connecting high-density areas into clusters. Sparser regions separate the dense areas in data space.
If the dataset includes varied densities and large dimensions, these algorithms may struggle to cluster the data points.
- Distribution Model-based Clustering: This method divides data based on the probability that a dataset corresponds to a specific distribution. The grouping is accomplished by assuming some distributions, most notably the Gaussian Distribution. The Expectation-Maximisation Clustering algorithm is an example of this type.
source: https://upload.wikimedia.org/wikipedia/commons/d/d8/EM-Gaussian-data.svg
- Hierarchical Clustering: It can be utilized instead of partitioned clustering, because there is no need to specify the number of clusters to be produced. The dataset is separated into clusters in this technique to produce a tree-like structure known as a dendrogram. By severing the tree at the appropriate level, the observations or any number of clusters can be picked.
source: 1*VvOVxdBb74IOxxF2RmthCQ.png (740×405) (medium.com)
Application:
- Recommended Systems
- Market and Customer Segementation
- Social Network Analysis
- Search Result Clustering
- Biomedical Engineering- detection of cancer cells, etc