Cluster Analysis or Clustering is used to form groups of similar records based on several dimensions (measurements). This technique is more popularly known in Business World as Segmentation, as one of the main motivation is to identify inherent segments in the business. This is one of the Unsupervised Machine Learning methods, as we do not have an Outcome or Response. We merely seek natural groupings of records based on multiple dimensions.
This technique has been applied in diverse areas, for very interesting applications, ranging from Marketing, Finance, Biology, Economics, Engineering etc. As an example, competitive strategy can be based on the groups of industry peers on multiple parameters like Profit, Revenue, Debt-Equity Ratio, Growth Rate, Employee Productivity and so on. The "Industry Leaders" would be high on most of them, while "Most Efficient" would be high on Profit, Productivity etc. Accordingly we can identify other segments too.
A widely used Segmentation technique in Travel, Hospitality and Retail Industry is called RFM Analysis. The customer segments are identified based on three dimensions simultaneously - Recency (R) of last transaction, Frequency (F) of transactions and average Monetary Value (M) of each transaction.
The key is the simultaneous use of multiple dimensions, which necessitates an Advanced technique. Please download the Example Sheet (link at the top of page), where we discuss Cluster Analysis for the data on 136 Countries.
If we have to find the groups of Countries on a single measurement (say Area), we could use a Bar Chart and arrange the countries in an ascending order. The picture on bottom right shows the grouping. Depending on the level of details needed, we can then assign them to 3, 4 or more groups by defining some criterion on area. The example shows identifications of three segments.
We can extend the idea to two measurements (say Area and GDP). Now a X-Y plot can serve the purpose. Once again, based on some criteria, we can identify segment of Countries (picture on below right). USA is in a class of its own, owing to its huge Area as well as GDP. The other segment is made of China, Russia and Canada, while rest belong to third segment.
However this idea has a limit and beyond a certain number of measurements, we may be limited by these Visual Plots and our judgements to identify the natural groupings. In that case, an automated algorithm is needed to group similar records (based on multiple dimensions).
As used in several other techniques, concept of Distance between two records is used to find similar records. The distance calculation uses several approaches, Euclidean Distance (used in KNN) being the most prominent. The following measures are used in practice.
Euclidean distance: most widely used, with normalization (like Z values)
D = 1 – (correlation coefficient)^2; more correlated measures have less distance
Manhattan distance: similar to Euclidean but uses absolute values rather than squared differences
There are two broad categories of techniques for Cluster Analysis.
Agglomerative - Starts with each record treated as a distinct cluster and then most similar records merged together in a hierarchical fashion, with only one cluster (all records) left at the final step.
Divisive - It works in a reverse fashion, with one cluster comprising of all records, and then it is divided into smaller clusters in a hierarchical fashion, till all records are put in separate clusters.
Agglomerative Approach is more popular, and used in practice.
These techniques are computationally intensive and not suitable for large data sets
Natural grouping is easily found through an inverted Tree like diagram, called "Dendrogram"
Tends to be unstable with reordering of data or addition of new records
It is sensitive to outliers
Makes a single pass, and even non optimal solution persists
It is also called K-Means Clustering. Unlike Hierarchical, a pre-defined number of cluster is input to the algorithm. Then records are inserted into appropriate groups (clusters) in step-by-step process.
K-means is computationally efficient and should be used for large data sets.
Choice of K needs some iterations, which could be based on user judgement or certain calculations provided by the algorithm. Usually k is chosen corresponding to highest ratio of between dispersion (between clusters) and total dispersion (between all records).
In brief these techniques use the following approaches.
Hierarchical: This is a purely data driven and does not need specification of number of clusters upfront. The natural groupings of records is clearly visible and user has the flexibility to see clustering at any level.
We start with n clusters corresponding to n records. Then we merge two of the closest records together forming n-1 cluster at the first step. We repeat this step merging newly formed clusters or records till we have only one cluster comprising all records.
But it can be too demanding in terms of computational time and resources and not advisable for large data sets. Also the results may change with addition or deletion of records. It needs number of computations of the order of N^2 where N is number of records.
Non-hierarchical: The records are inserted into k clusters and the within cluster dispersion is minimized. The dispersion is measured by distance of each records from the cluster centroid. This in effect becomes an Integer programming exercise. However in case of large no. of variables, it can be very time consuming and so a heuristic based method is used and this is called K-means cluster.
To begin with records are put arbitrarily in k clusters. Then the distance of each record from all cluster centroids are calculated. A record is assigned to different cluster than earlier assigned if distance is lower. Hence each step acts as a correction and doing this iteratively, we reach a stage where all records are correctly assigned to correct cluster. K-means needs number of computations of the order of N only.
Let us discuss Hierarchical Clustering in detail by means of the example in downloadable Example Sheet. For simplicity, we would focus on only 25 countries, however the idea can be easily extended to all countries or bigger data sets. We would use Area, Population, Roadways and GDP as four dimensions for Clustering. Please note how these dimensions have been Normalized as N_Area etc. Normalization is done to annul the effect of relative magnitude of these measurements. In absence of Normalization, Area would overwhelm the similarity calculations. Please follow Step 1 in "Hierarchical" worksheet. The minimum distance in the 25*25 Distance Matrix corresponds to (Bahamas, Belize) pair. Hence they are clubbed together as one cluster and would be treated as one records now on. We would use the Centroid (average of pair) as the measurements for this newly formed cluster. Step 2 does the same calculation, however with 24 records (as 2 records have been merged). In Step 2, Brunei is found to be most similar (least distance) with Cluster 1 and (Bahamas, Belize and Brunei) are merged to form a single Cluster at Step 2. In Step 3, a new Cluster is formed by merging Armenia and Albania. Please note that it is an entirely new cluster and the cluster formed till Step 2 is still changed. We have 22 records at this stage. In Step 4, Baharin is found to be most similar to the already formed Cluster 1 (at the end of Step 2), and merged with that. This process keeps on till we reach a stage where all records are treated as one cluster. The sequential process is shown (partially for 4 levels) in the following picture.
Let us discuss K-Means Clustering with same dataset from our Example Sheet. Once again we would consider only 25 countries. We would select k=5 and perform the steps (see the K-Mean Worksheet). In step 1, all records are arbitrarily assigned to one cluster. Hence each cluster has 5 countries to start with. In step 2, we calculate distance of each Country with all the clusters (Centroid is used for Cluster). If distance of a Country is lesser with any other Cluster than the Cluster it is assigned to, then it is corrected. Please see how, in step 2, majority of Countries are assigned to Cluster 5. This is expected as most of these countries are small and bulk of them go to one cluster. In step 2 correction, Cluster 2 does not get any Country, however we retain it with its existing Centroid values. Step 3 again sees some redistribution and 2 countries are reassigned to Cluster 2. Please observe how similar countries start coming to one Cluster like Australia, Brazil and Canada are three big countries with relatively high values on all dimensions and they form Cluster 3. This sequence of operation converge after some steps, and we may not see any changes. The converged step is taken as the final result. The whole process (partially), with a fictitious final step is shown in the picture below. Please complete few more steps to check for convergence, or run the same data in Self-Service Tool to check for final result.
Hierarchical Clustering needed calculation of N*N Distance Matrix. Although N keeps decreasing as more and more clusters are formed, still Distance calculations are needed for all levels, all the way up to formation of Single Cluster at top. This requires massive amount of computation for large data sets (large N). Hence this technique is not recommended for large data sets. In addition, no correction is done, as the computation progresses to higher levels. Hence the earlier comment, that Single Pass is made and non-optimal solution may exist.
In comparison, K-mean Clustering needs computation of the order of N*K, where K is a small number corresponding to predefined number of Clusters. Also numerous steps, unlike so many levels in Hierarchical, may not be needed as the result may converge quite fast. Each step being a correction step, results are expected to be much more stable, not so much impacted by outliers.
Our discussion was based on the centroid of each cluster, which we used to calculate distance between two clusters. We found the centroid by averaging each measurements, used that for distance calculations. However there are other approaches too for distance calculations, each with its own merits.
Minimum Distance: In this case, distance calculation is based on closest pair of records. When we formed cluster with Bahamas and Belize, then we would use values of either Bahamas or Belize, depending on which is nearer to a record (or record within other cluster), for distance calculation.
Maximum Distance: In this case, the values of farthest apart records are used for distance calculations.
Average Distance: It is more computationally tedious but applicable to all the scenarios. Here distance between all pair of records are calculated, and an average is taken.
Minimum and Maximum Distance calculations provide advantage in specialized cases like spread of epidemics, water flow, earthquake prone areas etc. In majority of cases Centroid Based or Average Distance Based distances are good enough.
Sometimes we may not have absolute measurements like distances between cities. These are pairwise similarity. Another example is some similarity measures between each pair of documents. In such cases, we can not use K-means clustering. A technique called Spectral Clustering, based on Graph Theory is used for such cases.