What is Cluster Analysis and Hierarchical Clustering

Cluster Analysis

What is it and why is it important?

Cluster analysis is a set of exploratory techniques (i.e. unsupervised) that searches multivariate data for a structure of “natural” groupings.
Grouping can be achieved through various measures of distances, which are then used to define similarities (or dissimilarities) among the identified groups.
These groupings can provide informal assessments of dimensionality, multivariate outliers and suggest possible hypothesis concerning their associations.
In cluster analysis, grouping of samples or individuals is typically the focus, however grouping of variables is also valued.
The number of groups (clusters) is arbitrary and there is no “standard” approach in defining these them.
In most practice applications, it relies on the expertise of the researcher on the subject at hand and the purpose of the investigation.
We can also assess the groupings visually with dendrograms and heatmaps.
There are also measures that can be used to assess the overall cluster structure.
The choice of closeness measure will also have an impact.

Distance Matrix

Public utility data of 22 U.S. companies (taken from Johnson and Wichern, 1992)

Euclidean distance matrix on the standardised data

Visualisation of distance matrix through shading of classes.

Hierarchical Clustering

Hierarchical clustering (HC) aims to create cluster hierarchy, which is commonly displayed in a tree diagram called dendrogram.
A dendrogram illustrates where clustering occurred at each step of the process.
HC can be performed on both the individuals (samples) and variables.
There are two main types of HC:
- Agglomerative HC (also known as Agglomerative Nesting; AGNES)
- Divisive HC (also known as Divisive Analysis; DIANA)

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering (AHC) takes a bottom-up approach in grouping the objects (individuals or variables).
AHC Procedure:
1) Each object begins as its own cluster, i.e. N objects =N clusters.
2) Create the distance/similarity matrix among the N clusters.
3) The most similar objects are grouped or merged according to the distance/similarity matrix to form a new cluster.
4) Create a new distance/similarity matrix based on the new cluster, and the remaining clusters.
5) Repeat steps 3) and 4) until all clusters are fused into one.
AHC makes clustering decisions based on local patterns without accounting for global distribution.

Divisive Hierarchical Clustering

Divisive hierarchical clustering (DHC) works in the opposite direction and take a top-down approach and divides the objects (individual or variables) into smaller subgroups or clusters.
DHC Procedure:
1) Start with a single cluster consisting of all N objects.
2) The single cluster is divided into two smaller clusters in such a way that 1st cluster is most dissimilar to the 2nd cluster according to some distance/similarity.
3) Each of these two smaller clusters are then further divided into dissimilar group.
4) The process repeats itself until each object is its own cluster.

DHC makes clustering decisions based on global information

Linkage Methods

The next question is: How do we measure similarity/dissimilarity between clusters of observations?
This is important in the process of merging or dividing clusters.

There are a variety of linkage methods that we can use to do just this.

The three most commonly used linkage methods are:

1)Single Linkage (nearest neighbourhood or minimum distance)

-Can handle non-elliptical clusters well if the gaps is significant

-Sensitive to noise and outliers

2)Complete Linkage (furthest neighbourhood or maximum distance)

-Less sensitive to noise and outliers

-Tends to break up larger clusters

-Biased toward globular clusters

3)Average Linkage (average distance)

-A compromise between Single and Complete Linkage

-Less sensitive to noise and outliers

-Biased toward globular clusters

4)Centroid Linkage

5)Ward’s Method.

-Similar to Average Linkage

-Based on sum of squares of distances instead.

Public Utility Data Revisited

Evaluation of Structures

There are two measures that can be used to evaluate clustering structures.

1) Agglomerative Coefficient (for AHC)

-Average of all 1-m(i), where m(i) is the dissimilarity of observation i to its first cluster that it merged with, divided by the dissimilarity of the merger in the final step of the algorithm.

-Measures the amount of clustering structure found, with values closer to one suggesting stronger clustering structure.

-Divisive coefficient (for DHC) works in a similar way.

2)Cophenetic Correlation Coefficient

-Correlation between the input dissimilarity matrix and the output dissimilarity matrix resulting from the dendrogram.

-Measures how well the dendrogram depicts the original data structure.

-Value of 0.75 or above is good.

Selection of Final Clusters

In most exploratory application, the number of clusters k unknown.
In many cases, an experienced researcher in the area is able make an informed choice in regards to the number of clusters.
However, in the age of automation, it is necessary to have a method that will allow us to make a choice.
The three methods for determining the optimal number of clusters k are:

-Elbow method

-Average silhouette method

-Gap statistics

Elbow Method

This method seeks to minimise the within-cluster sum of squares (WSS), i.e., within-cluster variability.
For each k (for k=1,2, …, N), the WSS is calculated and then plotted.
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

Average Silhouette Method

This method measures the quality of the clustering by determining how well each object lies within its cluster.
A high silhouette width indicates good clustering.
Average silhouette method computes the average silhouette of observations for different values of k.
Optimal number of clusters is given by the maximum AS width.

Gap Statistic Method

The GS method compares the total within-cluster variation for different values of k with their expected values under null reference distribution.
The optimal value of k is one that maximises GS.
A more robust approach is to choose the smallest value of k such that the gap statistic is within one standard error of the gap at k+1.

Realcode4you Services In Which You Can Get Help

Python Programming Help, Homework Help, Project Help
Data Science Coding and Project Help
R Programming Project Help, Homework Help and Assignment Help
Machine Learning Homework and Assignment Help
Java Project Help, Assignment Help and Homework Help
Power BI Project Help, Homework Help and Assignment Help
Tableau Project Help, Homework Help and Assignment Help
Rapid Miner Project Help, Homework Help and Assignment Help
Database Project Help, Homework Help and Assignment Help
Web Application Using Django, Flask, Tkinter, PyQT5, Java, React Js, Angular, PHP, JavaSript, HTML etc.
Big Data Assignment and Project Help
Data Analysis Project Help, Homework Help and Assignment Help
Excel Assignment and Project Help
Mobile App Development Using Android, React Native, etc.
And More Other Services

RealCode4You

What is Cluster Analysis and Hierarchical Clustering | Realcode4you

Recent Posts

Comments