Hierarchical and k-Means Clustering
Hierarchical ClusteringHierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps:
Any of the 5 discussed measures to calculate the distance between a pair of clusters can be used in hierarchical clustering.
Hierarchical clustering can be represented by a 2-D diagram called dendrogram. This diagram has the form of an inverted tree.
Horizontal axis: observations
Vertical axis: distance
The length of the branches represented by the vertical lines indicate the distance between the two clusters that are merging. The diagram is read from the bottom to the top by sweeping a horizontal line across the entire tree.
- Identify the two clusters that are closest together
- Merge the two most similar clusters
Any of the 5 discussed measures to calculate the distance between a pair of clusters can be used in hierarchical clustering.
Hierarchical clustering can be represented by a 2-D diagram called dendrogram. This diagram has the form of an inverted tree.
Horizontal axis: observations
Vertical axis: distance
The length of the branches represented by the vertical lines indicate the distance between the two clusters that are merging. The diagram is read from the bottom to the top by sweeping a horizontal line across the entire tree.
K-means clusteringThe overall idea is to assign objects to the nearest cluster. Where distance is measured from the object to the centroid of the cluster. The value of k indicates the number of clusters.
K-mean clustering is also scalable, meaning that it can deal with very large datasets.
Step 1: Selection of ‘k’ observations at the centroid of the initial clusters. This selection is somewhat arbitrary. The procedure works better if the initial centroids are as far apart as possible.
Step 2: The rest of the observations are assigned to the closest centroid.
Step 3: Once the assignment is completed, the cluster averages are recalculated. The centroid positions can change due to the recalculations and lead to a reassignment of the observations.
Step 4: Steps 2 and 3 two steps are repeated until the centroids do not change.
K-mean clustering is also scalable, meaning that it can deal with very large datasets.
Step 1: Selection of ‘k’ observations at the centroid of the initial clusters. This selection is somewhat arbitrary. The procedure works better if the initial centroids are as far apart as possible.
Step 2: The rest of the observations are assigned to the closest centroid.
Step 3: Once the assignment is completed, the cluster averages are recalculated. The centroid positions can change due to the recalculations and lead to a reassignment of the observations.
Step 4: Steps 2 and 3 two steps are repeated until the centroids do not change.
Pros:
Simple procedure
Cons:
Final clusters depend on the initial choices – Hence, cluster analysis software gives you the option of running the procedure multiple times from seeds that are randomly chosen.
Both hierarchical clustering and K-means are procedures that find approximate solutions to the problem maximizing the similarity of the objects in each cluster. The maximization must take into consideration that there must be k clusters and that all objects must belong to one cluster and only one cluster.
Simple procedure
Cons:
Final clusters depend on the initial choices – Hence, cluster analysis software gives you the option of running the procedure multiple times from seeds that are randomly chosen.
Both hierarchical clustering and K-means are procedures that find approximate solutions to the problem maximizing the similarity of the objects in each cluster. The maximization must take into consideration that there must be k clusters and that all objects must belong to one cluster and only one cluster.