Preparing Data and Measuring Dissimilarities
There are three concepts that are critical to performing a valid cluster analysis:
1. Data should be in the correct form by taking into consideration what each variable represents.Two most common data types:
Normalization takes care of differences in scale by transforming each original value to its standard value. The operation consists of subtracting the mean and dividing by the standard deviation.
1. Data should be in the correct form by taking into consideration what each variable represents.Two most common data types:
- Numerical
- Continuous: quantities that may be continuous, such as time
- Integer: such as number of purchases or number of dependents
- Categorical
- Ordinal: An ordinal variable implies some sort of ranking. For example, a customer satisfaction rating is stated as high, medium, and low (value transformation to a numerical variable will be to make high equal to 3, medium equal to 2 and low equal to 1)
- Nominal: Nominal variables on the other hand, can be thought of representing choices. These choices do not imply any particular order, and therefore they cannot be transformed into a single numerical variable. The transformation requires binary variables.
Normalization takes care of differences in scale by transforming each original value to its standard value. The operation consists of subtracting the mean and dividing by the standard deviation.
The last two columns of the table show the normalized values, for instance, the normalized age of Ann is -0.4948. It is obtained by subtracting the average age of the group (= 42.20 years) from Ann's age (=35 years). This is then divided by the standard deviation (=14.55). The normalized value means that Ann's age is 0.4948 standard deviations below the mean.
Why should we normalize our data?
Why should we normalize our data?
- Normalized values allow us to identify the outliers in our dataset.
- They eliminate biases from variables with relatively large original values.
2. A proper metric should be established to be able to measure the distance between every pair of observations.
The Euclidean distance is the most commonly used measure of the similarity between two observations. This measure is the equivalent of the straight-line distance between two objects in a two-dimensional space.
Continuing the previous example, we compute the distances from each pair of persons in the dataset by using the normalized age and income values.
Continuing the previous example, we compute the distances from each pair of persons in the dataset by using the normalized age and income values.
Further, we can create a scatter plot.
Observation: David is at least three times closer or more similar to Ann, than he is to Clara since David is both closer in age and income to Ann, than he is to Clara.
Observation: David is at least three times closer or more similar to Ann, than he is to Clara since David is both closer in age and income to Ann, than he is to Clara.
3. We must decide how distance between clusters is going to be measured
There are five distance measures between clusters:
- Single linkage
2 Complete linkage
Maximum distance between objects that are not in the same cluster
Maximum distance between objects that are not in the same cluster
3 Average linkage
Calculate the average of all distances across the two clusters
Calculate the average of all distances across the two clusters
4 Average group linkage
The distance between the centre of one cluster to the centre of the other
The distance between the centre of one cluster to the centre of the other
5 Ward's method
Sum of squares criterion. The sum of squares refers to the squared distance from each observation to the centroid of the cluster to which it is assigned.
Sum of squares criterion. The sum of squares refers to the squared distance from each observation to the centroid of the cluster to which it is assigned.