Data Reduction and Unsupervised Learning

Dataset: a table where the variables, which are also called features or attributes, are in the columns and the observations are in the rows. This means that all the data values are in the body of the table.

Dimension Reduction (process of reducing the number of variables):

Why do we need to reduce number of variables?

Redundancy among the variables in a dataset.
Thus, it is possible to reduce the number of dimensions without losing critical information.

Note: Redundancy occurs when different attributes respond in similar ways to some common underlying factor.
Example: HR department of a company creates an instrument to measure job satisfaction
Aim of study: HR manager wants to predict an employee’s intention to quit
Questions asked to be rated (one means that they strongly disagree with the statement and seven means that they strongly agree):

My supervisor treats me with consideration.
My supervisor consults me concerning important decisions that affect my work.
My supervisor gives me recognition when I do a good a job.
My supervisor gives me the support I need to do my job well.
My pay is fair.
My pay is appropriate, given the amount of responsibility that comes with my job.
My pay is comparable to the pay earned by other employees whose job are similar to mine.

Problem in methodology: Redundancy in the predictive variables. The seven items in the questionnaire are not really measuring seven different constructs.

Items one to four are measuring a single construct that could be labelled “satisfaction with supervision”.
Items five to seven are measuring a different construct that could be labelled “satisfaction with pay”.

These constructs can be identified using a technique called principal component analysis (PCA). PCA creates new variables as linear combinations of the original variables. These new variables are called principal components.
In the example, a principal component analysis would identify two components. PCA would transform the original seven values into two scores, one for each component.

The employee with ID 102274 seems to be more satisfied with supervision than with pay.

Data Reduction:

Clustering falls under data reduction.
It can take a large number of observations and reduce them into a small number of identifiable groups. Each of these groups can be interpreted more easily and is represented by a centroid.

The above scatter plot shows four clusters for the scores in the job satisfaction survey.
The stars represent the centroid of each cluster and can be used to describe all the observations in the group.

Unsupervised Learning:

In classification, the objective is to find a set of rules that can be applied to a new observation in order to assign this new observation to a group. The methods for classification develop rules by discovering patterns in historical data.
The critical feature of this historical data is that classification of the observations is known, and it is used to learn how to classify future observations. Because this piece of information is available, the process is known as supervised learning.
When the classification of observations is known and used to learn how to classify future observations, the process is known as supervised learning.
For example, in the below table, we can see ten of the answers to the job satisfaction survey and also whether the employee quit the company or not. A prediction model built on this data will fall in the category of supervised learning, because the outcome that the model is trying to predict is known in historical data.

In unsupervised learning, the observations in the historical data are not labelled. Thus, we don’t know if an observation belongs to one group or another. We also don't know how many different groups there are. Discovering the number of groups is therefore, one of the main outcomes of the analysis.
For example, Information Resources Incorporated conducted a cluster analysis of survey data to establish that the market of natural and organic products consisted of seven distinct segments, a number that was not known prior to the completion of the analysis.

Cluster analysis can also be applied to historical data that is labelled with the purpose of finding new labels.
For example, in one study, cluster analysis was used to categorize mutual funds based on their financial characteristics instead of their investment objectives. The historical data for the study consisted of 904 different funds that fund managers had classified into seven categories according to the investment objectives. However, a cluster analysis determined that there were only three different fund categories. The reduction in the number of categories has significant benefits to investors seeking to diversify their portfolios. The study determined that the consolidated categories were more informative about performance and risk than the original seven categories created by the fund managers.
In terms of data to use, the analyst initially considered 28 financial variables that were related to risk and return. However, after applying principal component analysis, they found that 16 out of the 28 variables were able to explain 98% of the variation in the dataset. Therefore, they only use 16 variables per cluster which as we already mentioned, resulted in three fund categories. This example shows that dimensionality reduction and data reduction complement each other. It is a common practice to apply dimensionality reduction techniques such as PCA before clustering