Excel Analysis of Motion Picture Industry Data
Problem: Predict how popular different movies are going to be
Movie data compiled for 2001-2005 à
Methods to examine categorical data:
Frequency tables and cumulative distribution
Pie charts
^^ Distribution of films across studios
Bar/Column charts
Caveats about bar and pie charts:
The charts are appropriate only if observations fall into a single category (for example, a particular film has to be produced by a single production house. If two production houses are collaborating, not possible to include it here)
Pie charts should add to 100%
Visual representations focus on a single categorical variable (for example, a single pie chart can only show the data for studio and NOT studio + ratings combined); can be generalized to analyse combinations
Note: The above 3 are ways to summarize a single categorical variable. What if we want to see the relationships that exist among two or more variables? à Crosstabs
Contingency tables (cross tabs)
e.g. Studio and genre / Studio and rating
The above depicts the raw count of the data.
The above data can also be formatted to show what fraction each cell is of the total:
Side-by-side and segmented bar charts
Discussed in further section
Movie data compiled for 2001-2005 à
- Blue boxes à Categorical variables
- Red box à Quantitative variables
Methods to examine categorical data:
Frequency tables and cumulative distribution
Pie charts
^^ Distribution of films across studios
Bar/Column charts
Caveats about bar and pie charts:
The charts are appropriate only if observations fall into a single category (for example, a particular film has to be produced by a single production house. If two production houses are collaborating, not possible to include it here)
Pie charts should add to 100%
Visual representations focus on a single categorical variable (for example, a single pie chart can only show the data for studio and NOT studio + ratings combined); can be generalized to analyse combinations
Note: The above 3 are ways to summarize a single categorical variable. What if we want to see the relationships that exist among two or more variables? à Crosstabs
Contingency tables (cross tabs)
e.g. Studio and genre / Studio and rating
The above depicts the raw count of the data.
- There are 351 movies released by the 4 studios in the data
- Of the 351, 23 were rated G (marginal distribution)
- 20 G-rated movies were released by Buena Vista (individual cell)
The above data can also be formatted to show what fraction each cell is of the total:
Side-by-side and segmented bar charts
Discussed in further section