Problem:

​ Predict how popular different movies are going to beMovie data compiled for 2001-2005 à Blue boxes à Categorical variablesRed box à Quantitative variablesThe variable type will determine the kind of analysis applied to that variable Methods to examine categorical data:Frequency tables and cumulative distribution Pie charts ^^ Distribution of films across studiosBar/Column charts Caveats about bar and pie charts:The charts are appropriate only if observations fall into a single category (for example, a particular film has to be produced by a single production house. If two production houses are collaborating, not possible to include it here)Pie charts should add to 100%Visual representations focus on a single categorical variable (for example, a single pie chart can only show the data for studio and NOT studio + ratings combined); can be generalized to analyse combinations Note: The above 3 are ways to summarize a single categorical variable. What if we want to see the relationships that exist among two or more variables? à Crosstabs Contingency tables (cross tabs)e.g. Studio and genre / Studio and rating The above depicts the raw count of the data.There are 351 movies released by the 4 studios in the dataOf the 351, 23 were rated G (marginal distribution)20 G-rated movies were released by Buena Vista (individual cell) The above data can also be formatted to show what fraction each cell is of the total: Side-by-side and segmented bar chartsDiscussed in further section