IMDb Movie Rating Predictions

Full EDA & Multiple Linear Regression

Introduction

This is a Multiple Linear Regression Analysis with the R programming language that attempts to create a linear regression model to predict IMDb Ratings for movies using data from Rotten Tomatoes and The Academy for Motion Picture Arts & Sciences.

Load packages

library(statsr)
library(tidyverse)
library(GGally)
library(ggcorrplot)

Load data

We are using a dataset from 2016 with 651 rows. Its codebook can be viewed here.

movies <- read_csv("moviedata.csv")

Research Question

Can we predict IMDb Movie Ratings using Rotten Tomatoes Ratings, movie genre, runtime, MPAA rating, theater debut year, Oscar awards/nominations, whether the film appear in BoxOfficeMojo’s Top 200 list?


Exploratory Data Analysis

We begin by verifying the information from our codebook by viewing the structure of the data then separating the variables by type.

str(movies)
## spc_tbl_ [651 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : chr [1:651] "Feature Film" "Feature Film" "Feature Film" "Feature Film" ...
##  $ genre           : chr [1:651] "Drama" "Drama" "Comedy" "Drama" ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : chr [1:651] "R" "PG-13" "R" "PG" ...
##  $ studio          : chr [1:651] "Indomina Media Inc." "Warner Bros. Pictures" "Sony Pictures Classics" "Columbia Pictures" ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : num [1:651] 899 12285 22381 35096 2386 ...
##  $ critics_rating  : chr [1:651] "Rotten" "Certified Fresh" "Certified Fresh" "Certified Fresh" ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : chr [1:651] "Upright" "Upright" "Upright" "Upright" ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : chr [1:651] "no" "no" "no" "no" ...
##  $ best_pic_win    : chr [1:651] "no" "no" "no" "no" ...
##  $ best_actor_win  : chr [1:651] "no" "no" "no" "yes" ...
##  $ best_actress_win: chr [1:651] "no" "no" "no" "no" ...
##  $ best_dir_win    : chr [1:651] "no" "no" "no" "yes" ...
##  $ top200_box      : chr [1:651] "no" "no" "no" "no" ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   title = col_character(),
##   ..   title_type = col_character(),
##   ..   genre = col_character(),
##   ..   runtime = col_double(),
##   ..   mpaa_rating = col_character(),
##   ..   studio = col_character(),
##   ..   thtr_rel_year = col_double(),
##   ..   thtr_rel_month = col_double(),
##   ..   thtr_rel_day = col_double(),
##   ..   dvd_rel_year = col_double(),
##   ..   dvd_rel_month = col_double(),
##   ..   dvd_rel_day = col_double(),
##   ..   imdb_rating = col_double(),
##   ..   imdb_num_votes = col_double(),
##   ..   critics_rating = col_character(),
##   ..   critics_score = col_double(),
##   ..   audience_rating = col_character(),
##   ..   audience_score = col_double(),
##   ..   best_pic_nom = col_character(),
##   ..   best_pic_win = col_character(),
##   ..   best_actor_win = col_character(),
##   ..   best_actress_win = col_character(),
##   ..   best_dir_win = col_character(),
##   ..   top200_box = col_character(),
##   ..   director = col_character(),
##   ..   actor1 = col_character(),
##   ..   actor2 = col_character(),
##   ..   actor3 = col_character(),
##   ..   actor4 = col_character(),
##   ..   actor5 = col_character(),
##   ..   imdb_url = col_character(),
##   ..   rt_url = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Character (no categories): title director actor1 actor2 actor3 actor4 actor5 imdb_url rt_url

Categorical (factor): title_type (nominal) genre (nominal) mpaa_rating (nominal) studio (nominal) thtr_rel_year (nominal) thtr_rel_month (nominal) thtr_rel_day (nominal) dvd_rel_year (nominal) dvd_rel_month (nominal) dvd_rel_day (nominal) critics_rating (ordinal) audience_rating (ordinal)

Binary: best_pic_nom best_pic_win best_actor_win best_actress best_dir_win top200_box

Numeric: runtime imdb_rating imdb_num_votes critics_score audience_score


There are a number of changes we need to make to the data types to not only match the codebook, but to make the data more applicable for the model to be created.

Character (no categories) Variables

We start by addressing the character variables that represent no categories. We know that title, imdb_url, and rt_url should be unique to every row so we can get rid of them first.

Next, if we want any chance at including director or actor into our model, we need to see how many unique values there are first.

# Select all actor columns, unlist df to vector, count distinct actors
movies %>% select(actor1, actor2, actor3, actor4, actor5) %>% unlist() %>% n_distinct()
## [1] 2363

With 2,363 different actors in the data, we aren’t even going to attempt to classify actors, and choose to exclude them.

# Number of distinct directors
n_distinct(movies$director)
## [1] 533

And with 533 different directors in the data we are going to exclude this variable.

For the initial changes, we create a new data frame to leave the original dataset unchanged.

# New df without actors, directors, & URLs
movies_df <- select(movies, -actor1, -actor2, -actor3, -actor4, -actor5, -director, -imdb_url, -rt_url)

Categorical Variables

Nominal

The first variable of this type to examine is studio to see if the number of unique studios will succumb it to the same fate of our Character Variables above, or if we can reduce the number of factor dimensions to something more appropriate for the model.

n_distinct(movies_df$studio)
## [1] 212

With 212 unique studios in the data we have the choice of leaving studio as a string that we can not include in the model, or we can attempt to bin the studios together to reduce the dimensionality that will be added to the model from incorporating it as a nominal factor.

Let’s look at the top 30.

# Top 30 studios
movies_df %>% count(studio) %>% arrange(desc(n)) %>% .[1:30,]
## # A tibble: 30 × 2
##    studio                                       n
##    <chr>                                    <int>
##  1 Paramount Pictures                          37
##  2 Warner Bros. Pictures                       30
##  3 Sony Pictures Home Entertainment            27
##  4 Universal Pictures                          23
##  5 Warner Home Video                           19
##  6 20th Century Fox                            18
##  7 Miramax Films                               18
##  8 MGM                                         16
##  9 Twentieth Century Fox Home Entertainment    14
## 10 IFC Films                                   13
## # ℹ 20 more rows

First we’ll try to identify subsidiaries to all of the studio parent organizations.

# Find string pattern in studio then show distinct
movies_df %>% 
  filter(grepl("paramount", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 6 × 1
##   studio              
##   <chr>               
## 1 Paramount Home Video
## 2 Paramount Pictures  
## 3 Paramount Classics  
## 4 Paramount           
## 5 Paramount Vantage   
## 6 Paramount Studios
movies_df %>% 
  filter(grepl("universal", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 4 × 1
##   studio                  
##   <chr>                   
## 1 Universal Pictures      
## 2 MCA Universal Home Video
## 3 Universal Studios       
## 4 Universal
movies_df %>% 
  filter(grepl("warner", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 8 × 1
##   studio                     
##   <chr>                      
## 1 Warner Bros. Pictures      
## 2 Warner Home Video          
## 3 WARNER BROTHERS PICTURES   
## 4 Warner Bros.               
## 5 Warner Bros Pictures       
## 6 Warner Independent Pictures
## 7 Warner Independent         
## 8 Warners Bros. Pictures
movies_df %>% 
  filter(grepl("sony", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 7 × 1
##   studio                          
##   <chr>                           
## 1 Sony Pictures Classics          
## 2 Sony Pictures Home Entertainment
## 3 Sony Pictures Entertainment     
## 4 Sony Pictures                   
## 5 Sony Pictures/Columbia          
## 6 Sony Pictures/Screen Gems       
## 7 Sony Entertainment
movies_df %>% 
  filter(grepl("fox", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 8 × 1
##   studio                                  
##   <chr>                                   
## 1 20th Century Fox                        
## 2 20th Century Fox Film Corporat          
## 3 Twentieth Century Fox Home Entertainment
## 4 Fox Searchlight Pictures                
## 5 Fox Atomic                              
## 6 20th Century Fox Film Corporation       
## 7 Fox Searchlight                         
## 8 Fox
movies_df %>% 
  filter(grepl("mgm", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 4 × 1
##   studio                
##   <chr>                 
## 1 MGM/United Artists    
## 2 MGM                   
## 3 MGM Home Entertainment
## 4 MGM/UA
movies_df %>% 
  filter(grepl("miramax", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 2 × 1
##   studio       
##   <chr>        
## 1 Miramax Films
## 2 Miramax
movies_df %>% 
  filter(grepl("disney|buena vista", studio, ignore.case = TRUE)) %>% distinct(studio) 
## # A tibble: 8 × 1
##   studio                        
##   <chr>                         
## 1 Buena Vista Distribution Compa
## 2 Walt Disney Home Entertainment
## 3 Walt Disney Pictures          
## 4 Buena Vista Pictures          
## 5 Buena Vista Internationa      
## 6 Walt Disney Productions       
## 7 Disney                        
## 8 Buena Vista

Now we will group these together with their respective parent company.

# Find patterns for studio values and replace with parent studio
movies_df <- movies_df %>% 
  mutate(studio = case_when(grepl("Paramount", studio, ignore.case = TRUE) ~ "Paramount",
                            grepl("Universal", studio, ignore.case = TRUE) ~ "Universal",
                            grepl("Fox", studio, ignore.case = TRUE) ~ "Fox",
                            grepl("Warner", studio, ignore.case = TRUE) ~ "Warner",
                            grepl("Sony", studio, ignore.case = TRUE) ~ "Sony",
                            grepl("MGM", studio, ignore.case = TRUE) ~ "MGM",
                            grepl("Miramax", studio, ignore.case = TRUE) ~ "Miramax",
                            grepl("Disney|Buena Vista", studio, ignore.case = TRUE) ~ "Disney",
                            TRUE ~ studio)
  )

movies_df %>% count(studio) %>% arrange(desc(n)) %>% .[1:30,]
## # A tibble: 30 × 2
##    studio              n
##    <chr>           <int>
##  1 Warner             71
##  2 Paramount          60
##  3 Sony               55
##  4 Fox                51
##  5 Universal          39
##  6 MGM                27
##  7 Disney             25
##  8 Miramax            25
##  9 IFC Films          13
## 10 New Line Cinema    10
## # ℹ 20 more rows

Now, with these studios binned together with their parent organizations, we can address the 8 NA values for studio.

# NA studio values
movies_df %>% filter(is.na(studio))
## # A tibble: 8 × 24
##   title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month
##   <chr> <chr>      <chr>   <dbl> <chr>       <chr>          <dbl>          <dbl>
## 1 Dirt… Documenta… Come…      94 R           <NA>            2007              9
## 2 Cave… Feature F… Acti…      91 PG          <NA>            1981              4
## 3 Atta… TV Movie   Other      90 R           <NA>            1993             12
## 4 Oliv… Feature F… Anim…      74 G           <NA>            1988             11
## 5 The … Feature F… Drama      97 R           <NA>            2001             10
## 6 Inse… Feature F… Drama     117 NC-17       <NA>            1975              1
## 7 Inbr… Feature F… Art …      90 R           <NA>            2011              8
## 8 Deat… Feature F… Horr…      87 R           <NA>            1972              1
## # ℹ 16 more variables: thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <dbl>, critics_rating <chr>, critics_score <dbl>,
## #   audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## #   best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## #   best_dir_win <chr>, top200_box <chr>

With a quick internet search for the 8 titles, we’ve replaced the NA values.

# Insert studio values found on web search
movies_df <- mutate(movies_df, 
                    studio = case_when(title == "Dirty Sanchez: The Movie" ~ "MTV",
                                       title == "Caveman" ~ "Turman-Foster Company",
                                       title == "Attack of the 50 Foot Woman" ~ "Woolner Brothers",
                                       title == "Oliver & Company" ~ "Disney",
                                       title == "The Man Who Sued God" ~ "Disney",
                                       title == "Inserts" ~ "United Artists",
                                       title == "Inbred" ~ "New Flesh Films",
                                       title == "Death Line (Raw Meat)" ~ "American International Pictures",
                                       TRUE ~ studio
                                       )
                    )

One last look at the top 20.

# Top 20 studios
movies_df %>% count(studio) %>% arrange(desc(n)) %>% .[1:20,]
## # A tibble: 20 × 2
##    studio                          n
##    <chr>                       <int>
##  1 Warner                         71
##  2 Paramount                      60
##  3 Sony                           55
##  4 Fox                            51
##  5 Universal                      39
##  6 Disney                         27
##  7 MGM                            27
##  8 Miramax                        25
##  9 IFC Films                      13
## 10 New Line Cinema                10
## 11 Magnolia Pictures               9
## 12 HBO Video                       8
## 13 Columbia Pictures               7
## 14 Lions Gate Films                7
## 15 New Line Home Entertainment     7
## 16 The Weinstein Company           7
## 17 United Artists                  7
## 18 First Run Features              6
## 19 Orion Home Video                5
## 20 Focus Features                  4

We will put the top 8 into a vector named major_studios and then change all other studio values to “Other”.

# Put top 8 studios into a character vector
major_studios <- movies_df %>% count(studio) %>% arrange(desc(n)) %>% .[1:8,1] %>% pull()

# Change studio value to "Other" if not in the top 8 studios
movies_df <- mutate(movies_df, studio = if_else(grepl(paste0(major_studios, collapse="|"), 
                                                      studio, ignore.case = TRUE), studio, "Other"))

We can consider the studio variable ready to change to a nominal factor with 9 levels into the model.

movies_df <- movies_df %>% mutate(studio = as.factor(studio))

The next categorical variables we want to assess for categorical data typing is thtr_rel_year and dvd_rel_year. This is a question of whether there is a linear relationship between thtr_rel_year/dvd_rel_year and imdb_rating (the response variable). If we find a relationship over many years that imdb_rating increases/decreases linearly, then we would change our data type to numeric. On the other hand if we find that not to be the case (which we suspect it won’t), then these yearly variables could be considered for nominal categories by each year, or levels with multiple year spans.

First, we check for any NA values present in these 6 time variables. To get more clarity on the rows with NA values, we add title and studio from the original data.

movies %>% select(title, studio, thtr_rel_year, thtr_rel_month, thtr_rel_day, 
                     dvd_rel_year, dvd_rel_month, dvd_rel_day) %>% 
  filter(rowSums(is.na(select(., -studio))) > 0)
## # A tibble: 8 × 8
##   title            studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
##   <chr>            <chr>          <dbl>          <dbl>        <dbl>        <dbl>
## 1 Charlie: The Li… Warne…          2004              2           13           NA
## 2 Streets of Gold  Live …          1986             11           14           NA
## 3 The Squeeze      HBO V…          1987              7           10           NA
## 4 Electric Dreams  MGM             1984              7           20           NA
## 5 Porky's Revenge  20th …          1985              3           22           NA
## 6 Teen Wolf Too    Param…          1987             11           20           NA
## 7 The Last Remake… MCA U…          1977              7           15           NA
## 8 Let It Be        Unite…          1970              5           20           NA
## # ℹ 2 more variables: dvd_rel_month <dbl>, dvd_rel_day <dbl>

These 8 titles appear to be TV productions or distributed by Home Video studios. We are going to use our discretion to copy the time values from the thtr_rel_ variables to the dvd_rel_ variables for these 8 specific titles.

# The titles with DVD release times to be updated
titles <- c("Charlie: The Life and Art of Charles Chaplin", 
            "Streets of Gold",
            "The Squeeze",
            "Electric Dreams",
            "Porky's Revenge",
            "Teen Wolf Too",
            "The Last Remake of Beau Geste",
            "Let It Be")

# Replace dvd_rel values with thtr_rel values for titles in titles vector
movies_df <- movies_df %>%
  mutate(dvd_rel_year = if_else(title %in% titles, thtr_rel_year, dvd_rel_year),
         dvd_rel_month = if_else(title %in% titles, thtr_rel_month, dvd_rel_month),
         dvd_rel_day = if_else(title %in% titles, thtr_rel_day, dvd_rel_day))

Verify that there are no more NA values in these columns.

# Select 8 time columns and filter for any rows with an NA
movies_df %>% select(thtr_rel_year, thtr_rel_month, thtr_rel_day, 
                     dvd_rel_year, dvd_rel_month, dvd_rel_day) %>% 
  filter(rowSums(is.na(.)) > 0)
## # A tibble: 0 × 6
## # ℹ 6 variables: thtr_rel_year <dbl>, thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>

Assessing linear associations for time variables.

movies_df %>% ggplot(aes(x = thtr_rel_year, y = imdb_rating)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red") +
  theme_bw()

movies_df %>% ggplot(aes(x = dvd_rel_year, y = imdb_rating)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red") +
  theme_bw()

We see no linear relationship from year values and the response variable, therefore if we wish to leave these variables in the model, they must be typed to nominal factors. Note that ordinal factors should be typed to numeric variables, not factors. We also see higher variations in lower scores indicating that our response variable is skewed to the left.

We will now perform the same test for months and days.

imdb_rating ~ thtr_rel_month

movies_df %>% ggplot(aes(x = thtr_rel_month, y = imdb_rating)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red") +
  scale_x_continuous(breaks = 1:12, labels = month.abb) +
  theme_bw()

imdb_rating ~ dvd_rel_month

movies_df %>% ggplot(aes(x = dvd_rel_month, y = imdb_rating)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red") +
  scale_x_continuous(breaks = 1:12, labels = month.abb) +
  theme_bw()

imdb_rating ~ thtr_rel_day

movies_df %>% ggplot(aes(x = thtr_rel_day, y = imdb_rating)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red") +
  scale_x_continuous(breaks = 1:31) +
  theme_bw()

imdb_rating ~ dvd_rel_day

movies_df %>% ggplot(aes(x = dvd_rel_day, y = imdb_rating)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red") +
  scale_x_continuous(breaks = 1:31) +
  theme_bw()

We conclude that all 6 of these time variables must be nominal factors and not numeric variables, if we wish to include them in the model.

We wish to convert the yearly variables into decade categories to reduce dimensionality.

movies_df <- movies_df %>%
  mutate(thtr_rel_decade = paste0((thtr_rel_year %/% 10) * 10, "'s"),
         .before = thtr_rel_year) %>%
  mutate(dvd_rel_decade = paste0((dvd_rel_year %/% 10) * 10, "'s"),
         .before = dvd_rel_year) %>% 
  select(-thtr_rel_year, -dvd_rel_year)

Now we’ll convert all 6 to factors.

# Mutate variables with as.factor
movies_df <- movies_df %>% mutate(thtr_rel_decade = as.factor(thtr_rel_decade),
                                  thtr_rel_month = as.factor(thtr_rel_month),
                                  thtr_rel_day = as.factor(thtr_rel_day),
                                  dvd_rel_decade = as.factor(dvd_rel_decade),
                                  dvd_rel_month = as.factor(dvd_rel_month),
                                  dvd_rel_day = as.factor(dvd_rel_day)
                                  )

The next group of variables: title_type, genre, and mpaa_rating we are going to type as nominal categories by using a summary table.

This custom function, unique_with_counts() will return a summary table that shows us the counts for each distinct value, including NA values.

unique_with_counts <- function(data) {
  # Getting the unique values and counts for each column
  list_columns <- lapply(data, function(col) {
    tbl <- table(col, useNA = "ifany")
    tibble(value = as.vector(names(tbl)), count = as.vector(tbl))
  })
  
  # Getting the maximum number of unique values (including NA) in any column
  max_length <- max(sapply(list_columns, nrow))
  
  # Making all tibbles the same length (with NA for missing)
  list_columns <- lapply(list_columns, function(tbl) {
    add_rows <- max_length - nrow(tbl)
    if (add_rows > 0) {
      tbl <- rbind(tbl, tibble(value = rep(NA, add_rows), count = rep(NA, add_rows)))
    }
    arrange(tbl, desc(tbl[,2]))
  })
  
  # Binding all the tibbles column-wise
  output <- do.call(cbind, list_columns)
  
  # Creating new column names
  new_col_names <- c(rbind(paste0(names(data), "_value"), paste0(names(data), "_count")))
  colnames(output) <- as.vector(new_col_names)
  
  # Returning the final data frame
  return(output)
}
movies_df %>% select(title_type, genre, mpaa_rating) %>% unique_with_counts()
##    title_type_value title_type_count               genre_value genre_count
## 1      Feature Film              591                     Drama         305
## 2       Documentary               55                    Comedy          87
## 3          TV Movie                5        Action & Adventure          65
## 4              <NA>               NA        Mystery & Suspense          59
## 5              <NA>               NA               Documentary          52
## 6              <NA>               NA                    Horror          23
## 7              <NA>               NA                     Other          16
## 8              <NA>               NA Art House & International          14
## 9              <NA>               NA Musical & Performing Arts          12
## 10             <NA>               NA                 Animation           9
## 11             <NA>               NA Science Fiction & Fantasy           9
##    mpaa_rating_value mpaa_rating_count
## 1                  R               329
## 2              PG-13               133
## 3                 PG               118
## 4            Unrated                50
## 5                  G                19
## 6              NC-17                 2
## 7               <NA>                NA
## 8               <NA>                NA
## 9               <NA>                NA
## 10              <NA>                NA
## 11              <NA>                NA

We can see there are no NA values present here. title_type looks good, genre doesn’t appear to have any similar or repeating levels, and mpaa_rating has 2 “NC-17” observations which we will change to “R”.

# Change "NC-17" values to "R"
movies_df <- movies_df %>% mutate(mpaa_rating = if_else(mpaa_rating == "NC-17", "R", mpaa_rating))


# Mutate variables with as.factor
movies_df <- movies_df %>% mutate(title_type = as.factor(title_type),
                                  genre = as.factor(genre),
                                  mpaa_rating = as.factor(mpaa_rating)
                                  )

Ordinal Categories

Our final 2 categorical variables represent ordinal categories.

Let’s view their summary.

movies_df %>% select(critics_rating, audience_rating) %>% unique_with_counts()
##   critics_rating_value critics_rating_count audience_rating_value
## 1               Rotten                  307               Upright
## 2                Fresh                  209               Spilled
## 3      Certified Fresh                  135                  <NA>
##   audience_rating_count
## 1                   376
## 2                   275
## 3                    NA

These look ready to convert with ordinary/integer encoding.

Let’s look at how critics_rating is derived.

Icon Score Description
Certified Fresh 100-75% Certified Fresh: Wide-release films with a score of 75% or higher that are reviewed by at least 80 critics, of whom 5 are “Top Critics”, are given this seal. The “Certified Fresh” seal remains until the score drops below 70%. Films with limited releases require only 40 reviews (including 5 from “Top Critics”) to qualify for this seal. For TV shows, only individual seasons are eligible for consideration, and each must have at least 20 critic reviews.
Fresh 100-60% Fresh: Films or TV shows with a score of 60% or higher that do not meet the requirements for the “Certified Fresh” seal.
Rotten 59-0% Rotten: Films or TV shows with a score of 59% or lower receive this seal.

Since the 3 levels of critics_rating are not equidistant and are weighted by types of critics it would not be appropriate to convert these to numerical values with ordinary/integer encoding. Instead we want to leave them as a ordinal categorical factors in order for them each to get their own reference level in the model. Since I was not able to find information on how audience_rating was derived (other than it being calculated from audience_score) it too will be converted to an ordinal factor.

movies_df <- movies_df %>% 
  mutate(critics_rating = factor(critics_rating, levels = c("Rotten", "Fresh", "Certified Fresh")),
         audience_rating = as.factor(audience_rating)
         )

Binary Variables

We use our summary table to show the 6 “yes/no” value variables.

select(movies_df, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box) %>%
  unique_with_counts()
##   best_pic_nom_value best_pic_nom_count best_pic_win_value best_pic_win_count
## 1                 no                629                 no                644
## 2                yes                 22                yes                  7
##   best_actor_win_value best_actor_win_count best_actress_win_value
## 1                   no                  558                     no
## 2                  yes                   93                    yes
##   best_actress_win_count best_dir_win_value best_dir_win_count top200_box_value
## 1                    579                 no                608               no
## 2                     72                yes                 43              yes
##   top200_box_count
## 1              636
## 2               15

We wish to convert from categories to numeric values (0,1) with ordinary/integer encoding.

movies_df <- movies_df %>% 
  mutate(best_pic_nom = if_else(best_pic_nom == "yes", 1, 0),
         best_pic_win = if_else(best_pic_win == "yes", 1, 0),
         best_actor_win = if_else(best_actor_win == "yes", 1, 0),
         best_actress_win = if_else(best_actress_win == "yes", 1, 0),
         best_dir_win = if_else(best_dir_win == "yes", 1, 0),
         top200_box = if_else(top200_box == "yes", 1, 0)
         )

Numeric Variables

Since our summary table is not practical for numeric variables, we will filter for NA values manually.

# Filter any rows with an NA numeric column
movies_df %>% select(title, runtime, imdb_rating, imdb_num_votes, critics_score, audience_score) %>% 
  filter(rowSums(is.na(.)) > 0)
## # A tibble: 1 × 6
##   title          runtime imdb_rating imdb_num_votes critics_score audience_score
##   <chr>            <dbl>       <dbl>          <dbl>         <dbl>          <dbl>
## 1 The End of Am…      NA         7.5            739            80             72

The runtime of 74 minutes for the movie, “The End of America”, can be found here.

movies_df <- movies_df %>% mutate(runtime = if_else(is.na(runtime), 74, runtime))

We will view the distributoin of each numeric variable with a historgram.

runtime imdb_rating imdb_num_votes critics_score audience_score

ggplot(movies_df, aes(x = runtime)) +
  geom_histogram(binwidth = 1, 
                 color = "red", alpha = 0.7) +
  labs(title = paste("Histogram of runtime")) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) 

Let’s look closer at the outliers on both sides.

# runtime outliers in original data to retrieve rt_url again
movies %>% select(title, runtime, rt_url) %>% filter(runtime < 60 | runtime > 180)
## # A tibble: 5 × 3
##   title                                              runtime rt_url             
##   <chr>                                                <dbl> <chr>              
## 1 Africa: The Serengeti                                   39 //www.rottentomato…
## 2 Hotel Terminus: The Life and Times of Klaus Barbie     267 //www.rottentomato…
## 3 The Godfather, Part II                                 202 //www.rottentomato…
## 4 Sea Monsters: A Prehistoric Adventure                   40 //www.rottentomato…
## 5 Titanic                                                194 //www.rottentomato…

All of these runtime values are accurate.

ggplot(movies_df, aes(x = imdb_rating)) +
  geom_histogram(binwidth = 0.1, fill = "red", color = "black", alpha = 0.7, 
                 bins=length(unique(movies_df$imdb_rating))) +
  labs(title = "Histogram of imdb_rating") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Let’s verify the outliers again.

# imdb_rating outliers in original data to retrieve imdb_url again
movies %>% filter(imdb_rating < 4 | imdb_rating > 8.8) %>% select(title, imdb_rating, imdb_url) %>% 
  arrange(imdb_rating)
## # A tibble: 21 × 3
##    title                                   imdb_rating imdb_url                 
##    <chr>                                         <dbl> <chr>                    
##  1 Disaster Movie                                  1.9 http://www.imdb.com/titl…
##  2 Epic Movie                                      2.3 http://www.imdb.com/titl…
##  3 Battlefield Earth                               2.4 http://www.imdb.com/titl…
##  4 Viva Knievel!                                   2.7 http://www.imdb.com/titl…
##  5 Doogal                                          2.8 http://www.imdb.com/titl…
##  6 Teen Wolf Too                                   3.1 http://www.imdb.com/titl…
##  7 Jack and Jill                                   3.4 http://www.imdb.com/titl…
##  8 Carnosaur                                       3.4 http://www.imdb.com/titl…
##  9 Dumb and Dumberer: When Harry Met Lloyd         3.4 http://www.imdb.com/titl…
## 10 Galaxina                                        3.4 http://www.imdb.com/titl…
## # ℹ 11 more rows

While not exact, due to the fluidity of this variable in real time and when this data was collected, but they are all in very comparable ranges.

We wish to remove the ratings lower than 4.0 from the data to lower the skew.

movies_df <- movies_df %>% filter(imdb_rating > 4)
ggplot(movies_df, aes(x = imdb_num_votes)) +
  geom_histogram(binwidth = 100, fill = "red", color = "red", alpha = 0.7) +
  labs(title = "Histogram of imdb_num_votes") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

This amount of skew requires a closer look.

ggplot(movies_df %>% filter(imdb_num_votes < 50000), aes(x = imdb_num_votes)) +
  geom_histogram(binwidth = 100, fill = "red", alpha = 0.7) +
  labs(title = "Histogram of imdb_num_votes") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

movies_df %>% filter(imdb_num_votes > 500000)
## # A tibble: 10 × 24
##    title             title_type genre runtime mpaa_rating studio thtr_rel_decade
##    <chr>             <fct>      <fct>   <dbl> <fct>       <fct>  <fct>          
##  1 Django Unchained  Feature F… Other     165 R           Other  2010's         
##  2 The Hangover      Feature F… Come…     100 R           Warner 2000's         
##  3 V for Vendetta    Feature F… Drama     132 R           Warner 2000's         
##  4 The Godfather, P… Feature F… Myst…     202 R           Param… 1970's         
##  5 Memento           Feature F… Myst…     113 R           Other  2000's         
##  6 Good Will Hunting Feature F… Drama     126 R           Miram… 1990's         
##  7 The Prestige      Feature F… Drama     130 PG-13       Disney 2000's         
##  8 No Country for O… Feature F… Drama     122 R           Miram… 2000's         
##  9 Titanic           Feature F… Drama     194 PG-13       Param… 1990's         
## 10 The Hunger Games  Feature F… Drama     142 PG-13       Other  2010's         
## # ℹ 17 more variables: thtr_rel_month <fct>, thtr_rel_day <fct>,
## #   dvd_rel_decade <fct>, dvd_rel_month <fct>, dvd_rel_day <fct>,
## #   imdb_rating <dbl>, imdb_num_votes <dbl>, critics_rating <fct>,
## #   critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## #   best_pic_nom <dbl>, best_pic_win <dbl>, best_actor_win <dbl>,
## #   best_actress_win <dbl>, best_dir_win <dbl>, top200_box <dbl>

We can verify this skew by these popular films and apply a few log transformations to imdb_num_votes to reduce the skew.

ggplot(movies_df, aes(x = log(imdb_num_votes))) +
  geom_histogram(binwidth = .1, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Histogram of imdb_num_votes") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(movies_df, aes(x = log2(imdb_num_votes))) +
  geom_histogram(binwidth = .2, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Histogram of imdb_num_votes") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(movies_df, aes(x = log10(imdb_num_votes))) +
  geom_histogram(binwidth = .05, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Histogram of imdb_num_votes") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

The base log transformation appears to depict the most normal distribution.

movies_df$imdb_num_votes <- log(movies_df$imdb_num_votes)
ggplot(movies_df, aes(x = critics_score)) +
  geom_histogram(binwidth = 1, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Histogram of critics_score") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

critics_score is not skewed.

ggplot(movies_df, aes(x = audience_score)) +
  geom_histogram(binwidth = 1, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Histogram of audience_score") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

The skew in audience_score is bearable for our purposes.

Final Variable Assessment

Are there any rows with an NA value left?

movies_df %>% filter(rowSums(is.na(.)) > 0)
## # A tibble: 0 × 24
## # ℹ 24 variables: title <chr>, title_type <fct>, genre <fct>, runtime <dbl>,
## #   mpaa_rating <fct>, studio <fct>, thtr_rel_decade <fct>,
## #   thtr_rel_month <fct>, thtr_rel_day <fct>, dvd_rel_decade <fct>,
## #   dvd_rel_month <fct>, dvd_rel_day <fct>, imdb_rating <dbl>,
## #   imdb_num_votes <dbl>, critics_rating <fct>, critics_score <dbl>,
## #   audience_rating <fct>, audience_score <dbl>, best_pic_nom <dbl>,
## #   best_pic_win <dbl>, best_actor_win <dbl>, best_actress_win <dbl>, …

Are there any duplicated rows?

movies_df[duplicated(movies_df), ]
## # A tibble: 1 × 24
##   title       title_type  genre       runtime mpaa_rating studio thtr_rel_decade
##   <chr>       <fct>       <fct>         <dbl> <fct>       <fct>  <fct>          
## 1 Man on Wire Documentary Documentary      94 PG-13       Other  2000's         
## # ℹ 17 more variables: thtr_rel_month <fct>, thtr_rel_day <fct>,
## #   dvd_rel_decade <fct>, dvd_rel_month <fct>, dvd_rel_day <fct>,
## #   imdb_rating <dbl>, imdb_num_votes <dbl>, critics_rating <fct>,
## #   critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## #   best_pic_nom <dbl>, best_pic_win <dbl>, best_actor_win <dbl>,
## #   best_actress_win <dbl>, best_dir_win <dbl>, top200_box <dbl>

Drop the 1 duplicated row.

movies_df <- movies_df %>% distinct()

Final review of all variable types for the model.

str(movies_df)
## tibble [630 × 24] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:630] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:630] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 5 levels "G","PG","PG-13",..: 4 3 4 2 4 5 3 4 5 5 ...
##  $ studio          : Factor w/ 9 levels "Disney","Fox",..: 5 9 7 5 5 5 6 3 5 5 ...
##  $ thtr_rel_decade : Factor w/ 5 levels "1970's","1980's",..: 5 4 3 3 4 4 2 3 5 5 ...
##  $ thtr_rel_month  : Factor w/ 12 levels "1","2","3","4",..: 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : Factor w/ 31 levels "1","2","3","4",..: 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_decade  : Factor w/ 5 levels "1970's","1980's",..: 5 4 4 4 4 5 4 4 5 5 ...
##  $ dvd_rel_month   : Factor w/ 12 levels "1","2","3","4",..: 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : Factor w/ 31 levels "1","2","3","4",..: 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:630] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : num [1:630] 6.8 9.42 10.02 10.47 7.78 ...
##  $ critics_rating  : Factor w/ 3 levels "Rotten","Fresh",..: 1 3 3 3 1 2 1 1 2 3 ...
##  $ critics_score   : num [1:630] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:630] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
##  $ best_pic_win    : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
##  $ best_actor_win  : num [1:630] 0 0 0 1 0 0 0 1 0 0 ...
##  $ best_actress_win: num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
##  $ best_dir_win    : num [1:630] 0 0 0 1 0 0 0 0 0 0 ...
##  $ top200_box      : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...

This concludes our EDA and Data Wrangling of all variables in the data.


Multicollinearity of Independent Variables & Linear Relations to Y

We’re going to start with 2 custom functions to create a correlation heat map for all variables (numeric or factor).

calculate_association takes any 2 variables and performs a series of else if statements to calculate association/correlation.

  1. If both variables are identical, returns 1.

  2. If both variables are numeric, returns Pearson correlation.

  3. If variables are 1 factor & 1 numeric,

    3.1. If factor = 2 levels, returns point-biserial correlation.

    3.2. If factor > 2 levels, returns eta-squared of ANOVA.

  4. If both variables are factors (any levels), returns Cramer’s V correlation.

calculate_association <- function(x, y) {
  if (identical(x, y)) { # TEST 1
    return(c("identity" = 1))
    
  } else if (is.numeric(x) && is.numeric(y)) {       # TEST 2
    return(c("pearson" = cor(x, y, use = "pairwise.complete.obs"))) 
    
  } else if ((is.numeric(x) && is.factor(y)) || (is.factor(x) && is.numeric(y))) { # TEST 3
    # Swap if necessary to ensure x is factor and y is numeric
    if (is.numeric(x)) {
      temp <- x
      x <- y
      y <- temp
    }
    
    if (length(levels(x)) == 2) { # TEST 3.1
      # Point-Biserial Correlation for binary factor
      return(c("point-biserial" = cor(y, as.integer(x) - 1)))
    } else {                      # TEST 3.2
      # Use eta-squared for factors with more than two levels
      anova_result <- summary(aov(y ~ x))  
      sum_sq <- anova_result[[1]][["Sum Sq"]]
      eta_squared <- sum_sq[1] / sum(sum_sq)
      return(c("eta-squared" = eta_squared))
    }
    
  } else {                        # TEST 4
    # Assuming both variables are factors
    tbl <- table(x, y)
    chi_squared <- chisq.test(tbl)$statistic
    n <- sum(tbl)
    k <- ncol(tbl)
    r <- nrow(tbl)
    cramers_v <- sqrt((chi_squared / n) / min(k - 1, r - 1))
    return(c("cramer's-v" = cramers_v))
  }
}

custom_corr_plot takes a data frame and iterates each pair of variables through the calculate_association function and then melts the matrix to a long format suitable for ggplot, and plotted with geom_tile. Text labels of correlation values are removed if = 1, black if > 0.5 or < -0.5, else gray.

custom_corr_plot <- function(data) {
  var_names <- names(data) 
  n <- length(var_names)
  assoc_matrix <- matrix(NA, ncol=n, nrow=n)
  
  for (i in 1:n) {
    for (j in 1:n) {
      assoc_matrix[i, j] <- calculate_association(data[[var_names[i]]], data[[var_names[j]]])
    }
  }
  
  colnames(assoc_matrix) <- var_names
  rownames(assoc_matrix) <- var_names
  
  # Melt the matrix
  melted <- reshape2::melt(assoc_matrix)
  
  # Plot the heatmap
  p <- ggplot(data = melted, aes(x = Var2, y = Var1, fill = value)) +
    geom_tile() +
    scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                         midpoint = 0, limit = c(-1,1), space = "Lab", 
                         name="Association") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                     size = 10, hjust = 1),
          axis.text.y = element_text(size = 10),
          axis.title = element_blank()) +
    coord_fixed()
  
  # Add text labels for associations >= 0.5 or <= -0.5, but not equal to 1, in black
  p <- p + geom_text(data = subset(melted, (value >= 0.5 | value <= -0.5) & value != 1), 
                    aes(x = Var2, y = Var1, label = round(value, 2)), 
                    size = 4, color = "black")
  
  # Add text labels for associations < 0.5 and > -0.5 and not equal to 1, in gray
  p <- p + geom_text(data = subset(melted, value < 0.5 & value > -0.5 & value != 1), 
                    aes(x = Var2, y = Var1, label = round(value, 2)), 
                    size = 4, color = "darkgray")
  
return(p)

}
movies_df %>% select(-title) %>% custom_corr_plot()

The custom correlation plot reveals the release date variables have the lowest association with the response variable, followed by studio.

There is also very high coliniarity between (critics_rating & audience_rating), (critics_score & critics_rating), as well as (audience_score & critics_score).

We know the audience_rating and critics_rating are derived from audience_score and critics_score respectively. And that audience_rating is a 2 level factor, while critics_score is a 3 level factor.

Since critics_score has 3 levels, and we actually have some clarity on how it is calculated, unlike audience_rating, we will omit audience_rating and critics_score from the model.

movies_df <- movies_df %>% select(-thtr_rel_decade, -thtr_rel_month, -thtr_rel_day, 
                                  -dvd_rel_decade, -dvd_rel_month, -dvd_rel_day,
                                  -audience_rating, -critics_score)

Modeling

We are going to build a Multiple Linear Regression model using the backwardation method to optimize the p-values of the independent variables. At the end we will pick the model with the highest \(R^{2}_{\text{adj}}\).

Dependent Variable

  1. imdb_rating (numeric: 1.0 - 10.0)

Independent Variables

  1. title_type (categorical: 3 levels)

  2. genre (categorical: 11 levels)

  3. runtime (numeric)

  4. mpaa_rating (categorical: 5 levels)

  5. critics_rating (categorical: 3 levels)

  6. audience_score (numeric)

  7. studio (categorical: 9 levels)

  8. imdb_num_votes (numeric (log))

  9. best_pic_nom (binary)

  10. best_pic_win (binary)

  11. best_actor_win (binary)

  12. best_actress_win (binary)

  13. best_dir_win (binary)

  14. top200_box (binary)

We are going to split the training and test sets at a 70/30 ratio.

df <- movies_df %>% select(-title)

set.seed(321)
sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
train  <- df[sample, ]
test   <- df[!sample, ]

Full Model

# Full model has 14 independent variables
full_model <- lm(imdb_rating ~ ., data = train)

# Model summary
summary(full_model)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.58045 -0.19563  0.01808  0.22459  0.96109 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.790920   0.242004  15.665  < 2e-16 ***
## title_typeFeature Film         -0.361593   0.150837  -2.397 0.016967 *  
## title_typeTV Movie             -0.066415   0.302173  -0.220 0.826143    
## genreAnimation                 -0.703019   0.206352  -3.407 0.000722 ***
## genreArt House & International  0.306149   0.150763   2.031 0.042935 *  
## genreComedy                    -0.116980   0.076940  -1.520 0.129185    
## genreDocumentary                0.290625   0.161918   1.795 0.073408 .  
## genreDrama                      0.114183   0.069190   1.650 0.099654 .  
## genreHorror                     0.114201   0.112601   1.014 0.311081    
## genreMusical & Performing Arts  0.133706   0.144280   0.927 0.354623    
## genreMystery & Suspense         0.160031   0.085144   1.880 0.060881 .  
## genreOther                     -0.008754   0.136196  -0.064 0.948781    
## genreScience Fiction & Fantasy  0.023126   0.153195   0.151 0.880085    
## runtime                         0.003823   0.001036   3.691 0.000254 ***
## mpaa_ratingPG                  -0.126689   0.140587  -0.901 0.368042    
## mpaa_ratingPG-13               -0.156071   0.147017  -1.062 0.289049    
## mpaa_ratingR                   -0.065670   0.142603  -0.461 0.645399    
## mpaa_ratingUnrated             -0.112043   0.156357  -0.717 0.474040    
## studioFox                      -0.065947   0.127433  -0.518 0.605083    
## studioMGM                      -0.219163   0.141272  -1.551 0.121591    
## studioMiramax                   0.047381   0.140632   0.337 0.736355    
## studioOther                    -0.006358   0.116212  -0.055 0.956397    
## studioParamount                -0.036883   0.125350  -0.294 0.768725    
## studioSony                      0.049284   0.129577   0.380 0.703886    
## studioUniversal                 0.013468   0.135414   0.099 0.920823    
## studioWarner                    0.002171   0.123915   0.018 0.986031    
## imdb_num_votes                  0.061954   0.015397   4.024 6.82e-05 ***
## critics_ratingFresh             0.275438   0.047344   5.818 1.20e-08 ***
## critics_ratingCertified Fresh   0.308775   0.061666   5.007 8.23e-07 ***
## audience_score                  0.031330   0.001305  24.015  < 2e-16 ***
## best_pic_nom                    0.023351   0.115398   0.202 0.839746    
## best_pic_win                   -0.015571   0.248181  -0.063 0.950003    
## best_actor_win                  0.033094   0.053414   0.620 0.535879    
## best_actress_win                0.029660   0.058730   0.505 0.613815    
## best_dir_win                    0.046658   0.074883   0.623 0.533578    
## top200_box                     -0.050425   0.120443  -0.419 0.675685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3699 on 409 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8356 
## F-statistic: 65.49 on 35 and 409 DF,  p-value: < 2.2e-16
# Predictions
predictions <- predict(full_model, newdata = test)

# Residuals
residuals <- test$imdb_rating - predictions

# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error:  0.412579536589289"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100

# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared:  0.817427989671253"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(full_model$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared:  0.80239264764418"
# Table to store results
results <- tibble("model" = NA, "rmse" = NA, "pct_rmse" = NA, "rsq" = NA, "adj_rsq" = NA) 

# Store results
results <- rbind(results, c("Full Model", rmse, pct_rmse, rsq, adj_rsq)) %>% drop_na()

The full model has a RMSE of 0.41, meaning our mean error for predictions of imdb_rating are 0.41, the Adjusted R-Squared for this model was 0.80, meaning 80% of the variations in the data can be explained by this model.

studio and mpaa_rating have no significant levels in the model and will be removed for the next model.

Model 2

# Model 2 has 12 independent variables
model2 <- lm(imdb_rating ~ . , 
                 data = train %>% select(-studio, -mpaa_rating))

# Model summary
summary(model2)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = train %>% select(-studio, 
##     -mpaa_rating))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61115 -0.19415  0.01798  0.23983  0.91816 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.7042453  0.2084548  17.770  < 2e-16 ***
## title_typeFeature Film         -0.3757960  0.1478562  -2.542 0.011391 *  
## title_typeTV Movie             -0.0702280  0.3012984  -0.233 0.815809    
## genreAnimation                 -0.7007946  0.1956723  -3.581 0.000382 ***
## genreArt House & International  0.3328247  0.1478415   2.251 0.024886 *  
## genreComedy                    -0.1268088  0.0751275  -1.688 0.092168 .  
## genreDocumentary                0.2723952  0.1584076   1.720 0.086243 .  
## genreDrama                      0.1202687  0.0665707   1.807 0.071534 .  
## genreHorror                     0.1302281  0.1095729   1.189 0.235304    
## genreMusical & Performing Arts  0.1295539  0.1424317   0.910 0.363561    
## genreMystery & Suspense         0.1817804  0.0824206   2.206 0.027957 *  
## genreOther                     -0.0165435  0.1344770  -0.123 0.902149    
## genreScience Fiction & Fantasy  0.0453383  0.1519427   0.298 0.765552    
## runtime                         0.0035995  0.0009956   3.615 0.000336 ***
## imdb_num_votes                  0.0608258  0.0145287   4.187 3.45e-05 ***
## critics_ratingFresh             0.2803119  0.0468880   5.978 4.81e-09 ***
## critics_ratingCertified Fresh   0.3304061  0.0595687   5.547 5.15e-08 ***
## audience_score                  0.0315804  0.0012922  24.440  < 2e-16 ***
## best_pic_nom                    0.0051778  0.1145458   0.045 0.963967    
## best_pic_win                    0.0094640  0.2454067   0.039 0.969256    
## best_actor_win                  0.0300258  0.0528444   0.568 0.570207    
## best_actress_win                0.0256085  0.0578457   0.443 0.658208    
## best_dir_win                    0.0582637  0.0740399   0.787 0.431770    
## top200_box                     -0.0620662  0.1156506  -0.537 0.591779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3705 on 421 degrees of freedom
## Multiple R-squared:  0.8436, Adjusted R-squared:  0.8351 
## F-statistic: 98.76 on 23 and 421 DF,  p-value: < 2.2e-16
# Predictions
predictions <- predict(model2, newdata = test)

# Residuals
residuals <- test$imdb_rating - predictions

# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error:  0.416620609888723"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100

# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared:  0.813834015692678"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(model2$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared:  0.80084569120612"
# Store results
results <- rbind(results, c("Model 2", rmse, pct_rmse, rsq, adj_rsq)) 

Since the p-value for top200_box, best_pic_nom, and best_pic_win are all extraordinarily high, we remove all 3 on the next iteration.

Model 3

# Model 3 has 9 independent variables
model3 <- lm(imdb_rating ~ . , 
                 data = train %>% select(-studio, -mpaa_rating,
                                         -top200_box, -best_pic_nom, -best_pic_win))

# Model summary
summary(model3)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = train %>% select(-studio, 
##     -mpaa_rating, -top200_box, -best_pic_nom, -best_pic_win))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60980 -0.19122  0.01764  0.24198  0.91975 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.7077604  0.2058612  18.011  < 2e-16 ***
## title_typeFeature Film         -0.3745175  0.1473637  -2.541 0.011395 *  
## title_typeTV Movie             -0.0699695  0.3002957  -0.233 0.815872    
## genreAnimation                 -0.6941699  0.1946752  -3.566 0.000404 ***
## genreArt House & International  0.3381596  0.1470126   2.300 0.021921 *  
## genreComedy                    -0.1214962  0.0742509  -1.636 0.102520    
## genreDocumentary                0.2781773  0.1575180   1.766 0.078115 .  
## genreDrama                      0.1263327  0.0653955   1.932 0.054047 .  
## genreHorror                     0.1362665  0.1086765   1.254 0.210579    
## genreMusical & Performing Arts  0.1360975  0.1412795   0.963 0.335936    
## genreMystery & Suspense         0.1879618  0.0813948   2.309 0.021409 *  
## genreOther                     -0.0096501  0.1332275  -0.072 0.942291    
## genreScience Fiction & Fantasy  0.0438712  0.1514347   0.290 0.772185    
## runtime                         0.0035830  0.0009803   3.655 0.000289 ***
## imdb_num_votes                  0.0598607  0.0143637   4.167 3.73e-05 ***
## critics_ratingFresh             0.2797171  0.0467129   5.988 4.53e-09 ***
## critics_ratingCertified Fresh   0.3272039  0.0582933   5.613 3.60e-08 ***
## audience_score                  0.0315893  0.0012821  24.639  < 2e-16 ***
## best_actor_win                  0.0292446  0.0525009   0.557 0.577800    
## best_actress_win                0.0258559  0.0572403   0.452 0.651710    
## best_dir_win                    0.0605614  0.0728118   0.832 0.406017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3693 on 424 degrees of freedom
## Multiple R-squared:  0.8435, Adjusted R-squared:  0.8362 
## F-statistic: 114.3 on 20 and 424 DF,  p-value: < 2.2e-16
# Predictions
predictions <- predict(model3, newdata = test)

# Residuals
residuals <- test$imdb_rating - predictions

# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error:  0.417005268418771"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100

# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared:  0.813490089394011"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(model3$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared:  0.803898151134274"
# Store results
results <- rbind(results, c("Model 3", rmse, pct_rmse, rsq, adj_rsq)) 

The last 3 binary variables are dropped for the next iteration.

Model 4

# Model 4 has 6 independent variables
model4 <- lm(imdb_rating ~ . , 
                 data = train %>% select(-studio, -mpaa_rating,
                                         -top200_box, -best_pic_nom, -best_pic_win,
                                         -best_actor_win, -best_actress_win, -best_dir_win))

# Model summary
summary(model4)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = train %>% select(-studio, 
##     -mpaa_rating, -top200_box, -best_pic_nom, -best_pic_win, 
##     -best_actor_win, -best_actress_win, -best_dir_win))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61836 -0.19334  0.02005  0.24308  0.90478 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.6650865  0.2020456  18.140  < 2e-16 ***
## title_typeFeature Film         -0.3736089  0.1470474  -2.541 0.011415 *  
## title_typeTV Movie             -0.0691443  0.2987544  -0.231 0.817082    
## genreAnimation                 -0.6938945  0.1942445  -3.572 0.000394 ***
## genreArt House & International  0.3397787  0.1462813   2.323 0.020660 *  
## genreComedy                    -0.1169948  0.0736708  -1.588 0.113009    
## genreDocumentary                0.2801549  0.1570645   1.784 0.075184 .  
## genreDrama                      0.1327046  0.0644934   2.058 0.040231 *  
## genreHorror                     0.1337597  0.1083805   1.234 0.217819    
## genreMusical & Performing Arts  0.1323506  0.1408817   0.939 0.348033    
## genreMystery & Suspense         0.1993319  0.0802253   2.485 0.013350 *  
## genreOther                     -0.0045928  0.1327615  -0.035 0.972420    
## genreScience Fiction & Fantasy  0.0450725  0.1509859   0.299 0.765450    
## runtime                         0.0038872  0.0009404   4.133 4.30e-05 ***
## imdb_num_votes                  0.0615548  0.0142459   4.321 1.94e-05 ***
## critics_ratingFresh             0.2836416  0.0464545   6.106 2.30e-09 ***
## critics_ratingCertified Fresh   0.3288815  0.0581081   5.660 2.78e-08 ***
## audience_score                  0.0315546  0.0012772  24.706  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3686 on 427 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.8368 
## F-statistic: 134.9 on 17 and 427 DF,  p-value: < 2.2e-16
# Predictions
predictions <- predict(model4, newdata = test)

# Residuals
residuals <- test$imdb_rating - predictions

# RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("Root Mean Squared Error: ", rmse))
## [1] "Root Mean Squared Error:  0.418305186764801"
# RMSE Percentage from Mean
mean <- mean(train$imdb_rating)
pct_rmse <- (rmse / mean) * 100

# R-Squared
ssr <- sum((test$imdb_rating - predictions)^2)
sst <- sum((test$imdb_rating - mean(test$imdb_rating))^2)
rsq <- 1 - (ssr / sst)
print(paste("R-Squared: ", rsq))
## [1] "R-Squared:  0.812325473210955"
# Adjusted R-Squared
n <- length(test$imdb_rating) # Number of observations
p <- length(model4$model) - 1 # Number of predictors (excluding the constant term)
adj_rsq <- 1 - (1 - rsq) * (n - 1) / (n - p - 1)
print(paste("Adjusted R-Squared: ", adj_rsq))
## [1] "Adjusted R-Squared:  0.805999365566381"
# Store results
results <- rbind(results, c("Model 4", rmse, pct_rmse, rsq, adj_rsq)) 

Let’s see how each model compares to the full.

# Change results columns to numeric and round
results %>% mutate_at(vars(2:5), as.numeric) %>% 
  mutate_if(is.numeric, round, digits = 4)
## # A tibble: 4 × 5
##   model       rmse pct_rmse   rsq adj_rsq
##   <chr>      <dbl>    <dbl> <dbl>   <dbl>
## 1 Full Model 0.413     6.27 0.817   0.802
## 2 Model 2    0.417     6.33 0.814   0.801
## 3 Model 3    0.417     6.34 0.814   0.804
## 4 Model 4    0.418     6.36 0.812   0.806

While the full model provided the lowest RMSE, indicating the smallest average prediction error, it also had the lowest adjusted R-squared value. This suggests that, despite its lower prediction error, the full model may be overfitting to the training data and may not generalize as well to new data. On the other hand, Model 4, despite having a slightly higher RMSE, had the highest adjusted R-squared value, suggesting that it may provide a better balance between fit and generalizability.


Predictions

Our final test of predictions is based upon 3 films released in 2016 that were not included in the original data.

Title URLs
Doctor Strange Rotten Tomatoes
IMDb
Zoolander 2 Rotten Tomatoes
IMDb
Deadpool Rotten Tomatoes
IMDb
# New data frame for outside data

new_test <- tibble(title = c("Doctor Strange","Zoolander 2", "Deadpool"),
              title_type = c("Feature Film", "Feature Film", "Feature Film"),
                   genre = c("Action & Adventure", "Comedy", "Action & Adventure"),
                 runtime = c(115, 102, 108),
             mpaa_rating = c("PG-13", "PG-13", "R"),
                  studio = c("Disney", "Paramount", "Fox"),
           thtr_rel_year = c(2016, 2016, 2016),
          thtr_rel_month = c(NA, NA, NA),
            thtr_rel_day = c(NA, NA, NA),
            dvd_rel_year = c(NA, NA, NA),
           dvd_rel_month = c(NA, NA, NA),
             dvd_rel_day = c(NA, NA, NA),
             imdb_rating = c(7.5, 4.7, 8),
          imdb_num_votes = log(c(769000, 73000, 1100000)),
          critics_rating = factor(c("Certified Fresh", "Rotten", "Certified Fresh"), 
                                  levels = c("Rotten", "Fresh", "Certified Fresh")),
           critics_score = c(NA, NA, NA),
         audience_rating = c(NA, NA, NA),
          audience_score = c(85, 20, 90),
            best_pic_nom = c(0, 0, 0),
            best_pic_win = c(0, 0, 0),
          best_actor_win = c(0, 0, 1),
        best_actress_win = c(1, 0, 1),
            best_dir_win = c(0, 0, 0),
              top200_box = c(0, 1, 0),
                director = c(NA, NA, NA),
                  actor1 = c(NA, NA, NA),
                  actor2 = c(NA, NA, NA),
                  actor3 = c(NA, NA, NA),
                  actor4 = c(NA, NA, NA),
                  actor5 = c(NA, NA, NA),
                imdb_url = c(NA, NA, NA),
                  rt_url = c(NA, NA, NA))


# Predictions for new data with all 4 models

new_test$full_model_y <- predict(full_model, newdata = new_test) 
new_test$full_model_res <- new_test$imdb_rating - new_test$full_model_y

new_test$model2_y <- predict(model2, newdata = new_test)
new_test$model2_res <- new_test$imdb_rating - new_test$model2_y

new_test$model3_y <- predict(model3, newdata = new_test)
new_test$model3_res <- new_test$imdb_rating - new_test$model3_y

new_test$model4_y <- predict(model4, newdata = new_test)
new_test$model4_res <- new_test$imdb_rating - new_test$model4_y

How did each model perform on new data?

# All model predictions
new_test %>% select(title, imdb_rating, 
                    full_model_y, 
                    model2_y, 
                    model3_y, 
                    model4_y)  %>% 
  mutate_if(is.numeric, round, digits = 3)
## # A tibble: 3 × 6
##   title          imdb_rating full_model_y model2_y model3_y model4_y
##   <chr>                <dbl>        <dbl>    <dbl>    <dbl>    <dbl>
## 1 Doctor Strange         7.5         7.55     7.61     7.60     7.58
## 2 Zoolander 2            4.7         4.78     4.82     4.88     4.89
## 3 Deadpool               8           7.76     7.79     7.78     7.74
# All models residuals
new_test %>% select(title, imdb_rating, 
                    full_model_res,
                    model2_res,
                    model3_res,
                    model4_res)  %>% 
  mutate_if(is.numeric, round, digits = 3)
## # A tibble: 3 × 6
##   title          imdb_rating full_model_res model2_res model3_res model4_res
##   <chr>                <dbl>          <dbl>      <dbl>      <dbl>      <dbl>
## 1 Doctor Strange         7.5         -0.054     -0.107     -0.095     -0.084
## 2 Zoolander 2            4.7         -0.079     -0.119     -0.179     -0.191
## 3 Deadpool               8            0.236      0.208      0.222      0.264
# Average residuals per model on new data
tibble(full_model = abs(new_test$full_model_res),
       model2     = abs(new_test$model2_res),
       model3     = abs(new_test$model3_res),
       model4     = abs(new_test$model4_res)) %>% 
  colSums() / 3
## full_model     model2     model3     model4 
##  0.1232441  0.1449902  0.1652650  0.1795911

The Full Model appears to outperform on our test data and our outside data despite its lower \(R^{2}_{\text{adj}}\). The RMSE for all 4 models in the outside data is significantly lower than the 0.41 error in the test data. Therefore, Model 4 will be chosen as the final parsimonious model.


Diagnostics

We put the model data in its own data frame to make plotting the diagnostics easier.

# Put residuals & fitted values into data frame for ggplot 

model_data <- data.frame(residuals = model4$residuals, 
                         fitted_values = model4$fitted.values, 
                         index = 1:length(model4$residuals))

Linear Relelationships Between Y & X (numeric)

# plot(full_model$residuals ~ train$runtime)

ggplot(train, aes(x = runtime, y = model_data$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color="red", linetype="dashed") +
  labs(y="Residuals") +
  theme_bw()

# plot(full_model$residuals ~ train$imdb_num_votes)

ggplot(train, aes(x = imdb_num_votes, y = model_data$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(y = "Residuals") +
  theme_bw()

# plot(full_model$residuals ~ train$audience_score)

ggplot(train, aes(x = audience_score, y = model_data$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(y = "Residuals") +
  theme_bw()

All 3 numeric variables are randomly dispersed when plotted against the model residuals. This implies linearity in the numeric predictors and the response variable, and homoscedasticity which implies a constant variance accross all 3 numeric variables.

The residuals of the model, when plotted against each of the three numeric predictors, are randomly dispersed. This pattern suggests

  1. Linearity - each predictor has a linear relationship with the response variable.

  2. Homoscedasticity - the variance of the errors is constant across all levels of these predictors.


Nearly Normal Residuals

# hist(full_model$residuals)

ggplot(model_data, aes(x = residuals)) +
  geom_histogram(binwidth = 0.1, fill = "red", color = "black", alpha = 0.7) +
  labs(title = paste("Histogram of Residuals")) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) 

A histogram of residuals reveals nearly normalization with a slight left skew and a mean = 0.

ggplot(model_data, aes(sample = residuals)) +
  geom_qq_line(color = "red", linetype = "dashed") +
  geom_qq(size = 2, color = "black", alpha = 0.5) +
  labs(title = "Normal Q-Q Plot of Residuals",
           x = "Theoretical Quantiles",
           y = "Sample Quantiles") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

The Normal Q-Q Plot suggests normality with a majority of points along the line with a slight skew still visible on the left.


Constant Variability of Reisudals

# plot(full_model$residuals ~ full_model$fitted.values)

ggplot(model_data, aes(x = fitted_values, y = residuals)) +
  geom_point(color = "black", fill = "darkgray", size = 2) +
  geom_hline(yintercept = 0, color="red", linetype = "dashed") +
  labs(x = "Fitted Values", y = "Residuals") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

# plot(abs(full_model$residuals) ~ full_model$fitted.values)

ggplot(model_data, aes(x = fitted_values, y = abs(residuals))) +
  geom_point(color = "black", fill = "darkgray", size = 2) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Fitted Values", y = "Residuals") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

A check for constant variability of residuals along the predicted values suggests that the model does not perform any better or worse in certain ranges of the predicted values.


Independence of Residuals

# plot(full_model$residuals)

ggplot(model_data, aes(x = index, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Index", y = "Residuals") +
  theme_bw()

The random scatter around 0 implies that the residuals in the model are independent of themselves.


Conclusion

The primary objective of this analysis was to develop a Multiple Linear Regression model capable of predicting IMDb movie ratings using various independent variables related to the movie’s characteristics and critics.

The full model, which included all 14 independent variables, achieved a Root Mean Squared Error (RMSE) of 0.4109 and an Adjusted R-squared value of 0.802. This model performed relatively well in terms of prediction accuracy, as evidenced by its low RMSE.

Through a series of iterations, we eliminated variables with high p-values, multicollinearity issues, and minimal contributions to the model’s performance. This process led to the development of Model 4, which included six independent variables: title_type, genre, runtime, critics_rating, audience_score, and imdb_num_votes. Despite having a slightly higher RMSE of 0.418, Model 4 achieved the highest Adjusted R-squared value of 0.806, indicating a better balance between fit and generalizability.

When tested on a small set of new movies released in 2016, the full model outperformed the other models in terms of prediction accuracy, suggesting that it may be better suited for predicting ratings for new movies not included in the original dataset.

Overall, the analysis successfully developed multiple regression models capable of predicting IMDb movie ratings with reasonable accuracy. The choice between the full model and Model 4 would depend on the specific requirements and constraints of the use case. If the primary objective is to achieve the highest possible prediction accuracy, the full model may be preferred, despite the risk of overfitting. On the other hand, if generalization and interpretability are more important considerations, Model 4 could be the better choice due to its higher Adjusted R-squared value and simpler structure.

Future attempts at model construction could involve exploring alternative modeling techniques, such as ensemble methods or machine learning algorithms, to potentially improve prediction accuracy further. Additionally, incorporating more diverse and up-to-date data sources, as well as considering additional movie-related features, could enhance the models’ predictive capabilities.